March 1, 2018

Goals

  • Foster interest in CS 285, MAT 470, and Data Science Minor (all new)
  • Demonstrate machine learning and data visualization techniques
  • Share some of what I'm interested in

    -Slides will be available on http://jpreszler.rbind.io

Background:
Classification and Class Imbalance

Classification

  • Classification problems predict which category an item belongs to.
  • Examples:

    • Is email spam?
    • Which of 5 people wrote this paper?
    • Is this transaction fraudulent?
  • This is one of the pillars of machine learning.

Class Imbalance

  • When distribution of categories is highly skewed,
    we have class imbalance

  • This makes classification harder.

  • Our problem: given data on irreducible cubic polynomial \(f(x)\), will \(f\circ f(x)\) be irreducible?

  • Data: over \(200\) million irreducible cubics, \(75\) have reducible iterates.

Machine Learning Process

Machine Learning Workflow

  • Get data: C with FLiNT and OpenMP to build data set.
  • Build Training Set (typically \(60\% - 80\%\) of data)
  • Build Test Set (the rest of data)

  • Use training set to build model(s), measure performance using test set.

Typical Imbalance Solution

  • Rebalance by inflating rate of low-class cases in training set.
  • Keep test set class distribution similar to real-world.
  • But how much should we adjust class distributions by?

My Process

Build Sets

  • Read data in with data.table
  • Remove duplicates
  • Build 21 training sets
  • Each has same 52 ER cases
  • Number of non-ER varies from 500 to 2500 by 100
  • Non-ER cases are sampled from main dataset
  • One test set, 23 ER cases and 8000 non-ER

Build NER R code

mkNER.R:

  bigNER <- fread(bigFile, header=TRUE, sep=",")
  bigNER <- bigNER[!duplicated(bigNER) & bigNER$numFact==1,]

    samps <- sapply(nerSize, function(x) sample(1:n, x, 
                                            replace = FALSE))
    nerss <- map(samps, function(x) bigNER[x,])
    for(i in 1:length(nerSize)){
      trsName <- paste(paste("NERtrain",nerSize[i],
                             sep = "-"),"csv",sep=".")
      write.csv(nerss[[i]],trsName, row.names = FALSE)
    }

Building Training Sets

  erIDX <- sample(1:length(er$cube), .7*length(er$cube), replace=FALSE)
  for(i in nerTrainFiles){
    ner <- loadTT(i)
    ner <- separate(ner, poly,
              into=c("len","const","lin","quad","cube"), 
                sep="[[ ]]+") %>% dplyr::select(c(-len,-content))
    tr <- rbind.data.frame(ner, er[erIDX,])
    write.csv(tr, paste(paste("train", length(ner$cube),
                    sep="-"), "csv",sep="."),row.names=FALSE)
    rm(ner)
    rm(tr)
  }

Model Building

For each of the 21 training sets, we'll build 9 models

  • 3 logistic regression with regularization (glmnet)
  • 4 random forests
  • naive bayes, knn
  • That's 189 models!

  • Each model build using 10-fold cross validation and "Kappa" error metric

  • Need Parallelization to train multiple models at once, and multiple CV runs

CV and Kappa

  • Cross-validation:

    • split training set into mini-training/test set pairs

    • build model and check model on mini-sets with different hyperparameter values

    • build model on full training set using hyperparameters with "best" error metric

  • Kappa:

    • Standard error metric for imbalanced classifiers

    • Compares observed accuracy with what's expected from random chance.

Model Building Code

One of the models:

library(caret)
library(doParallel)
  cl <- makeCluster(detectCores())
  registerDoParallel(cl)
  
 tr1.rfs <- train(numFact~const+lin+quad+cube+nSign+pSign+sigReal,
                  data=trs1, method="rf", metric = "Kappa",  
                  trControl = trainControl(method="cv", 
                              number = 10, allowParallel = TRUE))

 tst$rfs <- predict(tr1.rfs, tst, type = "prob")[,2]

  stopCluster(cl)

Model Performance

Confusion Matrices

Predicted vs. Actual Act. 1 Act. 2
Pred. 1 TN FN
Pred. 2 FP TP
  • Assign prediction class from probabilities \(p\) of having 2 factors by checking \(p \ge \theta\).

Confusion Data Frame

Sample of Data Frame with 1134 confusion matrices!

mdl TN TP FP FN ner theta
lrs 7862 1 138 22 600 0.50
lrp 7382 13 618 10 700 0.20
lrsq 7900 5 100 18 2100 0.30
knn 7876 17 124 6 800 0.50
lrs 7761 8 239 15 1600 0.15
rfpp 7943 10 57 13 2000 0.40

Visualizing Performance

ROC: Receiver Operating Characteristic

ROC: Fix ner, vary \(\theta\)

ROC: Fix \(\theta\), vary ner

Animated ROC

aniroc

anirocft

ROC Summary

  • Generally knn followed by some of the random forest models find the most cases of emergent reducibility
  • Higher theta, and higher class imbalance causes models to generally perform worse
  • exact changes depend on the model
  • Logistic Regression most susceptible to noise
  • see post on http://jpreszler.rbind.io to see comparison without regularization on logistic regression.
  • ROC helps compare models, theta threshold, and class imbalance
  • But which polynomials are found by each model?

Heatmaps

Animated Heatmaps

Future Plans

-Model Critique:

-Why are certain ER polynomials missed?
-Why are others found by certain models?
-Interpret KNN and RF models in context

-Additional Models:

-GAMs
-SVM
-xgboost

-Use best models to improve search for ER and make conjectures