Visualizing Classifier Performance

Jason Preszler

March 1, 2018

Goals

Foster interest in CS 285, MAT 470, and Data Science Minor (all new)
Demonstrate machine learning and data visualization techniques
Share some of what I'm interested in

-Slides will be available on http://jpreszler.rbind.io

Background:
Classification and Class Imbalance

Classification

Classification problems predict which category an item belongs to.
Examples:
- Is email spam?
- Which of 5 people wrote this paper?
- Is this transaction fraudulent?
This is one of the pillars of machine learning.

Class Imbalance

When distribution of categories is highly skewed,
we have class imbalance
This makes classification harder.
Our problem: given data on irreducible cubic polynomial , will be irreducible?
Data: over million irreducible cubics, have reducible iterates.

Machine Learning Process

Machine Learning Workflow

Get data: C with FLiNT and OpenMP to build data set.
Build Training Set (typically of data)
Build Test Set (the rest of data)
Use training set to build model(s), measure performance using test set.

Typical Imbalance Solution

Rebalance by inflating rate of low-class cases in training set.
Keep test set class distribution similar to real-world.
But how much should we adjust class distributions by?

My Process

Build Sets

Read data in with data.table
Remove duplicates
Build 21 training sets
Each has same 52 ER cases
Number of non-ER varies from 500 to 2500 by 100
Non-ER cases are sampled from main dataset
One test set, 23 ER cases and 8000 non-ER

Build NER R code

mkNER.R:

  bigNER <- fread(bigFile, header=TRUE, sep=",")
  bigNER <- bigNER[!duplicated(bigNER) & bigNER$numFact==1,]

    samps <- sapply(nerSize, function(x) sample(1:n, x, 
                                            replace = FALSE))
    nerss <- map(samps, function(x) bigNER[x,])
    for(i in 1:length(nerSize)){
      trsName <- paste(paste("NERtrain",nerSize[i],
                             sep = "-"),"csv",sep=".")
      write.csv(nerss[[i]],trsName, row.names = FALSE)
    }

Building Training Sets

  erIDX <- sample(1:length(er$cube), .7*length(er$cube), replace=FALSE)
  for(i in nerTrainFiles){
    ner <- loadTT(i)
    ner <- separate(ner, poly,
              into=c("len","const","lin","quad","cube"), 
                sep="[[ ]]+") %>% dplyr::select(c(-len,-content))
    tr <- rbind.data.frame(ner, er[erIDX,])
    write.csv(tr, paste(paste("train", length(ner$cube),
                    sep="-"), "csv",sep="."),row.names=FALSE)
    rm(ner)
    rm(tr)
  }

Model Building

For each of the 21 training sets, we'll build 9 models

3 logistic regression with regularization (glmnet)
4 random forests
naive bayes, knn
That's 189 models!
Each model build using 10-fold cross validation and "Kappa" error metric
Need Parallelization to train multiple models at once, and multiple CV runs

CV and Kappa

Cross-validation:
- split training set into mini-training/test set pairs
- build model and check model on mini-sets with different hyperparameter values
- build model on full training set using hyperparameters with "best" error metric
Kappa:
- Standard error metric for imbalanced classifiers
- Compares observed accuracy with what's expected from random chance.

Model Building Code

One of the models:

library(caret)
library(doParallel)
  cl <- makeCluster(detectCores())
  registerDoParallel(cl)
  
 tr1.rfs <- train(numFact~const+lin+quad+cube+nSign+pSign+sigReal,
                  data=trs1, method="rf", metric = "Kappa",  
                  trControl = trainControl(method="cv", 
                              number = 10, allowParallel = TRUE))

 tst$rfs <- predict(tr1.rfs, tst, type = "prob")[,2]

  stopCluster(cl)

Model Performance

Confusion Matrices

Predicted vs. Actual	Act. 1	Act. 2
Pred. 1	TN	FN
Pred. 2	FP	TP

Assign prediction class from probabilities of having 2 factors by checking .

Confusion Data Frame

Sample of Data Frame with 1134 confusion matrices!

mdl	TN	TP	FP	FN	ner	theta
lrs	7862	1	138	22	600	0.50
lrp	7382	13	618	10	700	0.20
lrsq	7900	5	100	18	2100	0.30
knn	7876	17	124	6	800	0.50
lrs	7761	8	239	15	1600	0.15
rfpp	7943	10	57	13	2000	0.40

Visualizing Performance

ROC: Receiver Operating Characteristic

ROC: Fix ner, vary

ROC: Fix , vary ner

Animated ROC

aniroc

anirocft

ROC Summary

Generally knn followed by some of the random forest models find the most cases of emergent reducibility
Higher theta, and higher class imbalance causes models to generally perform worse
exact changes depend on the model
Logistic Regression most susceptible to noise
see post on http://jpreszler.rbind.io to see comparison without regularization on logistic regression.
ROC helps compare models, theta threshold, and class imbalance
But which polynomials are found by each model?

Heatmaps

Animated Heatmaps

Future Plans

-Model Critique:

-Why are certain ER polynomials missed?
-Why are others found by certain models?
-Interpret KNN and RF models in context

-Additional Models:

-GAMs
-SVM
-xgboost

-Use best models to improve search for ER and make conjectures