--- title: "Visualizing Classifier Performance" author: "Jason Preszler" date: "March 1, 2018" output: ioslides_presentation: incremental: true widescreen: true logo: CofI-vert.png --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) ``` ## Goals - Foster interest in CS 285, MAT 470, and Data Science Minor (all new) - Demonstrate machine learning and data visualization techniques - Share some of what I'm interested in -Slides will be available on http://jpreszler.rbind.io # Background:
Classification and Class Imbalance ## Classification - Classification problems predict which category an item belongs to. - Examples: - *Is email spam?* - *Which of 5 people wrote this paper?* - *Is this transaction fraudulent?* - This is one of the pillars of machine learning. ## Class Imbalance - When distribution of categories is highly skewed,
we have **class imbalance** - This makes classification harder. - Our problem: *given data on irreducible cubic polynomial $f(x)$, will $f\circ f(x)$ be irreducible?* - Data: over $200$ million irreducible cubics, $75$ have reducible iterates. # Machine Learning Process ## Machine Learning Workflow - Get data: C with FLiNT and OpenMP to build data set. - Build Training Set (typically $60\% - 80\%$ of data) - Build Test Set (the rest of data) - Use training set to build model(s), measure performance using test set. ## Typical Imbalance Solution - Rebalance by inflating rate of low-class cases in training set. - Keep test set class distribution similar to real-world. - But how much should we adjust class distributions by? # My Process ## Build Sets - Read data in with data.table - Remove duplicates - Build 21 training sets - Each has same 52 ER cases - Number of non-ER varies from 500 to 2500 by 100 - Non-ER cases are sampled from main dataset - One test set, 23 ER cases and 8000 non-ER ## Build NER R code mkNER.R: ```{r setBuild, echo=TRUE, eval=FALSE} bigNER <- fread(bigFile, header=TRUE, sep=",") bigNER <- bigNER[!duplicated(bigNER) & bigNER$numFact==1,] samps <- sapply(nerSize, function(x) sample(1:n, x, replace = FALSE)) nerss <- map(samps, function(x) bigNER[x,]) for(i in 1:length(nerSize)){ trsName <- paste(paste("NERtrain",nerSize[i], sep = "-"),"csv",sep=".") write.csv(nerss[[i]],trsName, row.names = FALSE) } ``` ## Building Training Sets ```{r buildTrain, eval=FALSE, echo=TRUE} erIDX <- sample(1:length(er$cube), .7*length(er$cube), replace=FALSE) for(i in nerTrainFiles){ ner <- loadTT(i) ner <- separate(ner, poly, into=c("len","const","lin","quad","cube"), sep="[[ ]]+") %>% dplyr::select(c(-len,-content)) tr <- rbind.data.frame(ner, er[erIDX,]) write.csv(tr, paste(paste("train", length(ner$cube), sep="-"), "csv",sep="."),row.names=FALSE) rm(ner) rm(tr) } ``` ## Model Building For each of the 21 training sets, we'll build 9 models - 3 logistic regression with **regularization** (glmnet) - 4 random forests - naive bayes, knn - That's 189 models! - Each model build using 10-fold cross validation and "Kappa" error metric - Need Parallelization to train multiple models at once, and multiple CV runs ## CV and Kappa - Cross-validation: - split training set into mini-training/test set pairs - build model and check model on mini-sets with different hyperparameter values - build model on full training set using hyperparameters with "best" error metric - Kappa: - Standard error metric for imbalanced classifiers - Compares observed accuracy with what's expected from random chance. ## Model Building Code One of the models: ```{r modelBuild, echo=TRUE, eval=FALSE} library(caret) library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) tr1.rfs <- train(numFact~const+lin+quad+cube+nSign+pSign+sigReal, data=trs1, method="rf", metric = "Kappa", trControl = trainControl(method="cv", number = 10, allowParallel = TRUE)) tst$rfs <- predict(tr1.rfs, tst, type = "prob")[,2] stopCluster(cl) ``` # Model Performance ## Confusion Matrices ```{r confmatDef, echo=FALSE, message=FALSE, warning=FALSE, results='asis'} confMat <- " | Predicted vs. Actual | Act. 1 | Act. 2 | |:------------------:|:------:|:------:| | Pred. 1 | TN | FN | | Pred. 2 | FP | TP | " cat(confMat) ``` - Assign prediction class from probabilities $p$ of having 2 factors by checking $p \ge \theta$. ## Confusion Data Frame Sample of Data Frame with 1134 confusion matrices! ```{r confDF, echo=FALSE, warning=FALSE, message=FALSE} library(readr) pred_confMatrix <- read_csv("pred-confMatrix.csv") library(knitr) kable(pred_confMatrix[sample(1:length(pred_confMatrix$mdl), 6, replace=FALSE),]) ``` # Visualizing Performance ## ROC: Receiver Operating Characteristic ```{r rocEX, echo=FALSE, warning=FALSE, message=FALSE} library(ggplot2) library(plotROC) D.ex <- rbinom(200 , size=1,prob=.5) M <- rnorm(200, mean=D.ex, sd=.65) df <- data.frame(D = D.ex, M=M) ggplot(df, aes(d=D, m=M))+geom_roc(n.cuts=0) ``` ## ROC: Fix ner, vary $\theta$ ```{r rocTheta, echo=FALSE, warning=FALSE, message=FALSE} library(dplyr) cbbPalette <- c("#000000", "#009E73", "#e79f00", "#9ad0f3", "#0072B2", "#D55E00", "#CC79A7", "#F0E442" ) cb11<-c('#000000','#1f78b4','#0000ff','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#551151','#cab2d6','#6a3d9a','#ffff99') c9<-c('#000000','#800000','#f58231','#e6194b','#aa6e28','#3cb44b','#0082c8','#0000ff','#911eb4') filter(pred_confMatrix, ner==800) %>% ggplot(aes(y= TP/(TP+FN),x=FP/(TN+FP), col=mdl,shape=as.factor(theta)))+geom_point(size=3, position="jitter")+ ggtitle("ROC Plot with ner==800")+scale_color_manual(values=c9) ``` ## ROC: Fix $\theta$, vary ner ```{r rocNER, echo=FALSE, warning=FALSE, message=FALSE} filter(pred_confMatrix, theta==.25) %>% ggplot(aes(y= TP/(TP+FN),x=FP/(TN+FP), col=mdl))+geom_line()+ggtitle('ROC Plot with theta = .25')+scale_color_manual(values=c9) ``` ## Animated ROC
```{r aniROC, echo=FALSE, warning=FALSE, message=FALSE} library(gganimate) g <- ggplot(pred_confMatrix, aes(x = FP/(TN+FP), y= TP/(TP+FN), col=mdl, frame=ner))+geom_path(aes(cumulative=TRUE, group=mdl))+ggtitle("ROC paths")+scale_color_manual(values=c9)+facet_wrap(~as.factor(theta)) gganimate(g, "aniROC.gif") ``` ![aniroc](aniROC.gif) ```{r aniROCft, message=FALSE, warning=FALSE, echo=FALSE} g <- filter(pred_confMatrix, theta==.25)%>%ggplot(aes(x = FP/(TN+FP), y= TP/(TP+FN), col=mdl, frame=ner))+geom_path(aes(cumulative=TRUE, group=mdl))+ggtitle("ROC paths, theta .25 ")+scale_color_manual(values=c9) gganimate(g, "aniROCft.gif") ``` ![anirocft](aniROCft.gif)
## ROC Summary - Generally knn followed by some of the random forest models find the most cases of emergent reducibility - Higher theta, and higher class imbalance causes models to generally perform worse - exact changes depend on the model - Logistic Regression most susceptible to noise - see post on http://jpreszler.rbind.io to see comparison without regularization on logistic regression. - ROC helps compare models, theta threshold, and class imbalance - But which polynomials are found by each model? ## Heatmaps ```{r heatSetup, echo=FALSE, warning=FALSE, message=FALSE} library(tidyr) predDF <- read.csv("pred-theta25.csv", header=TRUE) hmPred <- filter(predDF, numFact==2) %>% dplyr::select(-ner, -theta, -numFact) %>% gather(key=model,value=prediction, -poly) %>% group_by(poly, model) %>% summarise(predCNT = (sum(prediction-1)), predMean = sum(prediction)/n(), nerCnt = n()) hmPred$model <- factor(hmPred$model) ``` ```{r heatStatic, echo=FALSE, warning=FALSE, message=FALSE, fig.width=8} #ggplot(hmPred, aes(x=model, y=poly, fill=predCNT))+geom_tile()+ggtitle("Predictions Summed over NER ratio")#+scale_fill_gradientn(colors=rainbow(5)) ggplot(hmPred, aes(y=model, x=poly, fill=predMean))+geom_tile(alpha = .6)+ggtitle("Prediction Mean over NER ratio") + theme(axis.text.x = element_text(angle = 90, hjust = 1))+ guides(colour = guide_legend(override.aes = list(alpha = .6))) ``` ## Animated Heatmaps ```{r aniHeat, echo=FALSE, warning=FALSE, message=FALSE, fig.width=9} library(tidyr) predByner<- predDF %>% filter(numFact==2) %>% select(-numFact, -theta) %>% gather(key=model, value=prediction, -c(poly,ner)) predByner$model <- factor(predByner$model) predByner$prediction <- as.factor(predByner$prediction) ahm <- ggplot(predByner, aes(y=model, x=poly, fill=prediction, frame=ner))+geom_tile(alpha=.4)+ggtitle("Animated Heatmap by NER ratio")+theme(axis.text.x = element_text(angle = 90, hjust = 1)) + coord_fixed() gganimate(ahm, "ahm.gif", ani.width=1000) ``` ![](ahm.gif) ## Future Plans -Model Critique: -Why are certain ER polynomials missed? -Why are others found by certain models? -Interpret KNN and RF models in context -Additional Models: -GAMs -SVM -xgboost -Use best models to improve search for ER and make conjectures