---
title: "Visualizing Classifier Performance"
author: "Jason Preszler"
date: "March 1, 2018"
output:
ioslides_presentation:
incremental: true
widescreen: true
logo: CofI-vert.png
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
## Goals
- Foster interest in CS 285, MAT 470, and Data Science Minor (all new)
- Demonstrate machine learning and data visualization techniques
- Share some of what I'm interested in
-Slides will be available on http://jpreszler.rbind.io
# Background:
Classification and Class Imbalance
## Classification
- Classification problems predict which category an item belongs to.
- Examples:
- *Is email spam?*
- *Which of 5 people wrote this paper?*
- *Is this transaction fraudulent?*
- This is one of the pillars of machine learning.
## Class Imbalance
- When distribution of categories is highly skewed,
we have **class imbalance**
- This makes classification harder.
- Our problem: *given data on irreducible cubic polynomial $f(x)$, will $f\circ f(x)$ be irreducible?*
- Data: over $200$ million irreducible cubics, $75$ have reducible iterates.
# Machine Learning Process
## Machine Learning Workflow
- Get data: C with FLiNT and OpenMP to build data set.
- Build Training Set (typically $60\% - 80\%$ of data)
- Build Test Set (the rest of data)
- Use training set to build model(s), measure performance using test set.
## Typical Imbalance Solution
- Rebalance by inflating rate of low-class cases in training set.
- Keep test set class distribution similar to real-world.
- But how much should we adjust class distributions by?
# My Process
## Build Sets
- Read data in with data.table
- Remove duplicates
- Build 21 training sets
- Each has same 52 ER cases
- Number of non-ER varies from 500 to 2500 by 100
- Non-ER cases are sampled from main dataset
- One test set, 23 ER cases and 8000 non-ER
## Build NER R code
mkNER.R:
```{r setBuild, echo=TRUE, eval=FALSE}
bigNER <- fread(bigFile, header=TRUE, sep=",")
bigNER <- bigNER[!duplicated(bigNER) & bigNER$numFact==1,]
samps <- sapply(nerSize, function(x) sample(1:n, x,
replace = FALSE))
nerss <- map(samps, function(x) bigNER[x,])
for(i in 1:length(nerSize)){
trsName <- paste(paste("NERtrain",nerSize[i],
sep = "-"),"csv",sep=".")
write.csv(nerss[[i]],trsName, row.names = FALSE)
}
```
## Building Training Sets
```{r buildTrain, eval=FALSE, echo=TRUE}
erIDX <- sample(1:length(er$cube), .7*length(er$cube), replace=FALSE)
for(i in nerTrainFiles){
ner <- loadTT(i)
ner <- separate(ner, poly,
into=c("len","const","lin","quad","cube"),
sep="[[ ]]+") %>% dplyr::select(c(-len,-content))
tr <- rbind.data.frame(ner, er[erIDX,])
write.csv(tr, paste(paste("train", length(ner$cube),
sep="-"), "csv",sep="."),row.names=FALSE)
rm(ner)
rm(tr)
}
```
## Model Building
For each of the 21 training sets, we'll build 9 models
- 3 logistic regression with **regularization** (glmnet)
- 4 random forests
- naive bayes, knn
- That's 189 models!
- Each model build using 10-fold cross validation and "Kappa" error metric
- Need Parallelization to train multiple models at once, and multiple CV runs
## CV and Kappa
- Cross-validation:
- split training set into mini-training/test set pairs
- build model and check model on mini-sets with different hyperparameter values
- build model on full training set using hyperparameters with "best" error metric
- Kappa:
- Standard error metric for imbalanced classifiers
- Compares observed accuracy with what's expected from random chance.
## Model Building Code
One of the models:
```{r modelBuild, echo=TRUE, eval=FALSE}
library(caret)
library(doParallel)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
tr1.rfs <- train(numFact~const+lin+quad+cube+nSign+pSign+sigReal,
data=trs1, method="rf", metric = "Kappa",
trControl = trainControl(method="cv",
number = 10, allowParallel = TRUE))
tst$rfs <- predict(tr1.rfs, tst, type = "prob")[,2]
stopCluster(cl)
```
# Model Performance
## Confusion Matrices
```{r confmatDef, echo=FALSE, message=FALSE, warning=FALSE, results='asis'}
confMat <- "
| Predicted vs. Actual | Act. 1 | Act. 2 |
|:------------------:|:------:|:------:|
| Pred. 1 | TN | FN |
| Pred. 2 | FP | TP |
"
cat(confMat)
```
- Assign prediction class from probabilities $p$ of having 2 factors by checking $p \ge \theta$.
## Confusion Data Frame
Sample of Data Frame with 1134 confusion matrices!
```{r confDF, echo=FALSE, warning=FALSE, message=FALSE}
library(readr)
pred_confMatrix <- read_csv("pred-confMatrix.csv")
library(knitr)
kable(pred_confMatrix[sample(1:length(pred_confMatrix$mdl), 6, replace=FALSE),])
```
# Visualizing Performance
## ROC: Receiver Operating Characteristic
```{r rocEX, echo=FALSE, warning=FALSE, message=FALSE}
library(ggplot2)
library(plotROC)
D.ex <- rbinom(200 , size=1,prob=.5)
M <- rnorm(200, mean=D.ex, sd=.65)
df <- data.frame(D = D.ex, M=M)
ggplot(df, aes(d=D, m=M))+geom_roc(n.cuts=0)
```
## ROC: Fix ner, vary $\theta$
```{r rocTheta, echo=FALSE, warning=FALSE, message=FALSE}
library(dplyr)
cbbPalette <- c("#000000", "#009E73", "#e79f00", "#9ad0f3", "#0072B2", "#D55E00", "#CC79A7", "#F0E442" )
cb11<-c('#000000','#1f78b4','#0000ff','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#551151','#cab2d6','#6a3d9a','#ffff99')
c9<-c('#000000','#800000','#f58231','#e6194b','#aa6e28','#3cb44b','#0082c8','#0000ff','#911eb4')
filter(pred_confMatrix, ner==800) %>%
ggplot(aes(y= TP/(TP+FN),x=FP/(TN+FP), col=mdl,shape=as.factor(theta)))+geom_point(size=3, position="jitter")+
ggtitle("ROC Plot with ner==800")+scale_color_manual(values=c9)
```
## ROC: Fix $\theta$, vary ner
```{r rocNER, echo=FALSE, warning=FALSE, message=FALSE}
filter(pred_confMatrix, theta==.25) %>%
ggplot(aes(y= TP/(TP+FN),x=FP/(TN+FP), col=mdl))+geom_line()+ggtitle('ROC Plot with theta = .25')+scale_color_manual(values=c9)
```
## Animated ROC