- Foster interest in CS 285, MAT 470, and Data Science Minor (all new)
- Demonstrate machine learning and data visualization techniques
Share some of what I'm interested in
-Slides will be available on http://jpreszler.rbind.io
Jason Preszler
March 1, 2018
This is one of the pillars of machine learning.
When distribution of categories is highly skewed,
we have class imbalance
This makes classification harder.
Our problem: given data on irreducible cubic polynomial f(x), will f∘f(x) be irreducible?
Data: over 200 million irreducible cubics, 75 have reducible iterates.
Build Test Set (the rest of data)
Use training set to build model(s), measure performance using test set.
bigNER <- fread(bigFile, header=TRUE, sep=",") bigNER <- bigNER[!duplicated(bigNER) & bigNER$numFact==1,] samps <- sapply(nerSize, function(x) sample(1:n, x, replace = FALSE)) nerss <- map(samps, function(x) bigNER[x,]) for(i in 1:length(nerSize)){ trsName <- paste(paste("NERtrain",nerSize[i], sep = "-"),"csv",sep=".") write.csv(nerss[[i]],trsName, row.names = FALSE) }
erIDX <- sample(1:length(er$cube), .7*length(er$cube), replace=FALSE) for(i in nerTrainFiles){ ner <- loadTT(i) ner <- separate(ner, poly, into=c("len","const","lin","quad","cube"), sep="[[ ]]+") %>% dplyr::select(c(-len,-content)) tr <- rbind.data.frame(ner, er[erIDX,]) write.csv(tr, paste(paste("train", length(ner$cube), sep="-"), "csv",sep="."),row.names=FALSE) rm(ner) rm(tr) }
For each of the 21 training sets, we'll build 9 models
That's 189 models!
Each model build using 10-fold cross validation and "Kappa" error metric
Need Parallelization to train multiple models at once, and multiple CV runs
split training set into mini-training/test set pairs
build model and check model on mini-sets with different hyperparameter values
build model on full training set using hyperparameters with "best" error metric
Standard error metric for imbalanced classifiers
Compares observed accuracy with what's expected from random chance.
One of the models:
library(caret) library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) tr1.rfs <- train(numFact~const+lin+quad+cube+nSign+pSign+sigReal, data=trs1, method="rf", metric = "Kappa", trControl = trainControl(method="cv", number = 10, allowParallel = TRUE)) tst$rfs <- predict(tr1.rfs, tst, type = "prob")[,2] stopCluster(cl)
Predicted vs. Actual | Act. 1 | Act. 2 |
Pred. 1 | TN | FN |
Pred. 2 | FP | TP |
Sample of Data Frame with 1134 confusion matrices!
mdl | TN | TP | FP | FN | ner | theta |
lrs | 7862 | 1 | 138 | 22 | 600 | 0.50 |
lrp | 7382 | 13 | 618 | 10 | 700 | 0.20 |
lrsq | 7900 | 5 | 100 | 18 | 2100 | 0.30 |
knn | 7876 | 17 | 124 | 6 | 800 | 0.50 |
lrs | 7761 | 8 | 239 | 15 | 1600 | 0.15 |
rfpp | 7943 | 10 | 57 | 13 | 2000 | 0.40 |
-Model Critique:
-Why are certain ER polynomials missed? -Why are others found by certain models? -Interpret KNN and RF models in context
-Additional Models:
-GAMs -SVM -xgboost
-Use best models to improve search for ER and make conjectures