- Foster interest in CS 285, MAT 470, and Data Science Minor (all new)
- Demonstrate machine learning and data visualization techniques
Share some of what I'm interested in
-Slides will be available on http://jpreszler.rbind.io
March 1, 2018
Share some of what I'm interested in
-Slides will be available on http://jpreszler.rbind.io
Examples:
This is one of the pillars of machine learning.
When distribution of categories is highly skewed,
we have class imbalance
This makes classification harder.
Our problem: given data on irreducible cubic polynomial \(f(x)\), will \(f\circ f(x)\) be irreducible?
Data: over \(200\) million irreducible cubics, \(75\) have reducible iterates.
Build Test Set (the rest of data)
Use training set to build model(s), measure performance using test set.
mkNER.R:
bigNER <- fread(bigFile, header=TRUE, sep=",") bigNER <- bigNER[!duplicated(bigNER) & bigNER$numFact==1,] samps <- sapply(nerSize, function(x) sample(1:n, x, replace = FALSE)) nerss <- map(samps, function(x) bigNER[x,]) for(i in 1:length(nerSize)){ trsName <- paste(paste("NERtrain",nerSize[i], sep = "-"),"csv",sep=".") write.csv(nerss[[i]],trsName, row.names = FALSE) }
erIDX <- sample(1:length(er$cube), .7*length(er$cube), replace=FALSE) for(i in nerTrainFiles){ ner <- loadTT(i) ner <- separate(ner, poly, into=c("len","const","lin","quad","cube"), sep="[[ ]]+") %>% dplyr::select(c(-len,-content)) tr <- rbind.data.frame(ner, er[erIDX,]) write.csv(tr, paste(paste("train", length(ner$cube), sep="-"), "csv",sep="."),row.names=FALSE) rm(ner) rm(tr) }
For each of the 21 training sets, we'll build 9 models
That's 189 models!
Each model build using 10-fold cross validation and "Kappa" error metric
Need Parallelization to train multiple models at once, and multiple CV runs
Cross-validation:
split training set into mini-training/test set pairs
build model and check model on mini-sets with different hyperparameter values
build model on full training set using hyperparameters with "best" error metric
Kappa:
Standard error metric for imbalanced classifiers
Compares observed accuracy with what's expected from random chance.
One of the models:
library(caret) library(doParallel) cl <- makeCluster(detectCores()) registerDoParallel(cl) tr1.rfs <- train(numFact~const+lin+quad+cube+nSign+pSign+sigReal, data=trs1, method="rf", metric = "Kappa", trControl = trainControl(method="cv", number = 10, allowParallel = TRUE)) tst$rfs <- predict(tr1.rfs, tst, type = "prob")[,2] stopCluster(cl)
Predicted vs. Actual | Act. 1 | Act. 2 |
---|---|---|
Pred. 1 | TN | FN |
Pred. 2 | FP | TP |
Sample of Data Frame with 1134 confusion matrices!
mdl | TN | TP | FP | FN | ner | theta |
---|---|---|---|---|---|---|
lrs | 7862 | 1 | 138 | 22 | 600 | 0.50 |
lrp | 7382 | 13 | 618 | 10 | 700 | 0.20 |
lrsq | 7900 | 5 | 100 | 18 | 2100 | 0.30 |
knn | 7876 | 17 | 124 | 6 | 800 | 0.50 |
lrs | 7761 | 8 | 239 | 15 | 1600 | 0.15 |
rfpp | 7943 | 10 | 57 | 13 | 2000 | 0.40 |
-Model Critique:
-Why are certain ER polynomials missed? -Why are others found by certain models? -Interpret KNN and RF models in context
-Additional Models:
-GAMs -SVM -xgboost
-Use best models to improve search for ER and make conjectures