I am a Data Science Fellow at Insight Data Science in Seattle. My project has been to develop RealAllocator, a web app for portfolio optimization and backtesting of an investment portfolio moving from only stocks and bonds to stocks, bonds, and direct real estate investment. This was done in consultation with RealCrowd
Other recent data science projects have been:
An investigation into using various (interpretable) binary classifiers to better understand an imbalanced classification problem in my number theory research (emergent reducibility).
Working to identify and reduce curriculum inefficiencies at the College of Idaho.
Developing usage and staffing models for a large drop-in tutoring center at James Madison University.
For ten years prior to joining Insight, I was in academia teaching math, statistics, and computer science at various colleges and universities including:
One of the first things I learned studying abstract mathematics was that having non-examples were as important as examples of abstract structure. For instance, the definition of a group and some examples of groups it really only part of the picture - having examples of sets that satisfy all but one of the requirements to be a group allows you to see how the pieces fit together and lays a foundation for more solid intuition for further study.
I’ve done some analysis of my gps running data before, but mostly just some mapping. I’ve always wanted to bring in some more sophisticated analysis such as identifying runs with similar geographic features (e.g. track workouts) or identifying, categorizing, and comparing hills. To really get into either of these things, I first needed good elevation data which isn’t provided by my forerunner 220. In this post I’ll show some of the problems with the elevation data coming from my garmin 220, how to get elevation data from the RaceMap API (and compare a few other elevation api’s), and then examine how good the new elevation data is.
While continuing to work through BDA3, and decided to revisit some of the earlier exercises that I had done in R. Problem 9 of chapter 1 asks to simulate a medical clinic with 3 doctors, patients arriving according to an exponential distribution with rate 10 minutes between 9AM and 4PM and each patient needing an appointment length uniformly distributed between 5 and 10 minutes. We are interested in things like the number of patients seen, average wait time, number of patients who had to wait, and when the clinic closes based on 1 simulated day and 100 simulated days (with intervals of each aggregation).
Every time there’s news about a mass shooting I feel like doing some type of data analysis about gun violence. With the shootings in Dayton and El Paso, as well as news of several likely shootings being prevented, I thought I would actually follow through with some analysis. Having been a senior in high school (in California) when the Columbine shooting took place, and also living in Salt Lake during the Trolley Square shooting I’ve seen the impacts of these tragedies and feel as though they are happening more frequently.
During my Applied Databases course in Spring 2019, I gave my students a choice of which language to use to interact with SQL and relational databases. They had already learned core SQL and I only gave them 3 options: C++, R, and Python. The choices of R and python are natural given my data science interests and experience. Last year I just showed them R and got some complaints on evaluations (some people don’t think R is a “real language”).
I was putting some data together about previous catalogs for students for projects in my Applied Databases course and realized that I was missing something. I had course info (subject, number, title and url) for the last 4 catalog years at the College of Idaho, but I didn’t have course descriptions! What a great chance to do some simple web scraping in python.
Data Import and Cleaning Since I have a csv file for each catalog year with a link to each course, I just needed to read the urls, extract the description from the page, and save the results.
I’ve been focusing on python recently to become a bi-lingual data scientist. Probably my least favorite thing about python is its plotting libraries - there are too many options built on top of matplotlib which pre-dates pandas dataframes. This makes for some clunky code and blurry boundaries (both “is that a seaborn, pandas, or matplotlib function?” and situations with 3 equally messy solutions but in very different ways). In my opinion, ggplot2’s deep interplay with dataframes makes a lot more sense and ggplot’s layers make it easy to change plot type (just switch the geom_), add facets, and tweak aesthetics.
As part of some clustering work and learning about hidden Markov models, I’ve been doing some reading about the EM algorithm and it’s applications. It’s a pretty neat algorithm (I love iterative algorithms like Newton’s method and the Euclidean algorithm) so I thought I’d illustrate how it works.
I’ve also been doing a bit more python recently, so I thought I would do all this in python rather than R.
In my data visualization class I had the students get a book from Project Gutenberg using the gutenbergr package and build a word cloud using tidytext and wordcloud. It’s much easier that the “old” corpus/text mapping approach, and when the students were sharing their clouds they started showing the cloud and having students try to guess the book. This made me think of using a Shiny runtime to make a little word cloud guessing game.
Update 7/23/2019 Various package updates have created problems with showing more than one javascript plot on a post. I’ve added calls to htlwidgets::onRender to get at least one plot displayed. I may revisit this, but the interaction between hugo, blogdown, and various javascript libraries (chorddiag, networkD3, D3, data tables, etc) is more than I’m able to dive into at the moment.
This post is about a type of visualization the will hopefully help see how students “flow” through college.
A polynomial $f(x)$ has emergent reducibility at depth $n$ if $f^{\circ k}(x)$ is irreducible for $0 \leq k \leq n − 1$ but $f^{\circ n}(x)$ is reducible. In this paper we prove that there are infinitely many irreducible cubics $f \in \mathbb{Z}[x]$ with $f\circ f$ reducible by exhibiting a one parameter family with this property.