Recent Posts

One of the first things I learned studying abstract mathematics was that having non-examples were as important as examples of abstract structure. For instance, the definition of a group and some examples of groups it really only part of the picture - having examples of sets that satisfy all but one of the requirements to be a group allows you to see how the pieces fit together and lays a foundation for more solid intuition for further study.

CONTINUE READING

I’ve done some analysis of my gps running data before, but mostly just some mapping. I’ve always wanted to bring in some more sophisticated analysis such as identifying runs with similar geographic features (e.g. track workouts) or identifying, categorizing, and comparing hills. To really get into either of these things, I first needed good elevation data which isn’t provided by my forerunner 220. In this post I’ll show some of the problems with the elevation data coming from my garmin 220, how to get elevation data from the RaceMap API (and compare a few other elevation api’s), and then examine how good the new elevation data is.

CONTINUE READING

While continuing to work through BDA3, and decided to revisit some of the earlier exercises that I had done in R. Problem 9 of chapter 1 asks to simulate a medical clinic with 3 doctors, patients arriving according to an exponential distribution with rate 10 minutes between 9AM and 4PM and each patient needing an appointment length uniformly distributed between 5 and 10 minutes. We are interested in things like the number of patients seen, average wait time, number of patients who had to wait, and when the clinic closes based on 1 simulated day and 100 simulated days (with intervals of each aggregation).

CONTINUE READING

Every time there’s news about a mass shooting I feel like doing some type of data analysis about gun violence. With the shootings in Dayton and El Paso, as well as news of several likely shootings being prevented, I thought I would actually follow through with some analysis. Having been a senior in high school (in California) when the Columbine shooting took place, and also living in Salt Lake during the Trolley Square shooting I’ve seen the impacts of these tragedies and feel as though they are happening more frequently.

CONTINUE READING

During my Applied Databases course in Spring 2019, I gave my students a choice of which language to use to interact with SQL and relational databases. They had already learned core SQL and I only gave them 3 options: C++, R, and Python. The choices of R and python are natural given my data science interests and experience. Last year I just showed them R and got some complaints on evaluations (some people don’t think R is a “real language”).

CONTINUE READING

I was putting some data together about previous catalogs for students for projects in my Applied Databases course and realized that I was missing something. I had course info (subject, number, title and url) for the last 4 catalog years at the College of Idaho, but I didn’t have course descriptions! What a great chance to do some simple web scraping in python. Data Import and Cleaning Since I have a csv file for each catalog year with a link to each course, I just needed to read the urls, extract the description from the page, and save the results.

CONTINUE READING

I’ve been focusing on python recently to become a bi-lingual data scientist. Probably my least favorite thing about python is its plotting libraries - there are too many options built on top of matplotlib which pre-dates pandas dataframes. This makes for some clunky code and blurry boundaries (both “is that a seaborn, pandas, or matplotlib function?” and situations with 3 equally messy solutions but in very different ways). In my opinion, ggplot2’s deep interplay with dataframes makes a lot more sense and ggplot’s layers make it easy to change plot type (just switch the geom_), add facets, and tweak aesthetics.

CONTINUE READING

As part of some clustering work and learning about hidden Markov models, I’ve been doing some reading about the EM algorithm and it’s applications. It’s a pretty neat algorithm (I love iterative algorithms like Newton’s method and the Euclidean algorithm) so I thought I’d illustrate how it works. I’ve also been doing a bit more python recently, so I thought I would do all this in python rather than R.

CONTINUE READING

In my data visualization class I had the students get a book from Project Gutenberg using the gutenbergr package and build a word cloud using tidytext and wordcloud. It’s much easier that the “old” corpus/text mapping approach, and when the students were sharing their clouds they started showing the cloud and having students try to guess the book. This made me think of using a Shiny runtime to make a little word cloud guessing game.

CONTINUE READING

Update 7/23/2019 Various package updates have created problems with showing more than one javascript plot on a post. I’ve added calls to htlwidgets::onRender to get at least one plot displayed. I may revisit this, but the interaction between hugo, blogdown, and various javascript libraries (chorddiag, networkD3, D3, data tables, etc) is more than I’m able to dive into at the moment. This post is about a type of visualization the will hopefully help see how students “flow” through college.

CONTINUE READING