R

Tidy Clouds

In my data visualization class I had the students get a book from Project Gutenberg using the gutenbergr package and build a word cloud using tidytext and wordcloud. It’s much easier that the “old” corpus/text mapping approach, and when the students were sharing their clouds they started showing the cloud and having students try to guess the book. This made me think of using a Shiny runtime to make a little word cloud guessing game.

Sankey Diagram

Update 7/23/2019 Various package updates have created problems with showing more than one javascript plot on a post. I’ve added calls to htlwidgets::onRender to get at least one plot displayed. I may revisit this, but the interaction between hugo, blogdown, and various javascript libraries (chorddiag, networkD3, D3, data tables, etc) is more than I’m able to dive into at the moment. This post is about a type of visualization the will hopefully help see how students “flow” through college.

PCA Overview

This post is primarily to give the basic overview of principal components analysis (PCA) for dimensionality reduction and regression. I wanted to create it as a guide for my regression students who may find it useful for their projects. First, let’s note the two main times that you may want to use PCA - dimensionality reduction (reducing variables in a dataset) and removing colinearity issues. These are not exclusive problems, often you want to do both.

Reticulated Mixture Models

Clearly, there’s no such thing as a “reticulated mixture model” but if you create one I’ll gladly take credit for the name. Instead this post is a demonstration of using mixture models for clustering and the interplay of R and Python via RStudio’s reticulate package. Mixture Model Basics The idea behind mixture models is that you have data containing information from two (or more) subgroups and you want to uncover structure of the subgroups.

Ridges of Normality

One of the classic assumptions of the linear regression models is that, conditioned on the explanatory variables, the response variable should be normally distributed. While teaching this the other day, I had a flash of insight into how to visualize this - ridge-line plots! Data I’ve been using Matloff’s Statistical Regression and Classification book, which uses the mlb dataset from his freqparcoord package. This has data on heights, weights, ages, positions, and teams of over 1000 major league baseball players.

What a Tangled Web We Weave...

Update 7/23/2019 Various package updates have created problems with showing more than one javascript plot on a post. I’ve added calls to htlwidgets::onRender to get at least one plot displayed. I may revisit this, but the interaction between hugo, blogdown, and various javascript libraries (chorddiag, networkD3, D3, data tables, etc) is more than I’m able to dive into at the moment. cd <- chorddiag( xtabs(~MAJOR+minor, data = mmhl[mmhl$Grad.

Lesser Known Verbs: top_n

I’ve been using R since 2006. That predates RStudio and the tidyverse. I remember the struggle of keeping track of the variants of apply and often fiddling with them to get code to work. Then came plyr and the dplyr and my life has never been the same. The major verbs of dplyr include select, filter, mutate, group_by, summarise, and arrange; and if you are doing data analysis in R then you should be fluent in them.

Re-Counting Classes

Edit 7/27/2018 I realized that MFL’s name change to WLC didn’t change the prefix of their courses, this broke my scrapper. Below is an updated post that deals with this. Back in early May, I wrote a post about scraping the College of Idaho catalog: Counting Classes. Below if the same post (boring…) except that the “current catalog” has been updated. This is really a demonstration of reproducibility, the upstream data has changed and ideally all my code still works.

DT: When Tables are the Product

In RMarkdown documents I often have a need to display tables, which I usually try to keep small with only the most useful information displayed. However, a recent project made me look for a better way to share tabular data with non-data-scientists. The answer was R’s DT package, which allows for very powerful displays of tabular data. Today’s data will be a summary of enrollment data from the College of Idaho:

Maps Majors in Neo4J

UPDATE (6/20/2018) The cypher query for Table 3 only used components with “optional” courses so the capstone and topics compnents of the Math/CS major weren’t included in table 3. UPDATE (6/19/2018) The original version of this post used incorrectly loaded data that caused to “Core” of every major to have the same classes attached to it. This was noticed by my colleague Dave Rosoff and has been corrected.