Update 7/23/2019 Various package updates have created problems with showing more than one javascript plot on a post. I’ve added calls to htlwidgets::onRender
to get at least one plot displayed. I may revisit this, but the interaction between hugo, blogdown, and various javascript libraries (chorddiag, networkD3, D3, data tables, etc) is more than I’m able to dive into at the moment.
This post is about a type of visualization the will hopefully help see how students “flow” through college. The data is an anonymized selection of Math and Math-Computer Science majors at the College of Idaho, and for simplicity we’ll only be using the math and computer science courses. Out goal is to produce the following Sankey Diagram
, which is really just a graph (in the discrete math sense - nodes connected by edges) where the edges are scaled by weight, in this case the number of students taking course A then B will be reflected in the width of the link between A and B. The sankeyNetwork
command also adjust the node layout to minimize edge crossing and have a general “left to right” aspect.
sn <- sankeyNetwork(Links = edgeDFTemp, Nodes = nodeDFTemp, Source = "source",
Target = "target", Value = "value", NodeID = "id",
nodeWidth = 20, fontSize = 8, units = "students")
htmlwidgets::onRender(sn, 'document.getElementsByTagName("svg")[0].setAttribute("viewBox", "")')
The Data
Following general practice, let’s first look at the data that we’ll be using. I’ve said this is some anonymized student data consisting of an id
(hashed, not actual student id’s), a course prefix and number crs
, and a standardized year value std.year
. This std.year
variable indicates when a student is taking the course during their “college career” with 0.0
meaning first fall semester, 1.4
meaning spring of their second year. The first digit is basically the number of years of college completed (\(0,1,2,\) or \(3\)) and the second codes the semester type (\(0\) is fall, \(2\) is winter, \(4\) is spring and \(6\) is summer). These are numeric so I could do arithmetic and standardize things. The point is to help us see students taking a course like CSC-150
(Intro to CS) in their freshman versus junior years, how they got there, and what they do next.
Now that you understand the variables, here’s a glimpse of the data (and the packages I need for the post):
library(dplyr)
library(ggplot2)
library(stringr)
library(tidyr)
library(DT)
library(networkD3)
students <- read.csv("../../static/files/math-major-anon.csv", header=TRUE)
students %>% head() %>% knitr::kable()
id | crs | std.year |
---|---|---|
8b551f6ba68aef2c9da1bf682e44716c | CSC-150 | 0.4 |
8b551f6ba68aef2c9da1bf682e44716c | MAT-252 | 0.4 |
8b551f6ba68aef2c9da1bf682e44716c | MAT-175 | 0.0 |
5069338819f368fb8960772475069d02 | CSC-150 | 0.4 |
5069338819f368fb8960772475069d02 | MAT-175 | 0.4 |
5069338819f368fb8960772475069d02 | MAT-150 | 0.0 |
Let’s also explore the data a little. Here’s the distribution of std.year
’s:
ggplot(students, aes(x=as.factor(std.year)))+geom_bar()
Clearly, most students are early in college and few students take summer math or CS courses (we don’t offer them, except internships and occasionally intro. stats).
Next, we’ll look at the courses by popularity (within our dataset):
library(widgetframe)
students %>% group_by(crs) %>%
summarise(Count = n_distinct(std.year),
Num.Students = n_distinct(id)) %>%
arrange(desc(Num.Students), desc(Count)) %>%
datatable() %>%
frameWidget(height = 550, width = "100%")
Since these are math and math/cs majors, it’s not surprising that single variable calculus and intro cs are the two most common courses (they’re required of all students in this dataset). What may be more interesting is that students have completed single variable calculus (MAT-175) at 4 different points in their trajectory towards a math or math/cs major. This means that we will expect 4 nodes for MAT-175 in our sankey diagram (this may get messy).
There is one thing we can do to make things a little cleaner. The MAT-28x intro to proofs courses should be combined since students usually only take MAT-280, MAT-281, MAT-282, etc. depending on what’s offered the year they need to take it.
students$crs <- str_replace(students$crs, "MAT-28[:digit:]{1}", "MAT-28X")
Building Nodes and Edges
The sankeyNetwork
command, like most networkD3
commands, needs to be given a graph (nodes and edges). Unfortunately, it’s not smart enough to build it from our students
data frame, we need to build it.
The nodes are easier, so we’ll start there. We want an id
for each node that combines the crs
and std.year
variables, so I’ll keep those and make a new id
column (note this will lose all student info).
nodes <- group_by(students, crs, std.year) %>%
summarise(cnt = n()) %>% ungroup() %>%
select(name=crs, std.year) %>%
unite(id, name, std.year, sep="_", remove=FALSE)
nodes %>% head() %>% knitr::kable()
id | name | std.year |
---|---|---|
CSC-150_0 | CSC-150 | 0.0 |
CSC-150_0.4 | CSC-150 | 0.4 |
CSC-150_1 | CSC-150 | 1.0 |
CSC-150_1.4 | CSC-150 | 1.4 |
CSC-150_2.4 | CSC-150 | 2.4 |
CSC-150_3 | CSC-150 | 3.0 |
The group_by
, summarise
, and ungroup
sequence is just a way to collapse down to each distinct crs
, std.year
pair that occurs in the data. We now have 89 nodes that will appear in the diagram.
Now for the edges. We’re going to create a data frame of “source” nodes and another of “target” nodes (remember we’re building a directed graph). A full join of the two will give all possible edges, so we’ll then remove those that aren’t needed. Most of the extra edges will be those connecting what a student did during fall of freshmen year to all subsequent courses, not just the immediate “next” course that we want plotted. These seem far easier to remove then to selectively build a much more careful join (only joining pairs with “adjacent” standard years).
Here’s the code to build the first pass of an edge list data frame:
#build source data
s.crs <- select(students, s.name=crs, s.std.year=std.year, id)
s.crs <- s.crs[!duplicated(s.crs),]
#build target data
t.crs <- select(students, t.name=crs, t.std.year=std.year, id)
t.crs <- t.crs[!duplicated(t.crs),]
#join and remove self-loops and backward edges
edgePerStudent <- full_join(s.crs, t.crs) %>%
filter(s.name!=t.name, s.std.year<t.std.year) %>%
arrange(id, s.std.year, t.std.year)
edgePerStudent %>% head() %>% knitr::kable()
s.name | s.std.year | id | t.name | t.std.year |
---|---|---|---|---|
CSC-150 | 0.4 | 09d4fb320db773a123eb97e7260caba1 | MAT-275 | 1.0 |
MAT-175 | 0.4 | 09d4fb320db773a123eb97e7260caba1 | MAT-275 | 1.0 |
CSC-150 | 0.4 | 09d4fb320db773a123eb97e7260caba1 | MAT-28X | 1.2 |
MAT-175 | 0.4 | 09d4fb320db773a123eb97e7260caba1 | MAT-28X | 1.2 |
CSC-150 | 0.4 | 09d4fb320db773a123eb97e7260caba1 | MAT-352 | 1.4 |
MAT-175 | 0.4 | 09d4fb320db773a123eb97e7260caba1 | MAT-352 | 1.4 |
As the name implies, we now have a data frame where each row is an edge from a distinct student. It’s important to note that I’ve arranged this in order for what comes next: id, then source standard year, and then target standard year. This data frame has 1112 rows, which is too many. We have extra edges formed by paths of what we want to keep and we need to collapse to have an edge with a weight equal to number of students taking that sequence of courses. The second part is easy (we could do a group_by
and summarise
now, but we would over count). The first part is possibly the ugliest R code I’ve written in a while, but it works and without messy lags and functional trickery is fairly straightforward. I initially tried to use a “run-length encoding”, but the resulting objects became to cumbersome (an rle
object isn’t a data frame or a list or anything else “nice”), so here’s a for
loop in R:
#build edges
edgeDF <- edgePerStudent[1,] #got to start somewhere
for(i in 2:nrow(edgePerStudent)){
if(edgeDF[nrow(edgeDF),]$id != edgePerStudent[i,]$id){
#different id's count
edgeDF <- rbind.data.frame(edgeDF, edgePerStudent[i,])
}
else if(edgeDF[nrow(edgeDF),]$s.std.year == edgePerStudent[i,]$s.std.year & edgeDF[nrow(edgeDF),]$t.std.year == edgePerStudent[i,]$t.std.year){
#same source and target as counting counts
edgeDF <- rbind.data.frame(edgeDF, edgePerStudent[i,])
}
else if(edgeDF[nrow(edgeDF),]$t.std.year == edgePerStudent[i,]$s.std.year){
#if last counted target is current source, it counts
edgeDF <- rbind.data.frame(edgeDF, edgePerStudent[i,])
}
}
#now count edge weights
edgeDF <- group_by(edgeDF, s.name, s.std.year, t.name,t.std.year) %>%
summarise(value=n_distinct(id))
edgeDF <- unite(edgeDF, s.id, s.name,s.std.year, sep="_", remove=FALSE) %>%
unite(t.id, t.name,t.std.year, sep="_", remove = FALSE)
edgeDF %>% head() %>% knitr::kable()
s.id | s.name | s.std.year | t.id | t.name | t.std.year | value |
---|---|---|---|---|---|---|
CSC-150_0 | CSC-150 | 0.0 | CSC-160_0.4 | CSC-160 | 0.4 | 1 |
CSC-150_0 | CSC-150 | 0.0 | MAT-175_0.4 | MAT-175 | 0.4 | 1 |
CSC-150_0 | CSC-150 | 0.0 | MAT-212_0.2 | MAT-212 | 0.2 | 1 |
CSC-150_0 | CSC-150 | 0.0 | MAT-28X_0.2 | MAT-28X | 0.2 | 3 |
CSC-150_0.4 | CSC-150 | 0.4 | CSC-152_1 | CSC-152 | 1.0 | 8 |
CSC-150_0.4 | CSC-150 | 0.4 | MAT-175_1 | MAT-175 | 1.0 | 3 |
Now we have 198 edges, with the desired weights for the sankey diagram.
Making the Diagram
Now we just call sankeyNetwork
and we’re done right? No, because we have data in R that we have to send to JavaScript to do the plotting so there’s a little book-keeping left to do. First, we need numeric node id’s and we need those id’s in the edge data.
#replace character s.id and t.id with numbers
nodes$node.id <- 1:nrow(nodes)
edgeDF <- inner_join(edgeDF, nodes, by=c("s.id"="id"))
edgeDF <- select(edgeDF, -s.id, source=node.id)
edgeDF <- inner_join(edgeDF, nodes, by=c("t.id"="id"))
edgeDF <- select(edgeDF, -t.id, target=node.id)
Next is the big conflict between R and almost every other programming language: indexing. R starts counting at 1, but JavaScript starts at 0 (as does Python, C/C++, Java, …) so we’ll have to re-index our node id’s. I’ll also replace everything after the first digit of std.year
info in the node id
variable, with a string indicating the semester.
#switch to zero indexing for javascript
edgeDFTemp <- mutate(edgeDF, source=source-1, target = target-1)
nodeDFTemp <- mutate(nodes, nid = node.id-1, id =
ifelse(str_detect(id, "\\.6"),
str_replace(id, "\\.6","SU"),
ifelse(str_detect(id, "\\.4"),
str_replace(id, "\\.4", "SP"),
ifelse(str_detect(id, "\\.2"),
str_replace(id, "\\.2", "W"),
str_c(id, "FA")))))
sn2 <- sankeyNetwork(Links = edgeDFTemp, Nodes = nodeDFTemp, Source = "source",
Target = "target", Value = "value", NodeID = "id",
nodeWidth = 20, fontSize = 8, units = "students")
htmlwidgets::onRender(sn2, 'document.getElementsByTagName("svg")[2].setAttribute("viewBox", "")')