Sankey Diagram

Update 7/23/2019 Various package updates have created problems with showing more than one javascript plot on a post. I’ve added calls to htlwidgets::onRender to get at least one plot displayed. I may revisit this, but the interaction between hugo, blogdown, and various javascript libraries (chorddiag, networkD3, D3, data tables, etc) is more than I’m able to dive into at the moment.

This post is about a type of visualization the will hopefully help see how students “flow” through college. The data is an anonymized selection of Math and Math-Computer Science majors at the College of Idaho, and for simplicity we’ll only be using the math and computer science courses. Out goal is to produce the following Sankey Diagram, which is really just a graph (in the discrete math sense - nodes connected by edges) where the edges are scaled by weight, in this case the number of students taking course A then B will be reflected in the width of the link between A and B. The sankeyNetwork command also adjust the node layout to minimize edge crossing and have a general “left to right” aspect.

sn <- sankeyNetwork(Links = edgeDFTemp, Nodes = nodeDFTemp, Source = "source", 
              Target = "target", Value = "value", NodeID = "id", 
              nodeWidth = 20, fontSize = 8, units = "students")

htmlwidgets::onRender(sn, 'document.getElementsByTagName("svg")[0].setAttribute("viewBox", "")')
MAT-275_1FA → MAT-28X_1W
14 students
MAT-150_0FA → MAT-175_0SP
13 students
CSC-150_0SP → MAT-275_1FA
10 students
MAT-150_0FA → CSC-150_0SP
9 students
CSC-150_0SP → CSC-152_1FA
8 students
MAT-175_0FA → CSC-150_0SP
8 students
MAT-28X_1W → MAT-352_1SP
8 students
CSC-152_1FA → MAT-28X_1W
6 students
MAT-175_0SP → MAT-275_1FA
6 students
MAT-175_0FA → MAT-212_0W
5 students
MAT-175_0FA → MAT-28X_0W
5 students
MAT-28X_1W → CSC-270_1SP
5 students
MAT-175_0SP → CSC-152_1FA
4 students
MAT-28X_0W → MAT-252_0SP
4 students
MAT-28X_1W → CSC-340_1SP
4 students
MAT-28X_1W → MAT-252_1SP
4 students
MAT-352_1SP → MAT-311_2FA
4 students
CSC-150_0FA → MAT-28X_0W
3 students
CSC-150_0SP → MAT-175_1FA
3 students
MAT-212_0W → CSC-150_0SP
3 students
MAT-252_0SP → MAT-275_1FA
3 students
MAT-28X_0W → CSC-150_0SP
3 students
MAT-28X_1W → CSC-150_1SP
3 students
MAT-28X_1W → MAT-361_1SP
3 students
CSC-150_1SP → MAT-361_2SP
2 students
CSC-152_1FA → CSC-285_1W
2 students
CSC-160_0SP → CSC-152_1FA
2 students
CSC-160_0SP → MAT-275_1FA
2 students
CSC-285_1W → CSC-340_1SP
2 students
CSC-285_1W → MAT-252_1SP
2 students
MAT-101_0FA → MAT-102_0W
2 students
MAT-102_0W → MAT-150_0SP
2 students
MAT-130_0SP → MAT-175_1FA
2 students
MAT-150_0FA → MAT-130_0SP
2 students
MAT-150_0FA → MAT-212_0W
2 students
MAT-175_0FA → MAT-199_0SP
2 students
MAT-175_0FA → MAT-212_0SP
2 students
MAT-175_0FA → MAT-252_0SP
2 students
MAT-212_0W → MAT-175_0SP
2 students
MAT-275_1FA → MAT-352_1SP
2 students
MAT-275_2FA → MAT-352_2SP
2 students
MAT-352_2SP → MAT-311_3FA
2 students
MAT-361_1SP → MAT-275_2FA
2 students
MAT-361_1SP → MAT-311_2FA
2 students
MAT-361_2SP → MAT-311_3FA
2 students
MAT-370_1W → MAT-361_1SP
2 students
CSC-150_0FA → CSC-160_0SP
1 students
CSC-150_0FA → MAT-175_0SP
1 students
CSC-150_0FA → MAT-212_0W
1 students
CSC-150_0SP → MAT-199_1FA
1 students
CSC-150_1FA → CSC-270_1SP
1 students
CSC-150_1FA → MAT-175_1SP
1 students
CSC-150_1FA → MAT-252_1SP
1 students
CSC-150_1FA → MAT-28X_1W
1 students
CSC-150_1FA → MAT-370_1W
1 students
CSC-150_1SP → MAT-275_2FA
1 students
CSC-150_1SP → MAT-352_2SP
1 students
CSC-150_2SP → MAT-199_3FA
1 students
CSC-150_2SP → MAT-311_3FA
1 students
CSC-150_3FA → MAT-431_3SP
1 students
CSC-150_3FA → MAT-494_3W
1 students
CSC-152_0FA → CSC-270_0SP
1 students
CSC-152_0FA → MAT-150_0SP
1 students
CSC-152_0FA → MAT-212_0SP
1 students
CSC-152_0FA → MAT-252_0SP
1 students
CSC-152_1FA → MAT-352_1SP
1 students
CSC-152_1FA → MAT-361_1SP
1 students
CSC-152_1FA → MAT-370_1W
1 students
CSC-152_2FA → CSC-285_2W
1 students
CSC-160_0SP → MAT-311_1FA
1 students
CSC-160_1SP → CSC-152_2FA
1 students
CSC-160_1SP → CSC-497_1SU
1 students
CSC-160_1SP → MAT-311_2FA
1 students
CSC-160_2SP → CSC-494_3FA
1 students
CSC-235_0SP → CSC-152_1FA
1 students
CSC-235_0SP → MAT-311_1FA
1 students
CSC-235_2SP → CSC-494_3FA
1 students
CSC-270_1SP → CSC-350_2FA
1 students
CSC-270_1SP → CSC-497_1SU
1 students
CSC-285_1W → CSC-270_1SP
1 students
CSC-285_2W → CSC-270_2SP
1 students
CSC-285_2W → MAT-252_2SP
1 students
CSC-285_2W → MAT-431_2SP
1 students
CSC-340_1SP → CSC-350_2FA
1 students
CSC-350_2FA → CSC-160_2SP
1 students
CSC-350_2FA → CSC-235_2SP
1 students
CSC-350_2FA → MAT-361_2SP
1 students
CSC-494_3FA → MAT-372_3W
1 students
CSC-497_1SU → CSC-494_2FA
1 students
MAT-102_0FA → CSC-150_0SP
1 students
MAT-102_0FA → MAT-150_0SP
1 students
MAT-102_0SP → MAT-150_1FA
1 students
MAT-102_1FA → MAT-175_1SP
1 students
MAT-125_0FA → MAT-111_0W
1 students
MAT-125_0FA → MAT-150_0SP
1 students
MAT-130_0FA → MAT-150_0SP
1 students
MAT-130_0W → CSC-150_0SP
1 students
MAT-130_0W → MAT-175_0SP
1 students
MAT-130_0SP → CSC-150_1FA
1 students
MAT-130_0SP → CSC-152_1FA
1 students
MAT-130_1FA → MAT-175_1SP
1 students
MAT-130_1W → MAT-361_1SP
1 students
MAT-150_0FA → MAT-130_0W
1 students
MAT-150_0FA → MAT-28X_0W
1 students
MAT-150_0FA → MAT-372_0W
1 students
MAT-150_0SP → CSC-150_1FA
1 students
MAT-150_0SP → MAT-102_1FA
1 students
MAT-150_0SP → MAT-130_1FA
1 students
MAT-150_1FA → CSC-150_1SP
1 students
MAT-150_1FA → MAT-175_1SP
1 students
MAT-175_0FA → CSC-160_0SP
1 students
MAT-175_0FA → CSC-270_0SP
1 students
MAT-175_0FA → MAT-199T_0SP
1 students
MAT-175_0SP → MAT-199_1FA
1 students
MAT-175_0SP → MAT-28X_1W
1 students
MAT-175_1FA → CSC-270_1SP
1 students
MAT-175_1FA → MAT-252_1SP
1 students
MAT-175_1FA → MAT-370_1W
1 students
MAT-175_1SP → MAT-275_2FA
1 students
MAT-199_0FA → CSC-150_0SP
1 students
MAT-199_0FA → MAT-150_0SP
1 students
MAT-199_0FA → MAT-175_0SP
1 students
MAT-199_1FA → MAT-130_1W
1 students
MAT-199_1FA → MAT-28X_1W
1 students
MAT-199_2FA → MAT-352_2SP
1 students
MAT-199_3FA → MAT-494_3W
1 students
MAT-199T_0SP → MAT-275_1FA
1 students
MAT-212_0W → CSC-160_0SP
1 students
MAT-212_0W → MAT-252_0SP
1 students
MAT-212_0W → MAT-361_0SP
1 students
MAT-212_0SP → CSC-150_1FA
1 students
MAT-212_0SP → MAT-275_1FA
1 students
MAT-212_1W → MAT-361_1SP
1 students
MAT-212_1SP → MAT-361_2SP
1 students
MAT-212_2FA → CSC-285_2W
1 students
MAT-252_0SP → CSC-150_1FA
1 students
MAT-252_0SP → CSC-152_1FA
1 students
MAT-252_1SP → CSC-350_2FA
1 students
MAT-275_1FA → CSC-150_1SP
1 students
MAT-275_1FA → CSC-285_1W
1 students
MAT-275_1FA → MAT-130_1W
1 students
MAT-275_1FA → MAT-212_1W
1 students
MAT-275_1FA → MAT-361_1SP
1 students
MAT-275_1FA → MAT-370_1W
1 students
MAT-275_1FA → MAT-372_1W
1 students
MAT-275_2FA → MAT-199_2SP
1 students
MAT-275_2FA → MAT-28X_2W
1 students
MAT-28X_0W → CSC-160_0SP
1 students
MAT-28X_0W → CSC-235_0SP
1 students
MAT-28X_0W → CSC-270_0SP
1 students
MAT-28X_0W → MAT-199_0SP
1 students
MAT-28X_0W → MAT-361_0SP
1 students
MAT-28X_0W → MAT-431_0SP
1 students
MAT-28X_1W → CSC-160_1SP
1 students
MAT-28X_1W → MAT-199_2SP
1 students
MAT-28X_1W → MAT-212_1SP
1 students
MAT-28X_1W → MAT-352_2SP
1 students
MAT-28X_2W → MAT-352_2SP
1 students
MAT-294_1SP → CSC-152_2FA
1 students
MAT-294_1SP → MAT-311_2FA
1 students
MAT-311_1FA → CSC-285_1W
1 students
MAT-311_2FA → CSC-285_2W
1 students
MAT-311_2FA → MAT-352_2SP
1 students
MAT-311_2FA → MAT-361_2SP
1 students
MAT-311_2FA → MAT-372_2W
1 students
MAT-311_3FA → MAT-212_3SP
1 students
MAT-311_3FA → MAT-431_3SP
1 students
MAT-311_3FA → MAT-494_3W
1 students
MAT-352_1SP → CSC-152_2FA
1 students
MAT-352_1SP → MAT-212_2FA
1 students
MAT-352_1SP → MAT-361_2SP
1 students
MAT-352_1SP → MAT-399T_2FA
1 students
MAT-361_0SP → CSC-150_1FA
1 students
MAT-361_0SP → MAT-275_1FA
1 students
MAT-361_1SP → CSC-152_2FA
1 students
MAT-361_1SP → MAT-199_2FA
1 students
MAT-361_1SP → MAT-212_2FA
1 students
MAT-361_1SP → MAT-451_2SP
1 students
MAT-361_2SP → CSC-150_3FA
1 students
MAT-361_2SP → CSC-494_3FA
1 students
MAT-361_2SP → MAT-494_3FA
1 students
MAT-370_1W → CSC-160_1SP
1 students
MAT-370_1W → MAT-294_1SP
1 students
MAT-370_1W → MAT-352_1SP
1 students
MAT-372_0W → CSC-150_0SP
1 students
MAT-372_1W → CSC-150_1SP
1 students
MAT-372_2W → MAT-431_2SP
1 students
MAT-372_3W → MAT-431_3SP
1 students
MAT-399T_2FA → CSC-150_2SP
1 students
MAT-399T_2FA → MAT-451_2SP
1 students
MAT-451_2SP → CSC-150_3FA
1 students
MAT-451_2SP → MAT-199_3FA
1 students
MAT-451_2SP → MAT-311_3FA
1 students
MAT-451_2SP → MAT-494_3FA
1 students
MAT-494_3FA → MAT-372_3W
1 students
MAT-494_3W → CSC-270_3SP
1 students
MAT-494_3W → MAT-361_3SP
1 students
MAT-494_3W → MAT-431_3SP
1 students
CSC-150_0FA
6 students
CSC-150_0FA
CSC-150_0SP
27 students
CSC-150_0SP
CSC-150_1FA
5 students
CSC-150_1FA
CSC-150_1SP
6 students
CSC-150_1SP
CSC-150_2SP
2 students
CSC-150_2SP
CSC-150_3FA
2 students
CSC-150_3FA
CSC-152_0FA
4 students
CSC-152_0FA
CSC-152_1FA
17 students
CSC-152_1FA
CSC-152_2FA
4 students
CSC-152_2FA
CSC-160_0SP
5 students
CSC-160_0SP
CSC-160_1SP
3 students
CSC-160_1SP
CSC-160_2SP
1 students
CSC-160_2SP
CSC-235_0SP
2 students
CSC-235_0SP
CSC-235_2SP
1 students
CSC-235_2SP
CSC-270_0SP
3 students
CSC-270_0SP
CSC-270_1SP
8 students
CSC-270_1SP
CSC-270_2SP
1 students
CSC-270_2SP
CSC-270_3SP
1 students
CSC-270_3SP
CSC-285_1W
5 students
CSC-285_1W
CSC-285_2W
3 students
CSC-285_2W
CSC-340_1SP
6 students
CSC-340_1SP
CSC-350_2FA
3 students
CSC-350_2FA
CSC-494_2FA
1 students
CSC-494_2FA
CSC-494_3FA
3 students
CSC-494_3FA
CSC-497_1SU
2 students
CSC-497_1SU
CSC-497_2FA
0 students
CSC-497_2FA
MAT-101_0FA
2 students
MAT-101_0FA
MAT-102_0FA
2 students
MAT-102_0FA
MAT-102_0W
2 students
MAT-102_0W
MAT-102_0SP
1 students
MAT-102_0SP
MAT-102_1FA
1 students
MAT-102_1FA
MAT-111_0W
1 students
MAT-111_0W
MAT-125_0FA
2 students
MAT-125_0FA
MAT-130_0FA
1 students
MAT-130_0FA
MAT-130_0W
2 students
MAT-130_0W
MAT-130_0SP
4 students
MAT-130_0SP
MAT-130_1FA
1 students
MAT-130_1FA
MAT-130_1W
2 students
MAT-130_1W
MAT-150_0FA
29 students
MAT-150_0FA
MAT-150_0SP
7 students
MAT-150_0SP
MAT-150_1FA
2 students
MAT-150_1FA
MAT-175_0FA
27 students
MAT-175_0FA
MAT-175_0SP
18 students
MAT-175_0SP
MAT-175_1FA
5 students
MAT-175_1FA
MAT-175_1SP
4 students
MAT-175_1SP
MAT-199_0FA
3 students
MAT-199_0FA
MAT-199_0SP
3 students
MAT-199_0SP
MAT-199_1FA
2 students
MAT-199_1FA
MAT-199_2FA
1 students
MAT-199_2FA
MAT-199_2SP
2 students
MAT-199_2SP
MAT-199_3FA
2 students
MAT-199_3FA
MAT-199T_0SP
1 students
MAT-199T_0SP
MAT-212_0FA
0 students
MAT-212_0FA
MAT-212_0W
8 students
MAT-212_0W
MAT-212_0SP
3 students
MAT-212_0SP
MAT-212_1W
1 students
MAT-212_1W
MAT-212_1SP
1 students
MAT-212_1SP
MAT-212_2FA
2 students
MAT-212_2FA
MAT-212_3SP
1 students
MAT-212_3SP
MAT-252_0SP
8 students
MAT-252_0SP
MAT-252_1SP
8 students
MAT-252_1SP
MAT-252_2SP
1 students
MAT-252_2SP
MAT-275_1FA
24 students
MAT-275_1FA
MAT-275_2FA
4 students
MAT-275_2FA
MAT-28X_0W
13 students
MAT-28X_0W
MAT-28X_1W
31 students
MAT-28X_1W
MAT-28X_2W
1 students
MAT-28X_2W
MAT-294_1SP
2 students
MAT-294_1SP
MAT-311_1FA
2 students
MAT-311_1FA
MAT-311_2FA
8 students
MAT-311_2FA
MAT-311_3FA
6 students
MAT-311_3FA
MAT-352_1SP
12 students
MAT-352_1SP
MAT-352_2SP
7 students
MAT-352_2SP
MAT-361_0SP
2 students
MAT-361_0SP
MAT-361_1SP
9 students
MAT-361_1SP
MAT-361_2SP
6 students
MAT-361_2SP
MAT-361_3SP
1 students
MAT-361_3SP
MAT-370_1W
5 students
MAT-370_1W
MAT-372_0W
1 students
MAT-372_0W
MAT-372_1W
1 students
MAT-372_1W
MAT-372_2W
1 students
MAT-372_2W
MAT-372_3W
2 students
MAT-372_3W
MAT-399T_2FA
2 students
MAT-399T_2FA
MAT-431_0SP
1 students
MAT-431_0SP
MAT-431_2SP
2 students
MAT-431_2SP
MAT-431_3SP
4 students
MAT-431_3SP
MAT-451_2SP
4 students
MAT-451_2SP
MAT-494_3FA
2 students
MAT-494_3FA
MAT-494_3W
3 students
MAT-494_3W

The Data

Following general practice, let’s first look at the data that we’ll be using. I’ve said this is some anonymized student data consisting of an id (hashed, not actual student id’s), a course prefix and number crs, and a standardized year value std.year. This std.year variable indicates when a student is taking the course during their “college career” with 0.0 meaning first fall semester, 1.4 meaning spring of their second year. The first digit is basically the number of years of college completed (\(0,1,2,\) or \(3\)) and the second codes the semester type (\(0\) is fall, \(2\) is winter, \(4\) is spring and \(6\) is summer). These are numeric so I could do arithmetic and standardize things. The point is to help us see students taking a course like CSC-150 (Intro to CS) in their freshman versus junior years, how they got there, and what they do next.

Now that you understand the variables, here’s a glimpse of the data (and the packages I need for the post):

library(dplyr)
library(ggplot2)
library(stringr)
library(tidyr)
library(DT)
library(networkD3)

students <- read.csv("../../static/files/math-major-anon.csv", header=TRUE)
students %>% head() %>% knitr::kable()
id crs std.year
8b551f6ba68aef2c9da1bf682e44716c CSC-150 0.4
8b551f6ba68aef2c9da1bf682e44716c MAT-252 0.4
8b551f6ba68aef2c9da1bf682e44716c MAT-175 0.0
5069338819f368fb8960772475069d02 CSC-150 0.4
5069338819f368fb8960772475069d02 MAT-175 0.4
5069338819f368fb8960772475069d02 MAT-150 0.0

Let’s also explore the data a little. Here’s the distribution of std.year’s:

ggplot(students, aes(x=as.factor(std.year)))+geom_bar()

Clearly, most students are early in college and few students take summer math or CS courses (we don’t offer them, except internships and occasionally intro. stats).

Next, we’ll look at the courses by popularity (within our dataset):

library(widgetframe)

students %>% group_by(crs) %>% 
  summarise(Count = n_distinct(std.year), 
            Num.Students = n_distinct(id)) %>% 
  arrange(desc(Num.Students), desc(Count)) %>% 
  datatable() %>%
  frameWidget(height = 550, width = "100%")

Since these are math and math/cs majors, it’s not surprising that single variable calculus and intro cs are the two most common courses (they’re required of all students in this dataset). What may be more interesting is that students have completed single variable calculus (MAT-175) at 4 different points in their trajectory towards a math or math/cs major. This means that we will expect 4 nodes for MAT-175 in our sankey diagram (this may get messy).

There is one thing we can do to make things a little cleaner. The MAT-28x intro to proofs courses should be combined since students usually only take MAT-280, MAT-281, MAT-282, etc. depending on what’s offered the year they need to take it.

students$crs <- str_replace(students$crs, "MAT-28[:digit:]{1}", "MAT-28X")

Building Nodes and Edges

The sankeyNetwork command, like most networkD3 commands, needs to be given a graph (nodes and edges). Unfortunately, it’s not smart enough to build it from our students data frame, we need to build it.

The nodes are easier, so we’ll start there. We want an id for each node that combines the crs and std.year variables, so I’ll keep those and make a new id column (note this will lose all student info).

nodes <- group_by(students, crs, std.year) %>% 
  summarise(cnt = n()) %>% ungroup() %>%
  select(name=crs, std.year) %>% 
  unite(id, name, std.year, sep="_", remove=FALSE)

nodes %>% head() %>% knitr::kable()
id name std.year
CSC-150_0 CSC-150 0.0
CSC-150_0.4 CSC-150 0.4
CSC-150_1 CSC-150 1.0
CSC-150_1.4 CSC-150 1.4
CSC-150_2.4 CSC-150 2.4
CSC-150_3 CSC-150 3.0

The group_by, summarise, and ungroup sequence is just a way to collapse down to each distinct crs, std.year pair that occurs in the data. We now have 89 nodes that will appear in the diagram.

Now for the edges. We’re going to create a data frame of “source” nodes and another of “target” nodes (remember we’re building a directed graph). A full join of the two will give all possible edges, so we’ll then remove those that aren’t needed. Most of the extra edges will be those connecting what a student did during fall of freshmen year to all subsequent courses, not just the immediate “next” course that we want plotted. These seem far easier to remove then to selectively build a much more careful join (only joining pairs with “adjacent” standard years).

Here’s the code to build the first pass of an edge list data frame:

#build source data
s.crs <- select(students, s.name=crs, s.std.year=std.year, id)
s.crs <- s.crs[!duplicated(s.crs),]
#build target data
t.crs <- select(students, t.name=crs, t.std.year=std.year, id)
t.crs <- t.crs[!duplicated(t.crs),]
#join and remove self-loops and backward edges
edgePerStudent <- full_join(s.crs, t.crs) %>% 
  filter(s.name!=t.name, s.std.year<t.std.year) %>%
  arrange(id, s.std.year, t.std.year)

edgePerStudent %>% head() %>% knitr::kable()
s.name s.std.year id t.name t.std.year
CSC-150 0.4 09d4fb320db773a123eb97e7260caba1 MAT-275 1.0
MAT-175 0.4 09d4fb320db773a123eb97e7260caba1 MAT-275 1.0
CSC-150 0.4 09d4fb320db773a123eb97e7260caba1 MAT-28X 1.2
MAT-175 0.4 09d4fb320db773a123eb97e7260caba1 MAT-28X 1.2
CSC-150 0.4 09d4fb320db773a123eb97e7260caba1 MAT-352 1.4
MAT-175 0.4 09d4fb320db773a123eb97e7260caba1 MAT-352 1.4

As the name implies, we now have a data frame where each row is an edge from a distinct student. It’s important to note that I’ve arranged this in order for what comes next: id, then source standard year, and then target standard year. This data frame has 1112 rows, which is too many. We have extra edges formed by paths of what we want to keep and we need to collapse to have an edge with a weight equal to number of students taking that sequence of courses. The second part is easy (we could do a group_by and summarise now, but we would over count). The first part is possibly the ugliest R code I’ve written in a while, but it works and without messy lags and functional trickery is fairly straightforward. I initially tried to use a “run-length encoding”, but the resulting objects became to cumbersome (an rle object isn’t a data frame or a list or anything else “nice”), so here’s a for loop in R:

#build edges
edgeDF <- edgePerStudent[1,] #got to start somewhere
for(i in 2:nrow(edgePerStudent)){
  if(edgeDF[nrow(edgeDF),]$id != edgePerStudent[i,]$id){
    #different id's count
    edgeDF <- rbind.data.frame(edgeDF, edgePerStudent[i,])
  }
  else if(edgeDF[nrow(edgeDF),]$s.std.year == edgePerStudent[i,]$s.std.year & edgeDF[nrow(edgeDF),]$t.std.year == edgePerStudent[i,]$t.std.year){
    #same source and target as counting counts
    edgeDF <- rbind.data.frame(edgeDF, edgePerStudent[i,])
  }
  else if(edgeDF[nrow(edgeDF),]$t.std.year == edgePerStudent[i,]$s.std.year){
    #if last counted target is current source, it counts
    edgeDF <- rbind.data.frame(edgeDF, edgePerStudent[i,])
  }
  
}

#now count edge weights
edgeDF <- group_by(edgeDF, s.name, s.std.year, t.name,t.std.year) %>% 
  summarise(value=n_distinct(id))
edgeDF <- unite(edgeDF, s.id, s.name,s.std.year, sep="_", remove=FALSE) %>%
  unite(t.id, t.name,t.std.year, sep="_", remove = FALSE)

edgeDF %>% head() %>% knitr::kable()
s.id s.name s.std.year t.id t.name t.std.year value
CSC-150_0 CSC-150 0.0 CSC-160_0.4 CSC-160 0.4 1
CSC-150_0 CSC-150 0.0 MAT-175_0.4 MAT-175 0.4 1
CSC-150_0 CSC-150 0.0 MAT-212_0.2 MAT-212 0.2 1
CSC-150_0 CSC-150 0.0 MAT-28X_0.2 MAT-28X 0.2 3
CSC-150_0.4 CSC-150 0.4 CSC-152_1 CSC-152 1.0 8
CSC-150_0.4 CSC-150 0.4 MAT-175_1 MAT-175 1.0 3

Now we have 198 edges, with the desired weights for the sankey diagram.

Making the Diagram

Now we just call sankeyNetwork and we’re done right? No, because we have data in R that we have to send to JavaScript to do the plotting so there’s a little book-keeping left to do. First, we need numeric node id’s and we need those id’s in the edge data.

#replace character s.id and t.id with numbers
nodes$node.id <- 1:nrow(nodes)
edgeDF <- inner_join(edgeDF, nodes, by=c("s.id"="id"))
edgeDF <- select(edgeDF, -s.id, source=node.id)
edgeDF <- inner_join(edgeDF, nodes, by=c("t.id"="id"))
edgeDF <- select(edgeDF, -t.id, target=node.id)

Next is the big conflict between R and almost every other programming language: indexing. R starts counting at 1, but JavaScript starts at 0 (as does Python, C/C++, Java, …) so we’ll have to re-index our node id’s. I’ll also replace everything after the first digit of std.year info in the node id variable, with a string indicating the semester.

#switch to zero indexing for javascript
edgeDFTemp <- mutate(edgeDF, source=source-1, target = target-1)
nodeDFTemp <- mutate(nodes, nid = node.id-1, id = 
                       ifelse(str_detect(id, "\\.6"), 
                              str_replace(id, "\\.6","SU"),
                              ifelse(str_detect(id, "\\.4"),
                                     str_replace(id, "\\.4", "SP"),
                                     ifelse(str_detect(id, "\\.2"),
                                            str_replace(id, "\\.2", "W"),
                                            str_c(id, "FA")))))

sn2 <- sankeyNetwork(Links = edgeDFTemp, Nodes = nodeDFTemp, Source = "source", 
              Target = "target", Value = "value", NodeID = "id", 
              nodeWidth = 20, fontSize = 8, units = "students")
htmlwidgets::onRender(sn2, 'document.getElementsByTagName("svg")[2].setAttribute("viewBox", "")')

Related