At the College of Idaho, there’s been discussion about visualizing the curriculum as well as understanding the curriculum. Naturally this interests me as a chance to wallow in some complicated data (students are required to complete a major and 3 minors across 4 “peaks” rather than complete courses from a traditional “core”). I thought using R and a Neo4j graph database would be useful (something to look forward to) - but first I needed to get data from the catalogue!
Web Scraping
Web scraping is one the essential data science skills that I’m not as fluent in as I would like to be, so getting the course data from the catalogue is a nice chance to hone my skills. The rvest
package (from Hadley Wickham) is a great tool for web scraping in R and combines well with stringr
text manipulation functions and the SelectorGadget
chrome extension.
We’ll start by getting a list of all subjects in the catalogue, with links to the subject page which has the class list in that subject.
base_url <- "http://collegeofidaho.smartcatalogiq.com"
base_url_ext <- "/en/2017-2018/Undergraduate-Catalog/Courses"
#read base page
base_html <- read_html(paste0(base_url,base_url_ext))
#extract links from base page
subjectLinks <- html_nodes(base_html, 'a')
#convert links to text
subjectText <- html_text(subjectLinks)
subjectText <- subjectText[78:120]
#keep url for each subject also
sub_url <- html_attr(subjectLinks, 'href')[78:120]
subDF <- str_split(subjectText,pattern= '-', n=2,
simplify=TRUE) %>% as.data.frame()
names(subDF) <- c("sub", "subject")
subDF <-mutate(subDF, sub = str_trim(sub, side = "both"),
subject = str_trim(subject, side = "both"),
url=paste0(base_url,sub_url))
kable(head(subDF),"html") %>%
kable_styling( bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
The kable_styling
function is from the kableExtra
package and provides some nice features to control kable style. Now we need to follow the link for each subject and get a dataframe of course numbers, names and urls for description, credit, and prerequisite info. The nice way of doing this is use purrr’s map_dfr
function which is a more user friendly version of *apply
and rbind.data.frame
. As with apply, we’ll need a function to call on each subject.
get_class_list <- function(i){
#get list of links on subject page
class_links <- html_nodes(read_html(subDF$url[i]), 'a')
#turn links to text
class_list <- html_text(class_links)
class_url <- html_attr(class_links, 'href')
classDF <- data.frame(list=class_list, url=class_url)
#only keep links for classes, each subject has
#classes starting in a different position
classDF <- classDF %>% filter(str_detect(list,
paste0(subDF$sub[i],'-')))
#We'll split on 1st space, discard everything after it and use
#what's before it to build the required DF
classDF <- separate(classDF, list,
into=c("id", "name"), sep = " ", extra = "merge")
#two theater classes have typos -THE-###
#this is solely dealing with that
classDF$id <- str_replace(classDF$id, "-THE", "THE")
#back to normal
classDF <- separate(classDF, id, into =
c("sub", "number"), sep="-")
#the id field has the last part of the new url, we need the
#subject url with the course level (100,200,etc) then id
classDF <- mutate(classDF, url=paste0(base_url,url))
#several class names have '\n' in them, let's remove that now
classDF <- mutate(classDF, name=str_replace_all(name, '\n',' '))
return(classDF)
}
classes <- map_dfr(1:length(subDF$sub), get_class_list)
#add level variable (100,200, etc.)
classes <- classes %>% mutate(level=paste0(
str_extract(number, '[:digit:]'),"00"))
kable(head(classes),"html") %>%
kable_styling( bootstrap_options =
c("striped", "hover", "condensed",
"responsive"), full_width=FALSE)
sub | number | name | url | level |
---|---|---|---|---|
ACC | 221 | Financial Accounting | http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/200/ACC-221 | 200 |
ACC | 222 | Managerial Accounting | http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/200/ACC-222 | 200 |
ACC | 318 | Intermediate Accounting I | http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/300/ACC-318 | 300 |
ACC | 320 | Intermediate Accounting II | http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/300/ACC-320 | 300 |
ACC | 323 | Electronic Accounting & Analysis | http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/300/ACC-323 | 300 |
ACC | 423 | Auditing | http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/400/ACC-423 | 400 |
Now we have the basic class info, on to some descriptive analysis.
The Counts
I should mention that this data is all from the current (2017-2018) catalogue, so it doesn’t reflect courses added, removed, or changed during this academic year. The most basic question is how many classes does CofI offer? Well, the 43 subjects offer 1029 courses, but how are they distributed across subject and level?
First, let’s group by subject to see which subjects offer the most and least courses.
group_by(classes, sub) %>%
summarise(Count = n()) %>%
arrange(desc(Count)) %>% head() %>%
kable("html", caption =
"Subjects with the Most Courses") %>%
kable_styling( bootstrap_options = c("striped",
"hover", "condensed", "responsive"), full_width=FALSE)
sub | Count |
---|---|
HIS | 94 |
BIO | 92 |
ENG | 61 |
MUS | 54 |
PSY | 54 |
POE | 51 |
group_by(classes, sub) %>% summarise(Count = n()) %>%
arrange(desc(Count)) %>% tail() %>%
kable("html", caption = "Subjects with the Least Courses") %>%
kable_styling( bootstrap_options = c("striped", "hover",
"condensed", "responsive"), full_width=FALSE)
sub | Count |
---|---|
ASN | 3 |
EDUSPA | 3 |
ESL | 2 |
FYS | 2 |
ECN | 1 |
STS | 1 |
Clearly, History doesn’t teach all of those classes every semester. It would be interesting to incorporate frequency of classes into this analysis, but that requires data from elsewhere since CofI doesn’t typically put course frequency into the catalogue.
Second, we can ignore subjects and look at the distribution of courses at different levels:
group_by(classes, level) %>% summarise(Count = n()) %>%
kable("html", caption = "Number of Courses at each Level") %>%
kable_styling( bootstrap_options = c("striped", "hover",
"condensed", "responsive"), full_width=FALSE)
level | Count |
---|---|
000 | 1 |
100 | 173 |
200 | 222 |
300 | 413 |
400 | 220 |
Finally, we can look at the distribution grouped by subject and level but because of the range of course counts at each subject, it’s probably better to work with proportions rather than counts. I’ll also order by decreasing proportion and note that if a subject (like accounting) doesn’t teach any classes of a particular level, we won’t see zeros in the list.
subCnt <- group_by(classes, sub) %>% summarise(Count_s = n())
sublevDF <- group_by(classes, sub, level) %>%
summarise(Count_sl = n())
sublevDF <- inner_join(sublevDF, subCnt, by="sub") %>%
mutate(sub.prop = round(Count_sl/Count_s, 4)) %>%
dplyr::select(sub, level, sub.prop) %>%
arrange(desc(sub.prop))
kable(sublevDF, "html", caption = "Proportion of Subject's
Courses at each Level") %>%
kable_styling( bootstrap_options = c("striped", "hover",
"condensed", "responsive"), full_width=FALSE) %>%
scroll_box(width="100%", height = "250px")
sub | level | sub.prop |
---|---|---|
ECN | 200 | 1.0000 |
EDUSPA | 100 | 1.0000 |
ESL | 100 | 1.0000 |
FYS | 100 | 1.0000 |
STS | 100 | 1.0000 |
ATHSOC | 400 | 0.8333 |
HHPA | 100 | 0.8276 |
MFL | 400 | 0.7273 |
HIS | 300 | 0.6702 |
ASN | 300 | 0.6667 |
IND | 300 | 0.6250 |
HSC | 300 | 0.6000 |
ACC | 400 | 0.5833 |
CHI | 100 | 0.5714 |
GER | 200 | 0.5714 |
BIO | 300 | 0.5652 |
POE | 300 | 0.5294 |
SOC | 300 | 0.5294 |
PHI | 300 | 0.5263 |
PHY | 200 | 0.5161 |
EDU | 400 | 0.5000 |
LAS | 400 | 0.5000 |
LSP | 200 | 0.5000 |
SPA | 300 | 0.5000 |
SPE | 300 | 0.5000 |
PSY | 300 | 0.4815 |
ENV | 300 | 0.4762 |
CSC | 400 | 0.4667 |
ENG | 300 | 0.4590 |
CHE | 400 | 0.4324 |
BUS | 400 | 0.4146 |
ENG | 200 | 0.4098 |
FRE | 200 | 0.4000 |
JOURN | 400 | 0.4000 |
MUSAP | 400 | 0.4000 |
REL | 300 | 0.4000 |
THE | 300 | 0.4000 |
BUS | 300 | 0.3902 |
HHP | 300 | 0.3864 |
EDU | 300 | 0.3750 |
GEO | 100 | 0.3750 |
GEO | 400 | 0.3750 |
REL | 200 | 0.3667 |
MS | 200 | 0.3636 |
MUS | 100 | 0.3519 |
ART | 300 | 0.3415 |
ASN | 400 | 0.3333 |
CHE | 300 | 0.3243 |
HHP | 400 | 0.3182 |
THE | 200 | 0.3111 |
FRE | 100 | 0.3000 |
FRE | 300 | 0.3000 |
JOURN | 200 | 0.3000 |
JOURN | 300 | 0.3000 |
ATH | 200 | 0.2941 |
ATH | 300 | 0.2941 |
MAT | 200 | 0.2941 |
CHI | 200 | 0.2857 |
GER | 300 | 0.2857 |
MS | 300 | 0.2727 |
ART | 100 | 0.2683 |
MAT | 100 | 0.2647 |
PHI | 200 | 0.2632 |
MUS | 300 | 0.2593 |
ACC | 300 | 0.2500 |
GEO | 300 | 0.2500 |
LAS | 300 | 0.2500 |
LSP | 400 | 0.2500 |
SPE | 100 | 0.2500 |
SPE | 200 | 0.2500 |
PSY | 400 | 0.2407 |
ATH | 400 | 0.2353 |
MAT | 400 | 0.2353 |
SPA | 400 | 0.2333 |
MAT | 300 | 0.2059 |
CSC | 100 | 0.2000 |
CSC | 200 | 0.2000 |
HSC | 100 | 0.2000 |
HSC | 400 | 0.2000 |
MUSAP | 100 | 0.2000 |
MUSAP | 200 | 0.2000 |
MUSAP | 300 | 0.2000 |
BIO | 100 | 0.1957 |
ART | 200 | 0.1951 |
ART | 400 | 0.1951 |
PHY | 400 | 0.1935 |
ENV | 200 | 0.1905 |
ENV | 400 | 0.1905 |
IND | 200 | 0.1875 |
MUS | 200 | 0.1852 |
MUS | 400 | 0.1852 |
MFL | 200 | 0.1818 |
MS | 100 | 0.1818 |
MS | 400 | 0.1818 |
THE | 400 | 0.1778 |
ATH | 100 | 0.1765 |
POE | 100 | 0.1765 |
POE | 200 | 0.1765 |
SOC | 200 | 0.1765 |
SOC | 400 | 0.1765 |
ACC | 200 | 0.1667 |
ATHSOC | 200 | 0.1667 |
PSY | 200 | 0.1667 |
SPA | 200 | 0.1667 |
PHY | 100 | 0.1613 |
HHP | 200 | 0.1591 |
PHI | 400 | 0.1579 |
BIO | 200 | 0.1522 |
HIS | 200 | 0.1489 |
CHI | 400 | 0.1429 |
ENV | 100 | 0.1429 |
GER | 400 | 0.1429 |
HIS | 400 | 0.1383 |
HHPA | 200 | 0.1379 |
HHP | 100 | 0.1364 |
CHE | 100 | 0.1351 |
CSC | 300 | 0.1333 |
REL | 400 | 0.1333 |
PHY | 300 | 0.1290 |
EDU | 200 | 0.1250 |
IND | 100 | 0.1250 |
LAS | 100 | 0.1250 |
LAS | 200 | 0.1250 |
LSP | 100 | 0.1250 |
LSP | 300 | 0.1250 |
POE | 400 | 0.1176 |
SOC | 100 | 0.1176 |
ENG | 400 | 0.1148 |
PSY | 100 | 0.1111 |
THE | 100 | 0.1111 |
CHE | 200 | 0.1081 |
REL | 100 | 0.1000 |
SPA | 100 | 0.1000 |
BUS | 100 | 0.0976 |
BUS | 200 | 0.0976 |
MFL | 300 | 0.0909 |
BIO | 400 | 0.0870 |
IND | 400 | 0.0625 |
PHI | 100 | 0.0526 |
HIS | 100 | 0.0426 |
HHPA | 300 | 0.0345 |
MUS | 000 | 0.0185 |
ENG | 100 | 0.0164 |
Course Name Word Cloud
Now for a different type of counting, let’s make a word cloud of words in course names. I would rather use the wordcloud2
package which allows for some interesting visualizations, but it relies on javascript which doesn’t play well with a static site generator (but a shiny app would work). We’ll need to load some additional packages and split up the name data into words and frequencies, after doing a little cleaning of the names.
library(devtools)
library(tm)
library(SnowballC)
library(wordcloud)
#Load and clean
wordFreq <- Corpus(VectorSource(classes$name))
wordFreq <- tm_map(wordFreq, PlainTextDocument)
wordFreq <- tm_map(wordFreq, content_transformer(tolower))
wordFreq <- tm_map(wordFreq, removePunctuation)
wordFreq <- tm_map(wordFreq, removeWords, c("a", "the", "and", "for"))
wordFreq <- tm_map(wordFreq, stripWhitespace)
#build TDM and DF of words and frequencies
wordTDM <- TermDocumentMatrix(wordFreq)
wordTDM<- wordTDM %>% as.matrix() %>% rowSums() %>% sort()
wordDF <- data.frame(word=names(wordTDM), freq=wordTDM)
pal <- brewer.pal(9, "Purples")[-(1:3)]
wordcloud(words=wordDF$word, freq=wordDF$freq,
min.freq = 6, random.order = FALSE, rot.per = .25,
scale = c(3,.5), colors = pal)
The special topics and independent study classes that almost all subjects have along with lab and introduction are not too surprising. History, with it’s high course count and the fact that it appears in other subjects (History of Math or Music History) is also not surprising. Beyond that, the wordcloud shows a nice diversity of disciplines and unifying ideas.
Clearly this is just scratching the surface of exploring a college’s curriculum and it would be interesting to compare similar schools in addition to exploring deeper into course descriptions and the links formed by prerequisites and major/minor requirements.