Counting Classes: The Basics

At the College of Idaho, there’s been discussion about visualizing the curriculum as well as understanding the curriculum. Naturally this interests me as a chance to wallow in some complicated data (students are required to complete a major and 3 minors across 4 “peaks” rather than complete courses from a traditional “core”). I thought using R and a Neo4j graph database would be useful (something to look forward to) - but first I needed to get data from the catalogue!

Web Scraping

Web scraping is one the essential data science skills that I’m not as fluent in as I would like to be, so getting the course data from the catalogue is a nice chance to hone my skills. The rvest package (from Hadley Wickham) is a great tool for web scraping in R and combines well with stringr text manipulation functions and the SelectorGadget chrome extension.

We’ll start by getting a list of all subjects in the catalogue, with links to the subject page which has the class list in that subject.

base_url <- "http://collegeofidaho.smartcatalogiq.com"
base_url_ext <- "/en/2017-2018/Undergraduate-Catalog/Courses"

#read base page
base_html <- read_html(paste0(base_url,base_url_ext))
#extract links from base page
subjectLinks <- html_nodes(base_html, 'a')
#convert links to text
subjectText <- html_text(subjectLinks)
subjectText <- subjectText[78:120]
#keep url for each subject also
sub_url <- html_attr(subjectLinks, 'href')[78:120]

subDF <- str_split(subjectText,pattern= '-', n=2, 
                   simplify=TRUE) %>% as.data.frame()
names(subDF) <- c("sub", "subject")
subDF <-mutate(subDF, sub = str_trim(sub, side = "both"), 
               subject = str_trim(subject, side = "both"), 
               url=paste0(base_url,sub_url))

kable(head(subDF),"html") %>%  
  kable_styling( bootstrap_options = c("striped", "hover",                                   "condensed", "responsive"), full_width=FALSE) 
sub subject url
ACC Accounting http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting
ART Art http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ART-Art
ASN Asian Studies http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ASN-Asian-Studies
ATH Anthropology http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ATH-Anthropology
ATHSOC Anthropology Sociology http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ATHSOC-Anthropology-Sociology
BIO Biology http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/BIO-Biology

The kable_styling function is from the kableExtra package and provides some nice features to control kable style. Now we need to follow the link for each subject and get a dataframe of course numbers, names and urls for description, credit, and prerequisite info. The nice way of doing this is use purrr’s map_dfr function which is a more user friendly version of *apply and rbind.data.frame. As with apply, we’ll need a function to call on each subject.

get_class_list <- function(i){
  #get list of links on subject page
  class_links <- html_nodes(read_html(subDF$url[i]), 'a')
  #turn links to text
  class_list <- html_text(class_links)
  class_url <- html_attr(class_links, 'href')
  classDF <- data.frame(list=class_list, url=class_url)
  
  #only keep links for classes, each subject has 
  #classes starting in a different position
  classDF <- classDF %>% filter(str_detect(list,
                                  paste0(subDF$sub[i],'-')))

    #We'll split on 1st space, discard everything after it and use
  #what's before it to build the required DF
  classDF <- separate(classDF, list, 
                  into=c("id", "name"), sep = " ", extra = "merge")
  
  #two theater classes have typos -THE-###
  #this is solely dealing with that
  classDF$id <- str_replace(classDF$id, "-THE", "THE")

  #back to normal  
  classDF <- separate(classDF, id, into = 
                        c("sub", "number"), sep="-")
  
  #the id field has the last part of the new url, we need the 
  #subject url with the course level (100,200,etc) then id
  classDF <- mutate(classDF, url=paste0(base_url,url))
  
  #several class names have '\n' in them, let's remove that now
  classDF <- mutate(classDF, name=str_replace_all(name, '\n',' '))
  
  return(classDF)
}

classes <- map_dfr(1:length(subDF$sub), get_class_list)
#add level variable (100,200, etc.)
classes <- classes %>% mutate(level=paste0(
          str_extract(number, '[:digit:]'),"00"))

kable(head(classes),"html") %>%  
  kable_styling( bootstrap_options = 
               c("striped", "hover", "condensed", 
                 "responsive"), full_width=FALSE) 
sub number name url level
ACC 221 Financial Accounting http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/200/ACC-221 200
ACC 222 Managerial Accounting http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/200/ACC-222 200
ACC 318 Intermediate Accounting I http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/300/ACC-318 300
ACC 320 Intermediate Accounting II http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/300/ACC-320 300
ACC 323 Electronic Accounting & Analysis http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/300/ACC-323 300
ACC 423 Auditing http://collegeofidaho.smartcatalogiq.com/en/2017-2018/Undergraduate-Catalog/Courses/ACC-Accounting/400/ACC-423 400

Now we have the basic class info, on to some descriptive analysis.

The Counts

I should mention that this data is all from the current (2017-2018) catalogue, so it doesn’t reflect courses added, removed, or changed during this academic year. The most basic question is how many classes does CofI offer? Well, the 43 subjects offer 1029 courses, but how are they distributed across subject and level?

First, let’s group by subject to see which subjects offer the most and least courses.

group_by(classes, sub) %>% 
  summarise(Count = n()) %>% 
  arrange(desc(Count)) %>% head() %>% 
  kable("html", caption = 
          "Subjects with the Most Courses") %>% 
  kable_styling( bootstrap_options = c("striped", 
                  "hover", "condensed", "responsive"), full_width=FALSE) 
Table 1: Subjects with the Most Courses
sub Count
HIS 94
BIO 92
ENG 61
MUS 54
PSY 54
POE 51
group_by(classes, sub) %>% summarise(Count = n()) %>% 
  arrange(desc(Count)) %>% tail() %>% 
  kable("html", caption = "Subjects with the Least Courses") %>%
  kable_styling( bootstrap_options = c("striped", "hover", 
                        "condensed", "responsive"), full_width=FALSE) 
Table 2: Subjects with the Least Courses
sub Count
ASN 3
EDUSPA 3
ESL 2
FYS 2
ECN 1
STS 1

Clearly, History doesn’t teach all of those classes every semester. It would be interesting to incorporate frequency of classes into this analysis, but that requires data from elsewhere since CofI doesn’t typically put course frequency into the catalogue.

Second, we can ignore subjects and look at the distribution of courses at different levels:

group_by(classes, level) %>% summarise(Count = n()) %>% 
  kable("html", caption = "Number of Courses at each Level") %>%
  kable_styling( bootstrap_options = c("striped", "hover", 
                          "condensed", "responsive"), full_width=FALSE) 
Table 3: Number of Courses at each Level
level Count
000 1
100 173
200 222
300 413
400 220

Finally, we can look at the distribution grouped by subject and level but because of the range of course counts at each subject, it’s probably better to work with proportions rather than counts. I’ll also order by decreasing proportion and note that if a subject (like accounting) doesn’t teach any classes of a particular level, we won’t see zeros in the list.

subCnt <- group_by(classes, sub) %>% summarise(Count_s = n())
sublevDF <- group_by(classes, sub, level) %>% 
  summarise(Count_sl = n())

sublevDF <- inner_join(sublevDF, subCnt, by="sub") %>% 
  mutate(sub.prop = round(Count_sl/Count_s, 4)) %>% 
  dplyr::select(sub, level, sub.prop) %>% 
  arrange(desc(sub.prop))

kable(sublevDF, "html", caption = "Proportion of Subject's 
      Courses at each Level") %>%  
  kable_styling( bootstrap_options = c("striped", "hover", 
          "condensed", "responsive"), full_width=FALSE) %>%
  scroll_box(width="100%", height = "250px")
Table 4: Proportion of Subject’s Courses at each Level
sub level sub.prop
ECN 200 1.0000
EDUSPA 100 1.0000
ESL 100 1.0000
FYS 100 1.0000
STS 100 1.0000
ATHSOC 400 0.8333
HHPA 100 0.8276
MFL 400 0.7273
HIS 300 0.6702
ASN 300 0.6667
IND 300 0.6250
HSC 300 0.6000
ACC 400 0.5833
CHI 100 0.5714
GER 200 0.5714
BIO 300 0.5652
POE 300 0.5294
SOC 300 0.5294
PHI 300 0.5263
PHY 200 0.5161
EDU 400 0.5000
LAS 400 0.5000
LSP 200 0.5000
SPA 300 0.5000
SPE 300 0.5000
PSY 300 0.4815
ENV 300 0.4762
CSC 400 0.4667
ENG 300 0.4590
CHE 400 0.4324
BUS 400 0.4146
ENG 200 0.4098
FRE 200 0.4000
JOURN 400 0.4000
MUSAP 400 0.4000
REL 300 0.4000
THE 300 0.4000
BUS 300 0.3902
HHP 300 0.3864
EDU 300 0.3750
GEO 100 0.3750
GEO 400 0.3750
REL 200 0.3667
MS 200 0.3636
MUS 100 0.3519
ART 300 0.3415
ASN 400 0.3333
CHE 300 0.3243
HHP 400 0.3182
THE 200 0.3111
FRE 100 0.3000
FRE 300 0.3000
JOURN 200 0.3000
JOURN 300 0.3000
ATH 200 0.2941
ATH 300 0.2941
MAT 200 0.2941
CHI 200 0.2857
GER 300 0.2857
MS 300 0.2727
ART 100 0.2683
MAT 100 0.2647
PHI 200 0.2632
MUS 300 0.2593
ACC 300 0.2500
GEO 300 0.2500
LAS 300 0.2500
LSP 400 0.2500
SPE 100 0.2500
SPE 200 0.2500
PSY 400 0.2407
ATH 400 0.2353
MAT 400 0.2353
SPA 400 0.2333
MAT 300 0.2059
CSC 100 0.2000
CSC 200 0.2000
HSC 100 0.2000
HSC 400 0.2000
MUSAP 100 0.2000
MUSAP 200 0.2000
MUSAP 300 0.2000
BIO 100 0.1957
ART 200 0.1951
ART 400 0.1951
PHY 400 0.1935
ENV 200 0.1905
ENV 400 0.1905
IND 200 0.1875
MUS 200 0.1852
MUS 400 0.1852
MFL 200 0.1818
MS 100 0.1818
MS 400 0.1818
THE 400 0.1778
ATH 100 0.1765
POE 100 0.1765
POE 200 0.1765
SOC 200 0.1765
SOC 400 0.1765
ACC 200 0.1667
ATHSOC 200 0.1667
PSY 200 0.1667
SPA 200 0.1667
PHY 100 0.1613
HHP 200 0.1591
PHI 400 0.1579
BIO 200 0.1522
HIS 200 0.1489
CHI 400 0.1429
ENV 100 0.1429
GER 400 0.1429
HIS 400 0.1383
HHPA 200 0.1379
HHP 100 0.1364
CHE 100 0.1351
CSC 300 0.1333
REL 400 0.1333
PHY 300 0.1290
EDU 200 0.1250
IND 100 0.1250
LAS 100 0.1250
LAS 200 0.1250
LSP 100 0.1250
LSP 300 0.1250
POE 400 0.1176
SOC 100 0.1176
ENG 400 0.1148
PSY 100 0.1111
THE 100 0.1111
CHE 200 0.1081
REL 100 0.1000
SPA 100 0.1000
BUS 100 0.0976
BUS 200 0.0976
MFL 300 0.0909
BIO 400 0.0870
IND 400 0.0625
PHI 100 0.0526
HIS 100 0.0426
HHPA 300 0.0345
MUS 000 0.0185
ENG 100 0.0164

Course Name Word Cloud

Now for a different type of counting, let’s make a word cloud of words in course names. I would rather use the wordcloud2 package which allows for some interesting visualizations, but it relies on javascript which doesn’t play well with a static site generator (but a shiny app would work). We’ll need to load some additional packages and split up the name data into words and frequencies, after doing a little cleaning of the names.

library(devtools)
library(tm)
library(SnowballC)
library(wordcloud)

#Load and clean
wordFreq <- Corpus(VectorSource(classes$name))
wordFreq <- tm_map(wordFreq, PlainTextDocument)
wordFreq <- tm_map(wordFreq, content_transformer(tolower))
wordFreq <- tm_map(wordFreq, removePunctuation)
wordFreq <- tm_map(wordFreq, removeWords, c("a", "the", "and", "for"))
wordFreq <- tm_map(wordFreq, stripWhitespace)

#build TDM and DF of words and frequencies
wordTDM <- TermDocumentMatrix(wordFreq)
wordTDM<- wordTDM %>% as.matrix() %>% rowSums() %>% sort()
wordDF <- data.frame(word=names(wordTDM), freq=wordTDM)

pal <- brewer.pal(9, "Purples")[-(1:3)]

wordcloud(words=wordDF$word, freq=wordDF$freq, 
          min.freq = 6,  random.order = FALSE, rot.per = .25, 
          scale = c(3,.5), colors = pal)

The special topics and independent study classes that almost all subjects have along with lab and introduction are not too surprising. History, with it’s high course count and the fact that it appears in other subjects (History of Math or Music History) is also not surprising. Beyond that, the wordcloud shows a nice diversity of disciplines and unifying ideas.

Clearly this is just scratching the surface of exploring a college’s curriculum and it would be interesting to compare similar schools in addition to exploring deeper into course descriptions and the links formed by prerequisites and major/minor requirements.

Related