A college curriculum seems like something that is a natural fit for a graph database. My last post collected data from the College of Idaho’s online catalog, using that and some information about majors and minors I’ve populated a graph database in Neo4j. In this post I’ll show how to do some basic queries that return tabular data as well as graph data using .
Graph DB Basics
For those who haven’t had much discrete math or computer science, a graph is a collection of nodes (aka vertices) and edges that connect nodes. A graph database is is a form of NoSQL databases (broadly defined) that stores data as a graph where nodes and edges can have various properties attached to them. Graph databases like Neo4j excel at working with highly connected data since relationships (edges) are put on an equal footing with objects (nodes).
To model a college curriculum as a graph, any “things” such as courses, minors, majors, or groups of classes are stored as nodes of different types and different types of nodes have different properties. For example course nodes with have subject, number, url (for catalog description), description, credits, etc.. The relationships between the nodes are the edges such as prerequisites (links between courses), “Satisfies” (courses can satisfy a requirement of a minor/major), and “Part_Of” (groups of courses are a part of a minor/major). The introduction of what I refer to as “Component” nodes (components of majors or minors) are part of the graph model to split up the large number of courses that may be options for a minor or major. This is done to improve visualization as well as database performance.
Connect to the DB
A neo4j database needs to be running locally with data in it. We can connect to it in R via:
library(RNeo4j)
#data is already loaded into database
gdb <- startGraph("http://localhost:7474/db/data", user="neo4j",
password = "maps") #only the password is DB specific here
Note the last comment. The default user is neo4j and the port and path are all defaults. Only the password is mine, and since this is a local DB with no sensitive info, I’m not concerned about revealing the password in a blog post, or trying to pick a secure password.
Basic Counting
We’ll start with some basic counting of courses and minor structure. For each minor, we list the number of courses required by that minor. This count only includes specific courses required such as MAT-175, if the minor requires either MAT-175 or MAT-275 that is an option group (or component) counted later.
query <- 'MATCH (c:Course)-[:Satisfies{type:"required"}]->(mc:Component{name:"req"})-[:Part_Of]->(m:Minor)
RETURN m.name AS Minor, count(c) AS NumberClasses, sum(toInt(c.minCredits)) AS CreditMin'
gq <- cypher(gdb,query) %>% as.data.frame()
kable(arrange(gq, Minor, desc(NumberClasses)), caption = "Number of courses required for each minor")
Minor | NumberClasses | CreditMin |
---|---|---|
Applied Math | 1 | 4 |
Computer Science | 4 | 16 |
Computer Studies | 2 | 8 |
Geosciences | 1 | 3 |
Physical Science | 1 | 4 |
Physics | 4 | 16 |
Next we look at the number of “option groups” for each minor. These are the number of groups of courses from which a selection of credits or courses must be completed. We then look at the number of courses and minimum credits available for each option group in each minor.
query <- 'MATCH (c:Component)-[p:Part_Of]->(m:Minor) WHERE c.name<>"req"
RETURN m.name AS Minor, c.name AS OptName'
gq <- cypher(gdb,query) %>% as.data.frame()
gq <- group_by(gq, Minor) %>% summarise(Number.Option.Groups = n_distinct(OptName))
kable(arrange(gq, Minor, desc(Number.Option.Groups)), caption = "Number of Option Groups required for each minor")
Minor | Number.Option.Groups |
---|---|
Applied Math | 3 |
Computer Science | 2 |
Computer Studies | 2 |
Geosciences | 3 |
Mathematics | 4 |
Physical Science | 2 |
Physics | 2 |
query <- 'MATCH (c:Course)-[s:Satisfies{type:"option"}]->(comp:Component)-[p:Part_Of]->(m:Minor)
RETURN m.name AS Minor, comp.name AS OptionGroup, count(c) AS NumberClasses, sum(toInt(c.minCredits)) AS CreditMin'
gq <- cypher(gdb,query) %>% as.data.frame()
kable(arrange(gq, Minor, OptionGroup, desc(NumberClasses)), caption = "Number of courses for each option group of each minor")
Minor | OptionGroup | NumberClasses | CreditMin |
---|---|---|---|
Applied Math | Calculus | 3 | 12 |
Applied Math | Data | 2 | 8 |
Applied Math | Lab | 1 | 0 |
Computer Science | Calculus | 2 | 8 |
Computer Science | IntermediateCSC | 2 | 8 |
Computer Studies | Calculus | 2 | 8 |
Computer Studies | Lab | 1 | 0 |
Geosciences | Core | 2 | 6 |
Geosciences | Math | 3 | 12 |
Geosciences | takeTwo | 5 | 16 |
Mathematics | Calculus | 2 | 8 |
Mathematics | Lab | 1 | 0 |
Mathematics | Proofs | 4 | 16 |
Mathematics | sixCredits | 15 | 60 |
Physical Science | takeTwo | 10 | 40 |
Physical Science | threeCredit | 76 | 206 |
Physics | sixCredit | 10 | 40 |
Physics | threeCredit | 27 | 108 |
Minor Visuals
Now let’s use graphs to visualize the minors in the MAPS department. I’m going to look at each minor separately for now. What follows is a function that takes the minor name, queries the neo4j database and uses visNetwork to show the resulting graph. VisNetork requires a dataframe of nodes and edges as input which we gather from separate queries - one for each type of node/edge.
library(visNetwork)
minorVis <- function(minorName){
MinorNodeQ <- paste0('MATCH (m:Minor {name:"',minorName,'"})
RETURN m.name AS id, m.name AS label, LABELS(m)[0] AS group')
ComponentNodeQ <- paste0('MATCH (c:Component)-[:Part_Of]->(m:Minor{name:"', minorName,'"})
RETURN c.name AS id, c.name AS label, LABELS(c)[0] AS group')
CourseNodeQ <- paste0('MATCH (c:Course)-[:Satisfies]->(:Component)-[:Part_Of]->(m:Minor{name:"',minorName,'"})
RETURN c.id AS id, c.id AS label, LABELS(c)[0] AS group')
nodes <- rbind.data.frame(cypher(gdb, MinorNodeQ),
cypher(gdb, CourseNodeQ))
nodes <- rbind.data.frame(nodes, cypher(gdb, ComponentNodeQ))
nodes <- nodes[!duplicated(nodes),]
edgeSatQ <- paste0('MATCH (c:Course)-[r:Satisfies]->(co:Component)-[:Part_Of]->(m:Minor {name:"',minorName,'"})
RETURN c.id AS from, co.name AS to, TYPE(r) AS label')
edgePOQ <- paste0('MATCH (c:Component)-[r:Part_Of]->(m:Minor {name:"',minorName,'"})
RETURN c.name AS from, m.name AS to, TYPE(r) AS label')
edges <- rbind.data.frame(cypher(gdb, edgeSatQ),cypher(gdb, edgePOQ))
visNetwork(nodes, edges)
}
Next we’ll call this function for each of the minors in MAPS. VisNetwork produces JavaScript graphs, so these may take some time to load. Also, you can drag nodes and edges around some (visNetwork bounces them back unless they move “enough”) as well as zoom in and out.
Several minors require an “approved lab course” which I’ve just put a single dummy course in for rather than showing over a dozen courses (most of which are outside MAPS) linked to the “Lab” component of these minors.
Applied Math
Computer Science
Computer Studies
Geosciences
Mathematics
Physics
Physical Science
Closing Thoughts
If it seems like the graph database is unnecessary overhead so far - you’re right! I could have stored everything in a node and edge dataframe, but so far we’re just showing how to pull data out and visualize it. The power of the graph DB comes in as curricula evolve and as we apply more sophisticated graph theory ideas to the data - such as finding articulation points, shortest paths, or computing node/edge metrics and using them in our analysis.
Additionally, the last two minors are “messy” with a huge cloud of courses linked to certain nodes. While this mess doesn’t pose a problem for students (finding six credits in the cloud in a given semester/year should be easy), closer inspection shows that many of the courses probably aren’t feasible options for most students due to prerequisites. This indicates lazy catalog copy - the minor should specify the only the first courses meeting the requirements in a chain of prerequisites rather than all courses. For example don’t include everything in the Chemistry department, only include the general chemistry course that is required for everything else or it will seem that students have more options than they do - or could allow cheats through the minor.
Clearly, I haven’t included any prerequisite links or used them to impose additional structure. Again, this is just the beginning of an ongoing project.