Maps Minors in Neo4J

A college curriculum seems like something that is a natural fit for a graph database. My last post collected data from the College of Idaho’s online catalog, using that and some information about majors and minors I’ve populated a graph database in Neo4j. In this post I’ll show how to do some basic queries that return tabular data as well as graph data using .

Graph DB Basics

For those who haven’t had much discrete math or computer science, a graph is a collection of nodes (aka vertices) and edges that connect nodes. A graph database is is a form of NoSQL databases (broadly defined) that stores data as a graph where nodes and edges can have various properties attached to them. Graph databases like Neo4j excel at working with highly connected data since relationships (edges) are put on an equal footing with objects (nodes).

To model a college curriculum as a graph, any “things” such as courses, minors, majors, or groups of classes are stored as nodes of different types and different types of nodes have different properties. For example course nodes with have subject, number, url (for catalog description), description, credits, etc.. The relationships between the nodes are the edges such as prerequisites (links between courses), “Satisfies” (courses can satisfy a requirement of a minor/major), and “Part_Of” (groups of courses are a part of a minor/major). The introduction of what I refer to as “Component” nodes (components of majors or minors) are part of the graph model to split up the large number of courses that may be options for a minor or major. This is done to improve visualization as well as database performance.

Connect to the DB

A neo4j database needs to be running locally with data in it. We can connect to it in R via:

library(RNeo4j)

#data is already loaded into database
gdb <- startGraph("http://localhost:7474/db/data", user="neo4j", 
                  password = "maps") #only the password is DB specific here

Note the last comment. The default user is neo4j and the port and path are all defaults. Only the password is mine, and since this is a local DB with no sensitive info, I’m not concerned about revealing the password in a blog post, or trying to pick a secure password.

Basic Counting

We’ll start with some basic counting of courses and minor structure. For each minor, we list the number of courses required by that minor. This count only includes specific courses required such as MAT-175, if the minor requires either MAT-175 or MAT-275 that is an option group (or component) counted later.

query <- 'MATCH (c:Course)-[:Satisfies{type:"required"}]->(mc:Component{name:"req"})-[:Part_Of]->(m:Minor) 
RETURN m.name AS Minor, count(c) AS NumberClasses, sum(toInt(c.minCredits)) AS CreditMin'
gq <- cypher(gdb,query) %>% as.data.frame()

kable(arrange(gq, Minor, desc(NumberClasses)), caption = "Number of courses required for each minor")
Table 1: Number of courses required for each minor
Minor NumberClasses CreditMin
Applied Math 1 4
Computer Science 4 16
Computer Studies 2 8
Geosciences 1 3
Physical Science 1 4
Physics 4 16

Next we look at the number of “option groups” for each minor. These are the number of groups of courses from which a selection of credits or courses must be completed. We then look at the number of courses and minimum credits available for each option group in each minor.

query <- 'MATCH (c:Component)-[p:Part_Of]->(m:Minor) WHERE c.name<>"req"
RETURN m.name AS Minor, c.name AS OptName'
gq <- cypher(gdb,query) %>% as.data.frame()

gq <- group_by(gq, Minor) %>% summarise(Number.Option.Groups = n_distinct(OptName))
kable(arrange(gq, Minor, desc(Number.Option.Groups)), caption = "Number of Option Groups required for each minor")
Table 2: Number of Option Groups required for each minor
Minor Number.Option.Groups
Applied Math 3
Computer Science 2
Computer Studies 2
Geosciences 3
Mathematics 4
Physical Science 2
Physics 2
query <- 'MATCH (c:Course)-[s:Satisfies{type:"option"}]->(comp:Component)-[p:Part_Of]->(m:Minor) 
RETURN m.name AS Minor, comp.name AS OptionGroup, count(c) AS NumberClasses, sum(toInt(c.minCredits)) AS CreditMin'

gq <- cypher(gdb,query) %>% as.data.frame()

kable(arrange(gq, Minor, OptionGroup, desc(NumberClasses)), caption = "Number of courses for each option group of each minor")
Table 3: Number of courses for each option group of each minor
Minor OptionGroup NumberClasses CreditMin
Applied Math Calculus 3 12
Applied Math Data 2 8
Applied Math Lab 1 0
Computer Science Calculus 2 8
Computer Science IntermediateCSC 2 8
Computer Studies Calculus 2 8
Computer Studies Lab 1 0
Geosciences Core 2 6
Geosciences Math 3 12
Geosciences takeTwo 5 16
Mathematics Calculus 2 8
Mathematics Lab 1 0
Mathematics Proofs 4 16
Mathematics sixCredits 15 60
Physical Science takeTwo 10 40
Physical Science threeCredit 76 206
Physics sixCredit 10 40
Physics threeCredit 27 108

Minor Visuals

Now let’s use graphs to visualize the minors in the MAPS department. I’m going to look at each minor separately for now. What follows is a function that takes the minor name, queries the neo4j database and uses visNetwork to show the resulting graph. VisNetork requires a dataframe of nodes and edges as input which we gather from separate queries - one for each type of node/edge.

library(visNetwork)
minorVis <- function(minorName){
  
  MinorNodeQ <- paste0('MATCH (m:Minor {name:"',minorName,'"}) 
                       RETURN m.name AS id, m.name AS label, LABELS(m)[0] AS group')

  ComponentNodeQ <- paste0('MATCH (c:Component)-[:Part_Of]->(m:Minor{name:"', minorName,'"}) 
                           RETURN c.name AS id, c.name AS label, LABELS(c)[0] AS group')

  CourseNodeQ <- paste0('MATCH (c:Course)-[:Satisfies]->(:Component)-[:Part_Of]->(m:Minor{name:"',minorName,'"}) 
                        RETURN c.id AS id, c.id AS label, LABELS(c)[0] AS group')

  nodes <- rbind.data.frame(cypher(gdb, MinorNodeQ), 
                            cypher(gdb, CourseNodeQ))
  nodes <- rbind.data.frame(nodes, cypher(gdb, ComponentNodeQ))
  nodes <- nodes[!duplicated(nodes),]

  edgeSatQ <- paste0('MATCH (c:Course)-[r:Satisfies]->(co:Component)-[:Part_Of]->(m:Minor {name:"',minorName,'"}) 
                     RETURN c.id AS from, co.name AS to, TYPE(r) AS label')

  edgePOQ <- paste0('MATCH (c:Component)-[r:Part_Of]->(m:Minor {name:"',minorName,'"}) 
                    RETURN c.name AS from, m.name AS to, TYPE(r) AS label')

  edges <- rbind.data.frame(cypher(gdb, edgeSatQ),cypher(gdb, edgePOQ))

  visNetwork(nodes, edges)
}

Next we’ll call this function for each of the minors in MAPS. VisNetwork produces JavaScript graphs, so these may take some time to load. Also, you can drag nodes and edges around some (visNetwork bounces them back unless they move “enough”) as well as zoom in and out.

Several minors require an “approved lab course” which I’ve just put a single dummy course in for rather than showing over a dozen courses (most of which are outside MAPS) linked to the “Lab” component of these minors.

Applied Math

Computer Science

Computer Studies

Geosciences

Mathematics

Physics

Physical Science

Closing Thoughts

If it seems like the graph database is unnecessary overhead so far - you’re right! I could have stored everything in a node and edge dataframe, but so far we’re just showing how to pull data out and visualize it. The power of the graph DB comes in as curricula evolve and as we apply more sophisticated graph theory ideas to the data - such as finding articulation points, shortest paths, or computing node/edge metrics and using them in our analysis.

Additionally, the last two minors are “messy” with a huge cloud of courses linked to certain nodes. While this mess doesn’t pose a problem for students (finding six credits in the cloud in a given semester/year should be easy), closer inspection shows that many of the courses probably aren’t feasible options for most students due to prerequisites. This indicates lazy catalog copy - the minor should specify the only the first courses meeting the requirements in a chain of prerequisites rather than all courses. For example don’t include everything in the Chemistry department, only include the general chemistry course that is required for everything else or it will seem that students have more options than they do - or could allow cheats through the minor.

Clearly, I haven’t included any prerequisite links or used them to impose additional structure. Again, this is just the beginning of an ongoing project.

Related