Lesser Known Verbs: top_n

I’ve been using R since 2006. That predates RStudio and the tidyverse. I remember the struggle of keeping track of the variants of apply and often fiddling with them to get code to work.

Then came plyr and the dplyr and my life has never been the same. The major verbs of dplyr include select, filter, mutate, group_by, summarise, and arrange; and if you are doing data analysis in R then you should be fluent in them. These are far from an exhaustive list of dplyr’s verbs though, and I recently discovered one that solves a problem I’ve encountered a few times: top_n.

The Problem

Recently I was analyzing some student data on an exam that had students from multiple section and their performance on on the exam. Since students could take the course multiple times, I only wanted to include the row with the best score. Clearly, grouping by student was needed, and summarise can get the max of one column but it won’t leave the other columns unchanged (it would apply and aggreation function to them). Instead top_n saves the day.

An Example

To avoid voilating FERPA, I’ll manufacture some data:

student_id <- c(1:6, seq(1,10, by=2))
section <- sample(c("A","B","C"), length(student_id), replace=TRUE)
score <- round(rnorm(11, 25, 5), 2)
question1 <- round(runif(1,0,5),0)
question2 <- round(runif(1,0,5),0)
question3 <- round(runif(1,0,5),0)
question4 <- round(runif(1,0,5),0)
question5 <- round(runif(1,0,5),0)

dat <- data.frame(student_id, section, score, question1, question2, question3, question4, question5)

kable(dat)
student_id section score question1 question2 question3 question4 question5
1 B 28.17 0 2 2 2 2
2 B 29.44 0 2 2 2 2
3 B 21.84 0 2 2 2 2
4 B 30.19 0 2 2 2 2
5 B 35.93 0 2 2 2 2
6 C 23.20 0 2 2 2 2
1 A 27.40 0 2 2 2 2
3 B 35.43 0 2 2 2 2
5 C 30.70 0 2 2 2 2
7 A 22.07 0 2 2 2 2
9 B 23.03 0 2 2 2 2

So, we have a small dataset with repeated student_id and we want to have no repeated students and keep the row (including question data) for the maximum score of any repeated student.

Here’s the dplyr one-liner:

dat %>% group_by(student_id) %>%
  top_n(1, score) %>% kable()
student_id section score question1 question2 question3 question4 question5
1 B 28.17 0 2 2 2 2
2 B 29.44 0 2 2 2 2
4 B 30.19 0 2 2 2 2
5 B 35.93 0 2 2 2 2
6 C 23.20 0 2 2 2 2
3 B 35.43 0 2 2 2 2
7 A 22.07 0 2 2 2 2
9 B 23.03 0 2 2 2 2

Conclusion

As mentioned in top_n documentation, it’s just a wrapper for filter and min_rank, so like most (all?) of dplyr it’s possible to avoid, but once you know about it why would you?

Related