Reticulated Mixture Models

Nov 10, 2018

Clearly, there’s no such thing as a “reticulated mixture model” but if you create one I’ll gladly take credit for the name. Instead this post is a demonstration of using mixture models for clustering and the interplay of R and Python via RStudio’s reticulate package.

Mixture Model Basics

The idea behind mixture models is that you have data containing information from two (or more) subgroups and you want to uncover structure of the subgroups. A classic example is you have a bunch of people’s height data and you would like to figure out which are likely to be from men and which are from women. If the data set is labeled with gender the problem is trivial, but if it’s not then it seems reasonable to think we’re looking at data sampled from 2 different normal distributions and we would like to use our data to get an idea what those distributions are. Of course, there’s no reason why we need to limit to only 2 groups or normal distributions, but we will here so we don’t overcomplicate the process.

Our Data

So show the mixture model process, I’m going to manufacture some data out of two bi-variate normal distributions, and I them to have different covariance matrices.

library(mvtnorm) #gets rmvnorm function

#function to make random covariance matrices
randCov <- function(n=2, k=1){
  mat <- matrix(runif(n^2)*k, ncol=n)
  return(t(mat)%*%mat) #make mat symmetric and return
}

cv1 <- randCov(2,2.5)
cv2 <- randCov(2,1.25)
A <- rmvnorm(100, mean=c(20,75), sigma = cv1)
B <- rmvnorm(100, mean=c(18,69), sigma = cv2)
df <- rbind.data.frame(as.data.frame(A), as.data.frame(B))
df$V3 <- c(rep("A",100),rep("B",100))

Here I’ve labeled the data so we can check how our mixture model performed. Let’s look at our data with and with-out using the labels:

library(ggplot2)
library(patchwork)
gNoLab <- ggplot(df, aes(x=V1, y=V2))+geom_point()+ggtitle("No Labels")
gLab <- ggplot(df, aes(x=V1, y=V2, col=V3))+geom_point()+ggtitle("True Labels")

gNoLab+gLab

Now our goal will be to recover the labels is we start with the data in the left graph.

Passing Data to Python

R has the functionality to build a gaussian mixture model, but I’ve been working with python some and want to use reticulate’s ability to pass data and results between R and python. First, let’s get R ready:

library(reticulate)
use_python("/usr/bin/python") # I'm using python 3.7.1 in Arch linux

Now in a python code chunk, we can access R objects.

import numpy as np
import pandas as pd
print(r.df.head())

##           V1         V2 V3
## 0  22.593247  75.423297  A
## 1  20.012685  74.874178  A
## 2  20.074491  75.357672  A
## 3  20.541347  75.030955  A
## 4  24.746648  76.296463  A

Mixture Model in Python

Now that we can get our data from R into python, we’ll use SciKit Learn to build a Gaussian Mixture model. We’ll need to give two parameters, the number of components we think the mixture has and a parameter about how the covariances may vary. We also have to copy the dataframe from R into a pandas dataframe so we can add a new column.

from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2, covariance_type='full')
pydf = r.df
pydf['gml']=gmm.fit_predict(pydf[['V1','V2']])
print(pydf.head())

##           V1         V2 V3  gml
## 0  22.593247  75.423297  A    1
## 1  20.012685  74.874178  A    1
## 2  20.074491  75.357672  A    1
## 3  20.541347  75.030955  A    1
## 4  24.746648  76.296463  A    1

Check Results

We can take advantage of ggplot2 to visualize the mixture model labels now. I’ll reproduce the graph above, but now the left side will be colored by the labels from the mixture model while the right is still colored with the true labels.

py$pydf$gml <- ifelse(py$pydf$gml==0, "A","B")
gMMLab <- ggplot(py$pydf, aes(x=V1, y=V2, col=gml))+geom_point()+ggtitle("Labeled by GMM")

gMMLab+gLab

That looks pretty successful! Obviously, the more mixed the data is, the hard it is for the mixture model to correctly identify the boundary. Also, if we have the wrong number of mixture components, the model labels will muddle the components. Sklearn provides a BayesianGaussianMixture that can identify less than the provided number of components. Perhaps that can be a post in the near future.