This post is primarily to give the basic overview of principal components analysis (PCA) for dimensionality reduction and regression. I wanted to create it as a guide for my regression students who may find it useful for their projects. First, let’s note the two main times that you may want to use PCA - dimensionality reduction (reducing variables in a dataset) and removing colinearity issues. These are not exclusive problems, often you want to do both. However, depending on the data, PCA will ensure a lack of colinearity among the principal components but may not be able to use less variables in subsequent models.
Basic Idea
Before getting into real examples, let’s look at what PCA does in 2 dimensions. I’ll generate some highly correlated data and compute the principal components, and we’ll make it easy to predict the components. My data will be related by \(y=3*x+1+\epsilon\), where \(\epsilon\) is normally distributed random error. This means that the greatest variation in my data should be along the line \(y=3x+1\), which should give the first principal component. The second (and final in the 2-D case) will be along the perpendicular line \(y=\frac{-1}{3}x+b\).
library(dplyr) #pipes and df manipulation
library(ggplot2) #graphing
library(patchwork) #graph layout
x <- seq(from=1, to=10, length.out = 100)
rn <- rnorm(100, mean=0, sd=1.25) #random error
y <- 3*x+1+rn
df <- data.frame(x, y)
#compute principal components
pcdf <- prcomp(df)
#graph data and pc's
data.graph <- ggplot(df, aes(x=x,y=y))+geom_point()+
ggtitle("Original Data")
pc.graph <- ggplot(as.data.frame(pcdf$x), aes(x=PC1, y=PC2))+
geom_point()+ggtitle("Data in PC-space")
data.graph+pc.graph
Notice how the correlation between \(x\) and \(y\) vanishes when looking at the data with axes aligned along the principal components - now PC1
and PC2
provide non-colinear data to us in regression. Furthermore, the sdev
element of pcdf
tells use how much of the standard deviation (and hence variance) is explained by each component:
pcdf$sdev^2/sum(pcdf$sdev^2)
## [1] 0.997359167 0.002640833
So PC1
accounts for almost all of the variance seen in the original data. This isn’t surprising given how the data was made, it is so highly correlated that the data is basically one-dimensional and PCA has found that. With higher dimensional data, a \(scree~plot\) is useful to see how additional components explain more variance:
ggplot(data.frame(
component = 1:length(pcdf$sdev),
explained.var.pct = pcdf$sdev^2/sum(pcdf$sdev^2)
),
aes(x=component, y=cumsum(explained.var.pct)))+
geom_line()+ylab("Total Percent Variance Explained")
Now, what about the relationship in our data (\(y=3x+1\)) and the principal components? The rotation
element of pcdf
gives us a matrix of eigenvectors that tells use how to turn a point in the original \(xy\)-plane into a point in the \(PC1PC2\)-plane. The second row of the rotation matrix divided by the first (\(y/x\)) gives use slopes of almost \(3\) and \(\frac{-1}{3}\) (the difference is the random error I’ve added to the data). The principal components are just a new basis (in the linear algebra sense), each column is a unit vector and the columns are orthogonal to each-other, so in two-dimensions the slope determines a unit vector. In higher dimensions this gets more complicated, but the rotation matrix columns still give us the direction vector for the principal components. If you remember multivariate calculus, you can turn a direction vector into a line in higher dimensions.
A Real Example
I’ll load data from sb3tnzv, which has data about the content of certain molecules in a certain species of sagebrush (this is related to a collaboration with a biochemist).
sb <- read.csv("../../static/files/sb3tnzv.csv", header=TRUE)
knitr::kable(head(sb))
id | species | browsed | CYP1A.grouse.micr | CYP1A.human.micr | SB02 | SB03 | SB05 | SB07 | SB09 | SB10 | SB11 | SB12 | SB13 | SB14 | SB15 | SB16 | SB17 | SB18 | SB19 | SB20 | SB22 | SB23 | SB24 | SB26 | SB28 | SB29 | SB30 | SB31 | SB32 | SB34 | SB36 | SB37 | SB38 | SB39 | SB40 | SB41 | SB42 | SB43 | SB44 | SB45 | SB45s | SB46 | SB47 | SB48 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
105 | 3T | NB | NA | NA | 0.0000 | 629.0214 | 565.9454 | 0 | 0.000 | 1510.473 | 618.7710 | 1389.2164 | 719.0859 | 370.6543 | 0 | 10076.76 | 0.000 | 0 | 26818.27 | 55521.95 | 7762.168 | 7988.484 | 0 | 29838.72 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0 | 0.000 | 0.00 | 0.000 | 0.000 | 0.000 | 70267.00 | 109853.00 | 0.00 | 60935.48 | 0.00 | 0.000 | 5583.644 | 4029.331 | 76519.95 |
134 | 3T | B | NA | NA | 418.1162 | 787.7897 | 649.8425 | 0 | 1825.224 | 0.000 | 749.1660 | 1629.1661 | 815.1082 | 495.5262 | 0 | 13573.21 | 1406.067 | 0 | 29028.81 | 68483.62 | 9038.913 | 9783.465 | 0 | 30501.25 | 10599.994 | 7101.127 | 8305.438 | 0.000 | 4244.354 | 0 | 2289.997 | 13235.83 | 3617.598 | 5641.736 | 5766.007 | 67198.16 | 127896.20 | 71841.00 | 66638.00 | 17905.29 | 0.000 | 4631.608 | 7263.780 | 86716.27 |
154 | 3T | NB | 137.2 | 64.3 | 403.1469 | 654.3164 | 565.2982 | 0 | 0.000 | 1808.972 | 650.8421 | 1560.4817 | 1266.7593 | 0.0000 | 0 | 12916.54 | 0.000 | 0 | 37969.62 | 66222.91 | 8011.746 | 8014.458 | 0 | 25759.11 | 6783.765 | 6627.330 | 6373.151 | 0.000 | 4767.672 | 0 | 3078.736 | 15643.33 | 4111.693 | 5938.325 | 16812.998 | 59342.97 | 93693.29 | 69981.00 | 64628.00 | 18887.17 | 0.000 | 5405.461 | 4248.908 | 67234.19 |
182 | 3T | NB | 123.5 | 62.9 | 558.8956 | 1044.3494 | 0.0000 | 0 | 2590.590 | 0.000 | 1161.7732 | 1744.6504 | 902.3607 | 0.0000 | 0 | 10598.65 | 1635.713 | 0 | 32982.45 | 67699.61 | 8178.428 | 6533.004 | 0 | 27824.57 | 12031.102 | 6346.032 | 9447.529 | 0.000 | 5135.961 | 0 | 2169.135 | 0.00 | 6862.923 | 6174.927 | 23795.510 | 14001.83 | 20861.51 | 131440.00 | 147611.00 | 20330.75 | 0.000 | 4810.313 | 8282.686 | 80970.20 |
222 | 3T | NB | 161.0 | 74.1 | 254.2615 | 431.4813 | 445.9520 | 0 | 0.000 | 1078.684 | 447.0190 | 895.7186 | 594.2125 | 0.0000 | 0 | 16864.11 | 0.000 | 0 | 23013.96 | 47683.07 | 5734.898 | 7296.023 | 0 | 12866.58 | 3783.557 | 4826.480 | 5690.679 | 0.000 | 3626.880 | 0 | 0.000 | 11667.33 | 3201.437 | 5310.534 | 8928.760 | 21563.69 | 44012.07 | 20978.52 | 34732.02 | 0.00 | 7606.205 | 0.000 | 11376.954 | 25991.20 |
238 | 3T | B | 132.4 | 72.9 | 0.0000 | 668.8263 | 0.0000 | 0 | 0.000 | 854.496 | 332.0903 | 652.6357 | 99.0683 | 0.0000 | 0 | 22382.95 | 0.000 | 0 | 25243.57 | 68834.52 | 9652.101 | 6982.520 | 0 | 42749.88 | 0.000 | 4809.935 | 6726.340 | 4758.589 | 0.000 | 0 | 2101.150 | 0.00 | 3842.604 | 4641.517 | 15705.387 | 51669.16 | 90192.88 | 63978.74 | 88519.23 | 24585.34 | 0.000 | 5925.326 | 3597.326 | 88216.69 |
Each of the SB
variables basically tells you how much of the molecule is in the sample and each number corresponds to a different molecule. Hopefully PCA will help reduce the number of variables. Let’s perform the PC computation and look at a scree plot.
sbpc <- prcomp(sb[,6:45], center = TRUE, scale. = TRUE)
ggplot(data.frame(
component = 1:length(sbpc$sdev),
explained.var.pct = sbpc$sdev^2/sum(sbpc$sdev^2)
),
aes(x=component, y=cumsum(explained.var.pct)))+
geom_line()+ylab("Total Percent Variance Explained")+
ggtitle("Scree Plot of Sagebrush Data")
This shows that we can explain most of the variation in our data with far fewer variables than all the SB
’s. It’s worth noting that I’ve removed all variables with zero variance already and am scaling and centering the data prior to performing the PCA computation - this is needed whenever different variables have vastly different scales.
We can now go about building models using the principal components instead of the original SB
variables and we don’t have to worry about colinearity. Furthermore, the order of our components is in order of decreasing variance explained so we would build models using the PC’s in order (i.e. a model without PC1 but with other PC’s would be strange). The SB
variables lack this aspect, but are more interpretable.