PCA Overview

Nov 24, 2018

This post is primarily to give the basic overview of principal components analysis (PCA) for dimensionality reduction and regression. I wanted to create it as a guide for my regression students who may find it useful for their projects. First, let’s note the two main times that you may want to use PCA - dimensionality reduction (reducing variables in a dataset) and removing colinearity issues. These are not exclusive problems, often you want to do both. However, depending on the data, PCA will ensure a lack of colinearity among the principal components but may not be able to use less variables in subsequent models.

Basic Idea

Before getting into real examples, let’s look at what PCA does in 2 dimensions. I’ll generate some highly correlated data and compute the principal components, and we’ll make it easy to predict the components. My data will be related by $y = 3 * x + 1 + ϵ$ , where $ϵ$ is normally distributed random error. This means that the greatest variation in my data should be along the line $y = 3 x + 1$ , which should give the first principal component. The second (and final in the 2-D case) will be along the perpendicular line $y = \frac{- 1}{3} x + b$ .

library(dplyr) #pipes and df manipulation
library(ggplot2) #graphing
library(patchwork) #graph layout

x <- seq(from=1, to=10, length.out = 100)
rn <- rnorm(100, mean=0, sd=1.25) #random error
y <- 3*x+1+rn 
df <- data.frame(x, y)

#compute principal components
pcdf <- prcomp(df)

#graph data and pc's
data.graph <- ggplot(df, aes(x=x,y=y))+geom_point()+
  ggtitle("Original Data")
pc.graph <- ggplot(as.data.frame(pcdf$x), aes(x=PC1, y=PC2))+
  geom_point()+ggtitle("Data in PC-space")

data.graph+pc.graph

Notice how the correlation between $x$ and $y$ vanishes when looking at the data with axes aligned along the principal components - now PC1 and PC2 provide non-colinear data to us in regression. Furthermore, the sdev element of pcdf tells use how much of the standard deviation (and hence variance) is explained by each component:

pcdf$sdev^2/sum(pcdf$sdev^2)

## [1] 0.997359167 0.002640833

So PC1 accounts for almost all of the variance seen in the original data. This isn’t surprising given how the data was made, it is so highly correlated that the data is basically one-dimensional and PCA has found that. With higher dimensional data, a $s c r e e p l o t$ is useful to see how additional components explain more variance:

ggplot(data.frame(
  component = 1:length(pcdf$sdev), 
  explained.var.pct = pcdf$sdev^2/sum(pcdf$sdev^2)
  ), 
  aes(x=component, y=cumsum(explained.var.pct)))+
    geom_line()+ylab("Total Percent Variance Explained")

Now, what about the relationship in our data ( $y = 3 x + 1$ ) and the principal components? The rotation element of pcdf gives us a matrix of eigenvectors that tells use how to turn a point in the original $x y$ -plane into a point in the $P C 1 P C 2$ -plane. The second row of the rotation matrix divided by the first ( $y / x$ ) gives use slopes of almost $3$ and $\frac{- 1}{3}$ (the difference is the random error I’ve added to the data). The principal components are just a new basis (in the linear algebra sense), each column is a unit vector and the columns are orthogonal to each-other, so in two-dimensions the slope determines a unit vector. In higher dimensions this gets more complicated, but the rotation matrix columns still give us the direction vector for the principal components. If you remember multivariate calculus, you can turn a direction vector into a line in higher dimensions.

A Real Example

I’ll load data from sb3tnzv, which has data about the content of certain molecules in a certain species of sagebrush (this is related to a collaboration with a biochemist).

sb <- read.csv("../../static/files/sb3tnzv.csv", header=TRUE)
knitr::kable(head(sb))

id	species	browsed	CYP1A.grouse.micr	CYP1A.human.micr	SB02	SB03	SB05	SB09	SB10	SB11	SB12	SB13	SB14	SB16	SB17	SB19	SB20	SB22	SB23	SB26	SB28	SB29	SB30	SB31	SB32	SB36	SB37	SB38	SB39	SB40	SB41	SB42	SB43	SB44	SB45	SB45s	SB46	SB47	SB48
105	3T	NB	NA	NA	0.0000	629.0214	565.9454	0.000	1510.473	618.7710	1389.2164	719.0859	370.6543	10076.76	0.000	26818.27	55521.95	7762.168	7988.484	29838.72	0.000	0.000	0.000	0.000	0.000	0.000	0.00	0.000	0.000	0.000	70267.00	109853.00	0.00	60935.48	0.00	0.000	5583.644	4029.331	76519.95
134	3T	B	NA	NA	418.1162	787.7897	649.8425	1825.224	0.000	749.1660	1629.1661	815.1082	495.5262	13573.21	1406.067	29028.81	68483.62	9038.913	9783.465	30501.25	10599.994	7101.127	8305.438	0.000	4244.354	2289.997	13235.83	3617.598	5641.736	5766.007	67198.16	127896.20	71841.00	66638.00	17905.29	0.000	4631.608	7263.780	86716.27
154	3T	NB	137.2	64.3	403.1469	654.3164	565.2982	0.000	1808.972	650.8421	1560.4817	1266.7593	0.0000	12916.54	0.000	37969.62	66222.91	8011.746	8014.458	25759.11	6783.765	6627.330	6373.151	0.000	4767.672	3078.736	15643.33	4111.693	5938.325	16812.998	59342.97	93693.29	69981.00	64628.00	18887.17	0.000	5405.461	4248.908	67234.19
182	3T	NB	123.5	62.9	558.8956	1044.3494	0.0000	2590.590	0.000	1161.7732	1744.6504	902.3607	0.0000	10598.65	1635.713	32982.45	67699.61	8178.428	6533.004	27824.57	12031.102	6346.032	9447.529	0.000	5135.961	2169.135	0.00	6862.923	6174.927	23795.510	14001.83	20861.51	131440.00	147611.00	20330.75	0.000	4810.313	8282.686	80970.20
222	3T	NB	161.0	74.1	254.2615	431.4813	445.9520	0.000	1078.684	447.0190	895.7186	594.2125	0.0000	16864.11	0.000	23013.96	47683.07	5734.898	7296.023	12866.58	3783.557	4826.480	5690.679	0.000	3626.880	0.000	11667.33	3201.437	5310.534	8928.760	21563.69	44012.07	20978.52	34732.02	0.00	7606.205	0.000	11376.954	25991.20
238	3T	B	132.4	72.9	0.0000	668.8263	0.0000	0.000	854.496	332.0903	652.6357	99.0683	0.0000	22382.95	0.000	25243.57	68834.52	9652.101	6982.520	42749.88	0.000	4809.935	6726.340	4758.589	0.000	2101.150	0.00	3842.604	4641.517	15705.387	51669.16	90192.88	63978.74	88519.23	24585.34	0.000	5925.326	3597.326	88216.69

Each of the SB variables basically tells you how much of the molecule is in the sample and each number corresponds to a different molecule. Hopefully PCA will help reduce the number of variables. Let’s perform the PC computation and look at a scree plot.

sbpc <- prcomp(sb[,6:45], center = TRUE, scale. = TRUE)

ggplot(data.frame(
  component = 1:length(sbpc$sdev), 
  explained.var.pct = sbpc$sdev^2/sum(sbpc$sdev^2)
  ), 
  aes(x=component, y=cumsum(explained.var.pct)))+
    geom_line()+ylab("Total Percent Variance Explained")+
    ggtitle("Scree Plot of Sagebrush Data")

This shows that we can explain most of the variation in our data with far fewer variables than all the SB’s. It’s worth noting that I’ve removed all variables with zero variance already and am scaling and centering the data prior to performing the PCA computation - this is needed whenever different variables have vastly different scales.

We can now go about building models using the principal components instead of the original SB variables and we don’t have to worry about colinearity. Furthermore, the order of our components is in order of decreasing variance explained so we would build models using the PC’s in order (i.e. a model without PC1 but with other PC’s would be strange). The SB variables lack this aspect, but are more interpretable.

R ggplot PCA

PCA Overview

Basic Idea

A Real Example

Related