Principal Component Analysis

Created by: ferran.muinos@irbbarcelona.org

How this html document was created

The source of this document is an R markdown (.Rmd) file. In such a file you will find two types of content source code:

Markdown-formatted text;
chunks of R code.

The idea is that by compiling it, you can put together in a single document:

html-formatted, human-readable text;
the printouts of your R code, including the plotting figures.

Specifically, to compile this website you can run the following steps in the command line:

$ R -e "install.packages('rmarkdown', repos='http://cran.us.r-project.org')"
$ R -e "install.packages('MASS', repos='http://cran.us.r-project.org')"
$ R -e "rmarkdown::render('pca_practical.Rmd')"
$ google-chrome pca_practical.html

More reference on “Knitr with R Markdown”: https://kbroman.org/knitr_knutshell/pages/Rmarkdown.html

Goal

We will go through a simple data analysis scenario where linear algebra happens to play a PROMINENT role: Principal Component Analysis a.k.a. PCA.

Principal Component Analysis in a Nutshell

Principal Component Analysis (PCA) is a dimensionality reduction technique. Given a dataset with samples and a vector of features per sample, it allows us to create new features to describe the samples in such a way that:

the new features are linear combinations of the old ones
the new features are not correlated with each other

The Data

Let’s first play with a small “synthetic” dataset representing night and day temperatures across 1000 weather stations. The samples here are weather stations. There are two features: night and day: we randomly generate the data so that there is some correlation between night and day.

N = 1000
mu = 15
s = 3
night = rnorm(N, mean=mu, sd=s)
day = night + rnorm(N, mean=0, sd=s)
temps = data.frame(station=1:N,
                   night=night,
                   day=day)

Head the table to check out it looks:

head(temps)

##   station    night      day
## 1       1 17.73508 14.84177
## 2       2 13.87412 18.28400
## 3       3 15.92668 12.15302
## 4       4 14.92396 18.30325
## 5       5 13.47541 10.34268
## 6       6 17.30327 21.42547

Scatter plot the full dataset to appraise the shape of the data:

eqscplot(night, day, col="orange")

Covariance Matrix

Let’s go further into the analysis of the data. Next thing we will do is to compute the covariance matrix of the joint distribution of night and day:

Matrix with feature columns

m = as.matrix(temps[, 2:3])

Centering matrix

ones = as.matrix(rep(1, N))
centering = diag(N) - 1/N * ones %*% t(ones)

Matrix with centered feature columns

z = centering %*% m  # apply centering to the columns of m

Scatter plot the centered data

eqscplot(z[,1], z[,2], col="orange", xlab="centered night", ylab="centered day")

Compute the covariance matrix

\[ \Omega = \frac{1}{N} Z^t Z \]

where \(Z\) is the centered data matrix for night and day.

omega = 1/N * t(z) %*% z
omega

##          night       day
## night 8,457855  8,341328
## day   8,341328 16,912694

\[ \Omega=\frac{1}{N} Z^t Z= \left[ {\begin{array}{cc} 8,46 & 8,34 \\ 8,34 & 16,91 \\ \end{array} } \right]. \]

Side note

From this matrix we can readily compute a couple of common statistics related to the joint distribution of the features “day” and “night”: the total residual sum of squares (total RSS) and Pearson’s correlation.

Recall that the total RSS is: \[ \textrm{RSS}(X, Y) = \textrm{Var}(X) + \textrm{Var}(Y) = \Omega_{11} + \Omega_{22} = \textrm{trace}(\Omega) \]

totalRSS = omega[1,1] + omega[2,2]
totalRSS

## [1] 25,37055

Recall that Pearson’s correlation is: \[\rho(X, Y) = \frac{\textrm{cov(X, Y)}}{\sqrt{\textrm{Var}(X)\textrm{Var}(Y)}}\]

rho = omega[1,2] / sqrt(omega[1,1] * omega[2,2])
rho

## [1] 0,6974269

Eigenvectors of the covariance matrix

Question: can we replace the column vectors of “day” and “night” by linear combinations of them, so that the new vectors are no longer correlated, i.e., can we make the covariance matrix diagonal?

To find the solution we “diagonalize” the covariance matrix. Since the covariance matrix is symmetric, we can always find such a basis of eigenvectors.

ev = eigen(omega)
ev

## eigen() decomposition
## $values
## [1] 22,036682  3,333867
## 
## $vectors
##           [,1]       [,2]
## [1,] 0,5234203 -0,8520746
## [2,] 0,8520746  0,5234203

p = ev$vectors
v1 = p[,1]  # eigenvector with eigenvalue ev$values[1]
v2 = p[,2]  # eigenvector with eigenvalue ev$values[2]

The vectors \[ v_1= \left[ {\begin{array}{cc} 0,52 \\ 0,85 \\ \end{array} } \right] \;\; v_2= \left[ {\begin{array}{cc} -0,85 \\ 0,52 \\ \end{array} } \right] \]

are eigenvectors of \(\Omega\) with eigenvalues \(\lambda_1 = 22,04\) and \(\lambda_2 = 3,33\), respectively.

In other words, the so-called matrix \(P\) with \(v_1, v_2\) as column vectors satisfies:

\[ P= \left[ {\begin{array}{cc} 0,52 & -0,85 \\ 0,85 & 0,52 \\ \end{array} } \right], \]

satisfies \(D = P^{-1} \Omega P,\) where

\[ D = \left[ {\begin{array}{cc} 22,04 & 0 \\ 0 & 3,33 \\ \end{array} } \right], \]

The inverse of \(P\) is easy to compute

Notice that \(P\) is an orthogonal matrix, i.e., its column vectors are orthogonal with respect to each other and have length \(=1\). For such matrices the inverse is easy to compute: \(P^{-1} = P^t\). Then

\[ D = P^t \Omega P. \]

From Eigenvectors to Principal Components

The matrix \(P\) contains all the information we require to compute the sought linear combinations of the day and night features.

The principal components

We define new data columns for our dataset as follows:

\[ \textrm{PC}_1 = 0,52 \; \textrm{centered night} + 0,85 \; \textrm{centered day} \\ \textrm{PC}_2 = -0,85 \; \textrm{centered night} + 0,52 \; \textrm{centered day}. \]

The new data matrix with columns \(\textrm{PC}_1, \textrm{PC}_2\) is precisely \(\hat{Z} = ZP\).

The covariance matrix of the principal components is diagonal

\[ \frac{1}{N} \hat{Z}^t \hat{Z} = \frac{1}{N} (ZP)^t (ZP) = \frac{1}{N} P^t(Z^tZ)P = P^t(\frac{1}{N}Z^tZ)P = P^t \Omega P = D. \]

Here, we are using the following general fact about the transpose: \((AB)^t = B^tA^t\).

Verify the diagonalization formula

new_z = z %*% ev$vectors

new_omega = 1/N * t(new_z) %*% new_z
new_omega

##              [,1]         [,2]
## [1,] 2,203668e+01 4,689582e-15
## [2,] 4,689582e-15 3,333867e+00

\[ \Omega_{\textrm{PCA}}= \left[ {\begin{array}{cc} 22,04 & 0 \\ 0 & 3,33 \\ \end{array} } \right]. \]

Proportion of Variance Explained

evalue.1 = ev$values[1]
ve.ratio.pc1 = evalue.1 / totalRSS
ve.ratio.pc1

## [1] 0,868593

ve.ratio.1.2 = sum(ev$values) / totalRSS
ve.ratio.1.2

## [1] 1

Principal Components Graphically

eqscplot(z[,1], z[,2], col="orange", xlab="centered night", ylab="centered day")
alpha=5  # streching factor to make arrows larger
arrows(0, 0, alpha*v1[1], alpha*v1[2], lwd=2)
text((alpha+1)*v1[1], (alpha+1)*v1[2], label="v1")
arrows(0, 0, alpha*v2[1], alpha*v2[2], lwd=2)
text((alpha+1)*v2[1], (alpha+1)*v2[2], label="v2")

Plot the data points with the new features a.k.a. principal components:

eqscplot(new_z[,1], new_z[,2], col="red", xlab="PC1", ylab="PC2")
alpha=5  # streching factor to make arrows larger
arrows(0, 0, alpha, 0, lwd=2)
text((alpha+1), 0, label="v1")
arrows(0, 0, 0, alpha, lwd=2)
text(0, (alpha+1), label="v2")

Epilogue: Idiomatic PCA

There is no need to repeat this process each time you want to compute a PCA, as there are built-ins that comprise all the computations necessary to carry out a proper PCA.

temp.pca = princomp(m, center = TRUE, scale. = TRUE)

## Warning: In princomp.default(m, center = TRUE, scale. = TRUE) :
##  extra arguments 'center', 'scale.' will be disregarded

Have a look at the summary:

summary(temp.pca)

## Importance of components:
##                          Comp.1   Comp.2
## Standard deviation     4,694324 1,825888
## Proportion of Variance 0,868593 0,131407
## Cumulative Proportion  0,868593 1,000000

This is in agreement with our previous computations. Check what additional info we can retrieve from the \(\texttt{temp.pca}\) instance:

str(temp.pca)

## List of 7
##  $ sdev    : Named num [1:2] 4,69 1,83
##   ..- attr(*, "names")= chr [1:2] "Comp.1" "Comp.2"
##  $ loadings: 'loadings' num [1:2, 1:2] 0,523 0,852 0,852 -0,523
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "night" "day"
##   .. ..$ : chr [1:2] "Comp.1" "Comp.2"
##  $ center  : Named num [1:2] 15 15
##   ..- attr(*, "names")= chr [1:2] "night" "day"
##  $ scale   : Named num [1:2] 1 1
##   ..- attr(*, "names")= chr [1:2] "night" "day"
##  $ n.obs   : int 1000
##  $ scores  : num [1:1000, 1:2] 1,28 2,20 -1,95 2,76 -4,78 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:2] "Comp.1" "Comp.2"
##  $ call    : language princomp(x = m, center = TRUE, scale. = TRUE)
##  - attr(*, "class")= chr "princomp"

Most of the entries are intuitively clear. Notice an important one with the cryptic name “loadings”:

loadings(temp.pca)[]

##          Comp.1     Comp.2
## night 0,5234203  0,8520746
## day   0,8520746 -0,5234203

Exactly, this is an alternative (orthogonal) change of coordinates that also leads to a solution of the PCA problem of finding (orthogonal) linear combinations of the original feature columns that are not correlated.