Created by: ferran.muinos@irbbarcelona.org
The source of this document is an R markdown (.Rmd) file. In such a file you will find two types of content source code:
The idea is that by compiling it, you can put together in a single document:
Specifically, to compile this website you can run the following steps in the command line:
$ R -e "install.packages('rmarkdown', repos='http://cran.us.r-project.org')"
$ R -e "install.packages('MASS', repos='http://cran.us.r-project.org')"
$ R -e "rmarkdown::render('pca_practical.Rmd')"
$ google-chrome pca_practical.html
More reference on “Knitr with R Markdown”: https://kbroman.org/knitr_knutshell/pages/Rmarkdown.html
We will go through a simple data analysis scenario where linear algebra happens to play a PROMINENT role: Principal Component Analysis a.k.a. PCA.
Principal Component Analysis (PCA) is a dimensionality reduction technique. Given a dataset with samples and a vector of features per sample, it allows us to create new features to describe the samples in such a way that:
Let’s first play with a small “synthetic” dataset representing night and day temperatures across 1000 weather stations. The samples here are weather stations. There are two features: night and day: we randomly generate the data so that there is some correlation between night and day.
N = 1000
mu = 15
s = 3
night = rnorm(N, mean=mu, sd=s)
day = night + rnorm(N, mean=0, sd=s)
temps = data.frame(station=1:N,
night=night,
day=day)
Head the table to check out it looks:
head(temps)
## station night day
## 1 1 17.73508 14.84177
## 2 2 13.87412 18.28400
## 3 3 15.92668 12.15302
## 4 4 14.92396 18.30325
## 5 5 13.47541 10.34268
## 6 6 17.30327 21.42547
Scatter plot the full dataset to appraise the shape of the data:
eqscplot(night, day, col="orange")
Let’s go further into the analysis of the data. Next thing we will do is to compute the covariance matrix of the joint distribution of night and day:
m = as.matrix(temps[, 2:3])
ones = as.matrix(rep(1, N))
centering = diag(N) - 1/N * ones %*% t(ones)
z = centering %*% m # apply centering to the columns of m
eqscplot(z[,1], z[,2], col="orange", xlab="centered night", ylab="centered day")
\[ \Omega = \frac{1}{N} Z^t Z \]
where \(Z\) is the centered data matrix for night and day.
omega = 1/N * t(z) %*% z
omega
## night day
## night 8,457855 8,341328
## day 8,341328 16,912694
\[ \Omega=\frac{1}{N} Z^t Z= \left[ {\begin{array}{cc} 8,46 & 8,34 \\ 8,34 & 16,91 \\ \end{array} } \right]. \]
From this matrix we can readily compute a couple of common statistics related to the joint distribution of the features “day” and “night”: the total residual sum of squares (total RSS) and Pearson’s correlation.
Recall that the total RSS is: \[ \textrm{RSS}(X, Y) = \textrm{Var}(X) + \textrm{Var}(Y) = \Omega_{11} + \Omega_{22} = \textrm{trace}(\Omega) \]
totalRSS = omega[1,1] + omega[2,2]
totalRSS
## [1] 25,37055
Recall that Pearson’s correlation is: \[\rho(X, Y) = \frac{\textrm{cov(X, Y)}}{\sqrt{\textrm{Var}(X)\textrm{Var}(Y)}}\]
rho = omega[1,2] / sqrt(omega[1,1] * omega[2,2])
rho
## [1] 0,6974269
Question: can we replace the column vectors of “day” and “night” by linear combinations of them, so that the new vectors are no longer correlated, i.e., can we make the covariance matrix diagonal?
To find the solution we “diagonalize” the covariance matrix. Since the covariance matrix is symmetric, we can always find such a basis of eigenvectors.
ev = eigen(omega)
ev
## eigen() decomposition
## $values
## [1] 22,036682 3,333867
##
## $vectors
## [,1] [,2]
## [1,] 0,5234203 -0,8520746
## [2,] 0,8520746 0,5234203
p = ev$vectors
v1 = p[,1] # eigenvector with eigenvalue ev$values[1]
v2 = p[,2] # eigenvector with eigenvalue ev$values[2]
The vectors \[ v_1= \left[ {\begin{array}{cc} 0,52 \\ 0,85 \\ \end{array} } \right] \;\; v_2= \left[ {\begin{array}{cc} -0,85 \\ 0,52 \\ \end{array} } \right] \]
are eigenvectors of \(\Omega\) with eigenvalues \(\lambda_1 = 22,04\) and \(\lambda_2 = 3,33\), respectively.
In other words, the so-called matrix \(P\) with \(v_1, v_2\) as column vectors satisfies:
\[ P= \left[ {\begin{array}{cc} 0,52 & -0,85 \\ 0,85 & 0,52 \\ \end{array} } \right], \]
satisfies \(D = P^{-1} \Omega P,\) where
\[ D = \left[ {\begin{array}{cc} 22,04 & 0 \\ 0 & 3,33 \\ \end{array} } \right], \]
Notice that \(P\) is an orthogonal matrix, i.e., its column vectors are orthogonal with respect to each other and have length \(=1\). For such matrices the inverse is easy to compute: \(P^{-1} = P^t\). Then
\[ D = P^t \Omega P. \]
The matrix \(P\) contains all the information we require to compute the sought linear combinations of the day and night features.
We define new data columns for our dataset as follows:
\[ \textrm{PC}_1 = 0,52 \; \textrm{centered night} + 0,85 \; \textrm{centered day} \\ \textrm{PC}_2 = -0,85 \; \textrm{centered night} + 0,52 \; \textrm{centered day}. \]
The new data matrix with columns \(\textrm{PC}_1, \textrm{PC}_2\) is precisely \(\hat{Z} = ZP\).
\[ \frac{1}{N} \hat{Z}^t \hat{Z} = \frac{1}{N} (ZP)^t (ZP) = \frac{1}{N} P^t(Z^tZ)P = P^t(\frac{1}{N}Z^tZ)P = P^t \Omega P = D. \]
Here, we are using the following general fact about the transpose: \((AB)^t = B^tA^t\).
new_z = z %*% ev$vectors
new_omega = 1/N * t(new_z) %*% new_z
new_omega
## [,1] [,2]
## [1,] 2,203668e+01 4,689582e-15
## [2,] 4,689582e-15 3,333867e+00
\[ \Omega_{\textrm{PCA}}= \left[ {\begin{array}{cc} 22,04 & 0 \\ 0 & 3,33 \\ \end{array} } \right]. \]
evalue.1 = ev$values[1]
ve.ratio.pc1 = evalue.1 / totalRSS
ve.ratio.pc1
## [1] 0,868593
ve.ratio.1.2 = sum(ev$values) / totalRSS
ve.ratio.1.2
## [1] 1
eqscplot(z[,1], z[,2], col="orange", xlab="centered night", ylab="centered day")
alpha=5 # streching factor to make arrows larger
arrows(0, 0, alpha*v1[1], alpha*v1[2], lwd=2)
text((alpha+1)*v1[1], (alpha+1)*v1[2], label="v1")
arrows(0, 0, alpha*v2[1], alpha*v2[2], lwd=2)
text((alpha+1)*v2[1], (alpha+1)*v2[2], label="v2")
Plot the data points with the new features a.k.a. principal components:
eqscplot(new_z[,1], new_z[,2], col="red", xlab="PC1", ylab="PC2")
alpha=5 # streching factor to make arrows larger
arrows(0, 0, alpha, 0, lwd=2)
text((alpha+1), 0, label="v1")
arrows(0, 0, 0, alpha, lwd=2)
text(0, (alpha+1), label="v2")
There is no need to repeat this process each time you want to compute a PCA, as there are built-ins that comprise all the computations necessary to carry out a proper PCA.
temp.pca = princomp(m, center = TRUE, scale. = TRUE)
## Warning: In princomp.default(m, center = TRUE, scale. = TRUE) :
## extra arguments 'center', 'scale.' will be disregarded
Have a look at the summary:
summary(temp.pca)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 4,694324 1,825888
## Proportion of Variance 0,868593 0,131407
## Cumulative Proportion 0,868593 1,000000
This is in agreement with our previous computations. Check what additional info we can retrieve from the \(\texttt{temp.pca}\) instance:
str(temp.pca)
## List of 7
## $ sdev : Named num [1:2] 4,69 1,83
## ..- attr(*, "names")= chr [1:2] "Comp.1" "Comp.2"
## $ loadings: 'loadings' num [1:2, 1:2] 0,523 0,852 0,852 -0,523
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "night" "day"
## .. ..$ : chr [1:2] "Comp.1" "Comp.2"
## $ center : Named num [1:2] 15 15
## ..- attr(*, "names")= chr [1:2] "night" "day"
## $ scale : Named num [1:2] 1 1
## ..- attr(*, "names")= chr [1:2] "night" "day"
## $ n.obs : int 1000
## $ scores : num [1:1000, 1:2] 1,28 2,20 -1,95 2,76 -4,78 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:2] "Comp.1" "Comp.2"
## $ call : language princomp(x = m, center = TRUE, scale. = TRUE)
## - attr(*, "class")= chr "princomp"
Most of the entries are intuitively clear. Notice an important one with the cryptic name “loadings”:
loadings(temp.pca)[]
## Comp.1 Comp.2
## night 0,5234203 0,8520746
## day 0,8520746 -0,5234203
Exactly, this is an alternative (orthogonal) change of coordinates that also leads to a solution of the PCA problem of finding (orthogonal) linear combinations of the original feature columns that are not correlated.