# Dimensionality reduction with PCA and SVD

### Intuition and mathematics behind PCA¶

Principal Component Analysis (PCA) is a commonly used method for dimensionality reduction. It is closely related to Singular Value Decomposition (SVD). The aim of this post is to give an intuition on how PCA works, go through the linear algebra behind it, and to illustrate some key properties of the transform.

Data sets often have many (hundreds or even thousands) features or dimensions, typically with many redundancies (i.e.: correlating variables). PCA is a technique based on linear algebra that aims to represent the data as compactly as possible, by introducing a "change of basis". The following text was largely based on this article by Jonathan Shlens. I will use a somewhat different notation/convention though:

• A data matrix has a shape nxD (or nxR), with n the number of observations, D (R) the number of (reduced) dimensions
• The transformed data matrix is called T

Furthermore, note that an unbiased estimator for the variance has $n-1$ in the denominator (in short; because we don't use the true mean but the empirical mean, we underestimate the variance by $(n-1)/n$. I will nevertheless use $n$ to keep the notation short).

The main concepts behind PCA are:

• Significance of the covariance matrix

When our data is centered (in our $nxD$ convention, meaning that the columns have zero mean), we can calculate the covariance matrix $C_x \equiv 1/n \cdot X^T X$. Large off-diagonal values in $C_x$ signal redundancy. Think for instance of a two-dimensional, diagonal scatter around the line $x=y$: the covariance matrix will have off-diagonal components nearly as large as diagonal components. By rotating our basis by 45°, the off-diagonal components will disappear and we can describe our data quite well with only one component (you have surely seen sketches of this).

This illustrates the main aim of PCA: to transform our data such that the covariance matrix of our transformed data is diagonal. Finally, we want our entries on the diagonal to be ordered from large to smaller, such that we can cut off our data (reduce $D$ to $R$) with as little loss of signal as possible.

• Covariance matrix of transformed data

Introducing the transformation $T = X \cdot P$, the covariance matrix of our transformed data $T$ is:

$\begin{eqnarray} C_T & \equiv & 1/n \cdot (T^T T) \\ &=& 1/n \cdot (X P)^T (XP) \\ &=& 1/n \cdot P^T X^T X P \\ &=& P^T C_X P \end{eqnarray}$

Which is a simple manipulation of our original covariance matrix $C_X$.

• Diagonalizing the covariance matrix

A covariance matrix is symmetric: the covariance operator $1/n <x, y>$ does not care about the order of $x$ and $y$. As such we can flip the row and column indices. Any symmetric matrix is diagonalized by a matrix consisting of its eigenvalues, and these eigenvalues are by necessity orthogonal. Because the proof that demonstrates orthogonality is nice and simple, I included it below. The diagonalization of a symmetric matrix $A$ is as follows:

$A = E D E^T$, with $D$ diagonal, and $E$ containing the orthonormal (orthogonal, and unit length) eigenvectors, column-wise. Note that, because $E$ consists of orthonormal vectors, $E E^T = I$, so $E^{-1} = E^T$.

Now the question is: what transformation matrix $P$ will result in $C_T$ being a diagonal matrix? The solution is: choose $P = E$, so the columns in $P$ are also the eigenvectors of $X^TX$:

$\begin{eqnarray} C_T &=& P^T C_X P \\ & =& P^T E D E^T P \\ & =& E^T E D E^T E \\& =& D \end{eqnarray}$

So, concluding:

• We center our data set (column means are zero), to obtain $X$
• We compute the covariance matrix $C_x$, its eigenvectors ${\mathbf{u_1}, \mathbf{u_2}, ... \mathbf{u_D}}$ corresponding to the descendingly ordered eigenvalues $\lambda_1, \lambda_2, ..\lambda_D$.
• We multiply our centered data matrix $X$ ($nxD$) by a $DxD$ matrix $P$: $[\mathbf{u_1}, \mathbf{u_2}, ... \mathbf{u_D}]$ to get the measurements in our new space $T$.
• The $DxD$ covariance matrix of our data matrix $T$ is diagonal, containing $\lambda_1, \lambda_2, ..\lambda_D$

So far, it seems we haven't gained much. However, now we can reduce the dimensionality of our data as much as we like, throwing away as little information as (linearly) possible, by not considering all columns of $T$. Now, we are compressing the data in a lossy fashion. Because of the ordering of the principal components according to the eigenvalues, we keep as much information as possible by discarding the columns starting from the right. What does it mean for our original data? Instead of being exact, and writing:

$X = T \ \ P^T$, we can approximate:

$X \approx T_r \ \ P_r^T$ , with $T_r$ $(nxR)$, $P_r$ $(RxR)$, and $R < D$.

A concrete example to illustrate this: a hypothetical data set which could be the result of 10 anonymous people rating 6 movies.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
ratings = pd.DataFrame(columns=['The Matrix', 'Star Wars', 'Monsters Inc.', 'Finding Nemo', 'Wall-E', 'FastFurious 8'])
ratings.loc['A', :] = [9, 8, 4, 5, 7, 2]
ratings.loc['B', :] = [3, 2, 8, 8, 6, 1]
ratings.loc['C', :] = [2, 3, 8, 9, 5, 3]
ratings.loc['D', :] = [8, 10, 3, 3, 6, 2]
ratings.loc['E', :] = [9, 7, 2, 1, 5, 2]
ratings.loc['F', :] = [2, 2, 10, 10, 6, 3]
ratings.loc['G', :] = [2, 1, 9, 10, 5, 2]
ratings.loc['H', :] = [7, 9, 1, 1, 5, 2]
ratings.loc['I', :] = [2, 3, 2, 4, 3, 9]
ratings.loc['J', :] = [3, 2, 3, 2, 2, 10]
# show it
ratings

Out[2]:
The Matrix Star Wars Monsters Inc. Finding Nemo Wall-E FastFurious 8
A 9.0 8.0 4.0 5.0 7.0 2.0
B 3.0 2.0 8.0 8.0 6.0 1.0
C 2.0 3.0 8.0 9.0 5.0 3.0
D 8.0 10.0 3.0 3.0 6.0 2.0
E 9.0 7.0 2.0 1.0 5.0 2.0
F 2.0 2.0 10.0 10.0 6.0 3.0
G 2.0 1.0 9.0 10.0 5.0 2.0
H 7.0 9.0 1.0 1.0 5.0 2.0
I 2.0 3.0 2.0 4.0 3.0 9.0
J 3.0 2.0 3.0 2.0 2.0 10.0
In [3]:
sns.set(style="white")
def show_correlationmatrix_df(df, title):
cov_df = df.cov()
plt.title(title, fontsize=16)
plt.show()
show_correlationmatrix_df(ratings, title='Covariance of ratings along movies')


There are several strong off-diagonal covariances: if you like The Matrix, you probably also like Star Wars (sci-fi/adventure). And if you like Monsters Inc., you probably also like Finding Nemo (animation). Wall-E is a bit of both. Fast and the Furious is a different category altogether, judging by the ratings.

Let's continue with numpy, and do PCA by eigenvalue decomposition of the covariance matrix.

In [4]:
# Obtain a numpy array from the DataFrame
X = ratings.values
def pca_by_eigdecomp(X, ddof=1):
"""
Does PCA on the data X (n observations x D features) by eigenvalue decomposition
of the correlation matrix of X.
Returns the transformed data T, the principal components P,
the eigenvalues, and the mean X_mean
ddof = 1 results in an unbiased estimator for the covariance
"""
# Subtract the mean of the original data (and return it for later use)
X_mean = X.mean(axis=0)
X_centered = X - X_mean[np.newaxis, :]
X_c = 1/(X.shape[0] - ddof) * np.dot(np.transpose(X_centered), X_centered)
# Get the eigenvalues (lambda) and eigenvectors of X_c
lambdas, P = np.linalg.eig(X_c)
# Note that the eigenvalues are not necessarily sorted. Sort them, and the corresponding ev's
idx_sort = np.argsort(-lambdas) # minus to accomplish descending order
lambdas = lambdas[idx_sort]
P = P[:, idx_sort]

# Use P to transform X
T = np.dot(X_centered, P)
return T, P, lambdas, X_mean

T, P, lambdas, X_mean = pca_by_eigdecomp(X)

In [5]:
# Check that P D P^T indeed yields X_c
np.set_printoptions(precision=3, suppress=True)
print(np.dot(np.dot(P, np.diag(lambdas)), np.transpose(P)) - np.cov(np.transpose(X)))

[[ 0.  0.  0.  0.  0. -0.]
[ 0. -0.  0.  0.  0. -0.]
[ 0.  0.  0.  0.  0. -0.]
[ 0.  0.  0.  0.  0. -0.]
[ 0.  0.  0.  0.  0. -0.]
[-0. -0. -0. -0. -0.  0.]]

In [6]:
# Verify that the covariance matrix of the transformed data is diagonal
print(np.cov(np.transpose(T))) # NB: np takes the covariance along the columns. Therefore, transpose
show_correlationmatrix_df(pd.DataFrame(T), title= 'Covariance of transformed data')

[[ 37.514  -0.     -0.      0.      0.     -0.   ]
[ -0.     18.192  -0.     -0.      0.     -0.   ]
[ -0.     -0.      1.273  -0.     -0.     -0.   ]
[  0.     -0.     -0.      0.926   0.      0.   ]
[  0.      0.     -0.      0.      0.296  -0.   ]
[ -0.     -0.     -0.      0.     -0.      0.098]]

In [7]:
# Verify the reconstruction error
def reconstruct_X(T, P, X_mean, n_components=None):
"""
Reconstructs X from the scores T, the principal components P, the mean
value of X X_mean, taking into account n_components
When n_components is None, the full reconstruction is done
"""
if not n_components:
return X_mean + np.dot(T, np.transpose(P))
else:
return X_mean + np.dot(T[:, :n_components], np.transpose(P[:, :n_components]))

In [8]:
# Calculate the reconstruction error as the variance of the difference between the actual and reconstructed X,
# for all possible number of pca components
rec_error_n_components = [np.var(X - reconstruct_X(T, P, X_mean, i), ddof=1) for i in range(1, 7)]

In [9]:
plt.plot(range(1, 7), rec_error_n_components/np.var(X), label='variance of the error')
plt.plot(range(1, 7), 1- (np.cumsum(lambdas) / np.sum(lambdas)), 's', label='1 - normalized cum.sum of lambda')
plt.xlabel('Number of components', fontsize=14)
plt.ylabel('Normalized reconstruction error', fontsize=14)
plt.legend(fontsize=14)
plt.show()


We have ascertained that:

• We can exactly reconstruct X in our transformed space P using the new feature matrix T
• We can approximate X by reducing the number of columns of T
• The reconstruction error (variance of the difference) quickly tends to zero, in a convex fashion
• The reconstruction error can also be directly determined from the $\lambda$'s

Let's look into a final aspect: the variance of the reconstructed data, as a function of the number of components that are considered. We have already seen that these are related to the eigenvalues of $X^T X$.

The total variance is the sum of the diagonal elements of the covariance matrix. Since

$C_X = P C_T P^T = C_X = \Sigma_{i=0}^R \lambda_i u_i u_i^T$ (this is known as a "spectral decomposition"), the sum of the diagonal values are simply $\Sigma_{i=0}^R \lambda_i u_i u_i^T$

we can rewrite the sum of the diagonal, summing over the $R$ components:

$\begin{eqnarray} \mathrm{Var}(X) & = & \Sigma_{i=0}^D (C_x[i, i]) \\ & = & \Sigma_{i=0}^D C_x[i, i] \\ & = & \Sigma_{i=0}^R \lambda_i <u_i, u_i> \\ & = & \Sigma_{i=0}^R \lambda_i \end{eqnarray}$,

because the dot product of the orthonormal eigenvectors with themselves is 1.

Let us verify this result with our toy data:

In [10]:
var_x_reconstruct = np.array([np.sum(np.var(reconstruct_X(T, P, X_mean, i), axis=0, ddof=1))
for i in range(1, 7) ]) # NB: ddof = 1 makes the denominator equal to n-1 (unbiased)

In [11]:
print('Reconstructed variance: \t {}'.format(var_x_reconstruct) )
print("Cumulative sum of lambda's: \t{}".format(np.cumsum(lambdas)) )

Reconstructed variance: 	 [ 37.514  55.706  56.98   57.906  58.202  58.3  ]
Cumulative sum of lambda's: 	[ 37.514  55.706  56.98   57.906  58.202  58.3  ]


We can therefore look at the lambda's to determine how many components we need to reconstruct the original data.

Sklearn has a pca implementation, such that we don't have to deal with the linear algebra. It is also more efficient for large matrices, as it uses Singular Value decomposition. As a final check, let's compare our results to sklearn's implementation

In [12]:
from sklearn.decomposition import PCA

In [13]:
pca = PCA(n_components=2)
print(pca.fit_transform(X))
# verify that this is indeed identical
print("\n")
print(T[:, :2])
# Note that the sign is reversed of the second column.
# This is because the principal component has the opposite direction

[[-4.193 -3.892]
[ 5.089 -2.419]
[ 5.597 -0.915]
[-6.349 -3.031]
[-6.936 -1.212]
[ 7.657 -1.78 ]
[ 7.637 -1.564]
[-7.539 -0.926]
[-0.204  7.203]
[-0.759  8.537]]

[[-4.193  3.892]
[ 5.089  2.419]
[ 5.597  0.915]
[-6.349  3.031]
[-6.936  1.212]
[ 7.657  1.78 ]
[ 7.637  1.564]
[-7.539  0.926]
[-0.204 -7.203]
[-0.759 -8.537]]

In [14]:
print(pca.explained_variance_)
print(lambdas[:2]*(9/10))
# NB: sklearn's PCA does not use the n-1 normalization, but n.
# Therefore, multiply by (n-1)/n to get identical results

[ 33.763  16.373]
[ 33.763  16.373]


Summarizing, we have looked at some key elements of PCA:

• By projecting our data on the principal components (the eigenvectors of the covariance matrix of X), we transformed our data X into a new matrix T
• Using all principal components, T has the same dimension as X, and the reconstruction is exact
• The covariance matrix of T is diagonal (i.e., the features in the transformed data do not correlate)
• The eigenvalues of the covariance matrix of X tell us directly how much of the original signal (variance) we get by including the corresponding principal component
• By leaving away the right column(s) of T, we can reduce the dimensionality of our data with minimal loss of signal
• In our toy example, we could express roughly 95% of the signal using 2 instead of 6 dimensions

The exact same result was achieved with sklearn's PCA (but internally, the sklearn implementation uses SVD instead of eigendecomposition of the covariance matrix. Note also that sklearn's PCA uses the biased estimator of the covariance - which does not result in significant differences: neither P nor the normalized eigenvalues (normalized by their sum) are affected).