mirror of
https://github.com/Brandon-Rozek/website.git
synced 2024-11-23 00:26:30 -05:00
175 lines
4.9 KiB
Markdown
175 lines
4.9 KiB
Markdown
---
|
|
title: Principal Component Analysis Part 2 - Formal Theory
|
|
showthedate: false
|
|
math: true
|
|
---
|
|
|
|
##Properties of PCA
|
|
|
|
There are a number of ways to maximize the variance of a principal component. To create an unique solution we should impose a constraint. Let us say that the sum of the square of the coefficients must equal 1. In vector notation this is the same as
|
|
$$
|
|
a_i^Ta_i = 1
|
|
$$
|
|
Every future principal component is said to be orthogonal to all the principal components previous to it.
|
|
$$
|
|
a_j^Ta_i = 0, i < j
|
|
$$
|
|
The total variance of the $q$ principal components will equal the total variance of the original variables
|
|
$$
|
|
\sum_{i = 1}^q {\lambda_i} = trace(S)
|
|
$$
|
|
Where $S$ is the sample covariance matrix.
|
|
|
|
The proportion of accounted variation in each principle component is
|
|
$$
|
|
P_j = \frac{\lambda_j}{trace(S)}
|
|
$$
|
|
From this, we can generalize to the first $m$ principal components where $m < q$ and find the proportion $P^{(m)}$ of variation accounted for
|
|
$$
|
|
P^{(m)} = \frac{\sum_{i = 1}^m{\lambda_i}}{trace(S)}
|
|
$$
|
|
You can think of the first principal component as the line of best fit that minimizes the residuals orthogonal to it.
|
|
|
|
### What to watch out for
|
|
|
|
As a reminder to the last lecture, *PCA is not scale-invariant*. Therefore, transformations done to the dataset before PCA and after PCA often lead to different results and possibly conclusions.
|
|
|
|
Additionally, if there are large differences between the variances of the original variables, then those whose variances are largest will tend to dominate the early components.
|
|
|
|
Therefore, principal components should only be extracted from the sample covariance matrix when all of the original variables have roughly the **same scale**.
|
|
|
|
### Alternatives to using the Covariance Matrix
|
|
|
|
But it is rare in practice to have a scenario when all of the variables are of the same scale. Therefore, principal components are typically extracted from the **correlation matrix** $R$
|
|
|
|
Choosing to work with the correlation matrix rather than the covariance matrix treats the variables as all equally important when performing PCA.
|
|
|
|
## Example Derivation: Bivariate Data
|
|
|
|
Let $R$ be the correlation matrix
|
|
$$
|
|
R = \begin{pmatrix}
|
|
1 & r \\
|
|
r & 1
|
|
\end{pmatrix}
|
|
$$
|
|
Let us find the eigenvectors and eigenvalues of the correlation matrix
|
|
$$
|
|
det(R - \lambda I) = 0
|
|
$$
|
|
|
|
$$
|
|
(1-\lambda)^2 - r^2 = 0
|
|
$$
|
|
|
|
$$
|
|
\lambda_1 = 1 + r, \lambda_2 = 1 - r
|
|
$$
|
|
|
|
Let us remember to check the condition "sum of the principal components equals the trace of the correlation matrix":
|
|
$$
|
|
\lambda_1 + \lambda_2 = 1+r + (1 - r) = 2 = trace(R)
|
|
$$
|
|
|
|
###Finding the First Eigenvector
|
|
|
|
Looking back at the characteristic equation
|
|
$$
|
|
Ra_1 = \lambda a_1
|
|
$$
|
|
We can get the following two formulas
|
|
$$
|
|
a_{11} + ra_{12} = (1+r)a_{11} \tag{1}
|
|
$$
|
|
|
|
$$
|
|
ra_{11} + a_{12} = (1 + r)a_{12} \tag{2}
|
|
$$
|
|
|
|
Now let us find out what $a_{11}$ and $a_{12}$ equal. First let us solve for $a_{11}$ using equation $(1)$
|
|
$$
|
|
ra_{12} = (1+r)a_{11} - a_{11}
|
|
$$
|
|
|
|
$$
|
|
ra_{12} = a_{11}(1 + r - 1)
|
|
$$
|
|
|
|
$$
|
|
ra_{12} = ra_{11}
|
|
$$
|
|
|
|
$$
|
|
a_{12} = a_{11}
|
|
$$
|
|
|
|
Where $r$ does not equal $0$.
|
|
|
|
Now we must apply the condition of sum squares
|
|
$$
|
|
a_1^Ta_1 = 1
|
|
$$
|
|
|
|
$$
|
|
a_{11}^2 + a_{12}^2 = 1
|
|
$$
|
|
|
|
Recall that $a_{12} = a_{11}$
|
|
$$
|
|
2a_{11}^2 = 1
|
|
$$
|
|
|
|
$$
|
|
a_{11}^2 = \frac{1}{2}
|
|
$$
|
|
|
|
$$
|
|
a_{11} =\pm \frac{1}{\sqrt{2}}
|
|
$$
|
|
|
|
For sake of choosing a value, let us take the principal root and say $a_{11} = \frac{1}{\sqrt{2}}$
|
|
|
|
###Finding the Second Eigenvector
|
|
|
|
Recall the fact that each subsequent eigenvector is orthogonal to the first. This means
|
|
$$
|
|
a_{11}a_{21} + a_{12}a_{22} = 0
|
|
$$
|
|
Substituting the values for $a_{11}$ and $a_{12}$ calculated in the previous section
|
|
$$
|
|
\frac{1}{\sqrt{2}}a_{21} + \frac{1}{\sqrt{2}}a_{22} = 0
|
|
$$
|
|
|
|
$$
|
|
a_{21} + a_{22} = 0
|
|
$$
|
|
|
|
$$
|
|
a_{21} = -a_{22}
|
|
$$
|
|
|
|
Since this eigenvector also needs to satisfy the first condition, we get the following values
|
|
$$
|
|
a_{21} = \frac{1}{\sqrt{2}} , a_{22} = \frac{-1}{\sqrt{2}}
|
|
$$
|
|
|
|
### Conclusion of Example
|
|
|
|
From this, we can say that the first principal components are given by
|
|
$$
|
|
y_1 = \frac{1}{\sqrt{2}}(x_1 + x_2), y_2 = \frac{1}{\sqrt{2}}(x_1-x_2)
|
|
$$
|
|
With the variance of the first principal component being given by $(1+r)$ and the second by $(1-r)$
|
|
|
|
Due to this, as $r$ increases, so does the variance explained in the first principal component. This in turn, lowers the variance explained in the second principal component.
|
|
|
|
## Choosing a Number of Principal Components
|
|
|
|
Principal Component Analysis is typically used in dimensionality reduction efforts. Therefore, there are several strategies for picking the right number of principal components to keep. Here are a few:
|
|
|
|
- Retain enough principal components to account for 70%-90% of the variation
|
|
- Exclude principal components where eigenvalues are less than the average eigenvalue
|
|
- Exclude principal components where eigenvalues are less than one.
|
|
- Generate a Scree Plot
|
|
- Stop when the plot goes from "steep" to "shallow"
|
|
- Stop when it essentially becomes a straight line.
|