<h1>Principal Component Analysis Part 2: Formal Theory</h1>
<h2>Properties of PCA</h2>
<p>There are a number of ways to maximize the variance of a principal component. To create an unique solution we should impose a constraint. Let us say that the sum of the square of the coefficients must equal 1. In vector notation this is the same as
$$
a_i^Ta_i = 1
$$
Every future principal component is said to be orthogonal to all the principal components previous to it.
$$
a_j^Ta<em>i = 0, i < j
$$
The total variance of the $q$ principal components will equal the total variance of the original variables
$$
\sum</em>{i = 1}^q {\lambda_i} = trace(S)
$$
Where $S$ is the sample covariance matrix.</p>
<p>The proportion of accounted variation in each principle component is
$$
P_j = \frac{\lambda<em>j}{trace(S)}
$$
From this, we can generalize to the first $m$ principal components where $m < q$ and find the proportion $P^{(m)}$ of variation accounted for
You can think of the first principal component as the line of best fit that minimizes the residuals orthogonal to it.</p>
<h3>What to watch out for</h3>
<p>As a reminder to the last lecture, <em>PCA is not scale-invariant</em>. Therefore, transformations done to the dataset before PCA and after PCA often lead to different results and possibly conclusions.</p>
<p>Additionally, if there are large differences between the variances of the original variables, then those whose variances are largest will tend to dominate the early components.</p>
<p>Therefore, principal components should only be extracted from the sample covariance matrix when all of the original variables have roughly the <strong>same scale</strong>.</p>
<h3>Alternatives to using the Covariance Matrix</h3>
<p>But it is rare in practice to have a scenario when all of the variables are of the same scale. Therefore, principal components are typically extracted from the <strong>correlation matrix</strong> $R$</p>
<p>Choosing to work with the correlation matrix rather than the covariance matrix treats the variables as all equally important when performing PCA.</p>
<h2>Example Derivation: Bivariate Data</h2>
<p>Let $R$ be the correlation matrix
$$
R = \begin{pmatrix}
1 & r \
r & 1
\end{pmatrix}
$$
Let us find the eigenvectors and eigenvalues of the correlation matrix
$$
det(R - \lambda I) = 0
$$</p>
<p>$$
(1-\lambda)^2 - r^2 = 0
$$</p>
<p>$$
\lambda_1 = 1 + r, \lambda_2 = 1 - r
$$</p>
<p>Let us remember to check the condition "sum of the principal components equals the trace of the correlation matrix":
With the variance of the first principal component being given by $(1+r)$ and the second by $(1-r)$</p>
<p>Due to this, as $r$ increases, so does the variance explained in the first principal component. This in turn, lowers the variance explained in the second principal component.</p>
<h2>Choosing a Number of Principal Components</h2>
<p>Principal Component Analysis is typically used in dimensionality reduction efforts. Therefore, there are several strategies for picking the right number of principal components to keep. Here are a few:</p>
<ul>
<li>Retain enough principal components to account for 70%-90% of the variation</li>
<li>Exclude principal components where eigenvalues are less than the average eigenvalue</li>
<li>Exclude principal components where eigenvalues are less than one.</li>
<li>Generate a Scree Plot
<ul>
<li>Stop when the plot goes from "steep" to "shallow"</li>
<li>Stop when it essentially becomes a straight line.</li>