====== Principal component analysis ====== Principal component analysis is essentially boiling down multidimensional data with a lot of dimensions (aka columns) into a few dimensions while keeping **most** of the information. Given $n$ $m$-dimensional vectors, steps to find the top $k$ principal components: - Calculate the component-wise average of all of the vectors $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i $ - Form $m \times m$ matrix $S = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})(x_i - \bar{x})^T $ - Calculate the $m$-dimensional eigenvectors associated with the largest $k$ eigenvalues of $S$: $v_1, ... v_k$ associated with $\lambda_1, ..., \lambda_k$ - The $k$ dimensional representation of $x_i$ is then $\hat{x}_i = (x_i^Tv_1, ... x_i^Tv_k)$ Another way to state the objective: $$ \min \sum_{i=1}^n || x_i - \hat{x}_i ||^2 $$ $$ \max \sum_{i=1}^n || \hat{x}_i ||^2 $$