Principal Component Analysis

🌱

MachineLearningLinearAlgebra

What is PCA?

This is a technique used for analyzing large datasets containing a high-dimensionality or high number of features that allows us to increase our interpretability while also seeking to preserve the max amount of information. We can think of it as compressing data into something that captures original essence of data (i.e. compressing 10000-d to 3-d or 2-d).

Why PCA?

When dealing with very high-dimensional data there might be different things we want to do with it such as: - Visualizing the data - Extracting features from the data - Clustering the data - Compressing the data PCA helps with each by - Selecting the most “important” principal components and using those to graph the data - Allows us to see which features have most variance and hence select those as the most distinct and distinguishing features of the data - Similar to previous - We can get rid of principal components that have minimal variance (i.e. these features don’t really tell us much so we can afford to shave them off) this can be thought as analogous to lossy compression

How does PCA work?

So given an example dataset you would go through the following steps: 1. Standardize the Data: For each data sample, subtract the mean and divide by standard deviation. 2. Create Covariance Matrix 3. Calculate eigenvalues and eigenvectors: Perform eigendecomposition on the matrix. 4. Choose Principal Components: Sort the eigenvalues in descending order and pick your top $k$ eigenvectors corresponding to the top $k$ eigenvalues. $k$ determined the dimensions of your reduced data 5. Transform Original Dataset: Form a feature vector by stacking the top $k$ eigenvectors horizontally and normalize them too!. Transform the original standardized dataset using this vector. This gives us our final dataset with reduced dimensions.