This post was originally published by John Clements at Towards Data Science
PCA in 2 Dimensions; All images are generated by the author
What it is, Why it’s useful, and How to use it
In real world data sets, many of our variables are unimportant or correlated with each other. If we are doing a supervised machine learning task, leaving in variables unrelated to our target variable lowers the signal to noise ratio in our training data, making it harder to construct a good model. Leaving correlated variables in our training data can cause issues with multicollinearity depending on the modeling method used.
One possible way around these issues is to remove the irrelevant variables. The problem is how to decide a variable is irrelevant. You could use a technique like best subset selection via Akaike Information Criterion, or recursive feature elimination (but NEVER step-wise regression), but using your training data to select which variables to keep can introduce bias into our estimates of mean squared error and other important estimators of model quality.
Another approach to this problem is dimensionality reduction; which involves finding a lower dimensional representation of the data. If we can find a sub-space of our original data that retains almost all of the relevant information, we can have an easier time estimating models. If the sub-space is small enough, it can make tasks like clustering or models dependent on distance metrics a feasible option because we avoid the curse of dimensionality.
The most popular method for finding a lower dimensional representation of a data set is Principal Components Analysis (PCA), which finds linear combinations of the variables in a training data set which maximize the variance explained by each linear combination, subject to some constraints. We go deeper in the next section.
Suppose we have a matrix, with n rows (observations) and k columns (variables). Lets call this matrix, X. We want to find linear combinations of the form:
where 𝑎𝑖 is a column vector, for i = 1, 2, … , k which explain the maximum amount of variability in X and each linear combination is orthogonal (at a right angle) to the others. We want the linear combinations to be orthogonal to each other so each principal component is picking up different information.
To find the linear combinations of X’s columns that maximize the variance of the linear combination, we need to solve:
One way is to maximize the variance of the linear combination is to make the weight of each column infinity. To keep every element in 𝑎1 from being infinity, we need to impose a restraint on 𝑎1. The way we do this in PCA is force the sum of the squared elements in 𝑎1 to be 1. Therefore, to find the first principal component, we need to find:
To find the second principal component, we need to impose the additional constraint that 𝑎2 is orthogonal to 𝑎1.
For each additional component until we reach k, we add orthogonality constraints for all the preceding components.
Instead of needing to iteratively solve these constrained maximization problems, it turns out that the eigenvectors of the covariance matrix of the data are the principal components, so we can just solve for those. We order the eigenvectors of the sample covariance matrix by their corresponding eigenvalues (largest to smallest) to get the ordered principal components. The eigenvalues are the variance explained by their corresponding eigenvectors.
I’m assuming you’ve taken a linear algebra course, so I won’t cover how to find eigenvectors and eigenvalues here, but you can find an example here for a refresher on how to do it by hand. I also highly recommend 3Blue1Brown’s video for a graphical interpretation. If your linear algebra class was heavy on memorization like mine was, I recommend you watch 3Blue1Brown’s entire Essence of Linear Algebra series. The visualizations in these videos really helped me understand concepts that I had previously just memorized.
It is important to re-scale all the columns in X so each variable is on the same scale using standardization. Getting all the variables on the same scale ensures that one variable won’t dominate a principal component due to the units in which it is measured.
Suppose we have a matrix , X, with n rows (observations) and k columns (variables), where each column has been scaled to have a mean of 0 and standard deviation of 1.
Example Scree Plot with line at elbow
We are now ready to use Z, a matrix of what are called principal component scores, for the purposes of clustering, regression, or whatever we set out to do with our original data set.
I’ve built a Python class below that is able to perform PCA, set the number of components, and transform data.
To help visualize what PCA is doing, I generated some random data with 2 dimensions where there is a linear relationship between the variables. After generating the data, I standardized it for plotting and fit the PCA object to the data. Then I plotted the standardized data along with the 2 principal components. The first one is purple and the second is blue. As you can see, the first principal component points in the direction of the most variation in the data. The second is orthogonal to the first and points in the direction of the second most variation in the data. I over-layed a unit circle to emphasize the length of each vector is 1. Now let’s apply PCA to a real problem.
PCA in 2 Dimensions
To give a practical example of PCA, I used some data I have from another project. It contains 2015 census tract demographic data aggregated up to the county level. I then added the vote share for Hilary Clinton and Donald Trump in each county and from that figured out who won the county. If you want the raw data and code I used to wrangle the data, send me a message and I’ll be happy to share it.
The variables in the data set include total population, the proportion of gender, race, ethnic background, how people commute and the commute duration, as well as the proportion of people publicly, privately, self-, or unemployed, the proportions employed in various industries, and the per capita income.
As you may suspect, the proportion of publicly employed, privately employed, self-employed, and unemployed should sum to 1. There are a couple of other sets of variables in this data set fitting this same pattern meaning some variables are redundant. We will see this in the PCA. If we weren’t doing PCA, we would drop one of the variables from each set of the linearly dependent columns, but we’ll leave them in here.
We will be comparing the performance of a K-Nearest Neighbors (K-NN) Classifier on the un-transformed data vs. transformed data in predicting whether Trump or Clinton won more votes in the county. This isn’t really a good problem since the 2016 US election is over and there will never be new observations to predict. This data set would be better suited for data exploration and identifying what demographic characteristics correlate with a given electoral outcome. That being said, exploring demographic variables and the 2016 US election isn’t the point of this article; how to apply PCA is point of this section.
I chose the K-NN classifier because it relies on distance to find the closest points to an out-of-sample observation. Data points get farther and farther apart as the number of dimensions increase (see: Curse of Diminsionality). For this reason, we should expect a K-NN classifier trained on a small number of linear combinations of the data to outperform one trained on the raw data set, conditional that the linear combinations capture meaningful relationships.
Now that we have all of our data separated, we can perform PCA on the training demographic variables. Using the rule of thumb for the scree plot, we would keep the first 6 principal components of the data.
Stopping when the variance explained by a component drops below the average variance explained would lead us to keep the first 9 principal components. Because the data is standardized and there are 31 variables, the total variance is 31. Divide total variance by the number of variables and you get 1. The 10th principal component explains about 0.9 out of the total variance of 31, so it is below that cut-off.
Because points are farther apart in higher dimensions, I will go with the first 6 principal components, instead of the first 9, for the purposes of modeling.
Scree Plot and Cumulative Variance Explained for the County Demographic Data
I printed out the variance explained by the last 6 principal components below along with the ratio of total variance explained and the cumulative sum of that ratio. Above I said that some of the variables were inherently linear combinations of the others. This is shown in the last 5 principal components. The last 5 explain approximately 0 variance suggesting there are 5 variables that are linear combinations of the other variables.
These groups of variables that should sum to 1 include the variables on gender proportions, race and ethnic background proportions, industry employment proportions, proportion of main commute method, and employment status. So although there are 31 variables in the un-transformed data, there are really only 26.
Variance Explained by the Last 6 Principal Components
Now let’s only keep the first 6 principal components and transform the testing demographic variables.
Although we are keeping 6 principal components, we can visualize the first 3 and add color coded labels (blue for a Clinton win and red for a Trump win). By doing so, we can see the that Clinton counties and Trump counties are not fully separable in the space spanned by the first 3 principal components, but there are clearly differences in demography between those counties.
Plot of the counties in the space of the first 3 principal components; Red is a county where Trump won more votes and Blue is a county where Clinton won more votes
Now it is time to train and test the K-NN classifier on un-transformed data, and the lower dimensional representation of the data using PCA.
Here we create an instance of sklearn’s K-NN classifier, using the 5 nearest neighbors to classify observations. The distance between points is measured by Euclidean distance. I used inverse distance weighting instead of uniform distance weighting.
Now let’s train the two models and get predictions for the testing data.
Now let’s compare the accuracy of each model on the test sets.
Here we can see that the K-NN classifier trained on the data transformed by the first 6 principal components has an out-of-sample accuracy that is 4 percentage points higher than the model trained on the un-transformed data!
By finding a few linear combinations of the variables, we were able to capture important information in a lower dimensional space where data points are closer together, thus improving the K-NN model’s accuracy!
I hope you now have an intuitive understanding of what PCA does, why it’s useful, how to estimate it (if for some reason you don’t want to use a fast and stable implementation from a source like scikit-learn), and how to apply it to a real problem. Thanks for reading to the end!
You can find the notebook used to make this article here.
This post was originally published by John Clements at Towards Data Science