Demonstrating PCA on K-Means Clustering

Principal Component Analysis(PCA) is a method employed to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. Source .

Data Collection:

To demonstrate this, I used a dataset from Kaggle. I focused on only the color variable to achieve the above. After I imported the dataset and loaded it, I viewed the first five columns. Screen Shot 2021-03-20 at 8.46.43 PM.png Next, I viewed its information and statistical information. Screen Shot 2021-03-20 at 8.47.29 PM.png

Screen Shot 2021-03-20 at 8.47.41 PM.png

Data Preprocessing:

Since I only used the colors, I extracted the colors and created a data frame. Checking the created data frame. Screen Shot 2021-03-20 at 8.49.39 PM.png

Then using the Scikit-learn library, I encoded the categorical variables, defined a dictionary to store the encoding and finally generated an encoded data frame. Screen Shot 2021-03-20 at 8.51.49 PM.png I checked the correlation of the data and visualized it. Screen Shot 2021-03-20 at 8.53.17 PM.png

Screen Shot 2021-03-20 at 8.53.28 PM.png

Principal Component Analysis:

I instantiated the PCA with number of components as three, fit the model and reduced the dimensionality of the color values by the number of components. Screen Shot 2021-03-20 at 8.56.16 PM.png

K-Means:

Final phase was to apply a simple K-Means with the number of clusters as three, then fit. Visualizing the clusters. Screen Shot 2021-03-20 at 8.57.58 PM.png

Conclusion:

This is a simple illustration of Principal Component Analysis on K-Means clustering. The code is in my repo and you can connect with me on LinkedIn for corrections or suggestions. Thank you for reading.