Principal Component Analysis(PCA) is a method employed to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. This is achieved by transforming to a new set of variables, the principal components (PCs), which are uncorrelated, and which are ordered so that the first few retain most of the variation present in all of the original variables. Source .
Data Collection:
To demonstrate this, I used a dataset from Kaggle. I focused on only the color variable to achieve the above. After I imported the dataset and loaded it, I viewed the first five columns. Next, I viewed its information and statistical information.
Data Preprocessing:
Since I only used the colors, I extracted the colors and created a data frame. Checking the created data frame.
Then using the Scikit-learn library, I encoded the categorical variables, defined a dictionary to store the encoding and finally generated an encoded data frame. I checked the correlation of the data and visualized it.
Principal Component Analysis:
I instantiated the PCA with number of components as three, fit the model and reduced the dimensionality of the color values by the number of components.
K-Means:
Final phase was to apply a simple K-Means with the number of clusters as three, then fit. Visualizing the clusters.
Conclusion:
This is a simple illustration of Principal Component Analysis on K-Means clustering. The code is in my repo and you can connect with me on LinkedIn for corrections or suggestions. Thank you for reading.