Case Study:
A retail analytics company wants to perform a market segmentation of their clients using their curated data of 2.5 years.
Task:
Create a targeted ad marketing campaign by dividing their customers into atleast three distinctive groups.
Data:
The data was sourced from Kaggle.
Dataset Summary
I checked the dataset for null values and discovered some columns/variables had null values.
I used the .drop function to drop those columns/variables.
Data Visulaization:
I used the barplot.visualization function to visualize the country and status variables.
I had to drop the status variable because of data imbalance.
I used dummy variables to replace the categorical variables in the dataset which included country, product line and deal size and then grouped the dataset by the order dates.
Then I visualized the peak sales periods.
Looking at the above plot, the peak periods were between the months of November and December.
The next step was to plot the correlation map
Looking at this map, the quarter IDs and the months IDs were highly correlated, I had to drop the quarter IDs and re-plotted the map.
K-Means Clustering:
I used the elbow method to get the optimal number of clusters. For more information about this method, it can be found here .
Visualizing it, I got this graph
I clustered the data using K-Means, and visualized the cluster centers
I also performed inverse transformation to get this
Cluster 0 (Highest) - This group represents customers who buy items in high quantity centered around ~47, they buy items in all price range leaning towards high price items of ~99. They also correspond to the highest total sales around ~8296 and they are active throughout the year. They are the highest buyers of products with high MSRP ~158.
Cluster 1 - This group represents customers who buy items in varying quantity ~35, they tend to buy high price items ~96. Their sales is bit better average ~4435, they buy products with second highest MSRP of ~133.
Cluster 2 (lowest) - This group represents customers who buy items in low quantity ~30. They tend to buy low price items ~68. Their sales ~2044 is lower than other clusters and they are extremely active around holiday season. They buy products with low MSRP ~75.
Cluster 3 - This group represents customers who are only active during the holidays. they buy in lower quantity ~35, but they tend to buy average price items around ~86. They also correspond to lower total sales around ~3673, they tend to buy items with MSRP around 102.
Cluster 4 - This group represents customers who buy items in varying quantity ~39, they tend to buy average price items ~94. Their sales ~4280.
Visualizing these clusters
I performed dimensionality reduction using principle component analysis (PCA). I reduced the dataset into 3 to visualize the variables, concatenating the cluster labels to the data frame.
This is the scatterplot of the variables in 3D
I also tried reducing the dimensionality using autoencoders, fitting the autoencoder with verbose equals 3, batch_size equals 128 and 500 epochs.
This is the graph of the score
Conclusion:
This project is one of the projects from my certification courses on Udemy. The repo to this project is here . Any questions, suggestions or accolades, I can be reached through my LinkedIn profile .