Project 6: Credit-Card-Customer-Segmentation
Project Overview
- Objective : To do Customer Segmentation on Credit Card Data
- Clustering Model
- Data cleaning and Data preprocessing
- Exploratory Data Analysis
- K-means Clustering
- PCA used for Model Visualization
- Autoencoders Model used for Dimensionality Reduction to improve Cluster Analysis

About Project
We are doing customer segmentation on credit card customer dataset so that we can analyse their customers and identify their needs.
As we know, marketing is very important for growth and development of any company. By using customer segmentation, they can target specific customers according to their need which helps the company to increase their sales and revenue.
Code and Resources used
- Python version: 3.7.6
- Packages: Pandas, Numpy, Seaborn, Matplotlib, Scikit, Tensorflow and Keras
- Resources used:
- https://medium.com/analytics-vidhya/how-to-determine-the-optimal-k-for-k-means-708505d204eb
- https://towardsdatascience.com/dimensionality-reduction-pca-versus-autoencoders-338fcaf3297d
- https://economictimes.indiatimes.com/wealth/borrow/how-to-get-a-credit-card-if-you-dont-have-a-job/articleshow/70769990.cms
Web Scraping
Dataset URL: https://www.kaggle.com/arjunbhasin2013/ccdata
The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables. Following is the Data Dictionary for Credit Card dataset :-
- CUSTID : Identification of Credit Card holder (Categorical)
- BALANCE : Balance amount left in their account to make purchases (
- BALANCEFREQUENCY : How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated)
- PURCHASES : Amount of purchases made from account
- ONEOFFPURCHASES : Maximum purchase amount done in one-go
- INSTALLMENTSPURCHASES : Amount of purchase done in installment
- CASHADVANCE : Cash in advance given by the user
- PURCHASESFREQUENCY : How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased)
- ONEOFFPURCHASESFREQUENCY : How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased)
- PURCHASESINSTALLMENTSFREQUENCY : How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done)
- CASHADVANCEFREQUENCY : How frequently the cash in advance being paid
- CASHADVANCETRX : Number of Transactions made with “Cash in Advanced”
- PURCHASESTRX : Numbe of purchase transactions made
- CREDITLIMIT : Limit of Credit Card for user
- PAYMENTS : Amount of Payment done by user
- MINIMUM_PAYMENTS : Minimum amount of payments made by user
- PRCFULLPAYMENT : Percent of full payment paid by user
- TENURE : Tenure of credit card service for user
Data Cleaning
I made the following changes and created the following variables:
- Set NULL values for “minimum payments” and “credit limit” with mean data
- Remove column “customer ID”
EDA
I looked at the distributions of the data and the value counts for the various numerical variables. Below are a few highlights:

Model Builiding
K-means
The K-Means is an unsupervised machine learning algorithm. It works by grouping data points with similar attribute values together by measuring the euclidean distances between them. It is simple and perhaps the most commonly used algorithm for clustering.
The basic idea behind k-means consists of defining k clusters such that total within-cluster variation (or error) is minimum.
Two methods that can be useful to find this mysterious k in k-Means are:
- The Elbow Method
- The Silhouette Method
The Elbow Method
Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.
Within-Cluster-Sum of Squared Errors sounds a bit complex. Let’s break it down:
- The Squared Error for each point is the square of the distance of the point from its representation i.e. its predicted cluster center.
- The WSS score is the sum of these Squared Errors for all the points.
- Any distance metric like the Euclidean Distance or the Manhattan Distance can be used
The Silhouette Method
The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).
The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.
The Silhouette Score reaches its global maximum at the optimal k.
Dimesionality Reduction
Autoencoders
Autoencoder is an unsupervised artificial neural network that learns how to efficiently compress and encode data then learns how to reconstruct the data back from the reduced encoded representation to a representation that is as close to the original input as possible.
Autoencoder, by design, reduces data dimensions by learning how to ignore the noise in the data.

Autoencoder Components:
Autoencoders consists of 4 main parts:
1- Encoder: In which the model learns how to reduce the input dimensions and compress the input data into an encoded representation.
2- Bottleneck: which is the layer that contains the compressed representation of the input data. This is the lowest possible dimensions of the input data.
3- Decoder: In which the model learns how to reconstruct the data from the encoded representation to be as close to the original input as possible.
4- Reconstruction Loss: This is the method that measures measure how well the decoder is performing and how close the output is to the original input.

Model Performance
WCSS : Within-Cluster-Sum of Squared Errors And k : No. of clusters
- Green Line: WCSS Vs K
- Red Line: WCSS of Autoencoder model Vs k

It is clear WCSS Error is reduced after dimensionality reduction by encoders.
Model Prediction
Customer Segmentation by k-means clustering

Conclusion
After analyzing whole credit card dataset, we finally done customer segmentation by dividing customers into four clusters which can be used to target specific customers as their needs by marketers to provide offers, loans etc.
Further Improvements
-
Not to avoid outliers as there are many outliers in dataset
-
Try other clustering strategies