View on GitHub

Credit-Card-Customer-Segmentation

Project 6: Credit-Card-Customer-Segmentation


Project Overview


About Project

We are doing customer segmentation on credit card customer dataset so that we can analyse their customers and identify their needs.

As we know, marketing is very important for growth and development of any company. By using customer segmentation, they can target specific customers according to their need which helps the company to increase their sales and revenue.


Code and Resources used


Web Scraping

Dataset URL: https://www.kaggle.com/arjunbhasin2013/ccdata

The sample Dataset summarizes the usage behavior of about 9000 active credit card holders during the last 6 months. The file is at a customer level with 18 behavioral variables. Following is the Data Dictionary for Credit Card dataset :-


Data Cleaning

I made the following changes and created the following variables:


EDA

I looked at the distributions of the data and the value counts for the various numerical variables. Below are a few highlights:


Model Builiding

K-means

The K-Means is an unsupervised machine learning algorithm. It works by grouping data points with similar attribute values together by measuring the euclidean distances between them. It is simple and perhaps the most commonly used algorithm for clustering.

The basic idea behind k-means consists of defining k clusters such that total within-cluster variation (or error) is minimum.

Two methods that can be useful to find this mysterious k in k-Means are:

The Elbow Method

Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish. In the plot of WSS-versus-k, this is visible as an elbow.

Within-Cluster-Sum of Squared Errors sounds a bit complex. Let’s break it down:

The Silhouette Method

The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).

The range of the Silhouette value is between +1 and -1. A high value is desirable and indicates that the point is placed in the correct cluster. If many points have a negative Silhouette value, it may indicate that we have created too many or too few clusters.

The Silhouette Score reaches its global maximum at the optimal k.


Dimesionality Reduction

Autoencoders

Autoencoder is an unsupervised artificial neural network that learns how to efficiently compress and encode data then learns how to reconstruct the data back from the reduced encoded representation to a representation that is as close to the original input as possible.

Autoencoder, by design, reduces data dimensions by learning how to ignore the noise in the data.

Autoendoder for MNIST

Autoencoder Components:

Autoencoders consists of 4 main parts:

1- Encoder: In which the model learns how to reduce the input dimensions and compress the input data into an encoded representation.

2- Bottleneck: which is the layer that contains the compressed representation of the input data. This is the lowest possible dimensions of the input data.

3- Decoder: In which the model learns how to reconstruct the data from the encoded representation to be as close to the original input as possible.

4- Reconstruction Loss: This is the method that measures measure how well the decoder is performing and how close the output is to the original input.


Model Performance

WCSS : Within-Cluster-Sum of Squared Errors And k : No. of clusters

It is clear WCSS Error is reduced after dimensionality reduction by encoders.


Model Prediction

Customer Segmentation by k-means clustering


Conclusion

After analyzing whole credit card dataset, we finally done customer segmentation by dividing customers into four clusters which can be used to target specific customers as their needs by marketers to provide offers, loans etc.


Further Improvements