Prerequisites: K-Means Clustering
In the clustering algorithm that we have studied before we used compactness(distance) between the data points as a characteristic to cluster our data points. However, we can also use connectivity between the data point as a feature to cluster our data points. Using connectivity we can cluster two data points into the same clusters even if the distance between the two data points is larger.
Spectral Clustering
Spectral Clustering is a variant of the clustering algorithm that uses the connectivity between the data points to form the clustering. It uses eigenvalues and eigenvectors of the data matrix to forecast the data into lower dimensions space to cluster the data points. It is based on the idea of a graph representation of data where the data point are represented as nodes and the similarity between the data points are represented by an edge.
Steps performed for spectral Clustering
Building the Similarity Graph Of The Data: This step builds the Similarity Graph in the form of an adjacency matrix which is represented by A. The adjacency matrix can be built in the following manners:
- Epsilon-neighbourhood Graph: A parameter epsilon is fixed beforehand. Then, each point is connected to all the points which lie in its epsilon-radius. If all the distances between any two points are similar in scale then typically the weights of the edges ie the distance between the two points are not stored since they do not provide any additional information. Thus, in this case, the graph built is an undirected and unweighted graph.
- K-Nearest Neighbours A parameter k is fixed beforehand. Then, for two vertices u and v, an edge is directed from u to v only if v is among the k-nearest neighbours of u. Note that this leads to the formation of a weighted and directed graph because it is not always the case that for each u having v as one of the k-nearest neighbours, it will be the same case for v having u among its k-nearest neighbours. To make this graph undirected, one of the following approaches is followed:-
- Direct an edge from u to v and from v to u if either v is among the k-nearest neighbours of u OR u is among the k-nearest neighbours of v.
- Direct an edge from u to v and from v to u if v is among the k-nearest neighbours of u AND u is among the k-nearest neighbours of v.
- Fully-Connected Graph: To build this graph, each point is connected with an undirected edge-weighted by the distance between the two points to every other point. Since this approach is used to model the local neighbourhood relationships thus typically the Gaussian similarity metric is used to calculate the distance.
Projecting the data onto a lower Dimensional Space: This step is done to account for the possibility that members of the same cluster may be far away in the given dimensional space. Thus the dimensional space is reduced so that those points are closer in the reduced dimensional space and thus can be clustered together by a traditional clustering algorithm. It is done by computing the Graph Laplacian Matrix.
Python Code For Graph Laplacian Matrix
To compute it though first, the degree of a node needs to be defined. The degree of the ith node is given byNote that is the edge between the nodes i and j as defined in the adjacency matrix above.
Python3
# Defining the adjaceny matix import numpy as np A = np.array([ [ 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 1 , 1 ], [ 1 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ], [ 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ], [ 0 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 ], [ 0 , 0 , 0 , 1 , 0 , 1 , 0 , 0 , 0 , 0 ], [ 0 , 0 , 0 , 1 , 1 , 0 , 1 , 1 , 0 , 0 ], [ 0 , 0 , 0 , 0 , 0 , 1 , 0 , 1 , 0 , 0 ], [ 0 , 0 , 0 , 0 , 0 , 1 , 1 , 0 , 0 , 0 ], [ 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 ], [ 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 ]]) |
The degree matrix is defined as follows:-
Python3
D = np.diag(A. sum (axis = 1 )) print (D) |
Thus the Graph Laplacian Matrix is defined as:-
Python3
L = D - A print (L) |
This Matrix is then normalized for mathematical efficiency. To reduce the dimensions, first, the eigenvalues and the respective eigenvectors are calculated. If the number of clusters is k then the first eigenvalues and their eigenvectors are taken and stacked into a matrix such that the eigenvectors are the columns.
Code For Calculating eigenvalues and eigenvector of the matrix in Python
Python3
# find eigenvalues and eigenvectors vals, vecs = np.linalg.eig(A) |
Clustering the Data: This process mainly involves clustering the reduced data by using any traditional clustering technique – typically K-Means Clustering. First, each node is assigned a row of the normalized of the Graph Laplacian Matrix. Then this data is clustered using any traditional technique. To transform the clustering result, the node identifier is retained.
Properties:
- Assumption-Less: This clustering technique, unlike other traditional techniques do not assume the data to follow some property. Thus this makes this technique to answer a more-generic class of clustering problems.
- Ease of implementation and Speed: This algorithm is easier to implement than other clustering algorithms and is also very fast as it mainly consists of mathematical computations.
- Not-Scalable: Since it involves the building of matrices and computation of eigenvalues and eigenvectors it is time-consuming for dense datasets.
- Dimensionality Reduction: The algorithm uses eigenvalue decomposition to reduce the dimensionality of the data, making it easier to visualize and analyze.
- Cluster Shape: This technique can handle non-linear cluster shapes, making it suitable for a wide range of applications.
- Noise Sensitivity: It is sensitive to noise and outliers, which may affect the quality of the resulting clusters.
- Number of Clusters: The algorithm requires the user to specify the number of clusters beforehand, which can be challenging in some cases.
- Memory Requirements: The algorithm requires significant memory to store the similarity matrix, which can be a limitation for large datasets.
Credit Card Data Clustering Using Spectral Clustering
The below steps demonstrate how to implement Spectral Clustering using Sklearn. The data for the following steps is the Credit Card Data which can be downloaded from Kaggle.
Step 1: Importing the required libraries
We will first import all the libraries that are needed for this project
Python3
import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import SpectralClustering from sklearn.preprocessing import StandardScaler, normalize from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score |
Step 2: Loading and Cleaning the Data
Python3
# Changing the working location to the location of the data cd "C:\Users\Dev\Desktop\Kaggle\Credit_Card" # Loading the data X = pd.read_csv( 'CC_GENERAL.csv' ) # Dropping the CUST_ID column from the data X = X.drop( 'CUST_ID' , axis = 1 ) # Handling the missing values if any X.fillna(method = 'ffill' , inplace = True ) X.head() |
Output:
Step 3: Preprocessing the data to make the data visualizable
Python3
# Preprocessing the data to make it visualizable # Scaling the Data scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Normalizing the Data X_normalized = normalize(X_scaled) # Converting the numpy array into a pandas DataFrame X_normalized = pd.DataFrame(X_normalized) # Reducing the dimensions of the data pca = PCA(n_components = 2 ) X_principal = pca.fit_transform(X_normalized) X_principal = pd.DataFrame(X_principal) X_principal.columns = [ 'P1' , 'P2' ] X_principal.head() |
Step 4: Building the Clustering models and Visualizing the Clustering
In the below steps, two different Spectral Clustering models with different values for the parameter ‘affinity’. You can read about the documentation of the Spectral Clustering class here. a) affinity = ‘rbf’
Python3
# Building the clustering model spectral_model_rbf = SpectralClustering(n_clusters = 2 , affinity = 'rbf' ) # Training the model and Storing the predicted cluster labels labels_rbf = spectral_model_rbf.fit_predict(X_principal) |
Python3
# Building the label to colour mapping colours = {} colours[ 0 ] = 'b' colours[ 1 ] = 'y' # Building the colour vector for each data point cvec = [colours[label] for label in labels_rbf] # Plotting the clustered scatter plot b = plt.scatter(X_principal[ 'P1' ], X_principal[ 'P2' ], color = 'b' ); y = plt.scatter(X_principal[ 'P1' ], X_principal[ 'P2' ], color = 'y' ); plt.figure(figsize = ( 9 , 9 )) plt.scatter(X_principal[ 'P1' ], X_principal[ 'P2' ], c = cvec) plt.legend((b, y), ( 'Label 0' , 'Label 1' )) plt.show() |
Output:
b) affinity = ‘nearest_neighbors’
Python3
# Building the clustering model spectral_model_nn = SpectralClustering(n_clusters = 2 , affinity = 'nearest_neighbors' ) # Training the model and Storing the predicted cluster labels labels_nn = spectral_model_nn.fit_predict(X_principal) |
Output:
Step 5: Evaluating the performances
Python3
# List of different values of affinity affinity = [ 'rbf' , 'nearest-neighbours' ] # List of Silhouette Scores s_scores = [] # Evaluating the performance s_scores.append(silhouette_score(X, labels_rbf)) s_scores.append(silhouette_score(X, labels_nn)) print (s_scores) |
Step 6: Comparing the performances
Python3
# Plotting a Bar Graph to compare the models plt.bar(affinity, s_scores) plt.xlabel( 'Affinity' ) plt.ylabel( 'Silhouette Score' ) plt.title( 'Comparison of different Clustering Models' ) plt.show() |
Output:
Spectral Clustering is a type of clustering algorithm in machine learning that uses eigenvectors of a similarity matrix to divide a set of data points into clusters. The basic idea behind spectral clustering is to use the eigenvectors of the Laplacian matrix of a graph to represent the data points and find clusters by applying k-means or another clustering algorithm to the eigenvectors.
Advantages of Spectral Clustering:
- Scalability: Spectral clustering can handle large datasets and high-dimensional data, as it reduces the dimensionality of the data before clustering.
- Flexibility: Spectral clustering can be applied to non-linearly separable data, as it does not rely on traditional distance-based clustering methods.
- Robustness: Spectral clustering can be more robust to noise and outliers in the data, as it considers the global structure of the data, rather than just local distances between data points.
Disadvantages of Spectral Clustering:
- Complexity: Spectral clustering can be computationally expensive, especially for large datasets, as it requires the calculation of eigenvectors and eigenvalues.
- Model selection: Choosing the right number of clusters and the right similarity matrix can be challenging and may require expert knowledge or trial and error.