Clustering:
Before learning Consensus Clustering, we must know what Clustering is. In Machine Learning, Clustering is a technique used for grouping different objects in separated clusters according to their similarity, i.e. similar objects will be in the same clusters, separated from other clusters of similar objects. It is an Unsupervised learning method. Few frequently used Clustering algorithms are K-means, K-prototype, DBSCAN etc.
Consensus Clustering:
There are few drawbacks of the normal clustering process. Algorithms like K-means or K-prototype etc use a random initialization procedure which results in different cluster results or cluster initialization in each iteration of the algorithm. There is also a need to initialize the value of K, which is generally chosen by the Elbow method. So, the clustering process is very dependent on these metrics, hence, it produces biased clusters which are also very unstable. To eliminate these drawbacks, we follow a different clustering approach which is Consensus Clustering.
The word ‘Consensus’ comes from a Latin word, which means ‘General agreement’. Consensus Clustering is a technique of combining multiple clusters into a more stable single cluster which is better than the input clusters. This way, all the clusters are merged into a stable single cluster and this process is done iteratively by generating a Consensus Matrix at each level.
Advantages of Consensus Clustering:
- Better quality and robustness of the clusters.
- Producing the correct number of clusters.
- Better handling of missing data.
- Individual partitions can be obtained independently.
Process of Consensus Clustering:
The Consensus Clustering is based on two phases-
- Partition Generation: In this stage, different partitions of data objects are created using different subsets of data attributes, applying different clustering algorithms with different bias, taking different parameters for clustering and using a different random subsample of the whole dataset. Once we generate the initial partition, we move forward towards generating consensus among the partitions and further generating the new partitions based on the previous consensus.
- Consensus Generation: The consensus among the data partitions is generated using the Consensus Function, which is obtained generally in these approaches –
- Median Partitioning based approach: Here the data points of different partitions are grouped together by their similarity index. We form new partitions based on the medians of the data points of previous partitions. The Similarity index depends on the agreement & disagreement of the data points, which is measured by F-measures, Rand index etc.
- Co-occurrence based approach: In this approach, there are 3 methods we can use: 1. Relabeling/Voting based method, 2. Co-association matrix-based method, 3. Graph-based method. Relabeling/Voting based method generates the new clusters by determining the correspondence with the current consensus. Each instance gains a certain vote from their cluster assignments and updates the consensus and the cluster assignments accordingly. The Co-association matrix-based method generates the new clusters based on the co-association matrix by the similarity of data points and the Graph-based method generates a weighted graph to represent multiple clusters and finds the optimal partitions by minimizing the graph cut.
There are many different Consensus Clustering algorithms based on different approaches of generating consensus function and there are many research works still going on improving the existing models.
Ready to dive in? Explore our Free Demo Content and join our DSA course, trusted by over 100,000 neveropen!