Reference: Zhou Z H. Ensemble methods: foundations and algorithms[M]. CRC Press, 2012.
Clustering aims to find the inherent structure of the unlabeled data by grouping them into clusters of objects. A satisfactory clustering method should increase the intra-cluster similarity while decrease inter-cluster similarity meanwhile. Define the sample sets , then the clusters can be defined as with and .
1. Partitioning Methods
A partitioning method organizes into partitions by optimizing an objective partitioning criterion. is one the well-known partitioning methods. It optimizes the error function , where .
To circumvent the solution, k-means adopts an iterative relocation technique to find the desired solution heuristically.
-step 1: select k instances from D as initial centers;
-step 2: find the nearest center for each instance;
-step 3: update the center ;
-step 4: repeat until convergence.
This figure shows how k-means works:
Another well-known clustering algorithm is clustering method, which introduces "message passing" between data points. AP takes as input a collection of real-valued similarities between data points, where the similarity indicates how well the data point with index is suited to be the class center for data point . When the goal is o minimize the squared error, each similarity is set to a negative Euclidean distance: for points and, .
Rather than requiring that the number of clusters be prespecified, AP takes as input a real number for each data point so that data points with larger values of are more likely to be chosen as class centers. These values are referred to as 'preferences', also called 'exemplars'. There are two kinds of messages exchanged between data points, i.e. 'responsibility' and 'availability'.
- Responsibility: Matrix , denotes how well-suited is to serve as the exemplar for point , taking into account other potential exemplars for point .
- Availability: Matrix , denotes how appropriate it would be for point to choose point as its exemplar, taking into account the support from other points that point should be an exemplar. The messages need only be exchanged between pairs of points with known similarities. The following figures and algorithms are how AP works.
The advantages of AP include stable convergence and independence of clustering prior numbers and initial points. However, due to time complexity, tradeoff of the algorithm is required on large scale data.
2. Hierarchical Methods
It creates a hierarchy of clustering on at various granular levels, where a specific clustering can be obtained by threshholding the hierarchy at a specific level of granule. A typical method is SAHN method:
- step 1: Each data point is place into a cluster as its own, then the similarity matrix has the component
- step 2: Two closest clusters are identified based on and replaced by the agglomerated cluster , then .
All coefficients are given before calculations.
This figure shows how SAHN works:
3. Density-based Methods
A density-based method constructs clusters on based on the notion of the density, where regions of instances with high density are regarded as clusters which are separated by regions of low density. DBSCAN is a representative density-based clustering method, which characterizes the density of the data space with a pair of parameters . This pair parameters determines a unique -neighborhood for any given point .
- step 1: DBSCAN identifies core objects which satisfy the requirement imposed by the parameters.
- step 2: It forms clusters by iteratively connecting the directly density-reachable instances starting from those core objects.
- step 3: The connecting process terminates when no new data point can be added to any cluster.
4. Grid-based Methods
A grid based method quantizes into a finite number of cells forming a grid-structure, where the quantization process is usually performed in a multi-resolution style. STING is a representative grid-based method, which divides the data space into a number of rectangular cells. Each cell stores statistical information of the instances falling into this cell. There are several levels of rectangular cells, each corresponding to a different level of resolution. Here, each cell at a higher level is partitioned into a number of cells at the next lower level, and statistical information of high-level cells can be easily inferred from its lower-level cells with simple operations such as elementary algebraic calculations.
5. Model-based Methods
It assumes a mathematical model characterizing the properties of , where the clusters are formed to optimize the fit between the data and the underlying model. The most famous model-based method is GMM-based clustering, which works b utilizing the Gaussian Mixture Model (GMM)
where and .
The cluster label for each is specified according to the rule
is learned from by employing the popular EM procedure to maximize log-likelihood function in an iterative manner.