[Home]
[Notes]
[Publications]
[Projects]
[GitHub]
[BHL0388]
In unsupervised learning, there are no labels associated with features. Generally speaking, the ultimate goal of unsupervised learning is to find patterns and structures that help us to better understand data. Sometimes, we also use unsupervised learning to model a distribution. But we generally will not make predictions.
There are 3 types of clustering 1. Partitional (centroid, graph-theoretic, spectral) 1. Hierarchical (agglomerative, divisive) 2. Bayesian (decision-based, non-parametric)
\(k\)-means is a type of partitional centroid-based clustering algorithm. The algorithm is described as follows: 1. Randomly pick \(k\) cluster centers; 2. Find the closest center for each point; 3. Update cluster centers by computing centroids; 4. While not converging, jump to step 2.
Let \(G = (V, E)\) has vertex set \(V\) and edge set \(E\). Each \(e \in E\) can be weighted or unweighted, and it encodes the similarity between data points.
If each vertex represents a data point, then finding a clustering amongst these points is isomorphic to partition \(V\) into \(V_1\) and \(V_2\) (when \(k = 2\)). The partition of \(V\) implies that we need to split the graph. We can define an objective function to determine the best way to split the edges of a graph. Then, we can optimize the objective function in order to find the optimal partition. Consider the objective function to be \(\text{Cut}(V_1, V_2) = \sum_{i \in V_1, j \in V_2} w_{ij}\), then we would like to have split \(V\) so that the \(\text{Cut}\) is minimized. Of course, such a greedy approach could lead to a less ideal solution: \(|V_1| << |V_2| (|V_2| = |V| - 1)\). We want to balance the cardinality of \(V_1\) and \(V_2\). A way to balance it is to use "balanced" cut like: \[ \text{Ratio Cut}(V_1, V_2) = \frac{\text{Cut}(V_1, V_2)}{|V_1|} + \frac{\text{Cut}(V_1, V_2)}{|V_2|}, \] or \[ \text{Normalized Cut}(V_1, V_2) = \frac{\text{Cut}(V_1, V_2)}{\sum_{i \in V_1} d_i} + \frac{\text{Cut}(V_1, V_2)}{\sum_{j \in V_2} d_j}, \] where \(d_i = \sum_j w_{ij}\).
We start with a similarity/adjacency matrix, \(A\), of a graph \(G\). Let \(D\) be diagonal matrix \(D\) such that the i-\(th\) diagonal entry is \(\sum^n_{k=1} w_{ik}\). Define graph Laplacian matrix \(L = D - A\). \(L\) has the 2 following properties: 1. L is symmetric 2. L is positive semi-definite The second properties implies that all eigenvalues of \(L\) are non-negative. Then, compute the \(k\) smallest eigenvectors and stack them as columns into a matrix \(V\). Finally, we run \(k\)-means on the rows of \(V\) to obtain the clustering result.
The basic idea of hierarchical clustering is to build hierarchy amongst the data points, i.e. to form an arrangement of these points from specific to general. The advantage of such an algorithm is that there is no need for \(k\), the number of clusters. The output of this algorithm is a binary tree. There are two types of hierarchical clustering, which are described as follows: 1. Agglomerative clustering: A buttom-up approach, which initially treats each data point as its own singleton cluster and progressively merge clusters. 2. Divisive: A top-down approach, which initially treats all points as in a single cluster and progressively split clusters.
In step 2, we need to calculate the distance between all pairs of clusters in order to select the closest one to merge. There are 3 ways to define the distance between two clusters \(A\) and \(B\): 1. single-linkage: \(d(A, B) = \min_{x_1 \in A, x_2 \in B} d(x_1, x_2)\); 2. complete-linkage: \(d(A, B) = \max_{x_1 \in A, x_2 \in B} d(x_1, x_2)\); 3. average-linkage: \(d(A, B) = \frac{1}{|A| |B|}\sum_{x_1 \in A, x_2 \in B} d(x_1 , x_2)\).