website/content/research/clusteranalysis/notes/lec10-2.md at 220f2b75715691c6d91a40c4b5005cc5448ef717

mirror of https://github.com/Brandon-Rozek/website.git synced 2024-11-25 01:26:30 -05:00

Brandon Rozek 330ace0de9 Fixed titles, math rendering, and links on some pages

2021-07-26 09:13:20 -04:00

2.7 KiB

Raw Blame History

title	showthedate	math
Centroid-based Clustering	false	true

In centroid-based clustering, clusters are represented by some central vector which may or may not be a member of the dataset. In practice, the number of clusters is fixed to k and the goal is to solve some sort of optimization problem.

The similarity of two clusters is defined as the similarity of their centroids.

This problem is computationally difficult so there are efficient heuristic algorithms that are commonly employed. These usually converge quickly to a local optimum.

K-means clustering

This aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean which serves as the centroid of the cluster.

This technique results in partitioning the data space into Voronoi cells.

Description

Given a set of observations x, k-means clustering aims to partition the n observations into k sets S so as to minimize the within-cluster sum of squares (i.e. variance). More formally, the objective is to find argmin_s{\sum_{i = 1}^k{\sum_{x \in S_i}{||x-\mu_i||^2}}}= argmin_{s}{\sum_{i = 1}^k{|S_i|Var(S_i)}} where \mu_i is the mean of points in S_i. This is equivalent to minimizing the pairwise squared deviations of points in the same cluster argmin_s{\sum_{i = 1}^k{\frac{1}{2|S_i|}\sum_{x, y \in S_i}{||x-y||^2}}}

Algorithm

Given an initial set of k means, the algorithm proceeds by alternating between two steps.

Assignment step: Assign each observation to the cluster whose mean has the least squared euclidean distance.

Intuitively this is finding the nearest mean
Mathematically this means partitioning the observations according to the Voronoi diagram generated by the means

Update Step: Calculate the new means to be the centroids of the observations in the new clusters

The algorithm is known to have converged when assignments no longer change. There is no guarantee that the optimum is found using this algorithm.

The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.

Using a different distance function other than the squared Euclidean distance may stop the algorithm from converging.

Initialization methods

Commonly used initialization methods are Forgy and Random Partition.

Forgy Method: This method randomly chooses k observations from the data set and uses these are the initial means

This method is known to spread the initial means out

Random Partition Method: This method first randomly assigns a cluster to each observation and then proceeds to the update step.

This method is known to place most of the means close to the center of the dataset.

2.7 KiB Raw Blame History

K-means clustering

Description

Algorithm

Initialization methods

2.7 KiB

Raw Blame History