website/content/research/clusteranalysis/notes/lec11-1.md at 5ec66d1ff701ca407b5e79db1fa3df30555549dd

mirror of https://github.com/Brandon-Rozek/website.git synced 2024-10-30 01:12:07 -04:00

Brandon Rozek 330ace0de9 Fixed titles, math rendering, and links on some pages

2021-07-26 09:13:20 -04:00

1.2 KiB

Raw Blame History

title	showthedate	math
K-means++	false	true

K-means++ is an algorithm for choosing the initial values or seeds for the k-means clustering algorithm. This was proposed as a way of avoiding the sometimes poor clustering found by a standard k-means algorithm.

Intuition

The intuition behind this approach involves spreading out the k initial cluster centers. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.

Algorithm

The exact algorithm is as follows

Choose one center uniformly at random from among data points
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proporitonal to D(x)^2
Repeat steps 2 and 3 until k centers have been chosen
Now that the initial centers have been chosen, proceed using standard k-means clustering

1.2 KiB Raw Blame History

Intuition

Algorithm

1.2 KiB

Raw Blame History