website/content/research/clusteranalysis/notes/lec4-3.md at 8f6ee8d8a9774df5c9ddbec4d62d39cd4d33349a

mirror of https://github.com/Brandon-Rozek/website.git synced 2024-11-22 16:26:28 -05:00

Brandon Rozek 330ace0de9 Fixed titles, math rendering, and links on some pages

2021-07-26 09:13:20 -04:00

2.1 KiB

Raw Blame History

title	showthedate	math
Cluster Tendency	false	true

This is the assessment of the suitability of clustering. Cluster Tendency determines whether the data has any inherent grouping structure.

This is a hard task since there are so many different definitions of clusters (portioning, hierarchical, density, graph, etc.) Even after fixing a cluster type, this is still hard in defining an appropriate null model for a data set.

One way we can go about measuring cluster tendency is to compare the data against random data. On average, random data should not contain clusters.

There are some clusterability assessment methods such as Spatial histogram, distance distribution and Hopkins statistic.

Hopkins Statistic

Let X be the set of n data points in d dimensional space. Consider a random sample (without replacement) of m << n data points. Also generate a set Y of m uniformly randomly distributed data points.

Now define two distance measures u_i to be the distance of y_i \in Y from its nearest neighbor in X and w_i to be the distance of x_i \in X from its nearest neighbor in X

We can then define Hopkins statistic as H = \frac{\sum_{i = 1}^m{u_i^d}}{\sum_{i = 1}^m{u_i^d} + \sum_{i =1}^m{w_i^d}}

Properties

With this definition, uniform random data should tend to have values near 0.5, and clustered data should tend to have values nearer to 1.

Drawbacks

However, data containing a single Gaussian will also score close to one. As this statistic measures deviation from a uniform distribution. Making this statistic less useful in application as real data is usually not remotely uniform.

Spatial Histogram Approach

For this method, I'm not too sure how this works, but here are some key points I found.

Divide each dimension in equal width bins, and count how many points lie in each of the bins and obtain the empirical joint probability mass function.

Do the same for the randomly sampled data

Finally compute how much they differ using the Kullback-Leibler (KL) divergence value. If it differs greatly than we can say that the data is clusterable.

2.1 KiB Raw Blame History

Hopkins Statistic

Properties

Drawbacks

Spatial Histogram Approach

2.1 KiB

Raw Blame History