mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-10-10 15:01:15 +00:00
Website snapshot
This commit is contained in:
parent
ee0ab66d73
commit
50ec3688a5
281 changed files with 21066 additions and 0 deletions
58
content/research/clusteranalysis/notes/lec9-1.md
Normal file
58
content/research/clusteranalysis/notes/lec9-1.md
Normal file
|
@ -0,0 +1,58 @@
|
|||
# CURE and TSNE
|
||||
|
||||
##Clustering Using Representatives (CURE)
|
||||
|
||||
Clustering using Representatives is a Hierarchical clustering technique in which you can represent a cluster using a **set** of well-scattered representative points.
|
||||
|
||||
This algorithm has a parameter $\alpha$ which defines the factor of the points in which to shrink towards the centroid.
|
||||
|
||||
CURE is known to be robust to outliers and able to identify clusters that have a **non-spherical** shape and size variance.
|
||||
|
||||
The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's algorithm.
|
||||
|
||||
This algorithm cannot be directly applied to large datasets due to high runtime complexity. Several enhancements were added to address this requirement
|
||||
|
||||
- Random sampling: This involves a trade off between accuracy and efficiency. One would hope that the random sample they obtain is representative of the population
|
||||
- Partitioning: The idea is to partition the sample space into $p$ partitions
|
||||
|
||||
Youtube Video: https://www.youtube.com/watch?v=JrOJspZ1CUw
|
||||
|
||||
Steps
|
||||
|
||||
1. Pick a random sample of points that fit in main memory
|
||||
2. Cluster sample points hierarchically to create the initial clusters
|
||||
3. Pick representative point**s**
|
||||
1. For each cluster, pick $k$ representative points, as dispersed as possible
|
||||
2. Move each representative points to a fixed fraction $\alpha$ toward the centroid of the cluster
|
||||
4. Rescan the whole dataset and visit each point $p$ in the data set
|
||||
5. Place it in the "closest cluster"
|
||||
1. Closest as in shortest distance among all the representative points.
|
||||
|
||||
## TSNE
|
||||
|
||||
TSNE allows us to reduce the dimensionality of a dataset to two which allows us to visualize the data.
|
||||
|
||||
It is able to do this since many real-world datasets have a low intrinsic dimensionality embedded within the high-dimensional space.
|
||||
|
||||
Since the technique needs to conserve the structure of the data, two corresponding mapped points must be close to each other distance wise as well. Let $|x_i - x_j|$ be the Euclidean distance between two data points, and $|y_i - y_j|$ he distance between the map points. This conditional similarity between two data points is:
|
||||
$$
|
||||
p_{j|i} = \frac{exp(-|x_i-x_j|^2 / (2\sigma_i^2))}{\sum_{k \ne i}{exp(-|x_i-x_k|^2/(2\sigma_i^2))}}
|
||||
$$
|
||||
Where we are considering the **Gaussian distribution** surrounding the distance between $x_j$ from $x_i$ with a given variance $\sigma_i^2$. The variance is different for every point; it is chosen such that points in dense areas are given a smaller variance than points in sparse areas.
|
||||
|
||||
Now the similarity matrix for mapped points are
|
||||
$$
|
||||
q_{ij} = \frac{f(|x_i - x_j|)}{\sum_{k \ne i}{f(|x_i - x_k)}}
|
||||
$$
|
||||
Where $f(z) = \frac{1}{1 + z^2}$
|
||||
|
||||
This has the same idea as the conditional similarity between two data points, except this is based on the **Cauchy distribution**.
|
||||
|
||||
TSNE works at minimizing the Kullback-Leiber divergence between the two distributions $p_{ij}$ and $q_{ij}$
|
||||
$$
|
||||
KL(P || Q) = \sum_{i,j}{p_{i,j} \log{\frac{p_{ij}}{q_{ij}}}}
|
||||
$$
|
||||
To minimize this score, gradient descent is typically performed
|
||||
$$
|
||||
\frac{\partial KL(P||Q)}{\partial y_i} = 4\sum_j{(p_{ij} - q_{ij})}
|
||||
$$
|
Loading…
Add table
Add a link
Reference in a new issue