mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-10-08 22:21:12 +00:00
Website snapshot
This commit is contained in:
parent
ee0ab66d73
commit
50ec3688a5
281 changed files with 21066 additions and 0 deletions
43
content/research/clusteranalysis/notes.md
Normal file
43
content/research/clusteranalysis/notes.md
Normal file
|
@ -0,0 +1,43 @@
|
|||
# Lecture Notes for Cluster Analysis
|
||||
|
||||
[Lecture 1: Measures of Similarity](lec1)
|
||||
|
||||
[Lecture 2.1: Distance Measures Reasoning](lec2-1)
|
||||
|
||||
[Lecture 2.2: Principle Component Analysis Pt. 1](lec2-2)
|
||||
|
||||
Lecture 3: Discussion of Dataset
|
||||
|
||||
[Lecture 4: Principal Component Analysis Pt. 2](lec4)
|
||||
|
||||
[Lecture 4.2: Revisiting Measures](lec4-2)
|
||||
|
||||
[Lecture 4.3: Cluster Tendency](lec4-3)
|
||||
|
||||
[Lecture 5: Introduction to Connectivity Based Models](lec5)
|
||||
|
||||
[Lecture 6: Agglomerative Methods](lec6)
|
||||
|
||||
[Lecture 7: Divisive Methods Part 1: Monothetic](lec7)
|
||||
|
||||
[Lecture 8: Divisive Methods Part 2: Polythetic](lec8)
|
||||
|
||||
[Lecture 9.1: CURE and TSNE](lec9-1)
|
||||
|
||||
[Lecture 9.2: Cluster Validation Part I](lec9-2)
|
||||
|
||||
[Lecture 10.1: Silhouette Coefficient](lec10-1)
|
||||
|
||||
[Lecture 10.2: Centroid-Based Clustering](lec10-2)
|
||||
|
||||
[Lecture 10.3: Voronoi Diagrams](lec10-3)
|
||||
|
||||
[Lecture 11.1: K-means++](lec11-1)
|
||||
|
||||
[Lecture 11.2: K-medoids](lec11-2)
|
||||
|
||||
[Lecture 11.3: K-medians](lec11-3)
|
||||
|
||||
[Lecture 12: Introduction to Density Based Clustering](lec12)
|
||||
|
||||
|
331
content/research/clusteranalysis/notes/lec1.md
Normal file
331
content/research/clusteranalysis/notes/lec1.md
Normal file
|
@ -0,0 +1,331 @@
|
|||
# Measures of similarity
|
||||
|
||||
To identify clusters of observations we need to know how **close individuals are to each other** or **how far apart they are**.
|
||||
|
||||
Two individuals are 'close' when their dissimilarity of distance is small and their similarity large.
|
||||
|
||||
Special attention will be paid to proximity measures suitable for data consisting of repeated measures of the same variable, for example taken at different time points.
|
||||
|
||||
## Similarity Measures for Categorical Data
|
||||
|
||||
Measures are generally scaled to be in the interval $[0, 1]$, although occasionally they are expressed as percentages in the range $0-100\%$
|
||||
|
||||
Similarity value of unity indicates that both observations have identical values for all variables
|
||||
|
||||
Similarity value of zero indicates that the two individuals differ maximally for all variables.
|
||||
|
||||
### Similarity Measures for Binary Data
|
||||
|
||||
An extensive list of similarity measures for binary data exist, the reason for such is that a large number of possible measures has to do with the apparent uncertainty as to how to **deal with the count of zero-zero matches**
|
||||
|
||||
In some cases, zero-zero matches are equivalent to one-one matches and therefore should be included in the calculated similarity measure
|
||||
|
||||
<u>Example</u>: Gender, where there is no preference as to which of the two categories should be coded as zero or one
|
||||
|
||||
In other cases the inclusion or otherwise of the matches is more problematic
|
||||
|
||||
<u>Example</u>: When the zero category corresponds to the genuine absence of some property, such as wings in a study of insects
|
||||
|
||||
The question that then needs to be asked is do the co-absences contain useful information about the similarity of the two objects?
|
||||
|
||||
Attributing a high degree of similarity to a pair of individuals simply because they both lack a large number of attributes may not be sensible in many situations
|
||||
|
||||
The following table below will help when it comes to interpreting the measures
|
||||
|
||||

|
||||
|
||||
Measure that ignore the co-absence (lack of both objects having a zero) are Jaccard's Coefficient (S2), Sneath and Sokal (S4)
|
||||
|
||||
When co-absences are considered informative, the simple matching coefficient (S1) is usually employed.
|
||||
|
||||
Measures S3 and S5 are further examples of symmetric coefficients that treat positive matches (a) and negative matches (d) in the same way.
|
||||
|
||||

|
||||
|
||||
### Similarity Measures for Categorical Data with More Than Two Levels
|
||||
|
||||
Categorical data where the variables have more than two levels (for example, eye color) could be dealt with in a similar way to binary data, with each level of a variable being regarded as a single binary variable.
|
||||
|
||||
This is not an attractive approach, however, simply because of the large number of ‘negative’ matches which will inevitably be involved.
|
||||
|
||||
A superior method is to allocate a score of zero or one to each variable depending on whether the two observations are the same on that variable. These scores are then averaged over all p variables to give the required similarity coefficient as
|
||||
$$
|
||||
s_{ij} = \frac{1}{p}\sum_{k = 1}^p{s_{ik}}
|
||||
$$
|
||||
|
||||
### Dissimilarity and Distance Measures for Continuous Data
|
||||
|
||||
A **metric** on a set $X$ is a distance function
|
||||
$$
|
||||
d : X \times X \to [0, \infty)
|
||||
$$
|
||||
where $[0, \infty)$ is the set of non-negative real numbers and for all $x, y, z \in X$, the following conditions are satisfied
|
||||
|
||||
1. $d(x, y) \ge 0$ non-negativity or separation axiom
|
||||
1. $d(x, y) = 0 \iff x = y$ identity of indiscernibles
|
||||
2. $d(x, y) = d(y, x)$ symmetry
|
||||
3. $d(x, z) \le d(x, y) + d(y, z)$ subadditivity or triangle inequality
|
||||
|
||||
Conditions 1 and 2 define a positive-definite function
|
||||
|
||||
All distance measures are formulated so as to allow for differential weighting of the quantitative variables $w_k$ denotes the nonnegative weights of $p$ variables
|
||||
|
||||

|
||||
|
||||
Proposed dissimilarity measures can be broadly divided into distance measures and correlation-type measures.
|
||||
|
||||
#### Distance Measures
|
||||
|
||||
#####$L^p$ Space
|
||||
|
||||
The Minkowski distance is a metric in normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance
|
||||
$$
|
||||
D(X, Y) = (\sum_{i = 1}^n{w_i^p|x_i - y_i|^p})^{\frac{1}{p}}
|
||||
$$
|
||||
This is a metric for $p > 1$
|
||||
|
||||
######Manhattan Distance
|
||||
|
||||
This is the case in the Minkowski distance when $p = 1$
|
||||
$$
|
||||
d(X, Y) = \sum_{i = 1}^n{w_i|x_i - y_i|}
|
||||
$$
|
||||
Manhattan distance depends on the rotation of the coordinate system, but does not depend on its reflection about a coordinate axis or its translation
|
||||
$$
|
||||
d(x, y) = d(-x, -y)
|
||||
$$
|
||||
|
||||
$$
|
||||
d(x, y) = d(x + a, y + a)
|
||||
$$
|
||||
|
||||
Shortest paths are not unique in this metric
|
||||
|
||||
######Euclidean Distance
|
||||
|
||||
This is the case in the Minkowski distance when $p = 2$. The Euclidean distance between points X and Y is the length of the line segment connection them.
|
||||
$$
|
||||
d(X, Y) = \sqrt{\sum_{i = 1}^n{w_i^2(x_i - y_i)^2}}
|
||||
$$
|
||||
There is a unique path in which it has the shortest distance. This distance metric is also translation and rotation invariant
|
||||
|
||||
######Squared Euclidean Distance
|
||||
|
||||
The standard Euclidean distance can be squared in order to place progressively greater weight on objects that are farther apart. In this case, the equation becomes
|
||||
$$
|
||||
d(X, Y) = \sum_{i = 1}^n{w_i^2(x_i - y_i)^2}
|
||||
$$
|
||||
Squared Euclidean Distance is not a metric as it does not satisfy the [triangle inequality](https://en.wikipedia.org/wiki/Triangle_inequality), however, it is frequently used in optimization problems in which distances only have to be compared.
|
||||
|
||||
######Chebyshev Distance
|
||||
|
||||
The Chebyshev distance is where the distance between two vectors is the greatest of their differences along any coordinate dimension.
|
||||
|
||||
It is also known as **chessboard distance**, since in the game of [chess](https://en.wikipedia.org/wiki/Chess) the minimum number of moves needed by a [king](https://en.wikipedia.org/wiki/King_(chess)) to go from one square on a [chessboard](https://en.wikipedia.org/wiki/Chessboard) to another equals the Chebyshev distance
|
||||
$$
|
||||
d(X, Y) = \lim_{p \to \infty}{(\sum_{i = 1}^n{|x_i - y_i|^p})}^\frac{1}{p}
|
||||
$$
|
||||
|
||||
$$
|
||||
= max_i(|x_i - y_i|)
|
||||
$$
|
||||
|
||||
Chebyshev distance is translation invariant
|
||||
|
||||
##### Canberra Distance Measure
|
||||
|
||||
The Canberra distance (D4) is a weighted version of the $L_1$ Manhattan distance. This measure is very sensitive to small changes close to $x_{ik} = x_{jk} = 0$.
|
||||
|
||||
It is often regarded as a generalization of the dissimilarity measure for binary data. In this context the measure can be divided by the number of variables, $p$, to ensure a dissimilarity coefficient in the interval $[0, 1]$
|
||||
|
||||
It can then be shown that this measure for binary variables is just one minus the matching coefficient.
|
||||
|
||||
### Correlation Measures
|
||||
|
||||
It has often been suggested that the correlation between two observations can be used to quantify the similarity between them.
|
||||
|
||||
Since for correlation coefficients we have that $-1 \le \phi_{ij} \le 1$ with the value ‘1’ reflecting the strongest possible positive relationship and the value ‘-1’ the strongest possible negative relationship, these coefficients can be transformed into dissimilarities, $d_{ij}$, within the interval $[0, 1]$
|
||||
|
||||
The use of correlation coefficients in this context is far more contentious than its noncontroversial role in assessing the linear relationship between two variables based on $n$ observations.
|
||||
|
||||
When correlations between two individuals are used to quantify their similarity the <u>rows of the data matrix are standardized</u>, not its columns.
|
||||
|
||||
**Disadvantages**
|
||||
|
||||
When variables are measured on different scales the notion of a difference between variable values and consequently that of a mean variable value or a variance is meaningless.
|
||||
|
||||
In addition, the correlation coefficient is unable to measure the difference in size between two observations.
|
||||
|
||||
**Advantages**
|
||||
|
||||
However, the use of a correlation coefficient can be justified for situations where all of the variables have been measured on the same scale and precise values taken are important only to the extent that they provide information about the subject's relative profile
|
||||
|
||||
<u>Example:</u> In classifying animals or plants, the absolute size of the organisms or their parts are often less important than their shapes. In such studies the investigator requires a dissimilarity coefficient that takes the value zero if and only if two individuals' profiles are multiples of each other. The angular separation dissimilarity measure has this property.
|
||||
|
||||
**Further considerations**
|
||||
|
||||
The Pearson correlation is sensitive to outliers. This has prompted a number of suggestions for modifying correlation coefficients when used as similarity measures; for example, robust versions of correlation coefficients such as *jackknife correlation* or altogether more general association coefficients such as *mutual information distance measure*
|
||||
|
||||
####Mahalanobis (Maximum) Distance [Not between 2 observations]
|
||||
|
||||
Mahalanobis distance is a measure of distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D
|
||||
|
||||
Mahalanobis distance is unitless and scale-invariant and takes into account the correlations of the data set
|
||||
$$
|
||||
D(\vec{x}) = \sqrt{(\vec{x} - \vec{\mu})^T S^{-1}(\vec{x}-\vec{\mu})}
|
||||
$$
|
||||
Where $\mu$ is a set of mean observations and $S$ is the covariance matrix
|
||||
|
||||
If the covariance matrix is diagonal then the resulting distance measure is called a normalized Euclidean distance.
|
||||
$$
|
||||
d(\vec{x}, \vec{y}) = \sqrt{\sum_{i = 1}^N{\frac{(x_i - y_i)^2}{s^2_i}}}
|
||||
$$
|
||||
Where $s_i$ is the standard deviation of the $x_i$ and $y_i$ over the sample set
|
||||
|
||||
####Discrete Metric
|
||||
|
||||
This metric describes whether or not two observations are equivalent
|
||||
$$
|
||||
\rho(x, y) = \begin{cases}
|
||||
1 & x \not= y \\
|
||||
0 & x = y
|
||||
\end{cases}
|
||||
$$
|
||||
|
||||
## Similarity Measures for Data Containing both Continuous and Categorical Variables
|
||||
|
||||
There are a number of approaches to constructing proximities for mixed-mode data, that is, data in which some variables are continuous and some categorical.
|
||||
|
||||
1. Dichotomize all variables and use a similarity measure for binary data
|
||||
2. Rescale all the variables so that they are on the same scale by replacing variable values by their ranks among the objects and then using a measure for continuous data
|
||||
3. Construct a dissimilarity measure for each type of variable and combine these, either with or without differential weighting into a single coefficient.
|
||||
|
||||
Most general-purpose statistical software implement a number of measurs for converting two-mode data matrix into a one-mode dissimilarity matrix.
|
||||
|
||||
R has `cluster`, `clusterSim`, or `proxy`
|
||||
|
||||
### Proximity Measures for Structured Data
|
||||
|
||||
We'll be looking at data that consists of repeated measures of the same outcome variable but under different conditions.
|
||||
|
||||
The simplest and perhaps most commonly used approach to exploiting the reference variable is in the construction of a reduced set of relevant summaries per object which are then used as the basis for defining object similarity.
|
||||
|
||||
Here we will look at some approaches for choosing summary measures and resulting proximity measures for the most frequently encountered reference vectors (e.g. time, experimental condition, and underlying factor)
|
||||
|
||||
Structured data arise when the variables can be assumed to follow a known *factor model*. Under *confirmatory factor analysis model* each variable or item can be allocated to one of a set of underlying factors or concepts. The factors cannot be observed directly but are 'indicated' by a number of items that are all measured on the same scale.
|
||||
|
||||
Note that the summary approach, while typically used with continuous variables, is not limited to variables on an interval scale. The same principles can be applied to dealing with categorical data. The difference is that summary measures now need to capture relevant aspects of the distribution of categorical variables over repeated measures.
|
||||
|
||||
Rows of **$X$** which represent ordered lists of elements, that is all the variables provide a categorical outcome and these variables can be aligned in one dimension, are more generally referred to as *sequences*. *Sequence analysis* is an area of research that centers on problems of events and actions in their temporal context and includes the measurements of similarities between sequences.
|
||||
|
||||
The most popular measure of dissimilarity between two sequences is the Levenhstein distance and counts the minimum number of operations needed to transform one sequence of categories into another, where an operation is an insertion, a deletion, or a substitution of a single category. Each operation may be assigned a penalty weight (a typical choice would be to give double the penalty to a substitution as opposed to an insertion or deletion. The measure is sometimes called the 'edit distance' due to its application in spell checkers.
|
||||
|
||||
Optimal matching algorithms (OMAs) need to be employed to find the minimum number of operations required to match one sequence to another. One such algorithm for aligning sequences is the Needleman-Wunsch algorithm, which is commonly used in bioinformatics to align proteins.
|
||||
|
||||
The *Jary similarity measure* is a related measure of similarity between sequences of categories often used to delete duplicates in the area of record linkage.
|
||||
|
||||
## Inter-group Proximity Measures
|
||||
|
||||
In clustering applications, it becomes necessary to consider how to measure the proximity between groups of individuals.
|
||||
|
||||
1. The proximity between two groups might be defined by a suitable summary of the proximities between individuals from either group
|
||||
2. Each group might be described by a representative observation by choosing a suitable summary statistic for each variable, and the inter group proximity defined as the proximity between the representative observations.
|
||||
|
||||
### Inter-group Proximity Derived from the Proximity Matrix
|
||||
|
||||
For deriving inter-group proximities from a matrix of inter-individual proximities, there are a variety of possibilities
|
||||
|
||||
- Take the smallest dissimilarity between any two individuals, one from each group. This is referred to as *nearest-neighbor distance* and is the basis of the clustering technique known as *single linkage*
|
||||
- Define hte intergroup distance as the largest distance between any two individuals, one from each group. This is known as the *furthest-neighbour distance* and constitute the basis of *complete linkage* cluster method.
|
||||
- Define as the average dissimiliarity between individuals from both groups. Such a measure is used in *group average clustering*
|
||||
|
||||
### Inter-group Proximity Based on Group Summaries for Continuous Data
|
||||
|
||||
One method for constructing inter-group dissimilarity measures for continuous data is to simply substitute group means (also known as the centroid) for the variable values in the formulae for inter-individual measures
|
||||
|
||||
More appropriate, however, might be measures which incorporate in one way or another, knowledge of within-group variation. One possibility is to use Mahallanobis's generalized distance.
|
||||
|
||||
####Mahalanobis (Maximum) Distance
|
||||
|
||||
Mahalanobis distance is a measure of distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D
|
||||
|
||||
Mahalanobis distance is unitless and scale-invariant and takes into account the correlations of the data set
|
||||
$$
|
||||
D(\vec{x}) = \sqrt{(\vec{x} - \vec{\mu})^T S^{-1}(\vec{x}-\vec{\mu})}
|
||||
$$
|
||||
Where $\mu$ is a set of mean observations and $S$ is the covariance matrix
|
||||
|
||||
If the covariance matrix is diagonal then the resulting distance measure is called a normalized Euclidean distance.
|
||||
$$
|
||||
d(\vec{x}, \vec{y}) = \sqrt{\sum_{i = 1}^N{\frac{(x_i - y_i)^2}{s^2_i}}}
|
||||
$$
|
||||
Where $s_i$ is the standard deviation of the $x_i$ and $y_i$ over the sample set
|
||||
|
||||
Thus, the Mahalanobis distance incraeses with increasing distances between the two group centers and with decreasing within-group variation.
|
||||
|
||||
By also employing within-group correlations, the Mahalanobis distance takes account the possibly non-spherical shapes of the groups.
|
||||
|
||||
The use of Mahalanobis implies that the investigator is willing to **assume** that the covariance matrices are at least approximately the same in the two groups. When this is not so, this measure is an inappropriate inter-group measure. Other alternatives exist such as the one proposed by Anderson and Bahadur
|
||||
|
||||
<img src="http://proquest.safaribooksonline.com.ezproxy.umw.edu/getfile?item=cjlhZWEzNDg0N2R0cGMvaS9zMG1nODk0czcvN3MwczM3L2UwLXMzL2VpL3RtYTBjMGdzY2QwLmkxLWdtaWY-" alt="equation">
|
||||
|
||||
Another alternative is the *normal information radius* suggested by Jardine and Sibson
|
||||
|
||||
<img src="http://proquest.safaribooksonline.com.ezproxy.umw.edu/getfile?item=cjlhZWEzNDg0N2R0cGMvaS9zMG1nODk0czcvN3MwczM4L2UwLXMzL2VpL3RtYTBjMGdzY2QwLmkxLWdtaWY-" alt="equation">
|
||||
|
||||
### Inter-group Proximity Based on Group Summaries for Categorical Data
|
||||
|
||||
Approaches for measuring inter-group dissimilarities between groups of individuals for which categorical variables have been observed have been considered by a number of authors. Balakrishnan and Sanghvi (1968), for example, proposed a dissimilarity index of the form
|
||||
|
||||

|
||||
|
||||
where $p_{Akl}$ and $p_{Bkl}$ are the proportions of the lth category of the kth variable in group A and B respectively, , ck + 1 is the number of categories for the kth variable and p is the number of variables.
|
||||
|
||||
Kurczynski (1969) suggested adapting the generalized Mahalanobis distance, with categorical variables replacing quantitative variables. In its most general form, this measure for inter-group distance is given by
|
||||
|
||||

|
||||
|
||||
where  contains sample proportions in group A and  is defined in a similar manner, and  is the m × m common sample covariance matrix, where .
|
||||
|
||||
## Weighting Variables
|
||||
|
||||
To weight a variable means to give it greater or lesser importance than other variables in determining the proximity between two objects.
|
||||
|
||||
The question is 'How should the weights be chosen?' Before we discuss this question, it is important to realize that the selection of variables for inclusion into the study already presents a form of weighting, since the variables not included are effectively being given the weight $0$.
|
||||
|
||||
The weights chosen for the variables reflect the importance that the investigator assigns the variables for the classification task.
|
||||
|
||||
There are several approaches to this
|
||||
|
||||
- Authors obtain perceived dissimilarities between selected objects, they then model the dissimilarities using the underlying variables and weights that indicate their relative importance. The weights that best fit the perceived dissimilarities are then chosen.
|
||||
- Define the weights to be inversely proportion to some measure of variability in this variable. This choice of weights implies that the importance of a variable decreases when its variability increases.
|
||||
- For a continous variable, the most commonly emplyed weight is either the reciprocal of its standard deviation or the reciprocal of its range
|
||||
- Employing variability weights is equivalent to what is commonly referred to as *standardizing* the variables.
|
||||
- Construct weights from the data matrix using *variable section*. In essence, such procedures proceed in an iterative fashion to identify variables which, when contributing to a cluster algorithm, lead to internally cohesive and externally isolated clusters and, when clustered singly, produce reasonable agreement.
|
||||
|
||||
The second approach assumed the importance of a variable to be inversely proportional to the total variability of that variable. The total variability of a variable comprises variation both within and between groups which may exist within the set of individuals. The aim of cluster analysis is typically to identify such groups. Hence it can be argued that the importance of a variable should not be reduced because of between-group variation (on the contrary, one might wish to assign more importance to a variable that shows larger between-group variation.)
|
||||
|
||||
Gnanadesikan et al. (1995) assessed the ability of squared distance functions based on data-determined weights, both those described above and others, to recover groups in eight simulated and real continuous data sets in a subsequent cluster analysis. Their main findings were:
|
||||
|
||||
1. Equal weights, (total) standard deviation weights, and range weights were generally ineffective, but range weights were preferable to standard deviation weights.
|
||||
2. Weighting based on estimates of within-cluster variability worked well overall.
|
||||
3. Weighting aimed at emphasizing variables with the most potential for identifying clusters did enhance clustering when some variables had a strong cluster structure.
|
||||
4. Weighting to optimize the fitting of a hierarchical tree was often even less effective than equal weighting or weighting based on (total) standard deviations.
|
||||
5. Forward variable selection was often among the better performers. (Note that all-subsets variable selection was not assessed at the time.)
|
||||
|
||||
## Standardization
|
||||
|
||||
In many clustering applications, the variables describing the objects to be clustered will not be measured in the same units. A number of variability measures have been used for this purpose
|
||||
|
||||
- When standard deviations calculated from the complete set of objects to be clustered are used, the technique is often referred to as *auto-scaling, standard scoring, or z-scoring*.
|
||||
- Division by the median absolute deviations or by the ranges.
|
||||
|
||||
The second is shown to outperform auto-scaling in many clustering applications. As pointed out in the previous section, standardization of variables to unit variance can be viewed as a special case of weighting.
|
||||
|
||||
## Choice of Proximity Measure
|
||||
|
||||
Firstly, the nature of the data should strongly influence the choice of the proximity measure.
|
||||
|
||||
Next, the choice of measure should depend on the scale of the data. Similarity coefficients should be used when the data is binary. For continuous data, distance of correlation-type dissimilarity measure should be used according to whether 'size' or 'shape' of the objects is of interest.
|
||||
|
||||
Finally, the clustering method to be used might have some implications for the choice of the coefficient. For example, making a choice between several proximity coefficients with similar properties which are also known to be monotonically related can be avoided by employing a cluster method that depends only on the ranking of the proximities, not their absolute values.
|
46
content/research/clusteranalysis/notes/lec10-1.md
Normal file
46
content/research/clusteranalysis/notes/lec10-1.md
Normal file
|
@ -0,0 +1,46 @@
|
|||
# Silhouette
|
||||
|
||||
This technique validates the consistency within clusters of data. It provides a succinct graphical representation of how well each object lies in its cluster.
|
||||
|
||||
The silhouette ranges from -1 to 1 where a high value indicates that the object is consistent within its own cluster and poorly matched to neighboring clustesr.
|
||||
|
||||
A low or negative silhouette value can mean that the current clustering configuration has too many or too few clusters.
|
||||
|
||||
## Definition
|
||||
|
||||
For each datum $i$, let $a(i)$ be the average distance of $i$ with all other data within the same cluster.
|
||||
|
||||
$a(i)$ can be interpreted as how well $i$ is assigned to its cluster. (lower values mean better agreement)
|
||||
|
||||
We can then define the average dissimilarity of point $i$ to a cluster $c$ as the average distance from $i$ to all points in $c$.
|
||||
|
||||
Let $b(i)$ be the lowest average distance of $i$ to all other points in any other cluster in which i is not already a member.
|
||||
|
||||
The cluster with this lowest average dissimilarity is said to be the neighboring cluster of $i$. From here we can define a silhouette:
|
||||
$$
|
||||
s(i) = \frac{b(i) - a(i)}{max\{a(i), b(i)\}}
|
||||
$$
|
||||
The average $s(i)$ over all data of a cluster is a measure of how tightly grouped all the data in the cluster are. A silhouette plot may be used to visualize the agreement between each of the data and its cluster.
|
||||
|
||||

|
||||
|
||||
### Properties
|
||||
|
||||
Recall that $a(i)$ is a measure of how dissimilar $i$ is to its own cluster, a smaller value means that it's in agreement to its cluster. For $s(i)$ to be close to 1, we require $a(i) << b(i)$ .
|
||||
|
||||
If $s(i)$ is close to negative one, then by the same logic we can see that $i$ would be more appropriate if it was clustered in its neighboring cluster.
|
||||
|
||||
$s(i)$ near zero means that the datum is on the border of two natural clusters.
|
||||
|
||||
## Determining the number of Clusters
|
||||
|
||||
This can also be used in helping to determine the number of clusters in a dataset. The ideal number of cluster is one that produces the highest silhouette value.
|
||||
|
||||
Also a good indication that one has too many clusters is if there are clusters with the majority of observations being under the mean silhouette value.
|
||||
|
||||
https://kapilddatascience.wordpress.com/2015/11/10/using-silhouette-analysis-for-selecting-the-number-of-cluster-for-k-means-clustering/
|
||||
|
||||

|
||||
|
||||
|
||||
|
54
content/research/clusteranalysis/notes/lec10-2.md
Normal file
54
content/research/clusteranalysis/notes/lec10-2.md
Normal file
|
@ -0,0 +1,54 @@
|
|||
# Centroid-based Clustering
|
||||
|
||||
In centroid-based clustering, clusters are represented by some central vector which may or may not be a member of the dataset. In practice, the number of clusters is fixed to $k$ and the goal is to solve some sort of optimization problem.
|
||||
|
||||
The similarity of two clusters is defined as the similarity of their centroids.
|
||||
|
||||
This problem is computationally difficult so there are efficient heuristic algorithms that are commonly employed. These usually converge quickly to a local optimum.
|
||||
|
||||
## K-means clustering
|
||||
|
||||
This aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean which serves as the centroid of the cluster.
|
||||
|
||||
This technique results in partitioning the data space into Voronoi cells.
|
||||
|
||||
### Description
|
||||
|
||||
Given a set of observations $x$, k-means clustering aims to partition the $n$ observations into $k$ sets $S$ so as to minimize the within-cluster sum of squares (i.e. variance). More formally, the objective is to find
|
||||
$$
|
||||
argmin_s{\sum_{i = 1}^k{\sum_{x \in S_i}{||x-\mu_i||^2}}}= argmin_{s}{\sum_{i = 1}^k{|S_i|Var(S_i)}}
|
||||
$$
|
||||
where $\mu_i$ is the mean of points in $S_i$. This is equivalent to minimizing the pairwise squared deviations of points in the same cluster
|
||||
$$
|
||||
argmin_s{\sum_{i = 1}^k{\frac{1}{2|S_i|}\sum_{x, y \in S_i}{||x-y||^2}}}
|
||||
$$
|
||||
|
||||
### Algorithm
|
||||
|
||||
Given an initial set of $k$ means, the algorithm proceeds by alternating between two steps.
|
||||
|
||||
**Assignment step**: Assign each observation to the cluster whose mean has the least squared euclidean distance.
|
||||
|
||||
- Intuitively this is finding the nearest mean
|
||||
- Mathematically this means partitioning the observations according to the Voronoi diagram generated by the means
|
||||
|
||||
**Update Step**: Calculate the new means to be the centroids of the observations in the new clusters
|
||||
|
||||
The algorithm is known to have converged when assignments no longer change. There is no guarantee that the optimum is found using this algorithm.
|
||||
|
||||
The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.
|
||||
|
||||
Using a different distance function other than the squared Euclidean distance may stop the algorithm from converging.
|
||||
|
||||
### Initialization methods
|
||||
|
||||
Commonly used initialization methods are Forgy and Random Partition.
|
||||
|
||||
**Forgy Method**: This method randomly chooses $k$ observations from the data set and uses these are the initial means
|
||||
|
||||
This method is known to spread the initial means out
|
||||
|
||||
**Random Partition Method**: This method first randomly assigns a cluster to each observation and then proceeds to the update step.
|
||||
|
||||
This method is known to place most of the means close to the center of the dataset.
|
||||
|
18
content/research/clusteranalysis/notes/lec10-3.md
Normal file
18
content/research/clusteranalysis/notes/lec10-3.md
Normal file
|
@ -0,0 +1,18 @@
|
|||
# Voronoi Diagram
|
||||
|
||||
A Voronoi diagram is a partitioning of a plan into regions based on distance to points in a specific subset of the plane.
|
||||
|
||||
The set of points (often called seeds, sites, or generators) is specified beforehand, and for each seed there is a corresponding region consisting of all points closer to that seed than any other.
|
||||
|
||||
Different metrics may be used and often result in different Voronoi diagrams
|
||||
|
||||
**Euclidean**
|
||||
|
||||

|
||||
|
||||
**Manhattan**
|
||||
|
||||

|
||||
|
||||
|
||||
|
18
content/research/clusteranalysis/notes/lec11-1.md
Normal file
18
content/research/clusteranalysis/notes/lec11-1.md
Normal file
|
@ -0,0 +1,18 @@
|
|||
# K-means++
|
||||
|
||||
K-means++ is an algorithm for choosing the initial values or seeds for the k-means clustering algorithm. This was proposed as a way of avoiding the sometimes poor clustering found by a standard k-means algorithm.
|
||||
|
||||
## Intuition
|
||||
|
||||
The intuition behind this approach involves spreading out the $k$ initial cluster centers. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.
|
||||
|
||||
## Algorithm
|
||||
|
||||
The exact algorithm is as follows
|
||||
|
||||
1. Choose one center uniformly at random from among data points
|
||||
2. For each data point $x$, compute $D(x)$, the distance between $x$ and the nearest center that has already been chosen.
|
||||
3. Choose one new data point at random as a new center, using a weighted probability distribution where a point $x$ is chosen with probability proporitonal to $D(x)^2$
|
||||
4. Repeat steps 2 and 3 until $k$ centers have been chosen
|
||||
5. Now that the initial centers have been chosen, proceed using standard k-means clustering
|
||||
|
52
content/research/clusteranalysis/notes/lec11-2.md
Normal file
52
content/research/clusteranalysis/notes/lec11-2.md
Normal file
|
@ -0,0 +1,52 @@
|
|||
# K-Medoids
|
||||
|
||||
A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal.
|
||||
|
||||
The K-medoids algorithm is related to k-means and the medoidshift algorithm. Both the k-means and k-medoids algorithms are partition and both attempt to minimize the distance between points in the cluster to it's center. In contrast to k-means, it chooses data points as centers and uses the Manhattan Norm to define the distance between data points instead of the Euclidean.
|
||||
|
||||
This method is known to be more robust to noise and outliers compared to k-means since it minimizes the sum of pairwise dissimilarities instead of the sum of squared Euclidean distances.
|
||||
|
||||
## Algorithms
|
||||
|
||||
There are several algorithms that have been created as an optimization to an exhaustive search. In this section, we'll discuss PAM and Voronoi iteration method.
|
||||
|
||||
### Partitioning Around Medoids (PAM)
|
||||
|
||||
1. Select $k$ of the $n$ data points as medoids
|
||||
2. Associate each data point to the closes medoid
|
||||
3. While the cost of the configuration decreases:
|
||||
1. For each medoid $m$, for each non-medoid data point $o$:
|
||||
1. Swap $m$ and $o$, recompute the cost (sum of distances of points to their medoid)
|
||||
2. If the total cost of the configuration increased in the previous step, undo the swap
|
||||
|
||||
|
||||
|
||||
### Voronoi Iteration Method
|
||||
|
||||
1. Select $k$ of the $n$ data points as medoids
|
||||
2. While the cost of the configuration decreases
|
||||
1. In each cluster, make the point that minimizes the sum of distances within the cluster the medoid
|
||||
2. Reassign each point to the cluster defined by the closest medoid determined in the previous step.
|
||||
|
||||
|
||||
|
||||
### Clustering Large Applications (CLARA
|
||||
|
||||
This is a variant of the PAM algorithm that relies on the sampling approach to handle large datasets. The cost of a particular cluster configuration is the mean cost of all the dissimilarities.
|
||||
|
||||
|
||||
|
||||
## R Implementations
|
||||
|
||||
Both PAM and CLARA are defined in the `cluster` package in R.
|
||||
|
||||
```R
|
||||
clara(x, k, metric = "euclidean", stand = FALSE, samples = 5,
|
||||
sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE,
|
||||
keep.data = medoids.x, rngR = FALSE)
|
||||
```
|
||||
|
||||
```R
|
||||
pam(x, k, metric = "euclidean", stand = FALSE)
|
||||
```
|
||||
|
19
content/research/clusteranalysis/notes/lec11-3.md
Normal file
19
content/research/clusteranalysis/notes/lec11-3.md
Normal file
|
@ -0,0 +1,19 @@
|
|||
# K-Medians
|
||||
|
||||
This is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid we are going to calculate the median instead.
|
||||
|
||||
This has the effect of minimizing error over all the clusters with respect to the Manhattan norm as opposed to the Euclidean squared norm which is minimized in K-means
|
||||
|
||||
### Algorithm
|
||||
|
||||
Given an initial set of $k$ medians, the algorithm proceeds by alternating between two steps.
|
||||
|
||||
**Assignment step**: Assign each observation to the cluster whose median has the leas Manhattan distance.
|
||||
|
||||
- Intuitively this is finding the nearest median
|
||||
|
||||
**Update Step**: Calculate the new medians to be the centroids of the observations in the new clusters
|
||||
|
||||
The algorithm is known to have converged when assignments no longer change. There is no guarantee that the optimum is found using this algorithm.
|
||||
|
||||
The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.
|
56
content/research/clusteranalysis/notes/lec12.md
Normal file
56
content/research/clusteranalysis/notes/lec12.md
Normal file
|
@ -0,0 +1,56 @@
|
|||
# Introduction to Density Based Clustering
|
||||
|
||||
In density-based clustering, clusters are defined as areas of higher density than the remainder of the data sets. Objects in more sparse areas are considered to be outliers or border points. This helps discover clusters of arbitrary shape.
|
||||
|
||||
## DBSCAN
|
||||
|
||||
Given a set of points in space, it groups together points that are closely packed together while marking points that lie alone in low-density regions as outliers.
|
||||
|
||||
### Preliminary Information
|
||||
|
||||
- A point $p$ is a core point if at least k (often referred to as minPts) are within $\epsilon$ of it. Those points are said to be *directly reachable* from $p$.
|
||||
- A point $q$ is directly reachable from $p$ if point $q$ is within distance $\epsilon$ from point $p$ and $p$ must be a core point
|
||||
- A point $q$ is reachable from $p$ if there is a path $p_1, \dots, p_n$ with $p_1 = p$ and $p_n = q$ where each $p_{i + 1}$ is directly reachable from $p_i$. (All points on the path must be core points, with the possible exception of $q$)
|
||||
- All points not reachable from any other points are outliers
|
||||
|
||||
Non core points can be part of a cluster, but they form its "edge", since they cannot be used to reach more points.
|
||||
|
||||
Reachability is not a symmetric relation since, by definition, no point may be reachable from a non-core point, regardless of distance.
|
||||
|
||||
Two points $p$ and $q$ are density-connected if there is a point $o$ such that both $p$ and $q$ are reachable from $o$. Density-connectedness is symmetric.
|
||||
|
||||
A cluster then satisfies two properties:
|
||||
|
||||
1. All points within the cluster are mutually density-connected
|
||||
2. If a point is density-reachable from any point of the cluster, it is part of the cluster as well.
|
||||
|
||||
|
||||
### Algorithm
|
||||
|
||||
1. Find the $\epsilon$ neighbors of every point, and identify the core points with more than $k$ neighbors.
|
||||
2. Find the connected components of *core* points on the neighborhood graph, ignoring all non-core points.
|
||||
3. Assign each non-core point to a nearby cluster if the cluster is an $\epsilon$ (eps) neighbor, otherwise assign it to noise.
|
||||
|
||||
###Advantages
|
||||
|
||||
- Does not require one to specify the number of clusters in the data
|
||||
- Can find arbitrarily shaped clusters
|
||||
- Has a notion of noise and is robust to outliers
|
||||
|
||||
### Disadvantages
|
||||
|
||||
- Not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster.
|
||||
- The quality to DBSCAN depends on the distance measure used.
|
||||
- Cannot cluster data sets well with large differences in densities.
|
||||
|
||||
### Rule of Thumbs for parameters
|
||||
|
||||
$k$: $k$ must be larger than $(D + 1)$ where $D$ is the number of dimensions in the dataset. Normally $k$ is chosen to be twice the number of dimensions.
|
||||
|
||||
$\epsilon$: Ideally the $k^{th}$ nearest neighbors are at roughly the same distance. Plot the sorted distance of every point to it's $k^{th}$ nearest neighbor
|
||||
|
||||
|
||||
|
||||
Example of Run Through
|
||||
|
||||
https://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf
|
34
content/research/clusteranalysis/notes/lec2-1.md
Normal file
34
content/research/clusteranalysis/notes/lec2-1.md
Normal file
|
@ -0,0 +1,34 @@
|
|||
# Why use different distance measures?
|
||||
|
||||
I made an attempt to find out in what situations people use different distance measures. Looking around in the Internet usually produces the results "It depends on the problem" or "I typically just always use Euclidean"
|
||||
|
||||
Which as you might imagine, isn't a terribly useful answer. Since it doesn't give me any examples of which types of problems different distances solve.
|
||||
|
||||
Therefore, let's think about it in a different way. What properties do different distance measures have that make them desirable?
|
||||
|
||||
## Manhattan Advantages
|
||||
|
||||
- The gradient of this function has a constant magnitude. There's no power in the formula
|
||||
- Unusual values affect distances on Euclidean more since the difference is squared
|
||||
|
||||
https://datascience.stackexchange.com/questions/20075/when-would-one-use-manhattan-distance-as-opposite-to-euclidean-distance
|
||||
|
||||
|
||||
|
||||
## Mahalanobis Advantages
|
||||
|
||||
Variables can be on different scales. The Mahalanobis formula has a built in variance-covariance matrix which allows you to rescale your variables to make distances of different variables more comparable.
|
||||
|
||||
https://stats.stackexchange.com/questions/50949/why-use-the-mahalanobis-distance#50956
|
||||
|
||||
|
||||
|
||||
## Euclidean Disadvantages
|
||||
|
||||
In higher dimensions, the points essentially become uniformly distant from one another. This is a problem observed in most distance metrics but it's more obvious with the Euclidean one.
|
||||
|
||||
https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions/
|
||||
|
||||
|
||||
|
||||
Hopefully in this course, we'll discover more properties as to why it makes sense to use different distance measures since it can have a impact on how our clusters are formed.
|
53
content/research/clusteranalysis/notes/lec2-2.md
Normal file
53
content/research/clusteranalysis/notes/lec2-2.md
Normal file
|
@ -0,0 +1,53 @@
|
|||
# Principal Component Analysis Pt. 1
|
||||
|
||||
## What is PCA?
|
||||
|
||||
Principal component analysis is a statistical procedure that performs an orthogonal transformation to convert a set of variables into a set of linearly uncorrelated variables called principle components.
|
||||
|
||||
Number of distinct principle components equals $min(\# Variables, \# Observations - 1)$
|
||||
|
||||
The transformation is defined in such a way that the first principle component has the largest possible variance explained in the data.
|
||||
|
||||
Each succeeding component has the highest possible variance under the constraint of having to be orthogonal to the preceding components.
|
||||
|
||||
PCA is sensitive to the relative scaling of the original variables.
|
||||
|
||||
### Results of a PCA
|
||||
|
||||
Results are discussed in terms of *component scores* which is the transformed variables and *loadings* which is the weight by which each original variable should be multiplied to get the component score.
|
||||
|
||||
## Assumptions of PCA
|
||||
|
||||
1. Linearity
|
||||
2. Large variances are important and small variances denote noise
|
||||
3. Principal components are orthogonal
|
||||
|
||||
## Why perform PCA?
|
||||
|
||||
- Distance measures perform poorly in high-dimensional space (https://stats.stackexchange.com/questions/256172/why-always-doing-dimensionality-reduction-before-clustering)
|
||||
- Helps eliminates noise from the dataset (https://www.quora.com/Does-it-make-sense-to-perform-principal-components-analysis-before-clustering-if-the-original-data-has-too-many-dimensions-Is-it-theoretically-unsound-to-try-to-cluster-data-with-no-correlation)
|
||||
- One initial cost to help reduce further computations
|
||||
|
||||
## Computing PCA
|
||||
|
||||
1. Subtract off the mean of each measurement type
|
||||
2. Compute the covariance matrix
|
||||
3. Take the eigenvalues/vectors of the covariance matrix
|
||||
|
||||
## R Code
|
||||
|
||||
```R
|
||||
pcal = function(data) {
|
||||
centered_data = scale(data)
|
||||
covariance = cov(centered_data)
|
||||
eigen_stuff = eigen(covariance)
|
||||
sorted_indices = sort(eigen_stuff$values,
|
||||
index.return = T,
|
||||
decreasing = T)$ix
|
||||
loadings = eigen_stuff$values[sorted_indices]
|
||||
components = eigen_stuff$vectors[sorted_indices,]
|
||||
combined_list = list(loadings, components)
|
||||
names(combined_list) = c("Loadings", "Components")
|
||||
return(combined_list)
|
||||
}
|
||||
```
|
24
content/research/clusteranalysis/notes/lec4-2.md
Normal file
24
content/research/clusteranalysis/notes/lec4-2.md
Normal file
|
@ -0,0 +1,24 @@
|
|||
# Revisiting Similarity Measures
|
||||
|
||||
## Manhatten Distance
|
||||
|
||||
An additional use case for Manhatten distance is when dealing with binary vectors. This approach, otherwise known as the Hamming distance, is the number of bits that are different between two binary vectors.
|
||||
|
||||
## Ordinal Variables
|
||||
|
||||
Ordinal variables can be treated as if they were on a interval scale.
|
||||
|
||||
First, replace the ordinal variable value by its rank ($r_{if}$) Then map the range of each variable onto the interval $[0, 1]$ by replacing the $f_i$ where f is the variable and i is the object by
|
||||
$$
|
||||
z_{if} = \frac{r_{if} - 1}{M_f - 1}
|
||||
$$
|
||||
Where $M_f$ is the maximum rank.
|
||||
|
||||
### Example
|
||||
|
||||
Freshman = $0$ Sophmore = $\frac{1}{3}$ Junior = $\frac{2}{3}$ Senior = $1$
|
||||
|
||||
$d(freshman, senior) = 1$
|
||||
|
||||
$d(junior, senior) = \frac{1}{3}$
|
||||
|
40
content/research/clusteranalysis/notes/lec4-3.md
Normal file
40
content/research/clusteranalysis/notes/lec4-3.md
Normal file
|
@ -0,0 +1,40 @@
|
|||
# Cluster Tendency
|
||||
|
||||
This is the assessment of the suitability of clustering. Cluster Tendency determines whether the data has any inherent grouping structure.
|
||||
|
||||
This is a hard task since there are so many different definitions of clusters (portioning, hierarchical, density, graph, etc.) Even after fixing a cluster type, this is still hard in defining an appropriate null model for a data set.
|
||||
|
||||
One way we can go about measuring cluster tendency is to compare the data against random data. On average, random data should not contain clusters.
|
||||
|
||||
There are some clusterability assessment methods such as Spatial histogram, distance distribution and Hopkins statistic.
|
||||
|
||||
## Hopkins Statistic
|
||||
|
||||
Let $X$ be the set of $n$ data points in $d$ dimensional space. Consider a random sample (without replacement) of $m << n$ data points. Also generate a set $Y$ of $m$ uniformly randomly distributed data points.
|
||||
|
||||
Now define two distance measures $u_i$ to be the distance of $y_i \in Y$ from its nearest neighbor in X and $w_i$ to be the distance of $x_i \in X$ from its nearest neighbor in X
|
||||
|
||||
We can then define Hopkins statistic as
|
||||
$$
|
||||
H = \frac{\sum_{i = 1}^m{u_i^d}}{\sum_{i = 1}^m{u_i^d} + \sum_{i =1}^m{w_i^d}}
|
||||
$$
|
||||
|
||||
### Properties
|
||||
|
||||
With this definition, uniform random data should tend to have values near 0.5, and clustered data should tend to have values nearer to 1.
|
||||
|
||||
### Drawbacks
|
||||
|
||||
However, data containing a single Gaussian will also score close to one. As this statistic measures deviation from a uniform distribution. Making this statistic less useful in application as real data is usually not remotely uniform.
|
||||
|
||||
|
||||
|
||||
## Spatial Histogram Approach
|
||||
|
||||
For this method, I'm not too sure how this works, but here are some key points I found.
|
||||
|
||||
Divide each dimension in equal width bins, and count how many points lie in each of the bins and obtain the empirical joint probability mass function.
|
||||
|
||||
Do the same for the randomly sampled data
|
||||
|
||||
Finally compute how much they differ using the Kullback-Leibler (KL) divergence value. If it differs greatly than we can say that the data is clusterable.
|
171
content/research/clusteranalysis/notes/lec4.md
Normal file
171
content/research/clusteranalysis/notes/lec4.md
Normal file
|
@ -0,0 +1,171 @@
|
|||
# Principal Component Analysis Part 2: Formal Theory
|
||||
|
||||
##Properties of PCA
|
||||
|
||||
There are a number of ways to maximize the variance of a principal component. To create an unique solution we should impose a constraint. Let us say that the sum of the square of the coefficients must equal 1. In vector notation this is the same as
|
||||
$$
|
||||
a_i^Ta_i = 1
|
||||
$$
|
||||
Every future principal component is said to be orthogonal to all the principal components previous to it.
|
||||
$$
|
||||
a_j^Ta_i = 0, i < j
|
||||
$$
|
||||
The total variance of the $q$ principal components will equal the total variance of the original variables
|
||||
$$
|
||||
\sum_{i = 1}^q {\lambda_i} = trace(S)
|
||||
$$
|
||||
Where $S$ is the sample covariance matrix.
|
||||
|
||||
The proportion of accounted variation in each principle component is
|
||||
$$
|
||||
P_j = \frac{\lambda_j}{trace(S)}
|
||||
$$
|
||||
From this, we can generalize to the first $m$ principal components where $m < q$ and find the proportion $P^{(m)}$ of variation accounted for
|
||||
$$
|
||||
P^{(m)} = \frac{\sum_{i = 1}^m{\lambda_i}}{trace(S)}
|
||||
$$
|
||||
You can think of the first principal component as the line of best fit that minimizes the residuals orthogonal to it.
|
||||
|
||||
### What to watch out for
|
||||
|
||||
As a reminder to the last lecture, *PCA is not scale-invariant*. Therefore, transformations done to the dataset before PCA and after PCA often lead to different results and possibly conclusions.
|
||||
|
||||
Additionally, if there are large differences between the variances of the original variables, then those whose variances are largest will tend to dominate the early components.
|
||||
|
||||
Therefore, principal components should only be extracted from the sample covariance matrix when all of the original variables have roughly the **same scale**.
|
||||
|
||||
### Alternatives to using the Covariance Matrix
|
||||
|
||||
But it is rare in practice to have a scenario when all of the variables are of the same scale. Therefore, principal components are typically extracted from the **correlation matrix** $R$
|
||||
|
||||
Choosing to work with the correlation matrix rather than the covariance matrix treats the variables as all equally important when performing PCA.
|
||||
|
||||
## Example Derivation: Bivariate Data
|
||||
|
||||
Let $R$ be the correlation matrix
|
||||
$$
|
||||
R = \begin{pmatrix}
|
||||
1 & r \\
|
||||
r & 1
|
||||
\end{pmatrix}
|
||||
$$
|
||||
Let us find the eigenvectors and eigenvalues of the correlation matrix
|
||||
$$
|
||||
det(R - \lambda I) = 0
|
||||
$$
|
||||
|
||||
$$
|
||||
(1-\lambda)^2 - r^2 = 0
|
||||
$$
|
||||
|
||||
$$
|
||||
\lambda_1 = 1 + r, \lambda_2 = 1 - r
|
||||
$$
|
||||
|
||||
Let us remember to check the condition "sum of the principal components equals the trace of the correlation matrix":
|
||||
$$
|
||||
\lambda_1 + \lambda_2 = 1+r + (1 - r) = 2 = trace(R)
|
||||
$$
|
||||
|
||||
###Finding the First Eigenvector
|
||||
|
||||
Looking back at the characteristic equation
|
||||
$$
|
||||
Ra_1 = \lambda a_1
|
||||
$$
|
||||
We can get the following two formulas
|
||||
$$
|
||||
a_{11} + ra_{12} = (1+r)a_{11} \tag{1}
|
||||
$$
|
||||
|
||||
$$
|
||||
ra_{11} + a_{12} = (1 + r)a_{12} \tag{2}
|
||||
$$
|
||||
|
||||
Now let us find out what $a_{11}$ and $a_{12}$ equal. First let us solve for $a_{11}$ using equation $(1)$
|
||||
$$
|
||||
ra_{12} = (1+r)a_{11} - a_{11}
|
||||
$$
|
||||
|
||||
$$
|
||||
ra_{12} = a_{11}(1 + r - 1)
|
||||
$$
|
||||
|
||||
$$
|
||||
ra_{12} = ra_{11}
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{12} = a_{11}
|
||||
$$
|
||||
|
||||
Where $r$ does not equal $0$.
|
||||
|
||||
Now we must apply the condition of sum squares
|
||||
$$
|
||||
a_1^Ta_1 = 1
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{11}^2 + a_{12}^2 = 1
|
||||
$$
|
||||
|
||||
Recall that $a_{12} = a_{11}$
|
||||
$$
|
||||
2a_{11}^2 = 1
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{11}^2 = \frac{1}{2}
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{11} =\pm \frac{1}{\sqrt{2}}
|
||||
$$
|
||||
|
||||
For sake of choosing a value, let us take the principal root and say $a_{11} = \frac{1}{\sqrt{2}}$
|
||||
|
||||
###Finding the Second Eigenvector
|
||||
|
||||
Recall the fact that each subsequent eigenvector is orthogonal to the first. This means
|
||||
$$
|
||||
a_{11}a_{21} + a_{12}a_{22} = 0
|
||||
$$
|
||||
Substituting the values for $a_{11}$ and $a_{12}$ calculated in the previous section
|
||||
$$
|
||||
\frac{1}{\sqrt{2}}a_{21} + \frac{1}{\sqrt{2}}a_{22} = 0
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{21} + a_{22} = 0
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{21} = -a_{22}
|
||||
$$
|
||||
|
||||
Since this eigenvector also needs to satisfy the first condition, we get the following values
|
||||
$$
|
||||
a_{21} = \frac{1}{\sqrt{2}} , a_{22} = \frac{-1}{\sqrt{2}}
|
||||
$$
|
||||
|
||||
### Conclusion of Example
|
||||
|
||||
From this, we can say that the first principal components are given by
|
||||
$$
|
||||
y_1 = \frac{1}{\sqrt{2}}(x_1 + x_2), y_2 = \frac{1}{\sqrt{2}}(x_1-x_2)
|
||||
$$
|
||||
With the variance of the first principal component being given by $(1+r)$ and the second by $(1-r)$
|
||||
|
||||
Due to this, as $r$ increases, so does the variance explained in the first principal component. This in turn, lowers the variance explained in the second principal component.
|
||||
|
||||
## Choosing a Number of Principal Components
|
||||
|
||||
Principal Component Analysis is typically used in dimensionality reduction efforts. Therefore, there are several strategies for picking the right number of principal components to keep. Here are a few:
|
||||
|
||||
- Retain enough principal components to account for 70%-90% of the variation
|
||||
- Exclude principal components where eigenvalues are less than the average eigenvalue
|
||||
- Exclude principal components where eigenvalues are less than one.
|
||||
- Generate a Scree Plot
|
||||
- Stop when the plot goes from "steep" to "shallow"
|
||||
- Stop when it essentially becomes a straight line.
|
35
content/research/clusteranalysis/notes/lec5.md
Normal file
35
content/research/clusteranalysis/notes/lec5.md
Normal file
|
@ -0,0 +1,35 @@
|
|||
# Introduction to Connectivity Based Models
|
||||
|
||||
Hierarchical algorithms combine observations to form clusters based on their distance.
|
||||
|
||||
## Connectivity Methods
|
||||
|
||||
Hierarchal Clustering techniques can be subdivided depending on the method of going about it.
|
||||
|
||||
First there are two different methods in forming the clusters *Agglomerative* and *Divisive*
|
||||
|
||||
<u>Agglomerative</u> is when you combine the n individuals into groups through each iteration
|
||||
|
||||
<u>Divisive</u> is when you are separating one giant group into finer groupings with each iteration.
|
||||
|
||||
Hierarchical methods are an irrevocable algorithm, once it joins or separates a grouping, it cannot be undone. As Kaufman and Rousseeuw (1990) colorfully comment: *"A hierarchical method suffers from the defect that it can never repair what was done in previous steps"*.
|
||||
|
||||
It is the job of the statistician to decide when to stop the agglomerative or decisive algorithm, since having one giant cluster containing all observations or having each observation be a cluster isn't particularly useful.
|
||||
|
||||
At different distances, different clusters are formed and are more readily represented using a **dendrogram**. These algorithms do not provide a unique solution but rather provide an extensive hierarchy of clusters that merge or divide at different distances.
|
||||
|
||||
## Linkage Criterion
|
||||
|
||||
Apart from the method of forming clusters, the user also needs to decide on a linkage criterion to use. Meaning, how do you want to optimize your clusters.
|
||||
|
||||
Do you want to group based on the nearest points in each cluster? Nearest Neighbor Clustering
|
||||
|
||||
Or do you want to based on the farthest observations in each cluster? Farthest neighbor clustering.
|
||||
|
||||

|
||||
|
||||
## Shortcomings
|
||||
|
||||
This method is not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge depending on the clustering method.
|
||||
|
||||
As we go through this section, we will go into detail about the different linkage criterion and other parameters of this model.
|
90
content/research/clusteranalysis/notes/lec6.md
Normal file
90
content/research/clusteranalysis/notes/lec6.md
Normal file
|
@ -0,0 +1,90 @@
|
|||
# Agglomerative Methods
|
||||
|
||||
## Single Linkage
|
||||
|
||||
First let us consider the single linkage (nearest neighbor) approach. The clusters can be found through the following algorithm
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances in the matrix by taking the minimum distances
|
||||
- Cluster a,b -> c = min(d(a, c), d(b, c))
|
||||
|
||||
Single linkage can operate directly on a proximity matrix, the actual data is not required.
|
||||
|
||||
A wonderful visual representation can be found in Everitt Section 4.2
|
||||
|
||||
## Centroid Clustering
|
||||
|
||||
This is another criterion measure that requires both the data and proximity matrix. These are the following steps of the algorithm. Requires Euclidean distance measure to preserve geometric correctness
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances by taking the mean of the clustered observations and computing the distances between all of the observations
|
||||
- Cluster a,b -> c = d(mean(a, b), c)
|
||||
|
||||
## Complete Linkage
|
||||
|
||||
This is like Single Linkage, except now we're taking the farthest distance. The algorithm can be adjusted to the following
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances in the matrix by taking the maximum distances
|
||||
- Cluster a,b -> c = max(d(a, c), d(b, c))
|
||||
|
||||
##Unweighted Pair-Group Method using the Average Approach (UPGMA)
|
||||
|
||||
In this criterion, we are no longer summarizing each cluster before taking distances, but instead comparing each observation in the cluster to the outside point and taking the average
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances in the matrix by taking the mean
|
||||
- Cluster A: a,b -> c = $mean_{i = 0}(d(A_i, c))$
|
||||
|
||||
## Median Linkage
|
||||
|
||||
This approach is similar to the UPGMA approach, except now we're taking the median instead of the mean
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances in the matrix by taking the median
|
||||
- Cluster A: a,b -> c = $median_{i = 0}{(d(A_i, c))}$
|
||||
|
||||
## Ward Linkage
|
||||
|
||||
This one I didn't look too far into but here's the description: With Ward's linkage method, the distance between two clusters is the sum of squared deviations from points to centroids. The objective of Ward's linkage is to minimize the within-cluster sum of squares.
|
||||
|
||||
## When to use different Linkage Types?
|
||||
|
||||
According to the following two stack overflow posts: https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-clustering and https://stats.stackexchange.com/questions/195456/how-to-select-a-clustering-method-how-to-validate-a-cluster-solution-to-warran/195481#195481
|
||||
|
||||
These are the following ways you can justify a linkage type.
|
||||
|
||||
**Cluster metaphor**. *"I preferred this method because it constitutes clusters such (or such a way) which meets with my concept of a cluster in my particular project."*
|
||||
|
||||
**Data/method assumptions**. *"I preferred this method because my data nature or format predispose to it."*
|
||||
|
||||
**Internal validity**. *"I preferred this method because it gave me most clear-cut, tight-and-isolated clusters."*
|
||||
|
||||
**External validity**. *"I preferred this method because it gave me clusters which differ by their background or clusters which match with the true ones I know."*
|
||||
|
||||
**Cross-validity**. *"I preferred this method because it is giving me very similar clusters on equivalent samples of the data or extrapolates well onto such samples."*
|
||||
|
||||
**Interpretation**. *"I preferred this method because it gave me clusters which, explained, are most persuasive that there is meaning in the world."*
|
||||
|
||||
### Cluster Metaphors
|
||||
|
||||
Let us explore the idea of cluster metaphors now.
|
||||
|
||||
**Single Linkage** or **Nearest Neighbor** is a *spectrum* or *chain*.
|
||||
|
||||
Since single linkage joins clusters by the shortest link between them, the technique cannot discern poorly separated clusters. On the other hand, single linkage is one of the few clustering methods that can delineate nonelipsodial clusters.
|
||||
|
||||
**Complete Linkage** or **Farthest Neighbor** is a *circle*.
|
||||
|
||||
**Between-Group Average linkage** (UPGMA) is a united *class
|
||||
|
||||
**Centroid method** (UPGMC) is *proximity of platforms* (commonly used in politics)
|
||||
|
||||
## Dendrograms
|
||||
|
||||
A **dendrogram** is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. It shows how different clusters are formed at different distance groupings.
|
74
content/research/clusteranalysis/notes/lec7.md
Normal file
74
content/research/clusteranalysis/notes/lec7.md
Normal file
|
@ -0,0 +1,74 @@
|
|||
# Divisive Methods Pt.1
|
||||
|
||||
Divisive methods work in the opposite direction of agglomerative methods. They take one large cluster and successively splits it.
|
||||
|
||||
This is computationally demanding if all $2^{k - 1} - 1$ possible divisions into two sub-clusters of a cluster of $k$ objects are considered at each stage.
|
||||
|
||||
While less common than Agglomerative methods, divisive methods have the advantage that most users are interested in the main structure in their data, and this is revealed from the outset of a divisive method.
|
||||
|
||||
## Monothetic Divisive Methods
|
||||
|
||||
For data consisting of $p$ **binary variables**, there is a computationally efficient method known as *monothetic divisive methods* available.
|
||||
|
||||
Monothetic divisions divide clusters according to the presence or absence of each of the $p$ variables, so that at each stage, clusters contain members with certain attributes that are either all present or all absent.
|
||||
|
||||
The term 'monothetic' refers to the use of a single variable on which to base the split on. *Polythetic* methods use all the variables at each stage.
|
||||
|
||||
### Choosing the Variable to Split On
|
||||
|
||||
The choice of the variable on which a split is made depends on optimizing a criterion reflecting either cluster homogeneity or association with other variables.
|
||||
|
||||
This tends to minimize the number of splits that have to be made.
|
||||
|
||||
An example of an homogeneity criterion is the information content $C$
|
||||
|
||||
This is defined with $p$ variables and $n$ objections where $f_k$ is the number of individuals with the $k$ attribute
|
||||
$$
|
||||
C = pn\log{n}-\sum_{k = 1}^p{(f_k\log{f_k} - (n-f_k)\log{(n-f_k)})}
|
||||
$$
|
||||
|
||||
### Association with other variables
|
||||
|
||||
Recall that another way to split is based on the association with other variables. The attribute used at each step can be chosen according to its overall association with all attributes remaining at each step.
|
||||
|
||||
This is sometimes termed *association analysis*.
|
||||
|
||||
| | V1 | V2 |
|
||||
| ---- | ---- | ---- |
|
||||
| | 1 | 0 |
|
||||
| 1 | a | b |
|
||||
| 0 | c | d |
|
||||
|
||||
####Common measures of association
|
||||
|
||||
$$
|
||||
|ad-bc| \tag{4.6}
|
||||
$$
|
||||
|
||||
$$
|
||||
(ad-bc)^2 \tag{4.7}
|
||||
$$
|
||||
|
||||
$$
|
||||
\frac{(ad-bc)^2n}{(a+b)(a+c)(b+d)(c+d)} \tag{4.8}
|
||||
$$
|
||||
|
||||
$$
|
||||
\sqrt{\frac{(ad-bc)^2n}{(a+b)(a+c)(b+d)(c+d)}} \tag{4.9}
|
||||
$$
|
||||
|
||||
$$
|
||||
\frac{(ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)} \tag{4.10}
|
||||
$$
|
||||
|
||||
$(4.6)$ and $(4.7)$ have the advantage that there is no danger of computational problems if any marginal totals are near zero.
|
||||
|
||||
The last three, $(4.8)$, $(4.9)$, $(4.10)$, are all related to the $\chi^2$ statistic, its square root, and the Pearson correlation coefficient respectively.
|
||||
|
||||
### Advantages/Disadvantages of Monothetic Methods
|
||||
|
||||
Appealing features of monothetic divisive methods are the easy classification of new members and the including of cases with missing values.
|
||||
|
||||
A further advantage of monothetic divisive methods is that it is obvious which variables produce the split at any stage of the process.
|
||||
|
||||
A disadvantage with these methods is that the possession of a particular attribute which is either rare or rarely found in combination with others may take an individual down a different path.
|
48
content/research/clusteranalysis/notes/lec8.md
Normal file
48
content/research/clusteranalysis/notes/lec8.md
Normal file
|
@ -0,0 +1,48 @@
|
|||
# Divisive Methods Pt 2.
|
||||
|
||||
Recall in the previous section that we spoke about Monothetic and Polythetic methods. Monothetic methods only looks at a single variable at a time while Polythetic looks at multiple variables simultaneously. In this section, we will speak more about polythetic divisive methods.
|
||||
|
||||
## Polythetic Divisive Methods
|
||||
|
||||
Polythetic methods operate via a distance matrix.
|
||||
|
||||
This procedure avoids considering all possible splits by
|
||||
|
||||
1. Finding the object that is furthest away from the others within a group and using that as a seed for a splinter group.
|
||||
2. Each object is then considered for entry to that separate splinter group: any that are closer to the splinter group than the main group is moved to the splinter one.
|
||||
3. The step is then repeated.
|
||||
|
||||
This process has been developed into a program named `DIANA` (DIvisive ANAlysis Clustering) which is implemented in `R`.
|
||||
|
||||
### Similarities to Politics
|
||||
|
||||
This somewhat resembles a way a political party might split due to inner conflicts.
|
||||
|
||||
Firstly, the most discontented member leaves the party and starts a new one, and then some others follow him until a kind of equilibrium is attained.
|
||||
|
||||
## Methods for Large Data Sets
|
||||
|
||||
There are two common hierarchical methods used for large data sets `BIRCH` and `CURE`. Both of these algorithms employ a pre-clustering phase in where dense regions are summarized, the summaries being then clustered using a hierarchical method based on centroids.
|
||||
|
||||
### CURE
|
||||
|
||||
1. `CURE` starts with a random sample of points and represents clusters by a smaller number of points that capture the shape of the cluster
|
||||
2. Which are then shrunk towards the centroid as to dampen the effect of the outliers
|
||||
3. Hierarchical clustering then operates on the representative points
|
||||
|
||||
`CURE` has been shown to be able to cope with arbitrary-shaped clusters and in that respect may be superior to `BIRCH`, although it does require judgment as to the number of clusters and also a parameter which favors either more or less compact clusters.
|
||||
|
||||
## Revisiting Topics: Cluster Dissimilarity
|
||||
|
||||
In order to decide where clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required.
|
||||
|
||||
In most methods of hierarchical clustering this is achieved by a use of an appropriate
|
||||
|
||||
- Metric (a measure of distance between pairs of observations)
|
||||
- Linkage Criterion (which specifies the dissimilarities of sets as functions of pairwise distances observations in the sets)
|
||||
|
||||
## Advantages of Hierarchical Clustering
|
||||
|
||||
- Any valid measure of distance measure can be used
|
||||
- In most cases, the observations themselves are not required, just hte matrix of distances
|
||||
- This can have the advantage of only having to store a distance matrix in memory as opposed to a n-dimensional matrix.
|
58
content/research/clusteranalysis/notes/lec9-1.md
Normal file
58
content/research/clusteranalysis/notes/lec9-1.md
Normal file
|
@ -0,0 +1,58 @@
|
|||
# CURE and TSNE
|
||||
|
||||
##Clustering Using Representatives (CURE)
|
||||
|
||||
Clustering using Representatives is a Hierarchical clustering technique in which you can represent a cluster using a **set** of well-scattered representative points.
|
||||
|
||||
This algorithm has a parameter $\alpha$ which defines the factor of the points in which to shrink towards the centroid.
|
||||
|
||||
CURE is known to be robust to outliers and able to identify clusters that have a **non-spherical** shape and size variance.
|
||||
|
||||
The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's algorithm.
|
||||
|
||||
This algorithm cannot be directly applied to large datasets due to high runtime complexity. Several enhancements were added to address this requirement
|
||||
|
||||
- Random sampling: This involves a trade off between accuracy and efficiency. One would hope that the random sample they obtain is representative of the population
|
||||
- Partitioning: The idea is to partition the sample space into $p$ partitions
|
||||
|
||||
Youtube Video: https://www.youtube.com/watch?v=JrOJspZ1CUw
|
||||
|
||||
Steps
|
||||
|
||||
1. Pick a random sample of points that fit in main memory
|
||||
2. Cluster sample points hierarchically to create the initial clusters
|
||||
3. Pick representative point**s**
|
||||
1. For each cluster, pick $k$ representative points, as dispersed as possible
|
||||
2. Move each representative points to a fixed fraction $\alpha$ toward the centroid of the cluster
|
||||
4. Rescan the whole dataset and visit each point $p$ in the data set
|
||||
5. Place it in the "closest cluster"
|
||||
1. Closest as in shortest distance among all the representative points.
|
||||
|
||||
## TSNE
|
||||
|
||||
TSNE allows us to reduce the dimensionality of a dataset to two which allows us to visualize the data.
|
||||
|
||||
It is able to do this since many real-world datasets have a low intrinsic dimensionality embedded within the high-dimensional space.
|
||||
|
||||
Since the technique needs to conserve the structure of the data, two corresponding mapped points must be close to each other distance wise as well. Let $|x_i - x_j|$ be the Euclidean distance between two data points, and $|y_i - y_j|$ he distance between the map points. This conditional similarity between two data points is:
|
||||
$$
|
||||
p_{j|i} = \frac{exp(-|x_i-x_j|^2 / (2\sigma_i^2))}{\sum_{k \ne i}{exp(-|x_i-x_k|^2/(2\sigma_i^2))}}
|
||||
$$
|
||||
Where we are considering the **Gaussian distribution** surrounding the distance between $x_j$ from $x_i$ with a given variance $\sigma_i^2$. The variance is different for every point; it is chosen such that points in dense areas are given a smaller variance than points in sparse areas.
|
||||
|
||||
Now the similarity matrix for mapped points are
|
||||
$$
|
||||
q_{ij} = \frac{f(|x_i - x_j|)}{\sum_{k \ne i}{f(|x_i - x_k)}}
|
||||
$$
|
||||
Where $f(z) = \frac{1}{1 + z^2}$
|
||||
|
||||
This has the same idea as the conditional similarity between two data points, except this is based on the **Cauchy distribution**.
|
||||
|
||||
TSNE works at minimizing the Kullback-Leiber divergence between the two distributions $p_{ij}$ and $q_{ij}$
|
||||
$$
|
||||
KL(P || Q) = \sum_{i,j}{p_{i,j} \log{\frac{p_{ij}}{q_{ij}}}}
|
||||
$$
|
||||
To minimize this score, gradient descent is typically performed
|
||||
$$
|
||||
\frac{\partial KL(P||Q)}{\partial y_i} = 4\sum_j{(p_{ij} - q_{ij})}
|
||||
$$
|
72
content/research/clusteranalysis/notes/lec9-2.md
Normal file
72
content/research/clusteranalysis/notes/lec9-2.md
Normal file
|
@ -0,0 +1,72 @@
|
|||
# Cluster Validation
|
||||
|
||||
There are multiple approaches to validating your cluster models
|
||||
|
||||
- Internal Evaluation: This is when you summarize the clustering into a single score. For example, minimizing the the deviations from the centroids.
|
||||
- External Evaluation: Minimizing the deviations from some known "labels"
|
||||
- Manual Evaluation: A human expert decides whether or not it's good
|
||||
- Indirect Evaluation: Evaluating the utility of clustering in its intended application.
|
||||
|
||||
## Some Problems With These Evaluations
|
||||
|
||||
Internal evaluation measures suffer form the problem that they represent functions that are objectives for many clustering algorithms. So of course the result of the clustering algorithm will be such that the objective would be minimized.
|
||||
|
||||
External evaluation suffers from the fact that if we had labels to begin with then we wouldn't need to cluster. Practical applications of clustering occur usually when we don't have labels. On the other hand, possible labeling can reflect one possible partitioning of the data set. There could exist different, perhaps even better clustering.
|
||||
|
||||
## Internal Evaluation
|
||||
|
||||
We like to see a few qualities in cluster models
|
||||
|
||||
- *Robustness*: Refers to the effects of errors in data or missing observations, and changes in the data and methods.
|
||||
- *Cohesiveness*: Clusters should be compact or high high intra-cluster similarity.
|
||||
- Clusters should be dissimilar to separate clusters. Should have low inter-cluster similarity
|
||||
- *Influence*: We should pay attention to and try to control for the influence of certain observations on the overall cluster
|
||||
|
||||
Let us focus on the second and third bullet point for now. Internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, this does not imply that one algorithm produces more valid results than another.
|
||||
|
||||
### Davies-Bouldin Index
|
||||
|
||||
$$
|
||||
DB = \frac{1}{n}\sum_{i=1}^n{max_{j\ne i}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}
|
||||
$$
|
||||
|
||||
Where $n$ is the number of clusters, $c$ indicates a centroid, and $\sigma$ represents the deviation from the centroid.
|
||||
|
||||
Better clustering algorithms are indicated by smaller DB values.
|
||||
|
||||
### Dunn Index
|
||||
|
||||
$$
|
||||
D= \frac{min_{1\le i < j \le n}{d(i,j)}}{max_{1\le k \le n}{d^\prime(k)}}
|
||||
$$
|
||||
|
||||
The Dunn index aims to identify dense and well-separated clusters. This is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance.
|
||||
|
||||
High Dunn Index values are more desirable.
|
||||
|
||||
###Bootstrapping
|
||||
|
||||
In terms of robustness we can measure uncertainty in each of the individual clusters. This can be examined using a bootstrapping approach by Suzuki and Shimodaira (2006). The probability or "p-value" is the proportion of bootstrapped samples that contain the cluster. Larger p-values in this case indicate more support for the cluster.
|
||||
|
||||
This is available in R via `Pvclust`
|
||||
|
||||
### Split-Sample Validation
|
||||
|
||||
One approach to assess the effects of perturbations of the data is by randomly dividing the data into two subsets and performing an analysis on each subset separately. This method was proposed by McIntyre and Blashfield in 1980; their method involves the following steps
|
||||
|
||||
- Divide the sample in two and perform a cluster analysis on one of the samples
|
||||
- Have a fixed rule for the number of clusters
|
||||
- Determine the centroids of the clusters, and compute proximities between the objects in teh second sample and the centroids, classifying the objects into their nearest cluster.
|
||||
- Cluster the second sample using the same methods as before and compare these two alternate clusterings using something like the *adjusted Rand index*.
|
||||
|
||||

|
||||
|
||||
## Influence of Individual Points
|
||||
|
||||
Using internal evaluation metrics, you can see the impact of each point by doing a "leave one out" analysis. Here you evaluate the dataset minus one point for each of the points. If a positive difference is found, the point is regarded as a *facilitator*, whereas if it is negative then it is considered an *inhibitor*. once an influential inhibitor is found, the suggestion is to normally omit it from the clustering.
|
||||
|
||||
## R Package
|
||||
|
||||
`clValid` contains a variety of internal validation measures.
|
||||
|
||||
Paper: https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf
|
35
content/research/clusteranalysis/readings.md
Normal file
35
content/research/clusteranalysis/readings.md
Normal file
|
@ -0,0 +1,35 @@
|
|||
# Readings for Lectures of Cluster Analysis
|
||||
|
||||
## Lecture 1
|
||||
Garson Textbook Chapter 3
|
||||
|
||||
## Lecture 2
|
||||
[A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf)
|
||||
|
||||
## Lecture 3
|
||||
No Readings
|
||||
|
||||
## Lecture 4
|
||||
An Introduction to Applied Multivariate Analysis with R by Brian Evveritt and Torsten Horthorn.
|
||||
|
||||
Sections 3.0-3.9 Everitt
|
||||
|
||||
## Lecture 5
|
||||
|
||||
Section 4.1 Everitt
|
||||
|
||||
## Lecture 6
|
||||
Section 4.2 Everitt
|
||||
|
||||
Applied Multivariate Statistical Analysis Johnson Section 12.3
|
||||
|
||||
[Linkage Methods for Cluster Observations](https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/multivariate/how-to/cluster-observations/methods-and-formulas/linkage-methods/#mcquitty)
|
||||
|
||||
## Lecture 7
|
||||
Section 4.3 Everitt
|
||||
|
||||
## Lecture 8
|
||||
[Introduction to the TSNE Algorithm](https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm)
|
||||
|
||||
## Lecture 9
|
||||
Section 9.5 Everitt
|
119
content/research/clusteranalysis/syllabus.md
Normal file
119
content/research/clusteranalysis/syllabus.md
Normal file
|
@ -0,0 +1,119 @@
|
|||
# Cluster Analysis Spring 2018
|
||||
|
||||
### Distance, Dimensionality Reduction, and Tendency
|
||||
|
||||
- Distance
|
||||
- Euclidean Distance
|
||||
- Squared Euclidean Distance
|
||||
- Manhattan Distance
|
||||
- Maximum Distance
|
||||
- Mahalanobis Distance
|
||||
- Which distance function should you use?
|
||||
- PCA
|
||||
- Cluster Tendency
|
||||
- Hopkins Statistic
|
||||
- Scaling Data
|
||||
|
||||
### Validating Clustering Models
|
||||
|
||||
- Clustering Validation
|
||||
- Cross Validation
|
||||
|
||||
### Connectivity Models
|
||||
|
||||
- Agglomerative Clustering
|
||||
- Single Linkage Clustering
|
||||
- Complete Linkage Clustering
|
||||
- Unweighted Pair Group Method with Arithmetic Mean (If time permits)
|
||||
- Dendrograms
|
||||
- Divisive Clustering
|
||||
- CURE (Clustering using REpresentatives) algorithm (If time permits)
|
||||
|
||||
### Cluster Evaluation
|
||||
|
||||
- Internal Evaluation
|
||||
- Dunn Index
|
||||
- Silhouette Coefficient
|
||||
- Davies-Bouldin Index (If time permits)
|
||||
- External Evaluation
|
||||
- Rand Measure
|
||||
- Jaccard Index
|
||||
- Dice Index
|
||||
- Confusion Matrix
|
||||
- F Measure (If time permits)
|
||||
- Fowlkes-Mallows Index (If time permits)
|
||||
|
||||
### Centroid Models
|
||||
|
||||
- Jenks Natural Breaks Optimization
|
||||
- Voronoi Diagram
|
||||
- K means clustering
|
||||
- K medoids clustering
|
||||
- K Medians/Modes clustering
|
||||
- When to use K means as opposed to K medoids or K Medians?
|
||||
- How many clusters should you use?
|
||||
- Lloyd's Algorithm for Approximating K-means (If time permits)
|
||||
|
||||
### Density Models
|
||||
|
||||
- DBSCAN Density Based Clustering Algorithm
|
||||
- OPTICS Ordering Points To Identify the Clustering Structure
|
||||
- DeLi-Clu Density Link Clustering (If time permits)
|
||||
- What should be your density threshold?
|
||||
|
||||
### Analysis of Model Appropriateness
|
||||
|
||||
- When do we use each of the models above?
|
||||
|
||||
### Distribution Models (If time permits)
|
||||
|
||||
- Fuzzy Clusters
|
||||
- EM (Expectation Maximization) Clustering
|
||||
- Maximum Likelihood Gaussian
|
||||
- Probabilistic Hierarchal Clustering
|
||||
|
||||
|
||||
|
||||
|
||||
## Textbooks
|
||||
|
||||
Cluster Analysis 5th Edition
|
||||
|
||||
- Authors: Brian S. Everitt, Sabine Landau, Morven Leese, Daniel Stahl
|
||||
- ISBN-13: 978-0470749913
|
||||
- Cost: Free on UMW Library Site
|
||||
- Amazon Link: https://www.amazon.com/Cluster-Analysis-Brian-S-Everitt/dp/0470749911/ref=sr_1_1?ie=UTF8&qid=1509135983&sr=8-1
|
||||
- Table of Contents: http://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP002266.html
|
||||
|
||||
Cluster Analysis: 2014 Edition (Statistical Associates Blue Book Series 24)
|
||||
|
||||
- Author: David Garson
|
||||
- ISBN: 978-1-62638-030-1
|
||||
- Cost: Free with Site Registration
|
||||
- Website: http://www.statisticalassociates.com/clusteranalysis.htm
|
||||
|
||||
|
||||
|
||||
## Schedule
|
||||
|
||||
In an ideal world, the topics below I estimated being a certain time period for learning them. Of course you have more experience when it comes to how long it actually takes to learn these topics, so I'll leave this mostly to your discretion.
|
||||
|
||||
**Distance, Dimensionality Reduction, and Tendency** -- 3 Weeks
|
||||
|
||||
**Validating Cluster Models** -- 1 Week
|
||||
|
||||
**Connectivity Models** -- 2 Weeks
|
||||
|
||||
**Cluster Evaluation** -- 1 Week
|
||||
|
||||
**Centroid Models** -- 3 Weeks
|
||||
|
||||
**Density Models** -- 3 Weeks
|
||||
|
||||
**Analysis of Model Appropriateness** -- 1 Week
|
||||
|
||||
The schedule above accounts for 14 weeks, so there is a week that is free as a buffer.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Creating this document got me really excited for this independent study. Feel free to give me feedback :)
|
Loading…
Add table
Add a link
Reference in a new issue