Fixed titles, math rendering, and links on some pages

This commit is contained in:
Brandon Rozek 2021-07-26 09:13:20 -04:00
parent 9f096a8720
commit 330ace0de9
61 changed files with 303 additions and 115 deletions

View file

@ -1,4 +1,8 @@
# Measures of similarity
---
title: Measures of similarity
showthedate: false
math: true
---
To identify clusters of observations we need to know how **close individuals are to each other** or **how far apart they are**.
@ -328,4 +332,4 @@ Firstly, the nature of the data should strongly influence the choice of the prox
Next, the choice of measure should depend on the scale of the data. Similarity coefficients should be used when the data is binary. For continuous data, distance of correlation-type dissimilarity measure should be used according to whether 'size' or 'shape' of the objects is of interest.
Finally, the clustering method to be used might have some implications for the choice of the coefficient. For example, making a choice between several proximity coefficients with similar properties which are also known to be monotonically related can be avoided by employing a cluster method that depends only on the ranking of the proximities, not their absolute values.
Finally, the clustering method to be used might have some implications for the choice of the coefficient. For example, making a choice between several proximity coefficients with similar properties which are also known to be monotonically related can be avoided by employing a cluster method that depends only on the ranking of the proximities, not their absolute values.

View file

@ -1,4 +1,8 @@
# Silhouette
---
title: Silhouette
showthedate: false
math: true
---
This technique validates the consistency within clusters of data. It provides a succinct graphical representation of how well each object lies in its cluster.

View file

@ -1,4 +1,8 @@
# Centroid-based Clustering
---
title: Centroid-based Clustering
showthedate: false
math: true
---
In centroid-based clustering, clusters are represented by some central vector which may or may not be a member of the dataset. In practice, the number of clusters is fixed to $k$ and the goal is to solve some sort of optimization problem.

View file

@ -1,4 +1,7 @@
# Voronoi Diagram
---
title: Voronoi Diagram
showthedate: false
---
A Voronoi diagram is a partitioning of a plan into regions based on distance to points in a specific subset of the plane.

View file

@ -1,4 +1,8 @@
# K-means++
---
title: K-means++
showthedate: false
math: true
---
K-means++ is an algorithm for choosing the initial values or seeds for the k-means clustering algorithm. This was proposed as a way of avoiding the sometimes poor clustering found by a standard k-means algorithm.

View file

@ -1,4 +1,8 @@
# K-Medoids
---
title: K-Medoids
showthedate: false
math: true
---
A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal.

View file

@ -1,4 +1,8 @@
# K-Medians
---
title: K-Medians
showthedate: false
math: true
---
This is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid we are going to calculate the median instead.
@ -16,4 +20,4 @@ Given an initial set of $k$ medians, the algorithm proceeds by alternating betwe
The algorithm is known to have converged when assignments no longer change. There is no guarantee that the optimum is found using this algorithm.
The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.
The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.

View file

@ -1,4 +1,8 @@
# Introduction to Density Based Clustering
---
title: Introduction to Density Based Clustering
showthedate: false
math: true
---
In density-based clustering, clusters are defined as areas of higher density than the remainder of the data sets. Objects in more sparse areas are considered to be outliers or border points. This helps discover clusters of arbitrary shape.
@ -31,7 +35,7 @@ A cluster then satisfies two properties:
2. Find the connected components of *core* points on the neighborhood graph, ignoring all non-core points.
3. Assign each non-core point to a nearby cluster if the cluster is an $\epsilon$ (eps) neighbor, otherwise assign it to noise.
###Advantages
### Advantages
- Does not require one to specify the number of clusters in the data
- Can find arbitrarily shaped clusters
@ -53,4 +57,4 @@ $\epsilon$: Ideally the $k^{th}$ nearest neighbors are at roughly the same dista
Example of Run Through
https://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf
https://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf

View file

@ -1,4 +1,7 @@
# Why use different distance measures?
---
title: Why use different distance measures?
showthedate: false
---
I made an attempt to find out in what situations people use different distance measures. Looking around in the Internet usually produces the results "It depends on the problem" or "I typically just always use Euclidean"
@ -31,4 +34,4 @@ https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-
Hopefully in this course, we'll discover more properties as to why it makes sense to use different distance measures since it can have a impact on how our clusters are formed.
Hopefully in this course, we'll discover more properties as to why it makes sense to use different distance measures since it can have a impact on how our clusters are formed.

View file

@ -1,4 +1,8 @@
# Principal Component Analysis Pt. 1
---
title: Principal Component Analysis Pt. 1
showthedate: false
math: true
---
## What is PCA?
@ -50,4 +54,4 @@ pcal = function(data) {
names(combined_list) = c("Loadings", "Components")
return(combined_list)
}
```
```

View file

@ -1,4 +1,8 @@
# Revisiting Similarity Measures
---
title: Revisiting Similarity Measures
showthedate: false
math: true
---
## Manhatten Distance

View file

@ -1,4 +1,8 @@
# Cluster Tendency
---
title: Cluster Tendency
showthedate: false
math: true
---
This is the assessment of the suitability of clustering. Cluster Tendency determines whether the data has any inherent grouping structure.
@ -37,4 +41,4 @@ Divide each dimension in equal width bins, and count how many points lie in each
Do the same for the randomly sampled data
Finally compute how much they differ using the Kullback-Leibler (KL) divergence value. If it differs greatly than we can say that the data is clusterable.
Finally compute how much they differ using the Kullback-Leibler (KL) divergence value. If it differs greatly than we can say that the data is clusterable.

View file

@ -1,4 +1,8 @@
# Principal Component Analysis Part 2: Formal Theory
---
title: Principal Component Analysis Part 2 - Formal Theory
showthedate: false
math: true
---
##Properties of PCA
@ -168,4 +172,4 @@ Principal Component Analysis is typically used in dimensionality reduction effor
- Exclude principal components where eigenvalues are less than one.
- Generate a Scree Plot
- Stop when the plot goes from "steep" to "shallow"
- Stop when it essentially becomes a straight line.
- Stop when it essentially becomes a straight line.

View file

@ -1,4 +1,7 @@
# Introduction to Connectivity Based Models
---
title: Introduction to Connectivity Based Models
showthedate: false
---
Hierarchical algorithms combine observations to form clusters based on their distance.
@ -32,4 +35,4 @@ Or do you want to based on the farthest observations in each cluster? Farthest n
This method is not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge depending on the clustering method.
As we go through this section, we will go into detail about the different linkage criterion and other parameters of this model.
As we go through this section, we will go into detail about the different linkage criterion and other parameters of this model.

View file

@ -1,4 +1,8 @@
# Agglomerative Methods
---
title: Agglomerative Methods
showthedate: false
math: true
---
## Single Linkage
@ -87,4 +91,4 @@ Since single linkage joins clusters by the shortest link between them, the techn
## Dendrograms
A **dendrogram** is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. It shows how different clusters are formed at different distance groupings.
A **dendrogram** is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. It shows how different clusters are formed at different distance groupings.

View file

@ -1,4 +1,8 @@
# Divisive Methods Pt.1
---
title: Divisive Methods Pt.1
showthedate: false
math: true
---
Divisive methods work in the opposite direction of agglomerative methods. They take one large cluster and successively splits it.
@ -39,7 +43,7 @@ This is sometimes termed *association analysis*.
| 1 | a | b |
| 0 | c | d |
####Common measures of association
#### Common measures of association
$$
|ad-bc| \tag{4.6}
@ -71,4 +75,4 @@ Appealing features of monothetic divisive methods are the easy classification of
A further advantage of monothetic divisive methods is that it is obvious which variables produce the split at any stage of the process.
A disadvantage with these methods is that the possession of a particular attribute which is either rare or rarely found in combination with others may take an individual down a different path.
A disadvantage with these methods is that the possession of a particular attribute which is either rare or rarely found in combination with others may take an individual down a different path.

View file

@ -1,4 +1,7 @@
# Divisive Methods Pt 2.
---
title: Divisive Methods Pt 2.
showthedate: false
---
Recall in the previous section that we spoke about Monothetic and Polythetic methods. Monothetic methods only looks at a single variable at a time while Polythetic looks at multiple variables simultaneously. In this section, we will speak more about polythetic divisive methods.
@ -45,4 +48,4 @@ In most methods of hierarchical clustering this is achieved by a use of an appro
- Any valid measure of distance measure can be used
- In most cases, the observations themselves are not required, just hte matrix of distances
- This can have the advantage of only having to store a distance matrix in memory as opposed to a n-dimensional matrix.
- This can have the advantage of only having to store a distance matrix in memory as opposed to a n-dimensional matrix.

View file

@ -1,4 +1,8 @@
# CURE and TSNE
---
title: CURE and TSNE
showthedate: false
math: true
---
##Clustering Using Representatives (CURE)

View file

@ -1,4 +1,8 @@
# Cluster Validation
---
title: Cluster Validation
showthedate: false
math: true
---
There are multiple approaches to validating your cluster models
@ -69,4 +73,4 @@ Using internal evaluation metrics, you can see the impact of each point by doing
`clValid` contains a variety of internal validation measures.
Paper: https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf
Paper: https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf