mirror of
https://github.com/Brandon-Rozek/website.git
synced 2025-10-08 22:21:12 +00:00
Fixed titles, math rendering, and links on some pages
This commit is contained in:
parent
9f096a8720
commit
330ace0de9
61 changed files with 303 additions and 115 deletions
|
@ -1,4 +1,8 @@
|
|||
# Measures of similarity
|
||||
---
|
||||
title: Measures of similarity
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
To identify clusters of observations we need to know how **close individuals are to each other** or **how far apart they are**.
|
||||
|
||||
|
@ -328,4 +332,4 @@ Firstly, the nature of the data should strongly influence the choice of the prox
|
|||
|
||||
Next, the choice of measure should depend on the scale of the data. Similarity coefficients should be used when the data is binary. For continuous data, distance of correlation-type dissimilarity measure should be used according to whether 'size' or 'shape' of the objects is of interest.
|
||||
|
||||
Finally, the clustering method to be used might have some implications for the choice of the coefficient. For example, making a choice between several proximity coefficients with similar properties which are also known to be monotonically related can be avoided by employing a cluster method that depends only on the ranking of the proximities, not their absolute values.
|
||||
Finally, the clustering method to be used might have some implications for the choice of the coefficient. For example, making a choice between several proximity coefficients with similar properties which are also known to be monotonically related can be avoided by employing a cluster method that depends only on the ranking of the proximities, not their absolute values.
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Silhouette
|
||||
---
|
||||
title: Silhouette
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
This technique validates the consistency within clusters of data. It provides a succinct graphical representation of how well each object lies in its cluster.
|
||||
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Centroid-based Clustering
|
||||
---
|
||||
title: Centroid-based Clustering
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
In centroid-based clustering, clusters are represented by some central vector which may or may not be a member of the dataset. In practice, the number of clusters is fixed to $k$ and the goal is to solve some sort of optimization problem.
|
||||
|
||||
|
|
|
@ -1,4 +1,7 @@
|
|||
# Voronoi Diagram
|
||||
---
|
||||
title: Voronoi Diagram
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
A Voronoi diagram is a partitioning of a plan into regions based on distance to points in a specific subset of the plane.
|
||||
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# K-means++
|
||||
---
|
||||
title: K-means++
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
K-means++ is an algorithm for choosing the initial values or seeds for the k-means clustering algorithm. This was proposed as a way of avoiding the sometimes poor clustering found by a standard k-means algorithm.
|
||||
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# K-Medoids
|
||||
---
|
||||
title: K-Medoids
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal.
|
||||
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# K-Medians
|
||||
---
|
||||
title: K-Medians
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
This is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid we are going to calculate the median instead.
|
||||
|
||||
|
@ -16,4 +20,4 @@ Given an initial set of $k$ medians, the algorithm proceeds by alternating betwe
|
|||
|
||||
The algorithm is known to have converged when assignments no longer change. There is no guarantee that the optimum is found using this algorithm.
|
||||
|
||||
The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.
|
||||
The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Introduction to Density Based Clustering
|
||||
---
|
||||
title: Introduction to Density Based Clustering
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
In density-based clustering, clusters are defined as areas of higher density than the remainder of the data sets. Objects in more sparse areas are considered to be outliers or border points. This helps discover clusters of arbitrary shape.
|
||||
|
||||
|
@ -31,7 +35,7 @@ A cluster then satisfies two properties:
|
|||
2. Find the connected components of *core* points on the neighborhood graph, ignoring all non-core points.
|
||||
3. Assign each non-core point to a nearby cluster if the cluster is an $\epsilon$ (eps) neighbor, otherwise assign it to noise.
|
||||
|
||||
###Advantages
|
||||
### Advantages
|
||||
|
||||
- Does not require one to specify the number of clusters in the data
|
||||
- Can find arbitrarily shaped clusters
|
||||
|
@ -53,4 +57,4 @@ $\epsilon$: Ideally the $k^{th}$ nearest neighbors are at roughly the same dista
|
|||
|
||||
Example of Run Through
|
||||
|
||||
https://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf
|
||||
https://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf
|
||||
|
|
|
@ -1,4 +1,7 @@
|
|||
# Why use different distance measures?
|
||||
---
|
||||
title: Why use different distance measures?
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
I made an attempt to find out in what situations people use different distance measures. Looking around in the Internet usually produces the results "It depends on the problem" or "I typically just always use Euclidean"
|
||||
|
||||
|
@ -31,4 +34,4 @@ https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-
|
|||
|
||||
|
||||
|
||||
Hopefully in this course, we'll discover more properties as to why it makes sense to use different distance measures since it can have a impact on how our clusters are formed.
|
||||
Hopefully in this course, we'll discover more properties as to why it makes sense to use different distance measures since it can have a impact on how our clusters are formed.
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Principal Component Analysis Pt. 1
|
||||
---
|
||||
title: Principal Component Analysis Pt. 1
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
## What is PCA?
|
||||
|
||||
|
@ -50,4 +54,4 @@ pcal = function(data) {
|
|||
names(combined_list) = c("Loadings", "Components")
|
||||
return(combined_list)
|
||||
}
|
||||
```
|
||||
```
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Revisiting Similarity Measures
|
||||
---
|
||||
title: Revisiting Similarity Measures
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
## Manhatten Distance
|
||||
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Cluster Tendency
|
||||
---
|
||||
title: Cluster Tendency
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
This is the assessment of the suitability of clustering. Cluster Tendency determines whether the data has any inherent grouping structure.
|
||||
|
||||
|
@ -37,4 +41,4 @@ Divide each dimension in equal width bins, and count how many points lie in each
|
|||
|
||||
Do the same for the randomly sampled data
|
||||
|
||||
Finally compute how much they differ using the Kullback-Leibler (KL) divergence value. If it differs greatly than we can say that the data is clusterable.
|
||||
Finally compute how much they differ using the Kullback-Leibler (KL) divergence value. If it differs greatly than we can say that the data is clusterable.
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Principal Component Analysis Part 2: Formal Theory
|
||||
---
|
||||
title: Principal Component Analysis Part 2 - Formal Theory
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
##Properties of PCA
|
||||
|
||||
|
@ -168,4 +172,4 @@ Principal Component Analysis is typically used in dimensionality reduction effor
|
|||
- Exclude principal components where eigenvalues are less than one.
|
||||
- Generate a Scree Plot
|
||||
- Stop when the plot goes from "steep" to "shallow"
|
||||
- Stop when it essentially becomes a straight line.
|
||||
- Stop when it essentially becomes a straight line.
|
||||
|
|
|
@ -1,4 +1,7 @@
|
|||
# Introduction to Connectivity Based Models
|
||||
---
|
||||
title: Introduction to Connectivity Based Models
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
Hierarchical algorithms combine observations to form clusters based on their distance.
|
||||
|
||||
|
@ -32,4 +35,4 @@ Or do you want to based on the farthest observations in each cluster? Farthest n
|
|||
|
||||
This method is not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge depending on the clustering method.
|
||||
|
||||
As we go through this section, we will go into detail about the different linkage criterion and other parameters of this model.
|
||||
As we go through this section, we will go into detail about the different linkage criterion and other parameters of this model.
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Agglomerative Methods
|
||||
---
|
||||
title: Agglomerative Methods
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
## Single Linkage
|
||||
|
||||
|
@ -87,4 +91,4 @@ Since single linkage joins clusters by the shortest link between them, the techn
|
|||
|
||||
## Dendrograms
|
||||
|
||||
A **dendrogram** is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. It shows how different clusters are formed at different distance groupings.
|
||||
A **dendrogram** is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. It shows how different clusters are formed at different distance groupings.
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Divisive Methods Pt.1
|
||||
---
|
||||
title: Divisive Methods Pt.1
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
Divisive methods work in the opposite direction of agglomerative methods. They take one large cluster and successively splits it.
|
||||
|
||||
|
@ -39,7 +43,7 @@ This is sometimes termed *association analysis*.
|
|||
| 1 | a | b |
|
||||
| 0 | c | d |
|
||||
|
||||
####Common measures of association
|
||||
#### Common measures of association
|
||||
|
||||
$$
|
||||
|ad-bc| \tag{4.6}
|
||||
|
@ -71,4 +75,4 @@ Appealing features of monothetic divisive methods are the easy classification of
|
|||
|
||||
A further advantage of monothetic divisive methods is that it is obvious which variables produce the split at any stage of the process.
|
||||
|
||||
A disadvantage with these methods is that the possession of a particular attribute which is either rare or rarely found in combination with others may take an individual down a different path.
|
||||
A disadvantage with these methods is that the possession of a particular attribute which is either rare or rarely found in combination with others may take an individual down a different path.
|
||||
|
|
|
@ -1,4 +1,7 @@
|
|||
# Divisive Methods Pt 2.
|
||||
---
|
||||
title: Divisive Methods Pt 2.
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
Recall in the previous section that we spoke about Monothetic and Polythetic methods. Monothetic methods only looks at a single variable at a time while Polythetic looks at multiple variables simultaneously. In this section, we will speak more about polythetic divisive methods.
|
||||
|
||||
|
@ -45,4 +48,4 @@ In most methods of hierarchical clustering this is achieved by a use of an appro
|
|||
|
||||
- Any valid measure of distance measure can be used
|
||||
- In most cases, the observations themselves are not required, just hte matrix of distances
|
||||
- This can have the advantage of only having to store a distance matrix in memory as opposed to a n-dimensional matrix.
|
||||
- This can have the advantage of only having to store a distance matrix in memory as opposed to a n-dimensional matrix.
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# CURE and TSNE
|
||||
---
|
||||
title: CURE and TSNE
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
##Clustering Using Representatives (CURE)
|
||||
|
||||
|
|
|
@ -1,4 +1,8 @@
|
|||
# Cluster Validation
|
||||
---
|
||||
title: Cluster Validation
|
||||
showthedate: false
|
||||
math: true
|
||||
---
|
||||
|
||||
There are multiple approaches to validating your cluster models
|
||||
|
||||
|
@ -69,4 +73,4 @@ Using internal evaluation metrics, you can see the impact of each point by doing
|
|||
|
||||
`clValid` contains a variety of internal validation measures.
|
||||
|
||||
Paper: https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf
|
||||
Paper: https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue