mirror of
https://github.com/Brandon-Rozek/website.git
synced 2026-01-12 08:40:23 +00:00
Website snapshot
This commit is contained in:
parent
ee0ab66d73
commit
50ec3688a5
281 changed files with 21066 additions and 0 deletions
0
content/research/_index.md
Normal file
0
content/research/_index.md
Normal file
22
content/research/clusteranalysis.md
Normal file
22
content/research/clusteranalysis.md
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
---
|
||||
Title: Cluster Analysis
|
||||
Description: A study of grouping observations
|
||||
---
|
||||
|
||||
# Cluster Analysis
|
||||
Cluster Analysis is the art of finding inherent structures in data to form groups of similar observations. This has a myriad of applications from recommendation engines to social network analysis.
|
||||
|
||||
This is an independent study, meaning that I will be studying this topic under the direction of a professor, in this case being Dr. Denhere.
|
||||
|
||||
I have provided a list of topics that I wish to explore in a [syllabus](syllabus)
|
||||
|
||||
Dr. Denhere likes to approach independent studies from a theoretical and applied sense. Meaning, I will learn the theory of the different algorithms, and then figure out a way to apply them onto a dataset.
|
||||
|
||||
## Readings
|
||||
There is no definitive textbook for this course. Instead I and Dr. Denhere search for materials that we think best demonstrates the topic at hand.
|
||||
|
||||
I have created a [Reading Page](readings) to keep track of the different reading materials.
|
||||
|
||||
|
||||
## Learning Notes
|
||||
I like to type of the content I learn from different sources. A [notes page](notes) is created to keep track of the content discussed each meeting.
|
||||
43
content/research/clusteranalysis/notes.md
Normal file
43
content/research/clusteranalysis/notes.md
Normal file
|
|
@ -0,0 +1,43 @@
|
|||
# Lecture Notes for Cluster Analysis
|
||||
|
||||
[Lecture 1: Measures of Similarity](lec1)
|
||||
|
||||
[Lecture 2.1: Distance Measures Reasoning](lec2-1)
|
||||
|
||||
[Lecture 2.2: Principle Component Analysis Pt. 1](lec2-2)
|
||||
|
||||
Lecture 3: Discussion of Dataset
|
||||
|
||||
[Lecture 4: Principal Component Analysis Pt. 2](lec4)
|
||||
|
||||
[Lecture 4.2: Revisiting Measures](lec4-2)
|
||||
|
||||
[Lecture 4.3: Cluster Tendency](lec4-3)
|
||||
|
||||
[Lecture 5: Introduction to Connectivity Based Models](lec5)
|
||||
|
||||
[Lecture 6: Agglomerative Methods](lec6)
|
||||
|
||||
[Lecture 7: Divisive Methods Part 1: Monothetic](lec7)
|
||||
|
||||
[Lecture 8: Divisive Methods Part 2: Polythetic](lec8)
|
||||
|
||||
[Lecture 9.1: CURE and TSNE](lec9-1)
|
||||
|
||||
[Lecture 9.2: Cluster Validation Part I](lec9-2)
|
||||
|
||||
[Lecture 10.1: Silhouette Coefficient](lec10-1)
|
||||
|
||||
[Lecture 10.2: Centroid-Based Clustering](lec10-2)
|
||||
|
||||
[Lecture 10.3: Voronoi Diagrams](lec10-3)
|
||||
|
||||
[Lecture 11.1: K-means++](lec11-1)
|
||||
|
||||
[Lecture 11.2: K-medoids](lec11-2)
|
||||
|
||||
[Lecture 11.3: K-medians](lec11-3)
|
||||
|
||||
[Lecture 12: Introduction to Density Based Clustering](lec12)
|
||||
|
||||
|
||||
331
content/research/clusteranalysis/notes/lec1.md
Normal file
331
content/research/clusteranalysis/notes/lec1.md
Normal file
|
|
@ -0,0 +1,331 @@
|
|||
# Measures of similarity
|
||||
|
||||
To identify clusters of observations we need to know how **close individuals are to each other** or **how far apart they are**.
|
||||
|
||||
Two individuals are 'close' when their dissimilarity of distance is small and their similarity large.
|
||||
|
||||
Special attention will be paid to proximity measures suitable for data consisting of repeated measures of the same variable, for example taken at different time points.
|
||||
|
||||
## Similarity Measures for Categorical Data
|
||||
|
||||
Measures are generally scaled to be in the interval $[0, 1]$, although occasionally they are expressed as percentages in the range $0-100\%$
|
||||
|
||||
Similarity value of unity indicates that both observations have identical values for all variables
|
||||
|
||||
Similarity value of zero indicates that the two individuals differ maximally for all variables.
|
||||
|
||||
### Similarity Measures for Binary Data
|
||||
|
||||
An extensive list of similarity measures for binary data exist, the reason for such is that a large number of possible measures has to do with the apparent uncertainty as to how to **deal with the count of zero-zero matches**
|
||||
|
||||
In some cases, zero-zero matches are equivalent to one-one matches and therefore should be included in the calculated similarity measure
|
||||
|
||||
<u>Example</u>: Gender, where there is no preference as to which of the two categories should be coded as zero or one
|
||||
|
||||
In other cases the inclusion or otherwise of the matches is more problematic
|
||||
|
||||
<u>Example</u>: When the zero category corresponds to the genuine absence of some property, such as wings in a study of insects
|
||||
|
||||
The question that then needs to be asked is do the co-absences contain useful information about the similarity of the two objects?
|
||||
|
||||
Attributing a high degree of similarity to a pair of individuals simply because they both lack a large number of attributes may not be sensible in many situations
|
||||
|
||||
The following table below will help when it comes to interpreting the measures
|
||||
|
||||

|
||||
|
||||
Measure that ignore the co-absence (lack of both objects having a zero) are Jaccard's Coefficient (S2), Sneath and Sokal (S4)
|
||||
|
||||
When co-absences are considered informative, the simple matching coefficient (S1) is usually employed.
|
||||
|
||||
Measures S3 and S5 are further examples of symmetric coefficients that treat positive matches (a) and negative matches (d) in the same way.
|
||||
|
||||

|
||||
|
||||
### Similarity Measures for Categorical Data with More Than Two Levels
|
||||
|
||||
Categorical data where the variables have more than two levels (for example, eye color) could be dealt with in a similar way to binary data, with each level of a variable being regarded as a single binary variable.
|
||||
|
||||
This is not an attractive approach, however, simply because of the large number of ‘negative’ matches which will inevitably be involved.
|
||||
|
||||
A superior method is to allocate a score of zero or one to each variable depending on whether the two observations are the same on that variable. These scores are then averaged over all p variables to give the required similarity coefficient as
|
||||
$$
|
||||
s_{ij} = \frac{1}{p}\sum_{k = 1}^p{s_{ik}}
|
||||
$$
|
||||
|
||||
### Dissimilarity and Distance Measures for Continuous Data
|
||||
|
||||
A **metric** on a set $X$ is a distance function
|
||||
$$
|
||||
d : X \times X \to [0, \infty)
|
||||
$$
|
||||
where $[0, \infty)$ is the set of non-negative real numbers and for all $x, y, z \in X$, the following conditions are satisfied
|
||||
|
||||
1. $d(x, y) \ge 0$ non-negativity or separation axiom
|
||||
1. $d(x, y) = 0 \iff x = y$ identity of indiscernibles
|
||||
2. $d(x, y) = d(y, x)$ symmetry
|
||||
3. $d(x, z) \le d(x, y) + d(y, z)$ subadditivity or triangle inequality
|
||||
|
||||
Conditions 1 and 2 define a positive-definite function
|
||||
|
||||
All distance measures are formulated so as to allow for differential weighting of the quantitative variables $w_k$ denotes the nonnegative weights of $p$ variables
|
||||
|
||||

|
||||
|
||||
Proposed dissimilarity measures can be broadly divided into distance measures and correlation-type measures.
|
||||
|
||||
#### Distance Measures
|
||||
|
||||
#####$L^p$ Space
|
||||
|
||||
The Minkowski distance is a metric in normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance
|
||||
$$
|
||||
D(X, Y) = (\sum_{i = 1}^n{w_i^p|x_i - y_i|^p})^{\frac{1}{p}}
|
||||
$$
|
||||
This is a metric for $p > 1$
|
||||
|
||||
######Manhattan Distance
|
||||
|
||||
This is the case in the Minkowski distance when $p = 1$
|
||||
$$
|
||||
d(X, Y) = \sum_{i = 1}^n{w_i|x_i - y_i|}
|
||||
$$
|
||||
Manhattan distance depends on the rotation of the coordinate system, but does not depend on its reflection about a coordinate axis or its translation
|
||||
$$
|
||||
d(x, y) = d(-x, -y)
|
||||
$$
|
||||
|
||||
$$
|
||||
d(x, y) = d(x + a, y + a)
|
||||
$$
|
||||
|
||||
Shortest paths are not unique in this metric
|
||||
|
||||
######Euclidean Distance
|
||||
|
||||
This is the case in the Minkowski distance when $p = 2$. The Euclidean distance between points X and Y is the length of the line segment connection them.
|
||||
$$
|
||||
d(X, Y) = \sqrt{\sum_{i = 1}^n{w_i^2(x_i - y_i)^2}}
|
||||
$$
|
||||
There is a unique path in which it has the shortest distance. This distance metric is also translation and rotation invariant
|
||||
|
||||
######Squared Euclidean Distance
|
||||
|
||||
The standard Euclidean distance can be squared in order to place progressively greater weight on objects that are farther apart. In this case, the equation becomes
|
||||
$$
|
||||
d(X, Y) = \sum_{i = 1}^n{w_i^2(x_i - y_i)^2}
|
||||
$$
|
||||
Squared Euclidean Distance is not a metric as it does not satisfy the [triangle inequality](https://en.wikipedia.org/wiki/Triangle_inequality), however, it is frequently used in optimization problems in which distances only have to be compared.
|
||||
|
||||
######Chebyshev Distance
|
||||
|
||||
The Chebyshev distance is where the distance between two vectors is the greatest of their differences along any coordinate dimension.
|
||||
|
||||
It is also known as **chessboard distance**, since in the game of [chess](https://en.wikipedia.org/wiki/Chess) the minimum number of moves needed by a [king](https://en.wikipedia.org/wiki/King_(chess)) to go from one square on a [chessboard](https://en.wikipedia.org/wiki/Chessboard) to another equals the Chebyshev distance
|
||||
$$
|
||||
d(X, Y) = \lim_{p \to \infty}{(\sum_{i = 1}^n{|x_i - y_i|^p})}^\frac{1}{p}
|
||||
$$
|
||||
|
||||
$$
|
||||
= max_i(|x_i - y_i|)
|
||||
$$
|
||||
|
||||
Chebyshev distance is translation invariant
|
||||
|
||||
##### Canberra Distance Measure
|
||||
|
||||
The Canberra distance (D4) is a weighted version of the $L_1$ Manhattan distance. This measure is very sensitive to small changes close to $x_{ik} = x_{jk} = 0$.
|
||||
|
||||
It is often regarded as a generalization of the dissimilarity measure for binary data. In this context the measure can be divided by the number of variables, $p$, to ensure a dissimilarity coefficient in the interval $[0, 1]$
|
||||
|
||||
It can then be shown that this measure for binary variables is just one minus the matching coefficient.
|
||||
|
||||
### Correlation Measures
|
||||
|
||||
It has often been suggested that the correlation between two observations can be used to quantify the similarity between them.
|
||||
|
||||
Since for correlation coefficients we have that $-1 \le \phi_{ij} \le 1$ with the value ‘1’ reflecting the strongest possible positive relationship and the value ‘-1’ the strongest possible negative relationship, these coefficients can be transformed into dissimilarities, $d_{ij}$, within the interval $[0, 1]$
|
||||
|
||||
The use of correlation coefficients in this context is far more contentious than its noncontroversial role in assessing the linear relationship between two variables based on $n$ observations.
|
||||
|
||||
When correlations between two individuals are used to quantify their similarity the <u>rows of the data matrix are standardized</u>, not its columns.
|
||||
|
||||
**Disadvantages**
|
||||
|
||||
When variables are measured on different scales the notion of a difference between variable values and consequently that of a mean variable value or a variance is meaningless.
|
||||
|
||||
In addition, the correlation coefficient is unable to measure the difference in size between two observations.
|
||||
|
||||
**Advantages**
|
||||
|
||||
However, the use of a correlation coefficient can be justified for situations where all of the variables have been measured on the same scale and precise values taken are important only to the extent that they provide information about the subject's relative profile
|
||||
|
||||
<u>Example:</u> In classifying animals or plants, the absolute size of the organisms or their parts are often less important than their shapes. In such studies the investigator requires a dissimilarity coefficient that takes the value zero if and only if two individuals' profiles are multiples of each other. The angular separation dissimilarity measure has this property.
|
||||
|
||||
**Further considerations**
|
||||
|
||||
The Pearson correlation is sensitive to outliers. This has prompted a number of suggestions for modifying correlation coefficients when used as similarity measures; for example, robust versions of correlation coefficients such as *jackknife correlation* or altogether more general association coefficients such as *mutual information distance measure*
|
||||
|
||||
####Mahalanobis (Maximum) Distance [Not between 2 observations]
|
||||
|
||||
Mahalanobis distance is a measure of distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D
|
||||
|
||||
Mahalanobis distance is unitless and scale-invariant and takes into account the correlations of the data set
|
||||
$$
|
||||
D(\vec{x}) = \sqrt{(\vec{x} - \vec{\mu})^T S^{-1}(\vec{x}-\vec{\mu})}
|
||||
$$
|
||||
Where $\mu$ is a set of mean observations and $S$ is the covariance matrix
|
||||
|
||||
If the covariance matrix is diagonal then the resulting distance measure is called a normalized Euclidean distance.
|
||||
$$
|
||||
d(\vec{x}, \vec{y}) = \sqrt{\sum_{i = 1}^N{\frac{(x_i - y_i)^2}{s^2_i}}}
|
||||
$$
|
||||
Where $s_i$ is the standard deviation of the $x_i$ and $y_i$ over the sample set
|
||||
|
||||
####Discrete Metric
|
||||
|
||||
This metric describes whether or not two observations are equivalent
|
||||
$$
|
||||
\rho(x, y) = \begin{cases}
|
||||
1 & x \not= y \\
|
||||
0 & x = y
|
||||
\end{cases}
|
||||
$$
|
||||
|
||||
## Similarity Measures for Data Containing both Continuous and Categorical Variables
|
||||
|
||||
There are a number of approaches to constructing proximities for mixed-mode data, that is, data in which some variables are continuous and some categorical.
|
||||
|
||||
1. Dichotomize all variables and use a similarity measure for binary data
|
||||
2. Rescale all the variables so that they are on the same scale by replacing variable values by their ranks among the objects and then using a measure for continuous data
|
||||
3. Construct a dissimilarity measure for each type of variable and combine these, either with or without differential weighting into a single coefficient.
|
||||
|
||||
Most general-purpose statistical software implement a number of measurs for converting two-mode data matrix into a one-mode dissimilarity matrix.
|
||||
|
||||
R has `cluster`, `clusterSim`, or `proxy`
|
||||
|
||||
### Proximity Measures for Structured Data
|
||||
|
||||
We'll be looking at data that consists of repeated measures of the same outcome variable but under different conditions.
|
||||
|
||||
The simplest and perhaps most commonly used approach to exploiting the reference variable is in the construction of a reduced set of relevant summaries per object which are then used as the basis for defining object similarity.
|
||||
|
||||
Here we will look at some approaches for choosing summary measures and resulting proximity measures for the most frequently encountered reference vectors (e.g. time, experimental condition, and underlying factor)
|
||||
|
||||
Structured data arise when the variables can be assumed to follow a known *factor model*. Under *confirmatory factor analysis model* each variable or item can be allocated to one of a set of underlying factors or concepts. The factors cannot be observed directly but are 'indicated' by a number of items that are all measured on the same scale.
|
||||
|
||||
Note that the summary approach, while typically used with continuous variables, is not limited to variables on an interval scale. The same principles can be applied to dealing with categorical data. The difference is that summary measures now need to capture relevant aspects of the distribution of categorical variables over repeated measures.
|
||||
|
||||
Rows of **$X$** which represent ordered lists of elements, that is all the variables provide a categorical outcome and these variables can be aligned in one dimension, are more generally referred to as *sequences*. *Sequence analysis* is an area of research that centers on problems of events and actions in their temporal context and includes the measurements of similarities between sequences.
|
||||
|
||||
The most popular measure of dissimilarity between two sequences is the Levenhstein distance and counts the minimum number of operations needed to transform one sequence of categories into another, where an operation is an insertion, a deletion, or a substitution of a single category. Each operation may be assigned a penalty weight (a typical choice would be to give double the penalty to a substitution as opposed to an insertion or deletion. The measure is sometimes called the 'edit distance' due to its application in spell checkers.
|
||||
|
||||
Optimal matching algorithms (OMAs) need to be employed to find the minimum number of operations required to match one sequence to another. One such algorithm for aligning sequences is the Needleman-Wunsch algorithm, which is commonly used in bioinformatics to align proteins.
|
||||
|
||||
The *Jary similarity measure* is a related measure of similarity between sequences of categories often used to delete duplicates in the area of record linkage.
|
||||
|
||||
## Inter-group Proximity Measures
|
||||
|
||||
In clustering applications, it becomes necessary to consider how to measure the proximity between groups of individuals.
|
||||
|
||||
1. The proximity between two groups might be defined by a suitable summary of the proximities between individuals from either group
|
||||
2. Each group might be described by a representative observation by choosing a suitable summary statistic for each variable, and the inter group proximity defined as the proximity between the representative observations.
|
||||
|
||||
### Inter-group Proximity Derived from the Proximity Matrix
|
||||
|
||||
For deriving inter-group proximities from a matrix of inter-individual proximities, there are a variety of possibilities
|
||||
|
||||
- Take the smallest dissimilarity between any two individuals, one from each group. This is referred to as *nearest-neighbor distance* and is the basis of the clustering technique known as *single linkage*
|
||||
- Define hte intergroup distance as the largest distance between any two individuals, one from each group. This is known as the *furthest-neighbour distance* and constitute the basis of *complete linkage* cluster method.
|
||||
- Define as the average dissimiliarity between individuals from both groups. Such a measure is used in *group average clustering*
|
||||
|
||||
### Inter-group Proximity Based on Group Summaries for Continuous Data
|
||||
|
||||
One method for constructing inter-group dissimilarity measures for continuous data is to simply substitute group means (also known as the centroid) for the variable values in the formulae for inter-individual measures
|
||||
|
||||
More appropriate, however, might be measures which incorporate in one way or another, knowledge of within-group variation. One possibility is to use Mahallanobis's generalized distance.
|
||||
|
||||
####Mahalanobis (Maximum) Distance
|
||||
|
||||
Mahalanobis distance is a measure of distance between a point P and a distribution D. It is a multi-dimensional generalization of the idea of measuring how many standard deviations away P is from the mean of D
|
||||
|
||||
Mahalanobis distance is unitless and scale-invariant and takes into account the correlations of the data set
|
||||
$$
|
||||
D(\vec{x}) = \sqrt{(\vec{x} - \vec{\mu})^T S^{-1}(\vec{x}-\vec{\mu})}
|
||||
$$
|
||||
Where $\mu$ is a set of mean observations and $S$ is the covariance matrix
|
||||
|
||||
If the covariance matrix is diagonal then the resulting distance measure is called a normalized Euclidean distance.
|
||||
$$
|
||||
d(\vec{x}, \vec{y}) = \sqrt{\sum_{i = 1}^N{\frac{(x_i - y_i)^2}{s^2_i}}}
|
||||
$$
|
||||
Where $s_i$ is the standard deviation of the $x_i$ and $y_i$ over the sample set
|
||||
|
||||
Thus, the Mahalanobis distance incraeses with increasing distances between the two group centers and with decreasing within-group variation.
|
||||
|
||||
By also employing within-group correlations, the Mahalanobis distance takes account the possibly non-spherical shapes of the groups.
|
||||
|
||||
The use of Mahalanobis implies that the investigator is willing to **assume** that the covariance matrices are at least approximately the same in the two groups. When this is not so, this measure is an inappropriate inter-group measure. Other alternatives exist such as the one proposed by Anderson and Bahadur
|
||||
|
||||
<img src="http://proquest.safaribooksonline.com.ezproxy.umw.edu/getfile?item=cjlhZWEzNDg0N2R0cGMvaS9zMG1nODk0czcvN3MwczM3L2UwLXMzL2VpL3RtYTBjMGdzY2QwLmkxLWdtaWY-" alt="equation">
|
||||
|
||||
Another alternative is the *normal information radius* suggested by Jardine and Sibson
|
||||
|
||||
<img src="http://proquest.safaribooksonline.com.ezproxy.umw.edu/getfile?item=cjlhZWEzNDg0N2R0cGMvaS9zMG1nODk0czcvN3MwczM4L2UwLXMzL2VpL3RtYTBjMGdzY2QwLmkxLWdtaWY-" alt="equation">
|
||||
|
||||
### Inter-group Proximity Based on Group Summaries for Categorical Data
|
||||
|
||||
Approaches for measuring inter-group dissimilarities between groups of individuals for which categorical variables have been observed have been considered by a number of authors. Balakrishnan and Sanghvi (1968), for example, proposed a dissimilarity index of the form
|
||||
|
||||

|
||||
|
||||
where $p_{Akl}$ and $p_{Bkl}$ are the proportions of the lth category of the kth variable in group A and B respectively, , ck + 1 is the number of categories for the kth variable and p is the number of variables.
|
||||
|
||||
Kurczynski (1969) suggested adapting the generalized Mahalanobis distance, with categorical variables replacing quantitative variables. In its most general form, this measure for inter-group distance is given by
|
||||
|
||||

|
||||
|
||||
where  contains sample proportions in group A and  is defined in a similar manner, and  is the m × m common sample covariance matrix, where .
|
||||
|
||||
## Weighting Variables
|
||||
|
||||
To weight a variable means to give it greater or lesser importance than other variables in determining the proximity between two objects.
|
||||
|
||||
The question is 'How should the weights be chosen?' Before we discuss this question, it is important to realize that the selection of variables for inclusion into the study already presents a form of weighting, since the variables not included are effectively being given the weight $0$.
|
||||
|
||||
The weights chosen for the variables reflect the importance that the investigator assigns the variables for the classification task.
|
||||
|
||||
There are several approaches to this
|
||||
|
||||
- Authors obtain perceived dissimilarities between selected objects, they then model the dissimilarities using the underlying variables and weights that indicate their relative importance. The weights that best fit the perceived dissimilarities are then chosen.
|
||||
- Define the weights to be inversely proportion to some measure of variability in this variable. This choice of weights implies that the importance of a variable decreases when its variability increases.
|
||||
- For a continous variable, the most commonly emplyed weight is either the reciprocal of its standard deviation or the reciprocal of its range
|
||||
- Employing variability weights is equivalent to what is commonly referred to as *standardizing* the variables.
|
||||
- Construct weights from the data matrix using *variable section*. In essence, such procedures proceed in an iterative fashion to identify variables which, when contributing to a cluster algorithm, lead to internally cohesive and externally isolated clusters and, when clustered singly, produce reasonable agreement.
|
||||
|
||||
The second approach assumed the importance of a variable to be inversely proportional to the total variability of that variable. The total variability of a variable comprises variation both within and between groups which may exist within the set of individuals. The aim of cluster analysis is typically to identify such groups. Hence it can be argued that the importance of a variable should not be reduced because of between-group variation (on the contrary, one might wish to assign more importance to a variable that shows larger between-group variation.)
|
||||
|
||||
Gnanadesikan et al. (1995) assessed the ability of squared distance functions based on data-determined weights, both those described above and others, to recover groups in eight simulated and real continuous data sets in a subsequent cluster analysis. Their main findings were:
|
||||
|
||||
1. Equal weights, (total) standard deviation weights, and range weights were generally ineffective, but range weights were preferable to standard deviation weights.
|
||||
2. Weighting based on estimates of within-cluster variability worked well overall.
|
||||
3. Weighting aimed at emphasizing variables with the most potential for identifying clusters did enhance clustering when some variables had a strong cluster structure.
|
||||
4. Weighting to optimize the fitting of a hierarchical tree was often even less effective than equal weighting or weighting based on (total) standard deviations.
|
||||
5. Forward variable selection was often among the better performers. (Note that all-subsets variable selection was not assessed at the time.)
|
||||
|
||||
## Standardization
|
||||
|
||||
In many clustering applications, the variables describing the objects to be clustered will not be measured in the same units. A number of variability measures have been used for this purpose
|
||||
|
||||
- When standard deviations calculated from the complete set of objects to be clustered are used, the technique is often referred to as *auto-scaling, standard scoring, or z-scoring*.
|
||||
- Division by the median absolute deviations or by the ranges.
|
||||
|
||||
The second is shown to outperform auto-scaling in many clustering applications. As pointed out in the previous section, standardization of variables to unit variance can be viewed as a special case of weighting.
|
||||
|
||||
## Choice of Proximity Measure
|
||||
|
||||
Firstly, the nature of the data should strongly influence the choice of the proximity measure.
|
||||
|
||||
Next, the choice of measure should depend on the scale of the data. Similarity coefficients should be used when the data is binary. For continuous data, distance of correlation-type dissimilarity measure should be used according to whether 'size' or 'shape' of the objects is of interest.
|
||||
|
||||
Finally, the clustering method to be used might have some implications for the choice of the coefficient. For example, making a choice between several proximity coefficients with similar properties which are also known to be monotonically related can be avoided by employing a cluster method that depends only on the ranking of the proximities, not their absolute values.
|
||||
46
content/research/clusteranalysis/notes/lec10-1.md
Normal file
46
content/research/clusteranalysis/notes/lec10-1.md
Normal file
|
|
@ -0,0 +1,46 @@
|
|||
# Silhouette
|
||||
|
||||
This technique validates the consistency within clusters of data. It provides a succinct graphical representation of how well each object lies in its cluster.
|
||||
|
||||
The silhouette ranges from -1 to 1 where a high value indicates that the object is consistent within its own cluster and poorly matched to neighboring clustesr.
|
||||
|
||||
A low or negative silhouette value can mean that the current clustering configuration has too many or too few clusters.
|
||||
|
||||
## Definition
|
||||
|
||||
For each datum $i$, let $a(i)$ be the average distance of $i$ with all other data within the same cluster.
|
||||
|
||||
$a(i)$ can be interpreted as how well $i$ is assigned to its cluster. (lower values mean better agreement)
|
||||
|
||||
We can then define the average dissimilarity of point $i$ to a cluster $c$ as the average distance from $i$ to all points in $c$.
|
||||
|
||||
Let $b(i)$ be the lowest average distance of $i$ to all other points in any other cluster in which i is not already a member.
|
||||
|
||||
The cluster with this lowest average dissimilarity is said to be the neighboring cluster of $i$. From here we can define a silhouette:
|
||||
$$
|
||||
s(i) = \frac{b(i) - a(i)}{max\{a(i), b(i)\}}
|
||||
$$
|
||||
The average $s(i)$ over all data of a cluster is a measure of how tightly grouped all the data in the cluster are. A silhouette plot may be used to visualize the agreement between each of the data and its cluster.
|
||||
|
||||

|
||||
|
||||
### Properties
|
||||
|
||||
Recall that $a(i)$ is a measure of how dissimilar $i$ is to its own cluster, a smaller value means that it's in agreement to its cluster. For $s(i)$ to be close to 1, we require $a(i) << b(i)$ .
|
||||
|
||||
If $s(i)$ is close to negative one, then by the same logic we can see that $i$ would be more appropriate if it was clustered in its neighboring cluster.
|
||||
|
||||
$s(i)$ near zero means that the datum is on the border of two natural clusters.
|
||||
|
||||
## Determining the number of Clusters
|
||||
|
||||
This can also be used in helping to determine the number of clusters in a dataset. The ideal number of cluster is one that produces the highest silhouette value.
|
||||
|
||||
Also a good indication that one has too many clusters is if there are clusters with the majority of observations being under the mean silhouette value.
|
||||
|
||||
https://kapilddatascience.wordpress.com/2015/11/10/using-silhouette-analysis-for-selecting-the-number-of-cluster-for-k-means-clustering/
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
54
content/research/clusteranalysis/notes/lec10-2.md
Normal file
54
content/research/clusteranalysis/notes/lec10-2.md
Normal file
|
|
@ -0,0 +1,54 @@
|
|||
# Centroid-based Clustering
|
||||
|
||||
In centroid-based clustering, clusters are represented by some central vector which may or may not be a member of the dataset. In practice, the number of clusters is fixed to $k$ and the goal is to solve some sort of optimization problem.
|
||||
|
||||
The similarity of two clusters is defined as the similarity of their centroids.
|
||||
|
||||
This problem is computationally difficult so there are efficient heuristic algorithms that are commonly employed. These usually converge quickly to a local optimum.
|
||||
|
||||
## K-means clustering
|
||||
|
||||
This aims to partition $n$ observations into $k$ clusters in which each observation belongs to the cluster with the nearest mean which serves as the centroid of the cluster.
|
||||
|
||||
This technique results in partitioning the data space into Voronoi cells.
|
||||
|
||||
### Description
|
||||
|
||||
Given a set of observations $x$, k-means clustering aims to partition the $n$ observations into $k$ sets $S$ so as to minimize the within-cluster sum of squares (i.e. variance). More formally, the objective is to find
|
||||
$$
|
||||
argmin_s{\sum_{i = 1}^k{\sum_{x \in S_i}{||x-\mu_i||^2}}}= argmin_{s}{\sum_{i = 1}^k{|S_i|Var(S_i)}}
|
||||
$$
|
||||
where $\mu_i$ is the mean of points in $S_i$. This is equivalent to minimizing the pairwise squared deviations of points in the same cluster
|
||||
$$
|
||||
argmin_s{\sum_{i = 1}^k{\frac{1}{2|S_i|}\sum_{x, y \in S_i}{||x-y||^2}}}
|
||||
$$
|
||||
|
||||
### Algorithm
|
||||
|
||||
Given an initial set of $k$ means, the algorithm proceeds by alternating between two steps.
|
||||
|
||||
**Assignment step**: Assign each observation to the cluster whose mean has the least squared euclidean distance.
|
||||
|
||||
- Intuitively this is finding the nearest mean
|
||||
- Mathematically this means partitioning the observations according to the Voronoi diagram generated by the means
|
||||
|
||||
**Update Step**: Calculate the new means to be the centroids of the observations in the new clusters
|
||||
|
||||
The algorithm is known to have converged when assignments no longer change. There is no guarantee that the optimum is found using this algorithm.
|
||||
|
||||
The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.
|
||||
|
||||
Using a different distance function other than the squared Euclidean distance may stop the algorithm from converging.
|
||||
|
||||
### Initialization methods
|
||||
|
||||
Commonly used initialization methods are Forgy and Random Partition.
|
||||
|
||||
**Forgy Method**: This method randomly chooses $k$ observations from the data set and uses these are the initial means
|
||||
|
||||
This method is known to spread the initial means out
|
||||
|
||||
**Random Partition Method**: This method first randomly assigns a cluster to each observation and then proceeds to the update step.
|
||||
|
||||
This method is known to place most of the means close to the center of the dataset.
|
||||
|
||||
18
content/research/clusteranalysis/notes/lec10-3.md
Normal file
18
content/research/clusteranalysis/notes/lec10-3.md
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
# Voronoi Diagram
|
||||
|
||||
A Voronoi diagram is a partitioning of a plan into regions based on distance to points in a specific subset of the plane.
|
||||
|
||||
The set of points (often called seeds, sites, or generators) is specified beforehand, and for each seed there is a corresponding region consisting of all points closer to that seed than any other.
|
||||
|
||||
Different metrics may be used and often result in different Voronoi diagrams
|
||||
|
||||
**Euclidean**
|
||||
|
||||

|
||||
|
||||
**Manhattan**
|
||||
|
||||

|
||||
|
||||
|
||||
|
||||
18
content/research/clusteranalysis/notes/lec11-1.md
Normal file
18
content/research/clusteranalysis/notes/lec11-1.md
Normal file
|
|
@ -0,0 +1,18 @@
|
|||
# K-means++
|
||||
|
||||
K-means++ is an algorithm for choosing the initial values or seeds for the k-means clustering algorithm. This was proposed as a way of avoiding the sometimes poor clustering found by a standard k-means algorithm.
|
||||
|
||||
## Intuition
|
||||
|
||||
The intuition behind this approach involves spreading out the $k$ initial cluster centers. The first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center.
|
||||
|
||||
## Algorithm
|
||||
|
||||
The exact algorithm is as follows
|
||||
|
||||
1. Choose one center uniformly at random from among data points
|
||||
2. For each data point $x$, compute $D(x)$, the distance between $x$ and the nearest center that has already been chosen.
|
||||
3. Choose one new data point at random as a new center, using a weighted probability distribution where a point $x$ is chosen with probability proporitonal to $D(x)^2$
|
||||
4. Repeat steps 2 and 3 until $k$ centers have been chosen
|
||||
5. Now that the initial centers have been chosen, proceed using standard k-means clustering
|
||||
|
||||
52
content/research/clusteranalysis/notes/lec11-2.md
Normal file
52
content/research/clusteranalysis/notes/lec11-2.md
Normal file
|
|
@ -0,0 +1,52 @@
|
|||
# K-Medoids
|
||||
|
||||
A medoid can be defined as the object of a cluster whose average dissimilarity to all the objects in the cluster is minimal.
|
||||
|
||||
The K-medoids algorithm is related to k-means and the medoidshift algorithm. Both the k-means and k-medoids algorithms are partition and both attempt to minimize the distance between points in the cluster to it's center. In contrast to k-means, it chooses data points as centers and uses the Manhattan Norm to define the distance between data points instead of the Euclidean.
|
||||
|
||||
This method is known to be more robust to noise and outliers compared to k-means since it minimizes the sum of pairwise dissimilarities instead of the sum of squared Euclidean distances.
|
||||
|
||||
## Algorithms
|
||||
|
||||
There are several algorithms that have been created as an optimization to an exhaustive search. In this section, we'll discuss PAM and Voronoi iteration method.
|
||||
|
||||
### Partitioning Around Medoids (PAM)
|
||||
|
||||
1. Select $k$ of the $n$ data points as medoids
|
||||
2. Associate each data point to the closes medoid
|
||||
3. While the cost of the configuration decreases:
|
||||
1. For each medoid $m$, for each non-medoid data point $o$:
|
||||
1. Swap $m$ and $o$, recompute the cost (sum of distances of points to their medoid)
|
||||
2. If the total cost of the configuration increased in the previous step, undo the swap
|
||||
|
||||
|
||||
|
||||
### Voronoi Iteration Method
|
||||
|
||||
1. Select $k$ of the $n$ data points as medoids
|
||||
2. While the cost of the configuration decreases
|
||||
1. In each cluster, make the point that minimizes the sum of distances within the cluster the medoid
|
||||
2. Reassign each point to the cluster defined by the closest medoid determined in the previous step.
|
||||
|
||||
|
||||
|
||||
### Clustering Large Applications (CLARA
|
||||
|
||||
This is a variant of the PAM algorithm that relies on the sampling approach to handle large datasets. The cost of a particular cluster configuration is the mean cost of all the dissimilarities.
|
||||
|
||||
|
||||
|
||||
## R Implementations
|
||||
|
||||
Both PAM and CLARA are defined in the `cluster` package in R.
|
||||
|
||||
```R
|
||||
clara(x, k, metric = "euclidean", stand = FALSE, samples = 5,
|
||||
sampsize = min(n, 40 + 2 * k), trace = 0, medoids.x = TRUE,
|
||||
keep.data = medoids.x, rngR = FALSE)
|
||||
```
|
||||
|
||||
```R
|
||||
pam(x, k, metric = "euclidean", stand = FALSE)
|
||||
```
|
||||
|
||||
19
content/research/clusteranalysis/notes/lec11-3.md
Normal file
19
content/research/clusteranalysis/notes/lec11-3.md
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
# K-Medians
|
||||
|
||||
This is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid we are going to calculate the median instead.
|
||||
|
||||
This has the effect of minimizing error over all the clusters with respect to the Manhattan norm as opposed to the Euclidean squared norm which is minimized in K-means
|
||||
|
||||
### Algorithm
|
||||
|
||||
Given an initial set of $k$ medians, the algorithm proceeds by alternating between two steps.
|
||||
|
||||
**Assignment step**: Assign each observation to the cluster whose median has the leas Manhattan distance.
|
||||
|
||||
- Intuitively this is finding the nearest median
|
||||
|
||||
**Update Step**: Calculate the new medians to be the centroids of the observations in the new clusters
|
||||
|
||||
The algorithm is known to have converged when assignments no longer change. There is no guarantee that the optimum is found using this algorithm.
|
||||
|
||||
The result depends on the initial clusters. It is common to run this multiple times with different starting conditions.
|
||||
56
content/research/clusteranalysis/notes/lec12.md
Normal file
56
content/research/clusteranalysis/notes/lec12.md
Normal file
|
|
@ -0,0 +1,56 @@
|
|||
# Introduction to Density Based Clustering
|
||||
|
||||
In density-based clustering, clusters are defined as areas of higher density than the remainder of the data sets. Objects in more sparse areas are considered to be outliers or border points. This helps discover clusters of arbitrary shape.
|
||||
|
||||
## DBSCAN
|
||||
|
||||
Given a set of points in space, it groups together points that are closely packed together while marking points that lie alone in low-density regions as outliers.
|
||||
|
||||
### Preliminary Information
|
||||
|
||||
- A point $p$ is a core point if at least k (often referred to as minPts) are within $\epsilon$ of it. Those points are said to be *directly reachable* from $p$.
|
||||
- A point $q$ is directly reachable from $p$ if point $q$ is within distance $\epsilon$ from point $p$ and $p$ must be a core point
|
||||
- A point $q$ is reachable from $p$ if there is a path $p_1, \dots, p_n$ with $p_1 = p$ and $p_n = q$ where each $p_{i + 1}$ is directly reachable from $p_i$. (All points on the path must be core points, with the possible exception of $q$)
|
||||
- All points not reachable from any other points are outliers
|
||||
|
||||
Non core points can be part of a cluster, but they form its "edge", since they cannot be used to reach more points.
|
||||
|
||||
Reachability is not a symmetric relation since, by definition, no point may be reachable from a non-core point, regardless of distance.
|
||||
|
||||
Two points $p$ and $q$ are density-connected if there is a point $o$ such that both $p$ and $q$ are reachable from $o$. Density-connectedness is symmetric.
|
||||
|
||||
A cluster then satisfies two properties:
|
||||
|
||||
1. All points within the cluster are mutually density-connected
|
||||
2. If a point is density-reachable from any point of the cluster, it is part of the cluster as well.
|
||||
|
||||
|
||||
### Algorithm
|
||||
|
||||
1. Find the $\epsilon$ neighbors of every point, and identify the core points with more than $k$ neighbors.
|
||||
2. Find the connected components of *core* points on the neighborhood graph, ignoring all non-core points.
|
||||
3. Assign each non-core point to a nearby cluster if the cluster is an $\epsilon$ (eps) neighbor, otherwise assign it to noise.
|
||||
|
||||
###Advantages
|
||||
|
||||
- Does not require one to specify the number of clusters in the data
|
||||
- Can find arbitrarily shaped clusters
|
||||
- Has a notion of noise and is robust to outliers
|
||||
|
||||
### Disadvantages
|
||||
|
||||
- Not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster.
|
||||
- The quality to DBSCAN depends on the distance measure used.
|
||||
- Cannot cluster data sets well with large differences in densities.
|
||||
|
||||
### Rule of Thumbs for parameters
|
||||
|
||||
$k$: $k$ must be larger than $(D + 1)$ where $D$ is the number of dimensions in the dataset. Normally $k$ is chosen to be twice the number of dimensions.
|
||||
|
||||
$\epsilon$: Ideally the $k^{th}$ nearest neighbors are at roughly the same distance. Plot the sorted distance of every point to it's $k^{th}$ nearest neighbor
|
||||
|
||||
|
||||
|
||||
Example of Run Through
|
||||
|
||||
https://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf
|
||||
34
content/research/clusteranalysis/notes/lec2-1.md
Normal file
34
content/research/clusteranalysis/notes/lec2-1.md
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
# Why use different distance measures?
|
||||
|
||||
I made an attempt to find out in what situations people use different distance measures. Looking around in the Internet usually produces the results "It depends on the problem" or "I typically just always use Euclidean"
|
||||
|
||||
Which as you might imagine, isn't a terribly useful answer. Since it doesn't give me any examples of which types of problems different distances solve.
|
||||
|
||||
Therefore, let's think about it in a different way. What properties do different distance measures have that make them desirable?
|
||||
|
||||
## Manhattan Advantages
|
||||
|
||||
- The gradient of this function has a constant magnitude. There's no power in the formula
|
||||
- Unusual values affect distances on Euclidean more since the difference is squared
|
||||
|
||||
https://datascience.stackexchange.com/questions/20075/when-would-one-use-manhattan-distance-as-opposite-to-euclidean-distance
|
||||
|
||||
|
||||
|
||||
## Mahalanobis Advantages
|
||||
|
||||
Variables can be on different scales. The Mahalanobis formula has a built in variance-covariance matrix which allows you to rescale your variables to make distances of different variables more comparable.
|
||||
|
||||
https://stats.stackexchange.com/questions/50949/why-use-the-mahalanobis-distance#50956
|
||||
|
||||
|
||||
|
||||
## Euclidean Disadvantages
|
||||
|
||||
In higher dimensions, the points essentially become uniformly distant from one another. This is a problem observed in most distance metrics but it's more obvious with the Euclidean one.
|
||||
|
||||
https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions/
|
||||
|
||||
|
||||
|
||||
Hopefully in this course, we'll discover more properties as to why it makes sense to use different distance measures since it can have a impact on how our clusters are formed.
|
||||
53
content/research/clusteranalysis/notes/lec2-2.md
Normal file
53
content/research/clusteranalysis/notes/lec2-2.md
Normal file
|
|
@ -0,0 +1,53 @@
|
|||
# Principal Component Analysis Pt. 1
|
||||
|
||||
## What is PCA?
|
||||
|
||||
Principal component analysis is a statistical procedure that performs an orthogonal transformation to convert a set of variables into a set of linearly uncorrelated variables called principle components.
|
||||
|
||||
Number of distinct principle components equals $min(\# Variables, \# Observations - 1)$
|
||||
|
||||
The transformation is defined in such a way that the first principle component has the largest possible variance explained in the data.
|
||||
|
||||
Each succeeding component has the highest possible variance under the constraint of having to be orthogonal to the preceding components.
|
||||
|
||||
PCA is sensitive to the relative scaling of the original variables.
|
||||
|
||||
### Results of a PCA
|
||||
|
||||
Results are discussed in terms of *component scores* which is the transformed variables and *loadings* which is the weight by which each original variable should be multiplied to get the component score.
|
||||
|
||||
## Assumptions of PCA
|
||||
|
||||
1. Linearity
|
||||
2. Large variances are important and small variances denote noise
|
||||
3. Principal components are orthogonal
|
||||
|
||||
## Why perform PCA?
|
||||
|
||||
- Distance measures perform poorly in high-dimensional space (https://stats.stackexchange.com/questions/256172/why-always-doing-dimensionality-reduction-before-clustering)
|
||||
- Helps eliminates noise from the dataset (https://www.quora.com/Does-it-make-sense-to-perform-principal-components-analysis-before-clustering-if-the-original-data-has-too-many-dimensions-Is-it-theoretically-unsound-to-try-to-cluster-data-with-no-correlation)
|
||||
- One initial cost to help reduce further computations
|
||||
|
||||
## Computing PCA
|
||||
|
||||
1. Subtract off the mean of each measurement type
|
||||
2. Compute the covariance matrix
|
||||
3. Take the eigenvalues/vectors of the covariance matrix
|
||||
|
||||
## R Code
|
||||
|
||||
```R
|
||||
pcal = function(data) {
|
||||
centered_data = scale(data)
|
||||
covariance = cov(centered_data)
|
||||
eigen_stuff = eigen(covariance)
|
||||
sorted_indices = sort(eigen_stuff$values,
|
||||
index.return = T,
|
||||
decreasing = T)$ix
|
||||
loadings = eigen_stuff$values[sorted_indices]
|
||||
components = eigen_stuff$vectors[sorted_indices,]
|
||||
combined_list = list(loadings, components)
|
||||
names(combined_list) = c("Loadings", "Components")
|
||||
return(combined_list)
|
||||
}
|
||||
```
|
||||
24
content/research/clusteranalysis/notes/lec4-2.md
Normal file
24
content/research/clusteranalysis/notes/lec4-2.md
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
# Revisiting Similarity Measures
|
||||
|
||||
## Manhatten Distance
|
||||
|
||||
An additional use case for Manhatten distance is when dealing with binary vectors. This approach, otherwise known as the Hamming distance, is the number of bits that are different between two binary vectors.
|
||||
|
||||
## Ordinal Variables
|
||||
|
||||
Ordinal variables can be treated as if they were on a interval scale.
|
||||
|
||||
First, replace the ordinal variable value by its rank ($r_{if}$) Then map the range of each variable onto the interval $[0, 1]$ by replacing the $f_i$ where f is the variable and i is the object by
|
||||
$$
|
||||
z_{if} = \frac{r_{if} - 1}{M_f - 1}
|
||||
$$
|
||||
Where $M_f$ is the maximum rank.
|
||||
|
||||
### Example
|
||||
|
||||
Freshman = $0$ Sophmore = $\frac{1}{3}$ Junior = $\frac{2}{3}$ Senior = $1$
|
||||
|
||||
$d(freshman, senior) = 1$
|
||||
|
||||
$d(junior, senior) = \frac{1}{3}$
|
||||
|
||||
40
content/research/clusteranalysis/notes/lec4-3.md
Normal file
40
content/research/clusteranalysis/notes/lec4-3.md
Normal file
|
|
@ -0,0 +1,40 @@
|
|||
# Cluster Tendency
|
||||
|
||||
This is the assessment of the suitability of clustering. Cluster Tendency determines whether the data has any inherent grouping structure.
|
||||
|
||||
This is a hard task since there are so many different definitions of clusters (portioning, hierarchical, density, graph, etc.) Even after fixing a cluster type, this is still hard in defining an appropriate null model for a data set.
|
||||
|
||||
One way we can go about measuring cluster tendency is to compare the data against random data. On average, random data should not contain clusters.
|
||||
|
||||
There are some clusterability assessment methods such as Spatial histogram, distance distribution and Hopkins statistic.
|
||||
|
||||
## Hopkins Statistic
|
||||
|
||||
Let $X$ be the set of $n$ data points in $d$ dimensional space. Consider a random sample (without replacement) of $m << n$ data points. Also generate a set $Y$ of $m$ uniformly randomly distributed data points.
|
||||
|
||||
Now define two distance measures $u_i$ to be the distance of $y_i \in Y$ from its nearest neighbor in X and $w_i$ to be the distance of $x_i \in X$ from its nearest neighbor in X
|
||||
|
||||
We can then define Hopkins statistic as
|
||||
$$
|
||||
H = \frac{\sum_{i = 1}^m{u_i^d}}{\sum_{i = 1}^m{u_i^d} + \sum_{i =1}^m{w_i^d}}
|
||||
$$
|
||||
|
||||
### Properties
|
||||
|
||||
With this definition, uniform random data should tend to have values near 0.5, and clustered data should tend to have values nearer to 1.
|
||||
|
||||
### Drawbacks
|
||||
|
||||
However, data containing a single Gaussian will also score close to one. As this statistic measures deviation from a uniform distribution. Making this statistic less useful in application as real data is usually not remotely uniform.
|
||||
|
||||
|
||||
|
||||
## Spatial Histogram Approach
|
||||
|
||||
For this method, I'm not too sure how this works, but here are some key points I found.
|
||||
|
||||
Divide each dimension in equal width bins, and count how many points lie in each of the bins and obtain the empirical joint probability mass function.
|
||||
|
||||
Do the same for the randomly sampled data
|
||||
|
||||
Finally compute how much they differ using the Kullback-Leibler (KL) divergence value. If it differs greatly than we can say that the data is clusterable.
|
||||
171
content/research/clusteranalysis/notes/lec4.md
Normal file
171
content/research/clusteranalysis/notes/lec4.md
Normal file
|
|
@ -0,0 +1,171 @@
|
|||
# Principal Component Analysis Part 2: Formal Theory
|
||||
|
||||
##Properties of PCA
|
||||
|
||||
There are a number of ways to maximize the variance of a principal component. To create an unique solution we should impose a constraint. Let us say that the sum of the square of the coefficients must equal 1. In vector notation this is the same as
|
||||
$$
|
||||
a_i^Ta_i = 1
|
||||
$$
|
||||
Every future principal component is said to be orthogonal to all the principal components previous to it.
|
||||
$$
|
||||
a_j^Ta_i = 0, i < j
|
||||
$$
|
||||
The total variance of the $q$ principal components will equal the total variance of the original variables
|
||||
$$
|
||||
\sum_{i = 1}^q {\lambda_i} = trace(S)
|
||||
$$
|
||||
Where $S$ is the sample covariance matrix.
|
||||
|
||||
The proportion of accounted variation in each principle component is
|
||||
$$
|
||||
P_j = \frac{\lambda_j}{trace(S)}
|
||||
$$
|
||||
From this, we can generalize to the first $m$ principal components where $m < q$ and find the proportion $P^{(m)}$ of variation accounted for
|
||||
$$
|
||||
P^{(m)} = \frac{\sum_{i = 1}^m{\lambda_i}}{trace(S)}
|
||||
$$
|
||||
You can think of the first principal component as the line of best fit that minimizes the residuals orthogonal to it.
|
||||
|
||||
### What to watch out for
|
||||
|
||||
As a reminder to the last lecture, *PCA is not scale-invariant*. Therefore, transformations done to the dataset before PCA and after PCA often lead to different results and possibly conclusions.
|
||||
|
||||
Additionally, if there are large differences between the variances of the original variables, then those whose variances are largest will tend to dominate the early components.
|
||||
|
||||
Therefore, principal components should only be extracted from the sample covariance matrix when all of the original variables have roughly the **same scale**.
|
||||
|
||||
### Alternatives to using the Covariance Matrix
|
||||
|
||||
But it is rare in practice to have a scenario when all of the variables are of the same scale. Therefore, principal components are typically extracted from the **correlation matrix** $R$
|
||||
|
||||
Choosing to work with the correlation matrix rather than the covariance matrix treats the variables as all equally important when performing PCA.
|
||||
|
||||
## Example Derivation: Bivariate Data
|
||||
|
||||
Let $R$ be the correlation matrix
|
||||
$$
|
||||
R = \begin{pmatrix}
|
||||
1 & r \\
|
||||
r & 1
|
||||
\end{pmatrix}
|
||||
$$
|
||||
Let us find the eigenvectors and eigenvalues of the correlation matrix
|
||||
$$
|
||||
det(R - \lambda I) = 0
|
||||
$$
|
||||
|
||||
$$
|
||||
(1-\lambda)^2 - r^2 = 0
|
||||
$$
|
||||
|
||||
$$
|
||||
\lambda_1 = 1 + r, \lambda_2 = 1 - r
|
||||
$$
|
||||
|
||||
Let us remember to check the condition "sum of the principal components equals the trace of the correlation matrix":
|
||||
$$
|
||||
\lambda_1 + \lambda_2 = 1+r + (1 - r) = 2 = trace(R)
|
||||
$$
|
||||
|
||||
###Finding the First Eigenvector
|
||||
|
||||
Looking back at the characteristic equation
|
||||
$$
|
||||
Ra_1 = \lambda a_1
|
||||
$$
|
||||
We can get the following two formulas
|
||||
$$
|
||||
a_{11} + ra_{12} = (1+r)a_{11} \tag{1}
|
||||
$$
|
||||
|
||||
$$
|
||||
ra_{11} + a_{12} = (1 + r)a_{12} \tag{2}
|
||||
$$
|
||||
|
||||
Now let us find out what $a_{11}$ and $a_{12}$ equal. First let us solve for $a_{11}$ using equation $(1)$
|
||||
$$
|
||||
ra_{12} = (1+r)a_{11} - a_{11}
|
||||
$$
|
||||
|
||||
$$
|
||||
ra_{12} = a_{11}(1 + r - 1)
|
||||
$$
|
||||
|
||||
$$
|
||||
ra_{12} = ra_{11}
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{12} = a_{11}
|
||||
$$
|
||||
|
||||
Where $r$ does not equal $0$.
|
||||
|
||||
Now we must apply the condition of sum squares
|
||||
$$
|
||||
a_1^Ta_1 = 1
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{11}^2 + a_{12}^2 = 1
|
||||
$$
|
||||
|
||||
Recall that $a_{12} = a_{11}$
|
||||
$$
|
||||
2a_{11}^2 = 1
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{11}^2 = \frac{1}{2}
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{11} =\pm \frac{1}{\sqrt{2}}
|
||||
$$
|
||||
|
||||
For sake of choosing a value, let us take the principal root and say $a_{11} = \frac{1}{\sqrt{2}}$
|
||||
|
||||
###Finding the Second Eigenvector
|
||||
|
||||
Recall the fact that each subsequent eigenvector is orthogonal to the first. This means
|
||||
$$
|
||||
a_{11}a_{21} + a_{12}a_{22} = 0
|
||||
$$
|
||||
Substituting the values for $a_{11}$ and $a_{12}$ calculated in the previous section
|
||||
$$
|
||||
\frac{1}{\sqrt{2}}a_{21} + \frac{1}{\sqrt{2}}a_{22} = 0
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{21} + a_{22} = 0
|
||||
$$
|
||||
|
||||
$$
|
||||
a_{21} = -a_{22}
|
||||
$$
|
||||
|
||||
Since this eigenvector also needs to satisfy the first condition, we get the following values
|
||||
$$
|
||||
a_{21} = \frac{1}{\sqrt{2}} , a_{22} = \frac{-1}{\sqrt{2}}
|
||||
$$
|
||||
|
||||
### Conclusion of Example
|
||||
|
||||
From this, we can say that the first principal components are given by
|
||||
$$
|
||||
y_1 = \frac{1}{\sqrt{2}}(x_1 + x_2), y_2 = \frac{1}{\sqrt{2}}(x_1-x_2)
|
||||
$$
|
||||
With the variance of the first principal component being given by $(1+r)$ and the second by $(1-r)$
|
||||
|
||||
Due to this, as $r$ increases, so does the variance explained in the first principal component. This in turn, lowers the variance explained in the second principal component.
|
||||
|
||||
## Choosing a Number of Principal Components
|
||||
|
||||
Principal Component Analysis is typically used in dimensionality reduction efforts. Therefore, there are several strategies for picking the right number of principal components to keep. Here are a few:
|
||||
|
||||
- Retain enough principal components to account for 70%-90% of the variation
|
||||
- Exclude principal components where eigenvalues are less than the average eigenvalue
|
||||
- Exclude principal components where eigenvalues are less than one.
|
||||
- Generate a Scree Plot
|
||||
- Stop when the plot goes from "steep" to "shallow"
|
||||
- Stop when it essentially becomes a straight line.
|
||||
35
content/research/clusteranalysis/notes/lec5.md
Normal file
35
content/research/clusteranalysis/notes/lec5.md
Normal file
|
|
@ -0,0 +1,35 @@
|
|||
# Introduction to Connectivity Based Models
|
||||
|
||||
Hierarchical algorithms combine observations to form clusters based on their distance.
|
||||
|
||||
## Connectivity Methods
|
||||
|
||||
Hierarchal Clustering techniques can be subdivided depending on the method of going about it.
|
||||
|
||||
First there are two different methods in forming the clusters *Agglomerative* and *Divisive*
|
||||
|
||||
<u>Agglomerative</u> is when you combine the n individuals into groups through each iteration
|
||||
|
||||
<u>Divisive</u> is when you are separating one giant group into finer groupings with each iteration.
|
||||
|
||||
Hierarchical methods are an irrevocable algorithm, once it joins or separates a grouping, it cannot be undone. As Kaufman and Rousseeuw (1990) colorfully comment: *"A hierarchical method suffers from the defect that it can never repair what was done in previous steps"*.
|
||||
|
||||
It is the job of the statistician to decide when to stop the agglomerative or decisive algorithm, since having one giant cluster containing all observations or having each observation be a cluster isn't particularly useful.
|
||||
|
||||
At different distances, different clusters are formed and are more readily represented using a **dendrogram**. These algorithms do not provide a unique solution but rather provide an extensive hierarchy of clusters that merge or divide at different distances.
|
||||
|
||||
## Linkage Criterion
|
||||
|
||||
Apart from the method of forming clusters, the user also needs to decide on a linkage criterion to use. Meaning, how do you want to optimize your clusters.
|
||||
|
||||
Do you want to group based on the nearest points in each cluster? Nearest Neighbor Clustering
|
||||
|
||||
Or do you want to based on the farthest observations in each cluster? Farthest neighbor clustering.
|
||||
|
||||

|
||||
|
||||
## Shortcomings
|
||||
|
||||
This method is not very robust towards outliers, which will either show up as additional clusters or even cause other clusters to merge depending on the clustering method.
|
||||
|
||||
As we go through this section, we will go into detail about the different linkage criterion and other parameters of this model.
|
||||
90
content/research/clusteranalysis/notes/lec6.md
Normal file
90
content/research/clusteranalysis/notes/lec6.md
Normal file
|
|
@ -0,0 +1,90 @@
|
|||
# Agglomerative Methods
|
||||
|
||||
## Single Linkage
|
||||
|
||||
First let us consider the single linkage (nearest neighbor) approach. The clusters can be found through the following algorithm
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances in the matrix by taking the minimum distances
|
||||
- Cluster a,b -> c = min(d(a, c), d(b, c))
|
||||
|
||||
Single linkage can operate directly on a proximity matrix, the actual data is not required.
|
||||
|
||||
A wonderful visual representation can be found in Everitt Section 4.2
|
||||
|
||||
## Centroid Clustering
|
||||
|
||||
This is another criterion measure that requires both the data and proximity matrix. These are the following steps of the algorithm. Requires Euclidean distance measure to preserve geometric correctness
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances by taking the mean of the clustered observations and computing the distances between all of the observations
|
||||
- Cluster a,b -> c = d(mean(a, b), c)
|
||||
|
||||
## Complete Linkage
|
||||
|
||||
This is like Single Linkage, except now we're taking the farthest distance. The algorithm can be adjusted to the following
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances in the matrix by taking the maximum distances
|
||||
- Cluster a,b -> c = max(d(a, c), d(b, c))
|
||||
|
||||
##Unweighted Pair-Group Method using the Average Approach (UPGMA)
|
||||
|
||||
In this criterion, we are no longer summarizing each cluster before taking distances, but instead comparing each observation in the cluster to the outside point and taking the average
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances in the matrix by taking the mean
|
||||
- Cluster A: a,b -> c = $mean_{i = 0}(d(A_i, c))$
|
||||
|
||||
## Median Linkage
|
||||
|
||||
This approach is similar to the UPGMA approach, except now we're taking the median instead of the mean
|
||||
|
||||
1. Find the smallest non-zero distance
|
||||
2. Group the two objects together as a cluster
|
||||
3. Recompute the distances in the matrix by taking the median
|
||||
- Cluster A: a,b -> c = $median_{i = 0}{(d(A_i, c))}$
|
||||
|
||||
## Ward Linkage
|
||||
|
||||
This one I didn't look too far into but here's the description: With Ward's linkage method, the distance between two clusters is the sum of squared deviations from points to centroids. The objective of Ward's linkage is to minimize the within-cluster sum of squares.
|
||||
|
||||
## When to use different Linkage Types?
|
||||
|
||||
According to the following two stack overflow posts: https://stats.stackexchange.com/questions/195446/choosing-the-right-linkage-method-for-hierarchical-clustering and https://stats.stackexchange.com/questions/195456/how-to-select-a-clustering-method-how-to-validate-a-cluster-solution-to-warran/195481#195481
|
||||
|
||||
These are the following ways you can justify a linkage type.
|
||||
|
||||
**Cluster metaphor**. *"I preferred this method because it constitutes clusters such (or such a way) which meets with my concept of a cluster in my particular project."*
|
||||
|
||||
**Data/method assumptions**. *"I preferred this method because my data nature or format predispose to it."*
|
||||
|
||||
**Internal validity**. *"I preferred this method because it gave me most clear-cut, tight-and-isolated clusters."*
|
||||
|
||||
**External validity**. *"I preferred this method because it gave me clusters which differ by their background or clusters which match with the true ones I know."*
|
||||
|
||||
**Cross-validity**. *"I preferred this method because it is giving me very similar clusters on equivalent samples of the data or extrapolates well onto such samples."*
|
||||
|
||||
**Interpretation**. *"I preferred this method because it gave me clusters which, explained, are most persuasive that there is meaning in the world."*
|
||||
|
||||
### Cluster Metaphors
|
||||
|
||||
Let us explore the idea of cluster metaphors now.
|
||||
|
||||
**Single Linkage** or **Nearest Neighbor** is a *spectrum* or *chain*.
|
||||
|
||||
Since single linkage joins clusters by the shortest link between them, the technique cannot discern poorly separated clusters. On the other hand, single linkage is one of the few clustering methods that can delineate nonelipsodial clusters.
|
||||
|
||||
**Complete Linkage** or **Farthest Neighbor** is a *circle*.
|
||||
|
||||
**Between-Group Average linkage** (UPGMA) is a united *class
|
||||
|
||||
**Centroid method** (UPGMC) is *proximity of platforms* (commonly used in politics)
|
||||
|
||||
## Dendrograms
|
||||
|
||||
A **dendrogram** is a tree diagram frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering. It shows how different clusters are formed at different distance groupings.
|
||||
74
content/research/clusteranalysis/notes/lec7.md
Normal file
74
content/research/clusteranalysis/notes/lec7.md
Normal file
|
|
@ -0,0 +1,74 @@
|
|||
# Divisive Methods Pt.1
|
||||
|
||||
Divisive methods work in the opposite direction of agglomerative methods. They take one large cluster and successively splits it.
|
||||
|
||||
This is computationally demanding if all $2^{k - 1} - 1$ possible divisions into two sub-clusters of a cluster of $k$ objects are considered at each stage.
|
||||
|
||||
While less common than Agglomerative methods, divisive methods have the advantage that most users are interested in the main structure in their data, and this is revealed from the outset of a divisive method.
|
||||
|
||||
## Monothetic Divisive Methods
|
||||
|
||||
For data consisting of $p$ **binary variables**, there is a computationally efficient method known as *monothetic divisive methods* available.
|
||||
|
||||
Monothetic divisions divide clusters according to the presence or absence of each of the $p$ variables, so that at each stage, clusters contain members with certain attributes that are either all present or all absent.
|
||||
|
||||
The term 'monothetic' refers to the use of a single variable on which to base the split on. *Polythetic* methods use all the variables at each stage.
|
||||
|
||||
### Choosing the Variable to Split On
|
||||
|
||||
The choice of the variable on which a split is made depends on optimizing a criterion reflecting either cluster homogeneity or association with other variables.
|
||||
|
||||
This tends to minimize the number of splits that have to be made.
|
||||
|
||||
An example of an homogeneity criterion is the information content $C$
|
||||
|
||||
This is defined with $p$ variables and $n$ objections where $f_k$ is the number of individuals with the $k$ attribute
|
||||
$$
|
||||
C = pn\log{n}-\sum_{k = 1}^p{(f_k\log{f_k} - (n-f_k)\log{(n-f_k)})}
|
||||
$$
|
||||
|
||||
### Association with other variables
|
||||
|
||||
Recall that another way to split is based on the association with other variables. The attribute used at each step can be chosen according to its overall association with all attributes remaining at each step.
|
||||
|
||||
This is sometimes termed *association analysis*.
|
||||
|
||||
| | V1 | V2 |
|
||||
| ---- | ---- | ---- |
|
||||
| | 1 | 0 |
|
||||
| 1 | a | b |
|
||||
| 0 | c | d |
|
||||
|
||||
####Common measures of association
|
||||
|
||||
$$
|
||||
|ad-bc| \tag{4.6}
|
||||
$$
|
||||
|
||||
$$
|
||||
(ad-bc)^2 \tag{4.7}
|
||||
$$
|
||||
|
||||
$$
|
||||
\frac{(ad-bc)^2n}{(a+b)(a+c)(b+d)(c+d)} \tag{4.8}
|
||||
$$
|
||||
|
||||
$$
|
||||
\sqrt{\frac{(ad-bc)^2n}{(a+b)(a+c)(b+d)(c+d)}} \tag{4.9}
|
||||
$$
|
||||
|
||||
$$
|
||||
\frac{(ad-bc)^2}{(a+b)(a+c)(b+d)(c+d)} \tag{4.10}
|
||||
$$
|
||||
|
||||
$(4.6)$ and $(4.7)$ have the advantage that there is no danger of computational problems if any marginal totals are near zero.
|
||||
|
||||
The last three, $(4.8)$, $(4.9)$, $(4.10)$, are all related to the $\chi^2$ statistic, its square root, and the Pearson correlation coefficient respectively.
|
||||
|
||||
### Advantages/Disadvantages of Monothetic Methods
|
||||
|
||||
Appealing features of monothetic divisive methods are the easy classification of new members and the including of cases with missing values.
|
||||
|
||||
A further advantage of monothetic divisive methods is that it is obvious which variables produce the split at any stage of the process.
|
||||
|
||||
A disadvantage with these methods is that the possession of a particular attribute which is either rare or rarely found in combination with others may take an individual down a different path.
|
||||
48
content/research/clusteranalysis/notes/lec8.md
Normal file
48
content/research/clusteranalysis/notes/lec8.md
Normal file
|
|
@ -0,0 +1,48 @@
|
|||
# Divisive Methods Pt 2.
|
||||
|
||||
Recall in the previous section that we spoke about Monothetic and Polythetic methods. Monothetic methods only looks at a single variable at a time while Polythetic looks at multiple variables simultaneously. In this section, we will speak more about polythetic divisive methods.
|
||||
|
||||
## Polythetic Divisive Methods
|
||||
|
||||
Polythetic methods operate via a distance matrix.
|
||||
|
||||
This procedure avoids considering all possible splits by
|
||||
|
||||
1. Finding the object that is furthest away from the others within a group and using that as a seed for a splinter group.
|
||||
2. Each object is then considered for entry to that separate splinter group: any that are closer to the splinter group than the main group is moved to the splinter one.
|
||||
3. The step is then repeated.
|
||||
|
||||
This process has been developed into a program named `DIANA` (DIvisive ANAlysis Clustering) which is implemented in `R`.
|
||||
|
||||
### Similarities to Politics
|
||||
|
||||
This somewhat resembles a way a political party might split due to inner conflicts.
|
||||
|
||||
Firstly, the most discontented member leaves the party and starts a new one, and then some others follow him until a kind of equilibrium is attained.
|
||||
|
||||
## Methods for Large Data Sets
|
||||
|
||||
There are two common hierarchical methods used for large data sets `BIRCH` and `CURE`. Both of these algorithms employ a pre-clustering phase in where dense regions are summarized, the summaries being then clustered using a hierarchical method based on centroids.
|
||||
|
||||
### CURE
|
||||
|
||||
1. `CURE` starts with a random sample of points and represents clusters by a smaller number of points that capture the shape of the cluster
|
||||
2. Which are then shrunk towards the centroid as to dampen the effect of the outliers
|
||||
3. Hierarchical clustering then operates on the representative points
|
||||
|
||||
`CURE` has been shown to be able to cope with arbitrary-shaped clusters and in that respect may be superior to `BIRCH`, although it does require judgment as to the number of clusters and also a parameter which favors either more or less compact clusters.
|
||||
|
||||
## Revisiting Topics: Cluster Dissimilarity
|
||||
|
||||
In order to decide where clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required.
|
||||
|
||||
In most methods of hierarchical clustering this is achieved by a use of an appropriate
|
||||
|
||||
- Metric (a measure of distance between pairs of observations)
|
||||
- Linkage Criterion (which specifies the dissimilarities of sets as functions of pairwise distances observations in the sets)
|
||||
|
||||
## Advantages of Hierarchical Clustering
|
||||
|
||||
- Any valid measure of distance measure can be used
|
||||
- In most cases, the observations themselves are not required, just hte matrix of distances
|
||||
- This can have the advantage of only having to store a distance matrix in memory as opposed to a n-dimensional matrix.
|
||||
58
content/research/clusteranalysis/notes/lec9-1.md
Normal file
58
content/research/clusteranalysis/notes/lec9-1.md
Normal file
|
|
@ -0,0 +1,58 @@
|
|||
# CURE and TSNE
|
||||
|
||||
##Clustering Using Representatives (CURE)
|
||||
|
||||
Clustering using Representatives is a Hierarchical clustering technique in which you can represent a cluster using a **set** of well-scattered representative points.
|
||||
|
||||
This algorithm has a parameter $\alpha$ which defines the factor of the points in which to shrink towards the centroid.
|
||||
|
||||
CURE is known to be robust to outliers and able to identify clusters that have a **non-spherical** shape and size variance.
|
||||
|
||||
The clusters with the closest pair of representatives are the clusters that are merged at each step of CURE's algorithm.
|
||||
|
||||
This algorithm cannot be directly applied to large datasets due to high runtime complexity. Several enhancements were added to address this requirement
|
||||
|
||||
- Random sampling: This involves a trade off between accuracy and efficiency. One would hope that the random sample they obtain is representative of the population
|
||||
- Partitioning: The idea is to partition the sample space into $p$ partitions
|
||||
|
||||
Youtube Video: https://www.youtube.com/watch?v=JrOJspZ1CUw
|
||||
|
||||
Steps
|
||||
|
||||
1. Pick a random sample of points that fit in main memory
|
||||
2. Cluster sample points hierarchically to create the initial clusters
|
||||
3. Pick representative point**s**
|
||||
1. For each cluster, pick $k$ representative points, as dispersed as possible
|
||||
2. Move each representative points to a fixed fraction $\alpha$ toward the centroid of the cluster
|
||||
4. Rescan the whole dataset and visit each point $p$ in the data set
|
||||
5. Place it in the "closest cluster"
|
||||
1. Closest as in shortest distance among all the representative points.
|
||||
|
||||
## TSNE
|
||||
|
||||
TSNE allows us to reduce the dimensionality of a dataset to two which allows us to visualize the data.
|
||||
|
||||
It is able to do this since many real-world datasets have a low intrinsic dimensionality embedded within the high-dimensional space.
|
||||
|
||||
Since the technique needs to conserve the structure of the data, two corresponding mapped points must be close to each other distance wise as well. Let $|x_i - x_j|$ be the Euclidean distance between two data points, and $|y_i - y_j|$ he distance between the map points. This conditional similarity between two data points is:
|
||||
$$
|
||||
p_{j|i} = \frac{exp(-|x_i-x_j|^2 / (2\sigma_i^2))}{\sum_{k \ne i}{exp(-|x_i-x_k|^2/(2\sigma_i^2))}}
|
||||
$$
|
||||
Where we are considering the **Gaussian distribution** surrounding the distance between $x_j$ from $x_i$ with a given variance $\sigma_i^2$. The variance is different for every point; it is chosen such that points in dense areas are given a smaller variance than points in sparse areas.
|
||||
|
||||
Now the similarity matrix for mapped points are
|
||||
$$
|
||||
q_{ij} = \frac{f(|x_i - x_j|)}{\sum_{k \ne i}{f(|x_i - x_k)}}
|
||||
$$
|
||||
Where $f(z) = \frac{1}{1 + z^2}$
|
||||
|
||||
This has the same idea as the conditional similarity between two data points, except this is based on the **Cauchy distribution**.
|
||||
|
||||
TSNE works at minimizing the Kullback-Leiber divergence between the two distributions $p_{ij}$ and $q_{ij}$
|
||||
$$
|
||||
KL(P || Q) = \sum_{i,j}{p_{i,j} \log{\frac{p_{ij}}{q_{ij}}}}
|
||||
$$
|
||||
To minimize this score, gradient descent is typically performed
|
||||
$$
|
||||
\frac{\partial KL(P||Q)}{\partial y_i} = 4\sum_j{(p_{ij} - q_{ij})}
|
||||
$$
|
||||
72
content/research/clusteranalysis/notes/lec9-2.md
Normal file
72
content/research/clusteranalysis/notes/lec9-2.md
Normal file
|
|
@ -0,0 +1,72 @@
|
|||
# Cluster Validation
|
||||
|
||||
There are multiple approaches to validating your cluster models
|
||||
|
||||
- Internal Evaluation: This is when you summarize the clustering into a single score. For example, minimizing the the deviations from the centroids.
|
||||
- External Evaluation: Minimizing the deviations from some known "labels"
|
||||
- Manual Evaluation: A human expert decides whether or not it's good
|
||||
- Indirect Evaluation: Evaluating the utility of clustering in its intended application.
|
||||
|
||||
## Some Problems With These Evaluations
|
||||
|
||||
Internal evaluation measures suffer form the problem that they represent functions that are objectives for many clustering algorithms. So of course the result of the clustering algorithm will be such that the objective would be minimized.
|
||||
|
||||
External evaluation suffers from the fact that if we had labels to begin with then we wouldn't need to cluster. Practical applications of clustering occur usually when we don't have labels. On the other hand, possible labeling can reflect one possible partitioning of the data set. There could exist different, perhaps even better clustering.
|
||||
|
||||
## Internal Evaluation
|
||||
|
||||
We like to see a few qualities in cluster models
|
||||
|
||||
- *Robustness*: Refers to the effects of errors in data or missing observations, and changes in the data and methods.
|
||||
- *Cohesiveness*: Clusters should be compact or high high intra-cluster similarity.
|
||||
- Clusters should be dissimilar to separate clusters. Should have low inter-cluster similarity
|
||||
- *Influence*: We should pay attention to and try to control for the influence of certain observations on the overall cluster
|
||||
|
||||
Let us focus on the second and third bullet point for now. Internal evaluation measures are best suited to get some insight into situations where one algorithm performs better than another, this does not imply that one algorithm produces more valid results than another.
|
||||
|
||||
### Davies-Bouldin Index
|
||||
|
||||
$$
|
||||
DB = \frac{1}{n}\sum_{i=1}^n{max_{j\ne i}{(\frac{\sigma_i + \sigma_j}{d(c_i,c_j)})}}
|
||||
$$
|
||||
|
||||
Where $n$ is the number of clusters, $c$ indicates a centroid, and $\sigma$ represents the deviation from the centroid.
|
||||
|
||||
Better clustering algorithms are indicated by smaller DB values.
|
||||
|
||||
### Dunn Index
|
||||
|
||||
$$
|
||||
D= \frac{min_{1\le i < j \le n}{d(i,j)}}{max_{1\le k \le n}{d^\prime(k)}}
|
||||
$$
|
||||
|
||||
The Dunn index aims to identify dense and well-separated clusters. This is defined as the ratio between the minimal inter-cluster distance to maximal intra-cluster distance.
|
||||
|
||||
High Dunn Index values are more desirable.
|
||||
|
||||
###Bootstrapping
|
||||
|
||||
In terms of robustness we can measure uncertainty in each of the individual clusters. This can be examined using a bootstrapping approach by Suzuki and Shimodaira (2006). The probability or "p-value" is the proportion of bootstrapped samples that contain the cluster. Larger p-values in this case indicate more support for the cluster.
|
||||
|
||||
This is available in R via `Pvclust`
|
||||
|
||||
### Split-Sample Validation
|
||||
|
||||
One approach to assess the effects of perturbations of the data is by randomly dividing the data into two subsets and performing an analysis on each subset separately. This method was proposed by McIntyre and Blashfield in 1980; their method involves the following steps
|
||||
|
||||
- Divide the sample in two and perform a cluster analysis on one of the samples
|
||||
- Have a fixed rule for the number of clusters
|
||||
- Determine the centroids of the clusters, and compute proximities between the objects in teh second sample and the centroids, classifying the objects into their nearest cluster.
|
||||
- Cluster the second sample using the same methods as before and compare these two alternate clusterings using something like the *adjusted Rand index*.
|
||||
|
||||

|
||||
|
||||
## Influence of Individual Points
|
||||
|
||||
Using internal evaluation metrics, you can see the impact of each point by doing a "leave one out" analysis. Here you evaluate the dataset minus one point for each of the points. If a positive difference is found, the point is regarded as a *facilitator*, whereas if it is negative then it is considered an *inhibitor*. once an influential inhibitor is found, the suggestion is to normally omit it from the clustering.
|
||||
|
||||
## R Package
|
||||
|
||||
`clValid` contains a variety of internal validation measures.
|
||||
|
||||
Paper: https://cran.r-project.org/web/packages/clValid/vignettes/clValid.pdf
|
||||
35
content/research/clusteranalysis/readings.md
Normal file
35
content/research/clusteranalysis/readings.md
Normal file
|
|
@ -0,0 +1,35 @@
|
|||
# Readings for Lectures of Cluster Analysis
|
||||
|
||||
## Lecture 1
|
||||
Garson Textbook Chapter 3
|
||||
|
||||
## Lecture 2
|
||||
[A Tutorial on Principal Component Analysis](https://arxiv.org/pdf/1404.1100.pdf)
|
||||
|
||||
## Lecture 3
|
||||
No Readings
|
||||
|
||||
## Lecture 4
|
||||
An Introduction to Applied Multivariate Analysis with R by Brian Evveritt and Torsten Horthorn.
|
||||
|
||||
Sections 3.0-3.9 Everitt
|
||||
|
||||
## Lecture 5
|
||||
|
||||
Section 4.1 Everitt
|
||||
|
||||
## Lecture 6
|
||||
Section 4.2 Everitt
|
||||
|
||||
Applied Multivariate Statistical Analysis Johnson Section 12.3
|
||||
|
||||
[Linkage Methods for Cluster Observations](https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/multivariate/how-to/cluster-observations/methods-and-formulas/linkage-methods/#mcquitty)
|
||||
|
||||
## Lecture 7
|
||||
Section 4.3 Everitt
|
||||
|
||||
## Lecture 8
|
||||
[Introduction to the TSNE Algorithm](https://www.oreilly.com/learning/an-illustrated-introduction-to-the-t-sne-algorithm)
|
||||
|
||||
## Lecture 9
|
||||
Section 9.5 Everitt
|
||||
119
content/research/clusteranalysis/syllabus.md
Normal file
119
content/research/clusteranalysis/syllabus.md
Normal file
|
|
@ -0,0 +1,119 @@
|
|||
# Cluster Analysis Spring 2018
|
||||
|
||||
### Distance, Dimensionality Reduction, and Tendency
|
||||
|
||||
- Distance
|
||||
- Euclidean Distance
|
||||
- Squared Euclidean Distance
|
||||
- Manhattan Distance
|
||||
- Maximum Distance
|
||||
- Mahalanobis Distance
|
||||
- Which distance function should you use?
|
||||
- PCA
|
||||
- Cluster Tendency
|
||||
- Hopkins Statistic
|
||||
- Scaling Data
|
||||
|
||||
### Validating Clustering Models
|
||||
|
||||
- Clustering Validation
|
||||
- Cross Validation
|
||||
|
||||
### Connectivity Models
|
||||
|
||||
- Agglomerative Clustering
|
||||
- Single Linkage Clustering
|
||||
- Complete Linkage Clustering
|
||||
- Unweighted Pair Group Method with Arithmetic Mean (If time permits)
|
||||
- Dendrograms
|
||||
- Divisive Clustering
|
||||
- CURE (Clustering using REpresentatives) algorithm (If time permits)
|
||||
|
||||
### Cluster Evaluation
|
||||
|
||||
- Internal Evaluation
|
||||
- Dunn Index
|
||||
- Silhouette Coefficient
|
||||
- Davies-Bouldin Index (If time permits)
|
||||
- External Evaluation
|
||||
- Rand Measure
|
||||
- Jaccard Index
|
||||
- Dice Index
|
||||
- Confusion Matrix
|
||||
- F Measure (If time permits)
|
||||
- Fowlkes-Mallows Index (If time permits)
|
||||
|
||||
### Centroid Models
|
||||
|
||||
- Jenks Natural Breaks Optimization
|
||||
- Voronoi Diagram
|
||||
- K means clustering
|
||||
- K medoids clustering
|
||||
- K Medians/Modes clustering
|
||||
- When to use K means as opposed to K medoids or K Medians?
|
||||
- How many clusters should you use?
|
||||
- Lloyd's Algorithm for Approximating K-means (If time permits)
|
||||
|
||||
### Density Models
|
||||
|
||||
- DBSCAN Density Based Clustering Algorithm
|
||||
- OPTICS Ordering Points To Identify the Clustering Structure
|
||||
- DeLi-Clu Density Link Clustering (If time permits)
|
||||
- What should be your density threshold?
|
||||
|
||||
### Analysis of Model Appropriateness
|
||||
|
||||
- When do we use each of the models above?
|
||||
|
||||
### Distribution Models (If time permits)
|
||||
|
||||
- Fuzzy Clusters
|
||||
- EM (Expectation Maximization) Clustering
|
||||
- Maximum Likelihood Gaussian
|
||||
- Probabilistic Hierarchal Clustering
|
||||
|
||||
|
||||
|
||||
|
||||
## Textbooks
|
||||
|
||||
Cluster Analysis 5th Edition
|
||||
|
||||
- Authors: Brian S. Everitt, Sabine Landau, Morven Leese, Daniel Stahl
|
||||
- ISBN-13: 978-0470749913
|
||||
- Cost: Free on UMW Library Site
|
||||
- Amazon Link: https://www.amazon.com/Cluster-Analysis-Brian-S-Everitt/dp/0470749911/ref=sr_1_1?ie=UTF8&qid=1509135983&sr=8-1
|
||||
- Table of Contents: http://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP002266.html
|
||||
|
||||
Cluster Analysis: 2014 Edition (Statistical Associates Blue Book Series 24)
|
||||
|
||||
- Author: David Garson
|
||||
- ISBN: 978-1-62638-030-1
|
||||
- Cost: Free with Site Registration
|
||||
- Website: http://www.statisticalassociates.com/clusteranalysis.htm
|
||||
|
||||
|
||||
|
||||
## Schedule
|
||||
|
||||
In an ideal world, the topics below I estimated being a certain time period for learning them. Of course you have more experience when it comes to how long it actually takes to learn these topics, so I'll leave this mostly to your discretion.
|
||||
|
||||
**Distance, Dimensionality Reduction, and Tendency** -- 3 Weeks
|
||||
|
||||
**Validating Cluster Models** -- 1 Week
|
||||
|
||||
**Connectivity Models** -- 2 Weeks
|
||||
|
||||
**Cluster Evaluation** -- 1 Week
|
||||
|
||||
**Centroid Models** -- 3 Weeks
|
||||
|
||||
**Density Models** -- 3 Weeks
|
||||
|
||||
**Analysis of Model Appropriateness** -- 1 Week
|
||||
|
||||
The schedule above accounts for 14 weeks, so there is a week that is free as a buffer.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Creating this document got me really excited for this independent study. Feel free to give me feedback :)
|
||||
22
content/research/deepreinforcementlearning.md
Normal file
22
content/research/deepreinforcementlearning.md
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
---
|
||||
Title: Deep Reinforcement Learning
|
||||
Description: Combining Reinforcement Learning with Deep Learning
|
||||
---
|
||||
|
||||
In the Fall of 2019, I look at integrating demonstration data into a reinforcement learning algorithm in order to make it sample efficient.
|
||||
|
||||
The results are positive and are heavily documented through the following:
|
||||
|
||||
[Honors Thesis](/files/research/honorsthesis.pdf)
|
||||
|
||||
[Honors Defense](/files/research/ExpeditedLearningInteractiveDemo.pptx)
|
||||
|
||||
Thanks to my advisor Dr. Ron Zacharksi and my committee members for all their feedback on my work!
|
||||
|
||||
In the spring of 2019, under the guidance of Dr. Ron Zacharski I practiced several of the modern techniques used in Reinforcement Learning today.
|
||||
|
||||
I facilitated my learning by creating a [reinforcement learning library](https://github.com/brandon-rozek/rltorch) with implementations of several popular papers. ([Semi-Weekly Progress](weeklyprogress))
|
||||
|
||||
I also presented my research (which involved creating an algorithm) at my school's research symposium. ([Slides](/files/research/QEP.pptx)) ([Abstract](abstractspring2019))
|
||||
|
||||
In the summer of 2019, I became interested in having the interactions with the environment be in a separate process. This inspired two different implementations, [ZeroMQ](https://github.com/brandon-rozek/zerogym) and [HTTP](https://github.com/brandon-rozek/gymhttp). Given the option, you should use the ZeroMQ implementation since it contains less communication overhead.
|
||||
13
content/research/deepreinforcementlearning/WeeklyProgress.md
Normal file
13
content/research/deepreinforcementlearning/WeeklyProgress.md
Normal file
|
|
@ -0,0 +1,13 @@
|
|||
## Weekly Progress
|
||||
|
||||
I didn't do the greatest job at writing a progress report every week but here on the page are the ones I did write.
|
||||
|
||||
[January 29 2019](Jan29)
|
||||
|
||||
[February 12 2019](Feb12)
|
||||
|
||||
[February 25 2019](Feb25)
|
||||
|
||||
[March 26 2019](Mar26)
|
||||
|
||||
[April 2 2019](Apr2)
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 24 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 40 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 11 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 25 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 68 KiB |
|
|
@ -0,0 +1,25 @@
|
|||
# Progress Report for Week of April 2nd
|
||||
|
||||
## Added Video Recording Capability to MinAtar environment
|
||||
|
||||
You can now use the OpenAI Monitor Wrapper to watch the actions performed by agents in the MinAtar suite. (Currently the videos are in grayscale)
|
||||
|
||||
Problems I had to solve:
|
||||
|
||||
- How to represent the channels into a grayscale value
|
||||
- Getting the tensor into the right format (with shape and dtype)
|
||||
- Adding additional meta information that OpenAI expected
|
||||
|
||||
## Progress Towards \#Exploration
|
||||
|
||||
After getting nowhere trying to combine the paper on Random Network Distillation and Count-based exploration and Intrinsic Motivation, I turned the paper \#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning.
|
||||
|
||||
This paper uses the idea of an autoencoder to learn a smaller latent state representation of the input. We can then use this smaller representation as a hash and count states based on these hashes.
|
||||
|
||||
Playing around with the ideas of autoencoders, I wanted a way to discretized my hash more than just what floating point precision allows. Of course this turns it into a non-differential function which I then tried turning towards Evolutionary methods to solve. Sadly the rate of optimization was drastically diminished using the Evolutionary approach. Therefore, my experiments for this week failed.
|
||||
|
||||
I'll probably look towards implementing what the paper did for my library and move on to a different piece.
|
||||
|
||||
|
||||
|
||||
Guru Indian: 3140 Cowan Blvd, Fredericksburg, VA 22401
|
||||
|
|
@ -0,0 +1,63 @@
|
|||
# Weekly Progress Feb 12
|
||||
|
||||
## Finished writing scripts for data collection
|
||||
|
||||
- Playing a game now records
|
||||
- Video
|
||||
- State / Next-State as pixel values
|
||||
- Action taken
|
||||
- Reward Received
|
||||
- Whether the environment is finished every turn
|
||||
- Wrote scripts to gather and preprocess the demonstration data
|
||||
- Now everything is standardized on the npy format. Hopefully that stays consistent for a while.
|
||||
|
||||
## Wrote code to create an actor that *imitates* the demonstrator
|
||||
|
||||
Tweaked the loss function to be a form of cross-entropy loss
|
||||
$$
|
||||
loss = max(Q(s, a) + l(s,a)) - Q(s, a_E)
|
||||
$$
|
||||
Where $l(s, a)$ is zero for the action the demonstrator took and positive elsewhere.
|
||||
|
||||
Turns out, that as with a lot of deep learning applications, you need a lot of training data. So the agent currently does poorly on mimicking the performance of the demonstrator.
|
||||
|
||||
### Aside : Pretraining with the Bellman Equation
|
||||
|
||||
Based off the paper:
|
||||
|
||||
Todd Hester, Matej Vecerik, Olivier Pietquin, arc Lanctot, Tom Schaul, Bilal Piot, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian OsbandI, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Learning from Demonstrations for Real World Reinforcement Learning**
|
||||
|
||||
|
||||
|
||||
This paper had the demonstration not include include the $(state, action)$ pairs like I did, but also the $(next_state, reward, done)$ signals. This way, they can pretrain with both supervised loss and with the general Q-learning loss.
|
||||
|
||||
That way, they can use the result of the pretraining as a starting ground for the actual training. The way I implemented it, I would first train an imitator which would then be used as the actor during the simulations from which we would collect data and begin training another net.
|
||||
|
||||
## Prioritized Replay
|
||||
|
||||
Instead of uniform sampling of experiences, we can sample by how surprised we were about the outcome of the Q-value loss.
|
||||
|
||||
I had a previous implementation of this, but it was faulty, so I took the code from OpenAI baselines and integrated it with my library.
|
||||
|
||||
It helps with games like Pong, because there are many states where the result is not surprising and inconsequential. Like when the ball is around the center of the field.
|
||||
|
||||
## Schedulers
|
||||
|
||||
There are some people who use Linear Schedulers to change the value of various parameters throughout training.
|
||||
|
||||
I implemented it as an iterator in python and called *next* for each time the function uses the hyper-parameter.
|
||||
|
||||
The two parameters I use schedulers in normally are:
|
||||
|
||||
- Epsilon - Gradually decreases exploration rate
|
||||
- Beta - Decreases the importance of the weights of experiences that get frequently sampled
|
||||
|
||||
|
||||
|
||||
## Layer Norm
|
||||
|
||||
"Reduces training by normalizes the activities of the neurons."
|
||||
|
||||
Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton. **Layer Normalization.**
|
||||
|
||||
It's nicely implemented in PyTorch already so I threw that in for each layer of the network. Reduces the average loss.
|
||||
|
|
@ -0,0 +1,117 @@
|
|||
# Weekly Progress for February 25th
|
||||
|
||||
## Evolutionary Algorithms
|
||||
|
||||
### Genetic Algorithm
|
||||
|
||||
I worked towards implementing the Genetic Algorithm into PyTorch. I got a working implementation which operates like the following:
|
||||
|
||||
Generate $n$ perturbations of the model by taking the tensors in the model dictionary and add some random noise to them.
|
||||
|
||||
- Calculate the fitness of each model
|
||||
- Keep the $k$ best survivors
|
||||
- Sample (with replacement) $2(n - k)$ parents based on their fitness. (Higher fitness -> More likely to be sampled)
|
||||
|
||||
- Easy way to do this is: $prob = fitness / sum(fitness)$
|
||||
- Split the parents in half to $parent1$ and $parent2$
|
||||
- Perform crossover in order to make children
|
||||
|
||||
- For every tensor in the model dictionary
|
||||
|
||||
- Find a point in where you want the split to happen
|
||||
- Create a new tensor: the left part of the split comes from $parent1$, the other side from $parent2$
|
||||
- Mutate the child with $\epsilon$ probability
|
||||
- Add random noise to the tensor
|
||||
|
||||
Normally if you perform this algorithm with many iterations, your results start to converge towards a particular solution.
|
||||
|
||||
The main issue with this algorithm is that you need to carry with you $n$ models of the environment throughout the entire training process. You also need to have a somewhat good definition of the bounds of which your weights and biases can be otherwise the algorithm might not converge to the true value.
|
||||
|
||||
Due to these reasons, I didn't end up adding this functionality to RLTorch.
|
||||
|
||||
### Evolutionary Strategies
|
||||
|
||||
To combat these issues, I knocked into a paper by OpenAI Called "Evolutionary Strategies"
|
||||
|
||||
Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, Ilya Sutskever. **Evolution Strategies as a Scalable Alternative to Reinforcement Learning**
|
||||
|
||||
https://arxiv.org/abs/1703.03864
|
||||
|
||||
*This paper mostly describes the efforts made to make a certain evolutionary strategy scalable to many nodes. I ended up using only the algorithm from the paper and I didn't implement any of the scalable considerations.*
|
||||
|
||||
The following code below explains the process for maximizing a simple function
|
||||
|
||||
```python
|
||||
white_noise = np.random.randn(population_size, *current_solution.shape)
|
||||
noise = sigma * white_noise
|
||||
candidate_solutions = current_solution + noise
|
||||
|
||||
# Calculate fitness, mean shift, and scale
|
||||
fitness_values = calculate_fitness(candidate_solutions)
|
||||
fitness_values = (fitness_values - np.mean(fitness_values)) / (np.std(fitness_values) + np.finfo('float').eps)
|
||||
|
||||
new_solution = current_solution + learning_rate * np.mean(white_noise.T * fitness_values, axis = 1) / sigma
|
||||
```
|
||||
|
||||
To explain further, suppose you have a guess as to what the solution is. To generate new guesses, let us add random noise around our guess like the image below.
|
||||
|
||||

|
||||
|
||||
Now calculate the fitness of all the points, let us represent that by the intensity of blue in the background,
|
||||
|
||||

|
||||
|
||||
What ends up happening is that your new solution, the black square, will move towards the areas with higher reward.
|
||||
|
||||
## Q-Evolutionary Policies
|
||||
|
||||
**Motivation**
|
||||
|
||||
So this brings up the point, why did I bother studying these algorithms? I ran into a problem when I was looking to implement the DDPG algorithm. Primarily that it required your action space to be continuous, which is not something I'm currently on.
|
||||
|
||||
Then I thought, why can I not make it work with discrete actions? First let us recall the loss of a policy function under DDPG:
|
||||
$$
|
||||
loss_\pi = -Q(s, \pi(s))
|
||||
$$
|
||||
For the discrete case, your Q-function is going to be a function of the state and output the values of each action taken under that state. In mathematical terms,
|
||||
$$
|
||||
loss_\pi = -Q(s)[\pi(s)]
|
||||
$$
|
||||
Indexing into another array, however, is not a differentiable function. Meaning I cannot then calculate $loss_\pi$ with respect to $\pi$.
|
||||
|
||||
Evolutionary Strategies are non-gradient based methods. Meaning that I can bypass this restriction with the traditional methods.
|
||||
|
||||
**How it works**
|
||||
|
||||
Train your Value function with the typical DQN loss.
|
||||
|
||||
Every 10 Value function updates, update the policy. This gives time for the Value function to stabilize so that the policy is not chasing suboptimal value functions. Update the policy according to the $loss_\pi$ written above.
|
||||
|
||||
**Results**
|
||||
|
||||

|
||||
|
||||
The orange line is the QEP performance
|
||||
|
||||
Blue is DQN
|
||||
|
||||
## Future Direction
|
||||
|
||||
I would like to look back towards demonstration data and figure out a way to pretrain a QEP model.
|
||||
|
||||
It's somewhat easy to think of a way to train the policy. Make it a cross-entropy loss with respect to the actions the demonstrator took.
|
||||
$$
|
||||
loss_\pi = -\sum_{c=1}^M{y_{o,c}\ln{(p_{o,c})}}
|
||||
$$
|
||||
Where $M$ is the number of classes, $y$ is the binary indicator for whether the correction classification was observed, and $p$ is the predicted probability observation for class c.
|
||||
|
||||
It's harder to think about how I would do it for a Value function. There was the approach we saw before where the loss function was like so,
|
||||
$$
|
||||
loss_Q = max(Q(s, a) + l(s,a)) - Q(s, a_E)
|
||||
$$
|
||||
Where $l(s, a)$ is a vector that is positive for all values except for what the demonstrator took which is action $a_E$.
|
||||
|
||||
The main issue with this loss function for the Value is that it does not capture the actual output values of the functions, just how they are relative to each other. Perhaps adding another layer can help transform it to the values it needs to be. This will take some more thought.
|
||||
|
||||
|
||||
|
||||
|
|
@ -0,0 +1,65 @@
|
|||
# Weekly Progress Jan 29
|
||||
|
||||
## 1. Training From Demonstrations
|
||||
|
||||
Training from demonstrations is the act of using previous data to help speed up the learning process.
|
||||
|
||||
I read two papers on the topic:
|
||||
|
||||
[1] Gabriel V. de la Cruz Jr., Yunshu Du, Matthew E. Taylor. **Pre-training Neural Networks with Human Demonstrations for Deep Reinforcement Learning**.
|
||||
|
||||
https://arxiv.org/abs/1709.04083
|
||||
|
||||
The authors showed how you can speed up the training of a DQN network, especially under problems involving computer vision, if you first train the convolution layers by using a supervised loss between the actions the network would choose and the actions from the demonstration data given a state.
|
||||
|
||||
[2] Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew Sendonaris, Gabriel Dulac-Arnold, Ian Osband, John Agapiou, Joel Z. Leibo, Audrunas Gruslys. **Deep Q-learning from Demonstrations.**
|
||||
|
||||
https://arxiv.org/abs/1704.03732
|
||||
|
||||
The authors showed how from "expert" demonstrations we can speed up the training of a DQN by incorporating the supervised loss into the loss function.
|
||||
|
||||
### Supervised Loss
|
||||
|
||||
What is supervised loss in the context of DQNs?
|
||||
$$
|
||||
Loss = max(Q(s, a)+l(s,a )) - Q(s, a_E)
|
||||
$$
|
||||
Where $a_E$ is the expert action, and $l(s, a)$ is a vector of positive values with an entry of zero for the expert action.
|
||||
|
||||
The intuition behind this is that for the loss to be zero, the network would've had to have chosen the same action as the expert. The $l(s, a)$ term exists to ensure that there are no ties.
|
||||
|
||||
### What I decided to do
|
||||
|
||||
The main environment I chose to test these algorithms is Acrobot. It is a control theory problem and it has several physics related numbers as input. (Not image based)
|
||||
|
||||
I noticed when implementing [1] at least for the non-convolution case, there's no point in trying to train earlier layers. Perhaps I'll try again when I move onto the atari gameplays...
|
||||
|
||||
I decided against following [2] exactly. It's not that I disagree with the approach, but I don't like the need for "expert" data. If you decide to proceed anyways with non-expert data, you need to remember that it is incorporated into the loss function. Which means that you fall risk into learning sub-optimal policies.
|
||||
|
||||
In the end, what I decided to do was the following
|
||||
|
||||
1. Train a neural network that maps states->actions from demonstration data
|
||||
2. Use that network to play through several simulated runs of the environment, state the (state, action, reward, next_state, done) signals and insert it into the experience replay buffer and train from those (**Pretrain step**)
|
||||
3. Once the pretrain step is completed, replace the network that maps from demonstration data with the one you've been training in the pre-training step and continue with the regular algorithm
|
||||
|
||||
|
||||
|
||||
## 2. Noisy Networks
|
||||
|
||||
Based on this paper...
|
||||
|
||||
Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Ian Osband, Alex Graves, Vlad Mnih, Remi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, Shane Legg. **Noisy Networks for Exploration.**
|
||||
|
||||
This paper describes adding parametric noise to the weights and biases and how it aids in exploration. The parameters of the noise are learned with gradient descent along with the other network parameters.
|
||||
|
||||
|
||||
|
||||
For the noise distribution I used the Gaussian Normal. One property that's handy to know about it is the following
|
||||
$$
|
||||
N(\mu, \sigma) = \mu + \sigma*N(0, 1)
|
||||
$$
|
||||
In our case, the $\mu$ would be the typical weights and biases, and the $\sigma$ is a new parameter representing how much variation or uncertainty needs to be added.
|
||||
|
||||
The concept is that as the network grows more confident about it's predictions, the variation in the weights start to decrease. This way the exploration is systematic and not something randomly injected like the epsilon-greedy strategy.
|
||||
|
||||
The paper describes replacing all your linear densely connected layers with this noisy linear approach.
|
||||
|
|
@ -0,0 +1,11 @@
|
|||
# Progress for Week of March 26
|
||||
|
||||
## Parallelized Evolutionary Strategies
|
||||
|
||||
When the parallel ES class is declared, I start a pool of workers that then gets sent with a loss function and its inputs to compute whenever calculating gradients.
|
||||
|
||||
## Started Exploring Count-Based Exploration
|
||||
|
||||
I started looking through papers on Exploration and am interested in using the theoretical niceness of Count-based exploration in tabular settings and being able to see their affects in the non-tabular case.
|
||||
|
||||
""[Unifying Count-Based Exploration and Intrinsic Motivation](https://arxiv.org/abs/1606.01868)" creates a model of a arbitrary density model that follows a couple nice properties we would expect of probabilities. Namely, $P(S) = N(S) / n$ and $P'(S) = (N(S) + 1) / (n + 1)$. Where $N(S)$ represents the number of times you've seen that state, $n$ represents the total number of states you've seen, and $P'(S)$ represents the $P(S)$ after you have seen $S$ another time. With this model, we are able to solve for $N(S)$ and derive what the authors call a *Psuedo-Count*.
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 1.1 MiB |
|
|
@ -0,0 +1,15 @@
|
|||
---
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
**Name:** Brandon Rozek
|
||||
|
||||
Department of Computer Science
|
||||
|
||||
**Mentor:** Dr. Ron Zacharski
|
||||
|
||||
QEP: The Q-Value Policy Evaluation Algorithm
|
||||
|
||||
|
||||
|
||||
*Abstract.* In Reinforcement Learning, sample complexity is often one of many concerns when designing algorithms. This concern outlines the number of interactions with a given environment that an agent needs in order to effectively learn a task. The Reinforcement Learning framework consists of finding a function (the policy) that maps states/scenarios to actions while maximizing the amount of reward from the environment. For example in video games, the reward is often characterized by some score. In recent years a variety of algorithms came out falling under the categories of Value-based methods and Policy-based methods. Value-based methods create a policy by approximating how much reward an agent is expected to receive if it performs the best actions from a given state. It is then common to choose the actions that maximizes such values. Meanwhile, in Policy-based methods, the policy function produces probabilities that an agent performs each action given a state and this is then optimized for the maximum reward. As such, Value-based methods produce deterministic policies while policy-based methods produce stochastic/probabilistic policies. Empirically, Value-based methods have lower sample complexity than Policy-based methods. However, in decision making not every situation has a best action associated with it. This is mainly due to the fact that real world environments are dynamic in nature and have confounding variables affecting the result. The QEP Algorithm combines both the Policy-based methods and Value-based methods by changing the policy's optimization scheme to involve approximate value functions. We have shown that this combines the benefits of both methods so that the sample complexity is kept low while maintaining a stochastic policy.
|
||||
17
content/research/dimensionalityreduction.md
Normal file
17
content/research/dimensionalityreduction.md
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
Title: Dimensionality Reduction
|
||||
Description: Reducing High Dimensional Datasets to what we can digest.
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
# Dimensionality Reduction
|
||||
|
||||
In the Summer of 2018, I was going to embark with another study on the topics pertaining to Dimensionality Reduction. Sadly the other student became busy a few weeks into the study. I decided to upload what we got through anyways for completeness.
|
||||
|
||||
[Syllabus](syllabus)
|
||||
|
||||
[Intro](intro)
|
||||
|
||||
[Feature Selection](featureselection)
|
||||
|
||||
[Optimality Criteria](optimalitycriteria)
|
||||
34
content/research/dimensionalityreduction/featureselection.md
Normal file
34
content/research/dimensionalityreduction/featureselection.md
Normal file
|
|
@ -0,0 +1,34 @@
|
|||
# Feature Selection
|
||||
|
||||
Feature selection is the process of selecting a subset of relevant features for use in model construction. The core idea is that data can contain many features that are redundant or irrelevant. Therefore, removing it will not result in much loss of information. We also wish to remove features that do not help in our goal.
|
||||
|
||||
Feature selection techniques are usually applied in domains where there are many features and comparatively few samples.
|
||||
|
||||
## Techniques
|
||||
|
||||
The brute force feature selection method exists to exhaustively evaluate all possible combinations of the input features, and find the best subset. The computational cost of this approach is prohibitively high and includes a considerable risk of overfitting to the data.
|
||||
|
||||
The techniques below describe greedy approaches to this problem. Greedy algorithms are ones that don't search the entire possible space but instead converges towards local maximums/minimums.
|
||||
|
||||
### Wrapper Methods
|
||||
|
||||
This uses a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. The error rate of the model results in a score for that subset. This method is computationally intensive, but usually provides the best performing feature set for that particular type of model.
|
||||
|
||||
### Filter Methods
|
||||
|
||||
This method uses a proxy measure instead of the error rate. The proxy measure can be specifically chosen to be faster to compute while still capturing the essence of the feature set. Common implementations include:
|
||||
|
||||
- Mutual information
|
||||
- Pointwise mutual information
|
||||
- Pearson product-moment correlation coefficient
|
||||
- Relief-based algorithms
|
||||
- Inter/intra class distances
|
||||
- Scores of significance tests
|
||||
|
||||
Many filters provide a feature ranking rather than producing an explicit best feature subset
|
||||
|
||||
### Embedded Methods
|
||||
|
||||
This is a catch-all group of techniques which perform feature selection as part of the model. For example, the LASSO linear model penalizes the regression coefficients shrinking unimportant ones to zero.
|
||||
|
||||
Stepwise regression is a commonly used feature selection technique that acts greedily by adding the feature that results in the best result each turn.
|
||||
30
content/research/dimensionalityreduction/intro.md
Normal file
30
content/research/dimensionalityreduction/intro.md
Normal file
|
|
@ -0,0 +1,30 @@
|
|||
# Introduction to Dimensionality Reduction
|
||||
|
||||
## Motivations
|
||||
|
||||
We all have problems to solve, but the data we might have at our disposal is too sparse or has too many features that it makes it computationally difficult or maybe even impossible to solve the problem.
|
||||
|
||||
### Types of Problems
|
||||
|
||||
**Prediction**: This is taking some input and trying to predict an output of it. An example includes having a bunch of labeled pictures of people and having the computer predict who is in the next picture taken. (Face or Object Recognition)
|
||||
|
||||
**Structure Discovery**: Find an alternative representation of the data. Usually used to find groups or alternate visualizations
|
||||
|
||||
**Density Estimation**: Finding the best model that describes the data. An example includes explaining the price of a home depending on several factors.
|
||||
|
||||
## Advantages
|
||||
|
||||
- Reduces the storage space of data (possibly removing noise in the process!)
|
||||
- Decreases complexity making algorithms run faster
|
||||
- Removes multi-collinearity which in turn likely improves the performance of a given machine learning model
|
||||
- Multi-collinearity usually indicates that multiple variables are correlated with each other. Most models make use of independent features to simplify computations. Therefore, ensuring independent features is important.
|
||||
- Data becomes easier to visualize as it can be projected into 2D/3D space
|
||||
- Lessens the chance of models *overfitting*
|
||||
- This typically happens when you have less observations compared to variables (also known as sparse data)
|
||||
- Overfitting leads to a model being able to have high accuracy on the test set, but generalize poorly to reality.
|
||||
- Curse of dimensionality does not apply in resulting dataset
|
||||
- Curse of dimensionality is that in high dimensional spaces, all points become equidistant
|
||||
|
||||
## Disadvantages
|
||||
|
||||
Data is lost through this method, potentially resulting in possibly insightful information being removed. Features from dimensionality reduction are typically harder to interpret leading to more confusing models.
|
||||
|
|
@ -0,0 +1,99 @@
|
|||
# Optimality Criteria
|
||||
|
||||
Falling under wrapper methods, optimality criterion are often used to aid in model selection. These criteria provide a measure of fit for the data to a given hypothesis.
|
||||
|
||||
## Akaike Information Criterion (AIC)
|
||||
|
||||
AIC is an estimator of <u>relative</u> quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model relative to each other.
|
||||
|
||||
This way, AIC provides a means for model selection. AIC offers an estimate of the relative information lost when a given model is used.
|
||||
|
||||
This metric does not say anything about the absolute quality of a model but only serves for comparison between models. Therefore, if all the candidate models fit poorly to the data, AIC will not provide any warnings.
|
||||
|
||||
It is desired to pick the model with the lowest AIC.
|
||||
|
||||
AIC is formally defined as
|
||||
$$
|
||||
AIC = 2k - 2\ln{(\hat{L})}
|
||||
$$
|
||||
|
||||
|
||||
## Bayesian Information Criterion (BIC)
|
||||
|
||||
This metric is based on the likelihood function and is closely related to the Akaike information criterion. It is desired to pick the model with the lowest BIC.
|
||||
|
||||
BIC is formally defined as
|
||||
$$
|
||||
BIC = \ln{(n)}k - 2\ln{(\hat{L})}
|
||||
$$
|
||||
|
||||
Where $\hat{L}$ is the maximized value of the likelihood function for the model $M$.
|
||||
$$
|
||||
\hat{L} = p(x | \hat{\theta}, M)
|
||||
$$
|
||||
$x$ is the observed data, $n$ is the number of observations, and $k$ is the number of parameters estimated.
|
||||
|
||||
|
||||
|
||||
### Properties of BIC
|
||||
|
||||
- It is independent from the prior
|
||||
- It penalizes the complexity of the model in terms of the number of parameters
|
||||
|
||||
### Limitations of BIC
|
||||
|
||||
- Approximations are only valid for sample sizes much greater than the number of parameters (dense data)
|
||||
- Cannot handle collections of models in high dimension
|
||||
|
||||
### Differences from AIC
|
||||
|
||||
AIC is mostly used when comparing models. BIC asks the question of whether or not the model resembles reality. Even though they have similar functions, they are separate goals.
|
||||
|
||||
## Mallow's $C_p$
|
||||
|
||||
$C_p$ is used to assess the fit of a regression model that has been estimated using ordinary least squares. A small value of $C_p$ indicates that the model is relatively precise.
|
||||
|
||||
The $C_p$ of a model is defined as
|
||||
$$
|
||||
C_p = \frac{\sum_{i =1}^N{(Y_i - Y_{pi})^2}}{S^2}- N + 2P
|
||||
$$
|
||||
|
||||
- $Y_pi$ is the predicted value of the $i$th observation of $Y$ from the $P$ regressors
|
||||
|
||||
- $S^2$ is the residual mean square after regression on the complete set of regressors and can be estimated by mean square error $MSE$,
|
||||
|
||||
- $N$ is the sample size.
|
||||
|
||||
An alternative definition is
|
||||
|
||||
|
||||
$$
|
||||
C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)
|
||||
$$
|
||||
|
||||
- $RSS$ is the residual sum of squares
|
||||
- $d$ is the number of predictors
|
||||
- $\hat{\sigma}^2$ refers to an estimate of the variances associated with each response in the linear model
|
||||
|
||||
## Deviance Information Criterion
|
||||
|
||||
The DIC is a hierarchical modeling generalization of the AIC and BIC. it is useful in Bayesian model selection problems where posterior distributions of the model was <u>obtained by a Markov Chain Monte Carlo simulation</u>.
|
||||
|
||||
This method is only valid if the posterior distribution is approximately multivariate normal.
|
||||
|
||||
Let us define the deviance as
|
||||
$$
|
||||
D(\theta) = -2\log{(p(y|\theta))} + C
|
||||
$$
|
||||
Where $y$ is the data and $\theta$ are the unknown parameters of the model.
|
||||
|
||||
Let us define a helper variable $p_D$ as the following
|
||||
$$
|
||||
p_D = \frac{1}{2}\hat{Var}(D(\theta))
|
||||
$$
|
||||
Finally the deviance information criterion can be calculated as
|
||||
$$
|
||||
DIC = D(\bar{\theta}) + 2p_D
|
||||
$$
|
||||
Where $\bar{theta}$ is the expectation of $\theta$.
|
||||
|
||||
71
content/research/dimensionalityreduction/syllabus.md
Normal file
71
content/research/dimensionalityreduction/syllabus.md
Normal file
|
|
@ -0,0 +1,71 @@
|
|||
# Dimensionality Reduction Study
|
||||
|
||||
Dimensionality reduction is the process of reducing the number of random variables under consideration. This study will last for 10 weeks, meeting twice a week for about an hour.
|
||||
|
||||
## Introduction to Dimensionality Reduction (0.5 Week)
|
||||
|
||||
- Motivations for dimensionality reduction
|
||||
- Advantages of dimensionality reduction
|
||||
- Disadvantages of dimensionality reduction
|
||||
|
||||
## Feature Selection (3 Weeks)
|
||||
|
||||
This is the process of selecting a subset of relevant features. The central premise of this technique is that many features are either redundant or irrelevant and thus can be removed without incurring much loss of information.
|
||||
|
||||
### Metaheuristic Methods (1.5 Weeks)
|
||||
|
||||
- Filter Method
|
||||
- Wrapper Method
|
||||
- Embedded Method
|
||||
|
||||
### Optimality Criteria (0.5 Weeks)
|
||||
|
||||
- Bayesian Information Criterion
|
||||
- Mallow's C
|
||||
- Akaike Information Criterion
|
||||
|
||||
### Other Feature Selection Techniques (1 Week)
|
||||
|
||||
- Subset Selection
|
||||
- Minimum-Redundancy-Maximum-Relevance (mRMR) feature selection
|
||||
- Global Optimization Formulations
|
||||
- Correlation Feature Selection
|
||||
|
||||
### Applications of Metaheuristic Techniques (0.5 Weeks)
|
||||
|
||||
- Stepwise Regression
|
||||
- Branch and Bound
|
||||
|
||||
## Feature Extraction (6 Weeks)
|
||||
|
||||
Feature extraction transforms the data in high-dimensional space to a space of fewer dimensions. In other words, feature extraction involves reducing the amount of resources required to describe a large set of data.
|
||||
|
||||
### Linear Dimensionality Reduction (3 Weeks)
|
||||
|
||||
- Principal Component Analysis (PCA)
|
||||
- Singular Value Decomposition (SVD)
|
||||
- Non-Negative Matrix Factorization
|
||||
- Linear Discriminant Analysis (LDA)
|
||||
- Multidimensional Scaling (MDS)
|
||||
- Canonical Correlation Analysis (CCA) [If Time Permits]
|
||||
- Linear Independent Component Analysis [If Time Permits]
|
||||
- Factor Analysis [If Time Permits]
|
||||
|
||||
### Non-Linear Dimensionality Reduction (3 Weeks)
|
||||
|
||||
One approach to the simplification is to assume that the data of interest lie on an embedded non-linear manifold within higher-dimensional space.
|
||||
|
||||
- Kernel Principal Component Analysis
|
||||
- Nonlinear Principal Component Analysis
|
||||
- Generalized Discriminant Analysis (GDA)
|
||||
- T-Distributed Stochastic Neighbor Embedding (T-SNE)
|
||||
- Self-Organizing Map
|
||||
- Multifactor Dimensionality Reduction (MDR)
|
||||
- Isomap
|
||||
- Locally-Linear Embedding
|
||||
- Nonlinear Independent Component Analysis
|
||||
- Sammon's Mapping [If Time Permits]
|
||||
- Hessian Eigenmaps [If Time Permits]
|
||||
- Diffusion Maps [If Time Permits]
|
||||
- RankVisu [If Time Permits]
|
||||
|
||||
17
content/research/lunac.md
Normal file
17
content/research/lunac.md
Normal file
|
|
@ -0,0 +1,17 @@
|
|||
---
|
||||
Title: LUNA-C Cluster
|
||||
Description: Building a Beowulf Cluster for the University
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
# LUNA-C
|
||||
|
||||
LUNA-C stands for Large Universal Network Array of Computers and it was a project started back in August 2017 to introduce a cluster computing system to the University. I started this project in response to a want for more computing power to help with the Physics research I was doing at the time.
|
||||
|
||||
I started this project and wrote a grant proposal and have brought on many students to help make this dream come true. The resources below give a look into all the thoughts and ideas that went along with the project.
|
||||
|
||||
[May 2018 Poster](poster.pdf)
|
||||
|
||||
[August 2018 Report](report.pdf)
|
||||
|
||||
[June 2019 Presentation](/files/slides/buildinglinuxcluster.pdf)
|
||||
BIN
content/research/lunac/LUNA-C Cluster Report.pdf
Normal file
BIN
content/research/lunac/LUNA-C Cluster Report.pdf
Normal file
Binary file not shown.
BIN
content/research/lunac/LUNACposter.pdf
Normal file
BIN
content/research/lunac/LUNACposter.pdf
Normal file
Binary file not shown.
13
content/research/physics.md
Normal file
13
content/research/physics.md
Normal file
|
|
@ -0,0 +1,13 @@
|
|||
---
|
||||
Title: Physics
|
||||
Description: Help with Physics Research Projects
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
## Physics Research
|
||||
|
||||
For the two projects below, I worked on Quantum Research in a physics lab with a fellow student Hannah Killian and an advisor Dr. Hai Nguyen. I mostly assisted with the software support for the project and assisted in the mathematics in whatever way I can.
|
||||
|
||||
[Modeling Population Dynamics of Incoherent and Coherent Excitation](/files/research/modellingpopulationdynamics.pdf)
|
||||
|
||||
[Coherent Control of Atomic Population Using the Genetic Algorithm](/files/research/coherentcontrolofatomicpopulation.pdf)
|
||||
14
content/research/progcomp.md
Normal file
14
content/research/progcomp.md
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
---
|
||||
Title: Programming Competition
|
||||
Description: Competition on Algorithms and Data Structures
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
# Programming Competition
|
||||
Back in the Fall of 2018, Harrison Crosse, Clare Arrington, and I all formed a team to compete in a programming competition.
|
||||
|
||||
I didn't make many notes for this, but below are the ones I did make for this.
|
||||
|
||||
[Strings](strings)
|
||||
|
||||
[Number Theory](numbertheory)
|
||||
295
content/research/progcomp/numbertheory.md
Normal file
295
content/research/progcomp/numbertheory.md
Normal file
|
|
@ -0,0 +1,295 @@
|
|||
# Number Theory
|
||||
|
||||
## Prime Numbers
|
||||
|
||||
A *prime number* is an integer $p > 1$ which is only divisible by $1$ and itself.
|
||||
|
||||
If $p$ is a prime number, then $p = ab$ for integers $a \le b$ implies that $a = 1$ and $b = p$.
|
||||
|
||||
### Definitions
|
||||
|
||||
**Fundamental Theorem of Arithmetic**: Every integer can be expressed in only one way as a product of primes.
|
||||
|
||||
**Prime factorization of $n$**: The unique set of numbers multiplying to $n$.
|
||||
|
||||
**Factor**: A prime number $p$ is a *factor* of $x$ if it appears in its prime factorization.
|
||||
|
||||
**Composite**: A number which is not prime
|
||||
|
||||
### Finding Primes
|
||||
|
||||
The easiest way to test if a given number $x$ is repeated division
|
||||
|
||||
- After testing two, we only need to check odd numbers afterwards
|
||||
- We only need to check until $\sqrt{x}$ since two numbers are perhaps multiplied to achieve $x$.
|
||||
|
||||
#### Considerations if you don't have nice things
|
||||
|
||||
The terminating condition of $i > \sqrt{x}$ is somewhat problematic, because `sqrt()` is a numerical function with imperfect precision.
|
||||
|
||||
To get around this, you can change the termination condition to $i^2 > x$. Though then we get into the problem of potential overflow when working on large integers.
|
||||
|
||||
The best solution if you have to deal with these issues is to compute $(i + 1)^2$ based on the result from $i^2$.
|
||||
$$
|
||||
\begin{align*}
|
||||
(i + 1)^2 &= i^2 + 2i + 1
|
||||
\end{align*}
|
||||
$$
|
||||
So just add $(2i + 1)$ to the previous result.
|
||||
|
||||
## Divisibility
|
||||
|
||||
A lot of number theory concerns itself with the study of integer divisibility.
|
||||
|
||||
### Definition
|
||||
|
||||
**Divides:** We say that $b$ *divides* $a$ (denoted $b|a$) if $a = bk$ for some integer $k$.
|
||||
|
||||
**Divisor:** If the above is true, then we say that $b$ is a *divisor* of $a$
|
||||
|
||||
**Multiple:** Given the above, we say that $a$ is a multiple of $b$.
|
||||
|
||||
As a consequence of this definition, the smallest natural divisor of every non-zero integer is $1$. This is also known as the *least common divisor*.
|
||||
|
||||
**Greatest Common Divisor $(gcd)$:** the *largest* divisor shared by a given pair of integers.
|
||||
|
||||
**Relatively Prime**: Two integers are *relatively prime* if their greatest common divisor is $1$.
|
||||
|
||||
**Reduced Form:** A fraction is said to be in *reduced form* if the greatest common divisor between the numerator and denominator is $1$.
|
||||
|
||||
### Properties
|
||||
|
||||
The greatest common divisor an integer has with itself is itself.
|
||||
$$
|
||||
gcd(b, b) = b \tag{1.1}
|
||||
$$
|
||||
The ordering of arguments to $gcd$ doesn't matter. Traditionally the larger value is placed in the first argument.
|
||||
$$
|
||||
gcd(a, b) = gcd(b, a) \tag{1.2}
|
||||
$$
|
||||
The greatest common divisor if $b$ divides $a$ between $a$ and $b$ is $b$.
|
||||
$$
|
||||
b | a \implies gcd(a, b) = b \tag{1.3}
|
||||
$$
|
||||
|
||||
This in part is because of two observations
|
||||
|
||||
- $b$ is the greatest common divisor of $b$ from $(1.1)$
|
||||
- $b$ divides $a$, therefore it's a common divisor
|
||||
|
||||
The greatest common divisor between $a$ and $b$ where $a = bt + r$ is the same as the greatest common divisor between $b$ and $r$.
|
||||
$$
|
||||
a = bt + r \implies gcd(a, b) = gcd(b, r) \tag{1.4}
|
||||
$$
|
||||
Let's work this out: $gcd(a, b) = gcd(bt + r, b)$. Since $bt$ is a multiple of $b$, we can add and subtract as many $bt$'s as we want without influencing the answer. This leads to $gcd(bt + r, b) = gcd(r, b) = gcd(b, r)$.
|
||||
|
||||
### Euclid's Algorithm
|
||||
|
||||
Using $(1.4)$ we can rewrite greatest common divisor problems in terms of the property in order to simplify the expression. Take a look at the example below:
|
||||
$$
|
||||
\begin{align*}
|
||||
gcd(34398, 2131) &= gcd(34398 \text{ mod } 2132, 2132) = gcd(2132, 286) \\
|
||||
gcd(2132, 286) &= gcd(2132 \text{ mod } 286, 286,) = gcd(286, 130) \\
|
||||
gcd(286, 130) &= gcd(286 \text{ mod } 130, 130) = gcd(130, 26) \\
|
||||
gcd(130, 26) &= gcd(130 \text{ mod } 26, 26,) = gcd(26, 0)
|
||||
\end{align*}
|
||||
$$
|
||||
Therefore, $gcd(34398, 2132) = 26$.
|
||||
|
||||
### Least Common Multiple
|
||||
|
||||
#### Definition
|
||||
|
||||
The *least common multiple* $(lcm)$ is the smallest integer which is divided by both of a given pair of integers. Ex: $lcm(24, 36) = 72$.
|
||||
|
||||
#### Properties
|
||||
|
||||
The least common multiple of $x$ and $y$ is greater (or equal) than both $x$ and $y$.
|
||||
$$
|
||||
lcm(x, y) \ge max(x,y) \tag{2.1}
|
||||
$$
|
||||
The least common multiple of $x$ and $y$ is less than or equal to $x$ and $y$ multiplied together.
|
||||
$$
|
||||
lcm(x, y) \le xy \tag{2.2}
|
||||
$$
|
||||
Coupled with Euclid's algorithm we can derive the property that the least common multiple is equal to the pair of multiplied integers divided by their greatest common divisor.
|
||||
$$
|
||||
lcm(x, y) = xy / gcd(x, y) \tag{2.3}
|
||||
$$
|
||||
|
||||
#### Potential Problems
|
||||
|
||||
Least common multiple arises when we want to compute the simultaneous periodicity of two distinct periodic events. When is the next year (after 2000) that the presidential election will coincide with the census?
|
||||
|
||||
These events coincide every twenty years, because $lcm(4, 10) = 20$.
|
||||
|
||||
## Modular Arithmetic
|
||||
|
||||
We are not always interested in full answers, however. Sometimes the remainder suffices for our purposes.
|
||||
|
||||
<u>Example:</u> Suppose your birthday this year falls on a Wednesday. What day of the week will it fall on next year?
|
||||
|
||||
The remainder of the number of days between now and then (365 or 366) mod the number of days in a week. $365$ mod $7 = 1$. Which means that your birthday will fall on a Thursday.
|
||||
|
||||
### Operations of Modular Arithmetic
|
||||
|
||||
**Addition**: $(x + y)$ mod $n$ $=$ $((x $ mod $n) + (y$ mod $n))$ mod $n$
|
||||
|
||||
<u>Example:</u> How much small change will I have if given \$123.45 by my mother and \$94.67 by my father?
|
||||
$$
|
||||
\begin{align*}
|
||||
(12345 \text{ mod } 100) + (9467 \text{ mod } 100) &= (45 + 67) \text{ mod } 100 \\
|
||||
&= 12 \text{ mod } 100
|
||||
\end{align*}
|
||||
$$
|
||||
**Subtraction** (Essentially addition with negatives):
|
||||
|
||||
<u>Example:</u> Based on the previous example, how much small change will I have after spending \$52.53?
|
||||
$$
|
||||
(12 \text{ mod } 100) - (53 \text{ mod } 100) = -41 \text{ mod } 100 = 59 \text{ mod } 100
|
||||
$$
|
||||
Notice the flip in signs, this can be generalized into the following form:
|
||||
$$
|
||||
x \text{ mod } y = (y - x) \text{ mod } y
|
||||
$$
|
||||
**Multiplication** (Otherwise known as repeated addition):
|
||||
$$
|
||||
xy \text{ mod } n = (x \text{ mod } n)(y \text{ mod } n) \text{ mod n}
|
||||
$$
|
||||
<u>Example:</u> How much change will you have if you earn \$17.28 per hour for 2,143 hours?
|
||||
$$
|
||||
\begin{align*}
|
||||
(1728 * 2143) \text{ mod } 100 &= (28 \text{ mod } 100)(43 \text{ mod 100}) \\
|
||||
&= 4 \text{ mod } 100
|
||||
\end{align*}
|
||||
$$
|
||||
**Exponentiation** (Otherwise known as repeated multiplication):
|
||||
$$
|
||||
x^y \text{ mod } n =(x \text{ mod n})^y \text{ mod } n
|
||||
$$
|
||||
<u>Example:</u> What is the last digit of $2^{100}$?
|
||||
$$
|
||||
\begin{align*}
|
||||
2^3 \text{ mod } 10 &= 8 \\
|
||||
2^6 \text{ mod } 10 &= 8(8) \text{ mod } 10 \rightarrow 4 \\
|
||||
2^{12} \text{ mod } 10 &= 4(4) \text{ mod } 10 \rightarrow 6 \\
|
||||
2^{24} \text{ mod } 10 &= 6(6) \text{ mod } 10 \rightarrow 6 \\
|
||||
2^{48} \text{ mod } 10 &= 6(6) \text{ mod } 10 \rightarrow 6 \\
|
||||
2^{96} \text{ mod } 10 &= 6(6) \text{ mod } 10 \rightarrow 6\\
|
||||
2^{100} \text{ mod } 10 &= 2^{96}(2^3)(2^1) \text{ mod } 10 \\
|
||||
&= 6(8)(2) \text{ mod } 10 \rightarrow 6
|
||||
\end{align*}
|
||||
$$
|
||||
|
||||
## Linear Congruences
|
||||
|
||||
**Definition:** $ a \equiv b \text{ (mod m)} \iff m |(a-b)$
|
||||
|
||||
### Properties
|
||||
|
||||
If $a$ mod $m$ is $b$, then $a$ is linearly congruent to $b$ in mod $m$.
|
||||
$$
|
||||
a \text{ mod } m = b \implies a \equiv b \text{ (mod m)} \tag{3.1}
|
||||
$$
|
||||
Let us show that from what we know so far...
|
||||
$$
|
||||
\begin{align*}
|
||||
a \text{ mod m} = b &\implies a = mk+b \hspace{.2in} \text{(for some $k \in \mathbb{Z}$)} \\
|
||||
&\implies a - b = mk + b - b \\
|
||||
&\iff a-b = m k \\
|
||||
&\iff m | (a - b) \\
|
||||
&\iff a \equiv b \hspace{.2in} \text{ (mod m)}
|
||||
\end{align*}
|
||||
$$
|
||||
**Example:** What set of integers satisfy the following congruence?
|
||||
$$
|
||||
x \equiv 3 \text{ mod } 9
|
||||
$$
|
||||
**Scratch work:**
|
||||
$$
|
||||
\begin{align*}
|
||||
x \equiv 3 \text{ mod } 9 &\iff 9 | (x - 3) \\
|
||||
&\iff x - 3 = 9k \hspace{.2in} \text{ (for some $k \in \mathbb{Z}$)} \\
|
||||
&\iff x = 9k - 3
|
||||
\end{align*}
|
||||
$$
|
||||
|
||||
### Operations on congruences
|
||||
|
||||
#### Addition/Subtraction
|
||||
|
||||
$$
|
||||
a \equiv b \text{ (mod n) and } c \equiv d \text{ (mod n) } \implies a + c \equiv b + d \text{ (mod n)} \tag{3.2}
|
||||
$$
|
||||
|
||||
*Proof.*
|
||||
$$
|
||||
\begin{align*}
|
||||
a \equiv b \text{ (mod n) } &\implies n | (a - b) \\
|
||||
&\implies a - b = nk_1 \hspace{.2in} \text{ (for some $k_1 \in \mathbb{Z}$)} \\
|
||||
&\implies a = nk_1 + b\\
|
||||
c \equiv d \text{ (mod n) } &\implies n | (c - d) \\
|
||||
&\implies c - d = nk_2 \hspace{.2in} \text{ (for some $k_2 \in \mathbb{Z}$)} \\
|
||||
&\implies c = nk_2 + d \\
|
||||
a + c &= nk_1 + b + nk_2 + d \\
|
||||
&= n(k_1 + k_2) + (b + d) \\
|
||||
&=nk + (b + d) \hspace{.2in} \text{(for some $k \in \mathbb{Z}$)} \\
|
||||
&\implies (a + c) \text{ mod } n = b + d \\
|
||||
&\implies a+c \equiv b + d \text{ (mod n)}
|
||||
\end{align*}
|
||||
$$
|
||||
|
||||
#### Multiplication
|
||||
|
||||
$$
|
||||
a \equiv b \text{ (mod n) } \implies ad \equiv bd \text{ (mod n)} \tag{3.3}
|
||||
$$
|
||||
|
||||
*Proof.*
|
||||
$$
|
||||
\begin{align*}
|
||||
a \equiv b \text{ (mod n) } &\implies n | (a - b) \\
|
||||
&\implies a - b = nk_1 \hspace{.2in} \text{ (for some $k_1 \in \mathbb{Z}$)} \\
|
||||
&\implies d(a - b) = d(nk_1) \\
|
||||
&\implies da-db = n(dk_1) \\
|
||||
&\implies n | (da-db) \\
|
||||
&\implies da \equiv db \text{ (mod n)}
|
||||
\end{align*}
|
||||
$$
|
||||
Next Theorem
|
||||
$$
|
||||
a \equiv b \text{ (mod n) and } c \equiv d \text{ (mod n) } \implies ac \equiv bd \text{ (mod n) }
|
||||
$$
|
||||
*Proof.*
|
||||
$$
|
||||
\begin{align*}
|
||||
a \equiv b \text{ (mod n) } &\implies a = nk_1 + b \hspace{.2in} \text{ (for some $k_1 \in \mathbb{Z}$)} \\
|
||||
c \equiv d \text{ (mod n)} &\implies c = nk_2 + d \hspace{.2in} \text{ (for some $k_2 \in \mathbb{Z}$)} \\
|
||||
a * c &= (nk_1 + b)(nk_2+d) \\
|
||||
&=n^2k_1k_2 + bnk_2+dnk_1 +bd \\
|
||||
&= n(nk_1k_2+bk_2+dk_2) + bd \\
|
||||
&\implies ac \text{ mod } n = bd \\
|
||||
ac &\equiv bd \text{ (mod n)}
|
||||
\end{align*}
|
||||
$$
|
||||
|
||||
#### Division
|
||||
|
||||
Let us define division as so
|
||||
$$
|
||||
bb^{-1} \equiv 1 \text{ (mod n)} \\
|
||||
$$
|
||||
Theorem.
|
||||
$$
|
||||
ad \equiv bd \text{ (mod dn) } \iff a \equiv b \text{ (mod n) }
|
||||
$$
|
||||
*Proof.*
|
||||
$$
|
||||
\begin{align*}
|
||||
ad \equiv bd \text{ (mod dn) } &\implies ad - bd = dnk \\
|
||||
&\implies d(a-b) = d(nk) \\
|
||||
&\implies a-b = nk \\
|
||||
&\implies n | (a-b) \\
|
||||
&\implies a \equiv b \text{ (mod n)}
|
||||
\end{align*}
|
||||
$$
|
||||
15
content/research/progcomp/strings.md
Normal file
15
content/research/progcomp/strings.md
Normal file
|
|
@ -0,0 +1,15 @@
|
|||
# Strings
|
||||
|
||||
## Character Codes
|
||||
|
||||
Character codes are mappings between numbers and symbols which make up a particular alphabet.
|
||||
|
||||
The *American Standard Code for Information Interchange* (ASCII) is a single-byte character code where $2^7 = 128$ characters are specified.
|
||||
|
||||
Symbol assignments were not done at random. Several interesting properties of the design make programming tasks easier:
|
||||
|
||||
- All non-printable characters have either the first three bits as zero or all seven lowest bits as one. This makes it easy to eliminate them before displaying junk.
|
||||
- Both the upper and lower case letters and the numerical digits appear sequentially
|
||||
- We can get the numeric order of a letter by subtracting the first letter
|
||||
- We can convert a character from uppercase to lowercase by $Letter - "A" + "a"$
|
||||
|
||||
12
content/research/proglang.md
Normal file
12
content/research/proglang.md
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
---
|
||||
Title: Programming Languages
|
||||
Description: Designing and Implementing Programming Languages
|
||||
showthedate: false
|
||||
---
|
||||
|
||||
# Programming Languages
|
||||
Back in the Fall of 2018, under the guidance of Ian Finlayson, I worked towards creating a programming language similar to SLOTH (Simple Language of Tiny Heft). The language as of now is interpreted and supports array based programming.
|
||||
|
||||
[Github repository](https://github.com/brandon-rozek/sloth) outlining my work.
|
||||
|
||||
[Short Notes](types) on Types of Programming Languages
|
||||
243
content/research/proglang/types.md
Normal file
243
content/research/proglang/types.md
Normal file
|
|
@ -0,0 +1,243 @@
|
|||
# Types of Programming Languages
|
||||
|
||||
Mostly content from Wikipedia
|
||||
|
||||
https://en.wikipedia.org/wiki/List_of_programming_languages_by_type#Hardware_description_languages
|
||||
|
||||
## Array Languages
|
||||
|
||||
Array programming languages generalizes operations on scalars to apply transparently to vectors, matrices, and higher-dimensional arrays.
|
||||
|
||||
Example Language: R
|
||||
|
||||
In R everything is by default a vector, typing in
|
||||
|
||||
```R
|
||||
x <- 5
|
||||
```
|
||||
|
||||
Will create a vector of length 1 with the value 5.
|
||||
|
||||
This is commonly used by the scientific and engineering community. Other languages include MATLAB, Octave, Julia, and the NumPy extension to Python.
|
||||
|
||||
## Constraint Programming Languages
|
||||
|
||||
A declarative programming language where relationships between variables are expressed as constraints. Execution proceeds by attempting to find values for the variables that satisfy all declared constraints.
|
||||
|
||||
Example: Microsoft Z3 Theorem Prover
|
||||
|
||||
```z3
|
||||
(declare-const a Int)
|
||||
(declare-fun f (Int Bool) Int)
|
||||
(assert (> a 10))
|
||||
(assert (< (f a true) 100))
|
||||
(check-sat)
|
||||
```
|
||||
|
||||
## Command Line Interface Languages
|
||||
|
||||
Commonly used to control jobs or processes
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
echo "hello, $USER. I wish to list some files of yours"
|
||||
echo "listing files in the current directory, $PWD"
|
||||
ls # list files
|
||||
```
|
||||
|
||||
## Concurrent Languages
|
||||
|
||||
Languages that support language constructs for concurrency. These may involve multi-threading, distributed computing, message passing, shared resources, and/or future and promises.
|
||||
|
||||
Example: Erlang
|
||||
|
||||
```erlang
|
||||
ping(0, Pong_PID) ->
|
||||
Pong_PID ! finished,
|
||||
io:format("ping finished~n", []);
|
||||
|
||||
ping(N, Pong_PID) ->
|
||||
Pong_PID ! {ping, self()},
|
||||
receive
|
||||
pong ->
|
||||
io:format("Ping received pong~n", [])
|
||||
end,
|
||||
ping(N - 1, Pong_PID).
|
||||
|
||||
pong() ->
|
||||
receive
|
||||
finished ->
|
||||
io:format("Pong finished~n", []);
|
||||
{ping, Ping_PID} ->
|
||||
io:format("Pong received ping~n", []),
|
||||
Ping_PID ! pong,
|
||||
pong()
|
||||
end.
|
||||
|
||||
start() ->
|
||||
Pong_PID = spawn(tut15, pong, []),
|
||||
spawn(tut15, ping, [3, Pong_PID]).
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Data-oriented Languages
|
||||
|
||||
These languages provide powerful ways of searching and manipulating relations.
|
||||
|
||||
Example: SQL
|
||||
|
||||
```sql
|
||||
SELECT * FROM STATION
|
||||
WHERE 50 < (SELECT AVG(TEMP_F);
|
||||
```
|
||||
|
||||
Give all the information from stations whose average temperature is above 50 degrees F
|
||||
|
||||
## Declarative Languages
|
||||
|
||||
Declarative languages express the logic of a computation without describing its control flow in detail
|
||||
|
||||
Example: SQL again
|
||||
|
||||
The example code above doesn't tell the computer how to perform the query, just what you want.
|
||||
|
||||
## Functional Languages
|
||||
|
||||
Style of computer programs that treat computation as the evaluation of mathematical functions and avoid changing-state and having mutable data.
|
||||
|
||||
```haskell
|
||||
primes = filterPrime [2..]
|
||||
where filterPrime (p:xs) =
|
||||
p : filterPrime [x | x <- xs, x `mod` p /= 0]
|
||||
```
|
||||
|
||||
## Imperative Languages
|
||||
|
||||
The use of statements to change a program's state
|
||||
|
||||
Example: C
|
||||
|
||||
```c
|
||||
#define FOOSIZE 10
|
||||
struct foo fooarr[FOOSIZE];
|
||||
|
||||
for(i = 0; i < FOOSIZE; i++)
|
||||
{
|
||||
do_something(fooarr[i].data);
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Iterative Languages
|
||||
|
||||
Languages built around or offering generators
|
||||
|
||||
Example: Python
|
||||
|
||||
```python
|
||||
def countfrom(n):
|
||||
while True:
|
||||
yield n
|
||||
n += 1
|
||||
|
||||
g = countfrom(0)
|
||||
next(g) # 0
|
||||
next(g) # 1
|
||||
```
|
||||
|
||||
## List-based Languages -LISPs
|
||||
|
||||
Family of programming languages with a history of fully parenthesized prefix notation.
|
||||
|
||||
Example: Common Lisp
|
||||
|
||||
```commonlisp
|
||||
;; Sorts the list using the > and < function as the relational operator.
|
||||
(sort (list 5 2 6 3 1 4) #'>) ; Returns (6 5 4 3 2 1)
|
||||
(sort (list 5 2 6 3 1 4) #'<) ; Returns (1 2 3 4 5 6)
|
||||
```
|
||||
|
||||
## Logic Languages
|
||||
|
||||
Programming paradigm based on formal logic. Expressions are written stating the facts and rules about some problem domain.
|
||||
|
||||
Example: Prolog
|
||||
|
||||
```prolog
|
||||
mother_child(trude, sally).
|
||||
|
||||
father_child(tom, sally).
|
||||
father_child(tom, erica).
|
||||
father_child(mike, tom).
|
||||
|
||||
sibling(X, Y) :- parent_child(Z, X), parent_child(Z, Y).
|
||||
|
||||
parent_child(X, Y) :- father_child(X, Y).
|
||||
parent_child(X, Y) :- mother_child(X, Y).
|
||||
```
|
||||
|
||||
```prolog
|
||||
?- sibling(sally, erica).
|
||||
Yes
|
||||
```
|
||||
|
||||
## Symbolic Programming
|
||||
|
||||
A programming paradigm in which the program can manipulate its own formulas and program components as if they were plain data.
|
||||
|
||||
Example: Clojure
|
||||
|
||||
Clojure is written in prefix notation, but with this macro, you can write in infix notation.
|
||||
|
||||
```clojure
|
||||
(defmacro infix
|
||||
[infixed]
|
||||
(list (second infixed)
|
||||
(first infixed)
|
||||
(last infixed)))
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Probabilistic Programming Language
|
||||
|
||||
Adds the ability to describe probabilistic models and then perform inference in those models.
|
||||
|
||||
Example: Stan
|
||||
|
||||
```stan
|
||||
model {
|
||||
real mu;
|
||||
|
||||
# priors:
|
||||
L_u ~ lkj_corr_cholesky(2.0);
|
||||
L_w ~ lkj_corr_cholesky(2.0);
|
||||
to_vector(z_u) ~ normal(0,1);
|
||||
to_vector(z_w) ~ normal(0,1);
|
||||
|
||||
for (i in 1:N){
|
||||
mu = beta[1] + u[subj[i],1] + w[item[i],1]
|
||||
+ (beta[2] + u[subj[i],2] + w[item[i],2])*so[i];
|
||||
rt[i] ~ lognormal(mu,sigma_e); // likelihood
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
||||
|
||||
## Quick Summary of Programming Paradigms
|
||||
|
||||
Imperative
|
||||
|
||||
- Object Oriented
|
||||
- Procedural
|
||||
|
||||
Declarative
|
||||
|
||||
- Functional
|
||||
- Logic
|
||||
|
||||
Symbolic Programming
|
||||
|
||||
22
content/research/publications.md
Normal file
22
content/research/publications.md
Normal file
|
|
@ -0,0 +1,22 @@
|
|||
---
|
||||
Title: Publications
|
||||
Descriptions: Papers, Presentations, and Grants obtained
|
||||
ShowDatesOnPosts: false
|
||||
---
|
||||
|
||||
# Publications
|
||||
|
||||
Brandon Rozek. "QEP: The Quality Policy Evaluation Algorithm", Research and Creativity Day at University of Mary Washington, 2019.
|
||||
|
||||
Brandon Rozek, Stefano Coronado. "Beowulf Cluster for Research and Education", Research and Creativity Day at University of Mary Washington, 2018.
|
||||
|
||||
Brandon Rozek. "Coherent Control of Atomic Population Using the Genetic Algorithm", Summer Science Institute Research Symposium, 2017.
|
||||
|
||||
Hannah Killian, Brandon Rozek. "Modelling Population Dynamics of Incoherent and Coherent Excitation", Virginia Academy of Science, 2017.
|
||||
|
||||
Hannah Killian, Brandon Rozek. “Modeling Population Dynamics of Incoherent and Coherent Excitation", Research and Creativity Day at University of Mary Washington, 2017.
|
||||
|
||||
|
||||
# Grants
|
||||
|
||||
Julia Arrington, Maia Magrakvilidze, Ethan Ramirez, Brandon Rozek "Beowulf Cluster for Research and Education", University of Mary Washington, 2017-2018 School Year
|
||||
45
content/research/reinforcementlearning.md
Normal file
45
content/research/reinforcementlearning.md
Normal file
|
|
@ -0,0 +1,45 @@
|
|||
---
|
||||
Title: Reinforcement Learning
|
||||
Description: The study of optimally mapping situations to actions
|
||||
---
|
||||
|
||||
# Reinforcement Learning
|
||||
Reinforcement learning is the art of analyzing situations and mapping them to actions in order to maximize a numerical reward signal.
|
||||
|
||||
In this independent study, I as well as Dr. Stephen Davies, will explore the Reinforcement Learning problem and its subproblems. We will go over the bandit problem, markov decision processes, and discover how best to translate a problem in order to **make decisions**.
|
||||
|
||||
I have provided a list of topics that I wish to explore in a [syllabus](syllabus)
|
||||
|
||||
## Readings
|
||||
|
||||
In order to spend more time learning, I decided to follow a textbook this time.
|
||||
|
||||
Reinforcement Learning: An Introduction
|
||||
|
||||
By Richard S. Sutton and Andrew G. Barto
|
||||
|
||||
|
||||
[Reading Schedule](readings)
|
||||
|
||||
|
||||
## Notes
|
||||
|
||||
The notes for this course, is going to be an extreemly summarized version of the textbook. There will also be notes on whatever side tangents Dr. Davies and I explore.
|
||||
|
||||
[Notes page](notes)
|
||||
|
||||
I wrote a small little quirky/funny report describing the bandit problem. Great for learning about the common considerations for Reinforcement Learning problems.
|
||||
|
||||
[The Bandit Report](/files/research/TheBanditReport.pdf)
|
||||
|
||||
## Code
|
||||
|
||||
Code will occasionally be written to solidify the learning material and to act as aids for more exploration.
|
||||
|
||||
[Github Link](https://github.com/brandon-rozek/ReinforcementLearning)
|
||||
|
||||
Specifically, if you want to see agents I've created to solve some OpenAI environments, take a look at this specific folder in the Github Repository
|
||||
|
||||
[Github Link](https://github.com/Brandon-Rozek/ReinforcementLearning/tree/master/agents)
|
||||
|
||||
|
||||
12
content/research/reinforcementlearning/notes.md
Normal file
12
content/research/reinforcementlearning/notes.md
Normal file
|
|
@ -0,0 +1,12 @@
|
|||
# Lecture Notes for Reinforcement Learning
|
||||
|
||||
[Chapter 1: An Introduction](intro)
|
||||
|
||||
[Chapter 2: Multi-armed Bandits](bandits)
|
||||
|
||||
[Chapter 3: Markov Decision Processes](mdp)
|
||||
|
||||
[Chapter 4: Dynamic Programming](dynamic)
|
||||
|
||||
[Chapter 5: Monte Carlo Methods](mcmethods)
|
||||
|
||||
144
content/research/reinforcementlearning/notes/bandits.md
Normal file
144
content/research/reinforcementlearning/notes/bandits.md
Normal file
|
|
@ -0,0 +1,144 @@
|
|||
# Chapter 2: Multi-armed Bandits
|
||||
|
||||
Reinforcement learning *evaluates* the actions taken rather than accepting $instructions$ of the correct actions. This creates the need for active exploration.
|
||||
|
||||
This chapter of the book goes over a simplified version of the reinforcement learning problem, that does not involve learning to act in more than one situation. This is called a *nonassociative* setting.
|
||||
|
||||
In summation, the type of problem we are about to study is a nonassociative, evaluative feedback problem that is a simplified version of the $k$-armed bandit problem.
|
||||
|
||||
## $K$-armed bandit problem
|
||||
|
||||
Consider the following learning problem. You are faced repeatedly with a choice among $k$ different options/actions. After each choice you receive a numerical reward chosen from a stationary probability distribution that depends on the action you selected.
|
||||
|
||||
Your objective (if you choose to accept it) is to maximize the expected total reward over some time period. Let's say $1000$ time steps.
|
||||
|
||||
### Analogy
|
||||
|
||||
This is called the $k$-armed bandit problem because it's an analogy of a slot machine. Slot machines are nick-named the "one-armed bandit", and the goal here is to play the slot machine that has the greatest value return.
|
||||
|
||||
### Sub-goal
|
||||
|
||||
We want to figure out which slot machine produces the greatest value. Therefore, we want to be able to estimate the value of a slot machine as close to the actual value as possible.
|
||||
|
||||
### Exploration vs Exploitation
|
||||
|
||||
If we maintain estimates of the action values, then at any time step there is at least one action whose estimated value is the greatest. We call these *greedy* actions. When you select one of these actions we say that you are *exploiting* your current knowledge of the values of the actions.
|
||||
|
||||
If you instead select a non-greedy action, then you are *exploring*, because this enables you to better improve your estimate of the non-greedy action's value.
|
||||
|
||||
Uncertainty is such that at least one of the other actions probably better than the greedy action, you just don't know which one yet.
|
||||
|
||||
## Action-value Methods
|
||||
|
||||
In this section, we will look at simple balancing methods in how to gain the greatest reward through exploration and exploitation.
|
||||
|
||||
We begin by looking more closely at some simple methods for estimating the values of actions and for using said estimates to make action selection decisions.
|
||||
|
||||
### Sample-average method
|
||||
|
||||
One natural way to estimate this is by averaging the rewards actually received
|
||||
$$
|
||||
Q_t(a) = \frac{\sum_{i = 1}^{t - 1}R_i * \mathbb{I}_{A_i = 1}}{\sum_{i = 1}^{t - 1}\mathbb{I}_{A_i = 1}}
|
||||
$$
|
||||
where $\mathbb{I}_{predicate}$ denotes the random variable that is 1 if the predicate is true and 0 if it is not. If the denominator is zero (we have not experienced the reward), then we assume some default value such as zero.
|
||||
|
||||
### Greedy action selection
|
||||
|
||||
This is where you choose greedily all the time.
|
||||
$$
|
||||
A_t = argmax_a(Q_t(a))
|
||||
$$
|
||||
|
||||
### $\epsilon$-greedy action selection
|
||||
|
||||
This is where we choose greedily most of the time, except for a small probability $\epsilon$. Where instead of selecting greedily, we select randomly from among all the actions with equal probability.
|
||||
|
||||
### Comparison of greedy and $\epsilon$-greedy methods
|
||||
|
||||
The advantage of $\epsilon$-greedy over greedy methods depends on the task. With noisier rewards it takes more exploration to find the optimal action, and $\epsilon$-greedy methods should fare better relative to the greedy method. However, if the reward variances were zero, then the greedy method would know the true value of each action after trying it once.
|
||||
|
||||
Suppose the bandit task were non-stationary, that is, the true values of actions changed over time. In this case exploration is needed to make sure one of the non-greedy actions has not changed to become better than the greedy one.
|
||||
|
||||
### Incremental Implementation
|
||||
|
||||
There is a way to update averages using small constant computations rather than storing the the numerators and denominators separate.
|
||||
|
||||
Note the derivation for the update formula
|
||||
$$
|
||||
\begin{align}
|
||||
Q_{n + 1} &= \frac{1}{n}\sum_{i = 1}^n{R_i} \\
|
||||
&= \frac{1}{n}(R_n + \sum_{i = 1}^{n - 1}{R_i}) \\
|
||||
&= \frac{1}{n}(R_n + (n - 1)\frac{1}{n-1}\sum_{i = 1}^{n - 1}{R_i}) \\
|
||||
&= \frac{1}{n}{R_n + (n - 1)Q_n} \\
|
||||
&= \frac{1}{n}(R_n + nQ_n - Q_n) \\
|
||||
&= Q_n + \frac{1}{n}(R_n - Q_n) \tag{2.3}
|
||||
\end{align}
|
||||
$$
|
||||
With formula 2.3, the implementation requires memory of only $Q_n$ and $n$.
|
||||
|
||||
This update rule is a form that occurs frequently throughout the book. The general form is
|
||||
$$
|
||||
NewEstimate = OldEstimate + StepSize(Target - OldEstimate)
|
||||
$$
|
||||
|
||||
### Tracking a Nonstationary Problem
|
||||
|
||||
As noted earlier, we often encounter problems that are nonstationary, in such cases it makes sense to give more weight to recent rewards than to long-past rewards. One of the most popular ways to do this is to use a constant value for the $StepSize$ parameter. We modify formula 2.3 to be
|
||||
$$
|
||||
\begin{align}
|
||||
Q_{n + 1} &= Q_n + \alpha(R_n - Q_n) \\
|
||||
&= \alpha R_n + Q_n - \alpha Q_n \\
|
||||
&= \alpha R_n + (1 - \alpha)Q_n \\
|
||||
&= \alpha R_n + (1 - \alpha)(\alpha R_{n - 1} + (1-\alpha)Q_{n - 1}) \\
|
||||
&= \alpha R_n + (1 - \alpha)(\alpha R_{n - 1} + (1-\alpha)(\alpha R_{n - 2} + (1 - a)Q_{n - 2})) \\
|
||||
&= \alpha R_n + (1-\alpha)\alpha R_{n - 1} + (1-\alpha)^2\alpha R_{n - 2} + \dots + (1-\alpha)^nQ_1 \\
|
||||
&= (1-\alpha)^nQ_1 + \sum_{i = 1}^n{\alpha(1-\alpha)^{n - i}R_i}
|
||||
\end{align}
|
||||
$$
|
||||
This is a weighted average since the summation of all the weights equal one. Note here that the farther away a value is from the current time, the more times $(1-\alpha)$ gets multiplied by itself. Hence making it less influential. This is sometimes called an *exponential recency-weighted average*.
|
||||
|
||||
### Manipulating $\alpha_n(a)$
|
||||
|
||||
Sometimes it is convenient to vary the step-size parameter from step to step. We can denote $\alpha_n(a)$ to be a function that determines the step-size parameter after the $n$th selection of action $a$. As noted before $\alpha_n(a) = \frac{1}{n}$ results in the sample average method which is guaranteed to converge to the truth action values assuming a large number of trials.
|
||||
|
||||
A well known result in stochastic approximation theory gives us the following conditions to assure convergence with probability 1:
|
||||
$$
|
||||
\sum_{n = 1}^\infty{\alpha_n(a) = \infty} \and \sum_{n = 1}^{\infty}{\alpha_n^2(a) \lt \infty}
|
||||
$$
|
||||
The first condition is required to guarantee that the steps are large enough to overcome any initial conditions or random fluctuations. The second condition guarantees that eventually the steps become small enough to assure convergence.
|
||||
|
||||
**Note:** Both convergence conditions are met for the sample-average case but not for the constant step-size parameter. The latter condition is violated in the constant parameter case. This is desirable since if the rewards are changing then we don't want it to converge to any one parameter.
|
||||
|
||||
### Optimistic Initial Values
|
||||
|
||||
The methods discussed so far are biased by their initial estimates. Another downside is that these values are another set of parameters that must be chosen by the user. Though these initial values can be used as a simple way to encourage exploration.
|
||||
|
||||
Let's say you set an initial estimate that is wildly optimistic. Whichever actions are initially selected, the reward is less than the starting estimates. Therefore, the learner switches to other actions, being *disappointed* with the rewards it was receiving.
|
||||
|
||||
The result of this is that all actions are tried several times before their values converge. It even does this if the algorithm is set to choose greedily most of the time!
|
||||
|
||||

|
||||
|
||||
This simple trick is quite effective for stationary problems. Not so much for nonstationary problems since the drive for exploration only happens at the beginning. If the task changes, creating a renewed need for exploration, this method would not catch it.
|
||||
|
||||
### Upper-Confidence-Bound Action Selection
|
||||
|
||||
Exploration is needed because there is always uncertainty about the accuracy of the action-value estimates. The greedy actions are those that look best at the present but some other options may actually be better. Let's choose options that have potential for being optimal, taking into account how close their estimates are to being maximal and the uncertainties in those estimates.
|
||||
$$
|
||||
A_t = argmax_a{(Q_t(a) + c\sqrt{\frac{ln(t)}{N_t(a)}})}
|
||||
$$
|
||||
where $N_t(a)$ denotes the number of times that $a$ has been selected prior to time $t$ and $c > 0$ controls the degree of exploration.
|
||||
|
||||
###Associative Search (Contextual Bandits)
|
||||
|
||||
So far, we've only considered nonassociative tasks, where there is no need to associate different actions with different situations. However, in a general reinforcement learning task there is more than one situation and the goal is to learn a policy: a mapping from situations to the actions that are best in those situations.
|
||||
|
||||
For sake of continuing our example, let us suppose that there are several different $k$-armed bandit tasks, and that on each step you confront one of these chosen at random. To you, this would appear as a single, nonstationry $k$-armed bandit task whose true action values change randomly from step to step. You could try using one of the previous methods, but unless the true action values change slowly, these methods will not work very well.
|
||||
|
||||
Now suppose, that when a bandit task is selected for you, you are given some clue about its identity. Now you can learn a policy association each task, singled by the clue, with the best action to take when facing that task.
|
||||
|
||||
This is an example of an *associative search* task, so called because it involves both trial-and-error learning to *search* for the best actions, and *association* of these actions with situations in which they are best. Nowadays they're called *contextual bandits* in literature.
|
||||
|
||||
If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem. This will be presented in the next chapter of the book with its ramifications appearing throughout the rest of the book.
|
||||
|
||||

|
||||
130
content/research/reinforcementlearning/notes/dynamic.md
Normal file
130
content/research/reinforcementlearning/notes/dynamic.md
Normal file
|
|
@ -0,0 +1,130 @@
|
|||
# Chapter 4: Dynamic Programming
|
||||
|
||||
Dynamic programming refers to a collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a Markov decision process (MDP).
|
||||
|
||||
Classic DP algorithms are of limited utility due to their assumption of a perfect model and their great computational expense.
|
||||
|
||||
Let's assume that the environment is a finite MDP. We assume its state, action, and reward sets, $\mathcal{S}, \mathcal{A}, \mathcal{R}$ are finite, and that its dynamics are given by a set of probabilities $p(s^\prime, r | s , a)$.
|
||||
|
||||
The key idea of dynamic programming, and of reinforcement learning is the use of value functions to organize and structure the search for good policies. In this chapter, we show how dynamic programming can be used to compute the value functions defined in Chapter 3. We can easily obtain optimal policies once we have found the optimal value functions which satisfy the Bellman optimality equations.
|
||||
|
||||
## Policy Evaluation
|
||||
|
||||
First we consider how to compute the state-value function for an arbitrary policy. The existence and uniqueness of a state-value function for an arbitrary policy are guaranteed as long as either the discount rate is less than one or eventual termination is guaranteed from all states under the given policy.
|
||||
|
||||
Consider a sequence of approximate value functions. The initial approximation, $v_0$, is chosen arbitrarily (except that the terminal state must be given a value of zero) and each successive approximation is obtained by using the Bellman equation for $v_\pi$ as an update rule:
|
||||
$$
|
||||
v_{k + 1} = \sum_{a}{\pi(a |s)\sum_{s^\prime, r}{p(s^\prime,r|s,a)[r + \gamma v_k(s^\prime)]}}
|
||||
$$
|
||||
This algorithm is called *iterative policy evaluation*.
|
||||
|
||||
To produce each successive approximation, $v_{k + 1}$ from $v_k$, iterative policy evaluation applies the same operation to each state $s$: it replaces the old value of $s$ with a new value obtained from the old values of the successor states of $s$, and the expected immediate rewards, along all the one-step transitions possible under the policy being evaluated.
|
||||
|
||||
<u>**Iterative Policy Evaluation**</u>
|
||||
|
||||
```
|
||||
Input π, the policy to be evaluated
|
||||
Initialize an array V(s) = 0, for all s ∈ S+
|
||||
Repeat
|
||||
∆ ← 0
|
||||
For each s ∈ S:
|
||||
v ← V(s)
|
||||
V(s) ← ∑_a π(a|s) ∑_s′,r p(s′,r|s,a)[r+γV(s′)]
|
||||
∆ ← max(∆,|v−V(s)|)
|
||||
until ∆ < θ (a small positive number)
|
||||
Output V ≈ v_π
|
||||
```
|
||||
|
||||
### Grid World Example
|
||||
|
||||
Consider a grid world where the top left and bottom right squares are the terminal state. Now consider that every other square, produces a reward of -1, and the available actions on each state is {up, down, left, right} as long as that action keeps the agent on the grid. Suppose our agent follows the equiprobable random policy.
|
||||
|
||||

|
||||
|
||||
## Policy Improvement
|
||||
|
||||
One reason for computing the value function for a policy is to help find better policies. Suppose we have determined the value function $v_\pi$ for an arbitrary deterministic policy $\pi$. For some state $s$ we would like to know whether or not we should change the policy to determinatically chose another action. We know how good it is to follow the current policy from $s$, that is $v_\pi(s)$, but would it be better or worse to change to the new policy?
|
||||
|
||||
One way to answer this question is to consider selecting $a$ in $s$ and thereafter follow the existing policy, $\pi$. The key criterion is whether the this produces a value greater than or less than $v_\pi(s)$. If it is greater, then one would expect it to be better still to select $a$ every time $s$ is encountered, and that the new policy would in fact be a better one overall.
|
||||
|
||||
That this is true is a special case of a general result called the *policy improvement theorem*. Let $\pi$ and $\pi^\prime$ be any pair of deterministic policies such that for all $s \in \mathcal{S}$.
|
||||
$$
|
||||
q_\pi(s, \pi^\prime(s)) \ge v_\pi(s)
|
||||
$$
|
||||
So far we have seen how, given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action. It is a natural extension to consider changes at all states and to all possible actions, selecting at each state the action that appears best according to $q_\pi(s, a)$. In other words, to consider the new *greedy* policy, $\pi^\prime$, given by:
|
||||
$$
|
||||
\pi^\prime = argmax (q_\pi(s, a))
|
||||
$$
|
||||
So far in this section we have considered the case of deterministic policies. We will not go through the details, but in fact all the ideas of this section extend easily to stochastic policies.
|
||||
|
||||
## Policy Iteration
|
||||
|
||||
Once a policy, $\pi$, has been improved using $v_\pi$ to yield a better policy, $\pi^\prime$, we can then compute $v_{\pi^\prime}$ and improve it again to yield an even better $\pi^{\prime\prime}$. We can thus obtain a sequence of monotonically improving policies and value functions.
|
||||
|
||||
Each policy is guaranteed to be a strict improvement over the previous one (unless its already optimal). Since a finite MDP has only a finite number of policies, this process must converge to an optimal policy and optimal value function in a finite number of iterations.
|
||||
|
||||
This way of finding an optimal policy is called *policy iteration*.
|
||||
|
||||
<u>Algorithm</u>
|
||||
|
||||
```
|
||||
1. Initialization
|
||||
V(s) ∈ R and π(s) ∈ A(s) arbitrarily for all s ∈ S
|
||||
2. Policy Evaluation
|
||||
Repeat
|
||||
∆ ← 0
|
||||
For each s∈S:
|
||||
v ← V(s)
|
||||
V(s) ← ∑_{s′,r} p(s′,r|s,π(s))[r + γV(s′)]
|
||||
∆ ← max(∆,|v − V(s)|)
|
||||
until ∆ < θ (a small positive number)
|
||||
3. Policy Improvement
|
||||
policy-stable ← true
|
||||
For each s ∈ S:
|
||||
old-action ← π(s)
|
||||
π(s) ← arg max_a ∑_{s′,r} p(s′,r|s,a)[r + γV(s′)]
|
||||
If old-action != π(s), then policy-stable ← false
|
||||
If policy-stable, then stop and return V ≈ v_∗,
|
||||
and π ≈ π_∗; else go to 2
|
||||
```
|
||||
|
||||
## Value Iteration
|
||||
|
||||
One drawback to policy iteration is that each of its iterations involve policy evaluation, which may itself be a protracted iterative computation requiring multiple sweeps through the state set. If policy evaluation is done iteratively, then convergence exactly to $v_\pi$ occurs only in the limit. Must we wait for exact convergence, or can we stop short of that?
|
||||
|
||||
In fact, the policy evaluation step of policy iteration can be truncated in several ways without losing the convergence guarantee of policy iteration. One important special case is when policy evaluation is stopped after one sweep. This algorithm is called value iteration.
|
||||
|
||||
Another way of understanding value iteration is by reference to the Bellman optimality equation. Note that value iteration is obtained simply by turning the Bellman optimality equation into an update rule. Also note how value iteration is identical to the policy evaluation update except that it requires the maximum to be taken over all actions.
|
||||
|
||||
Finally, let us consider how value iteration terminates. Like policy evaluation, value iteration formally requires an infinite number of iterations to converge exactly. In practice, we stop once the value function changes by only a small amount.
|
||||
|
||||
```
|
||||
Initialize array V arbitrarily (e.g., V(s) = 0 for all
|
||||
s ∈ S+)
|
||||
|
||||
Repeat
|
||||
∆ ← 0
|
||||
For each s ∈ S:
|
||||
v ← V(s)
|
||||
V(s) ← max_a∑_{s′,r} p(s′,r|s,a)[r + γV(s′)]
|
||||
∆ ← max(∆, |v − V(s)|)
|
||||
until ∆ < θ (a small positive number)
|
||||
|
||||
Output a deterministic policy, π ≈ π_∗, such that
|
||||
π(s) = arg max_a ∑_{s′,r} p(s′,r|s,a)[r + γV(s′)]
|
||||
```
|
||||
|
||||
Value iteration effectively combines, in each of its sweeps, one sweep of policy evaluation and one sweep of policy improvement. Faster convergence is often achieved by interposing multiple policy evaluation sweeps between each policy improvement sweep.
|
||||
|
||||
## Asynchronous Dynamic Programming
|
||||
|
||||
*Asynchronous* DP algorithms are in-place DP algorithms that are not organized in terms of systematic sweeps of the state set. These algorithms update the values of states in any order whatsoever, using whatever value of other states happen to be available.
|
||||
|
||||
To converge correctly, however, an asynchronous algorithm must continue to update the value of all the states: it can't ignore any state after some point in the computation.
|
||||
|
||||
## Generalized Policy Iteration
|
||||
|
||||
Policy iteration consists of two simultaneous, iterating processes, one making the value function consistent with the current policy (policy evaluation) and the other making the policy greedy with respect to the current value function (policy improvement).
|
||||
|
||||
We use the term *generalized policy iteration* (GPI) to competing and cooperating. They compete in the sense that they pull in opposing directions. Making the policy greedy with respect to the value function typically makes the value function incorrect for the changed policy. Making the value function consistent with the policy typically causes the policy to be greedy. In the long run, however, the two processes interact to find a single joint solution.
|
||||
|
||||
66
content/research/reinforcementlearning/notes/intro.md
Normal file
66
content/research/reinforcementlearning/notes/intro.md
Normal file
|
|
@ -0,0 +1,66 @@
|
|||
# Introduction to Reinforcement Learning Day 1
|
||||
|
||||
Recall that this course is based on the book --
|
||||
|
||||
Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
|
||||
|
||||
|
||||
|
||||
These notes really serve as talking points for the overall concepts described in the chapter and are not meant to stand for themselves. Check out the book for more complete thoughts :)
|
||||
|
||||
|
||||
|
||||
**Reinforcement Learning** is learning what to do -- how to map situations to actions -- so as to maximize a numerical reward signal. There are two characteristics, trial-and-error search and delayed reward, that are the two most important distinguishing features of reinforcement learning.
|
||||
|
||||
|
||||
|
||||
Markov decision processes are intended to include just these three aspects: sensation, action, and goal(s).
|
||||
|
||||
|
||||
|
||||
Reinforcement learning is **different** than the following categories
|
||||
|
||||
- Supervised learning: This is learning from a training set of labeled examples provided by a knowledgeable external supervisor. In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all situations in which the agent has to act.
|
||||
- Unsupervised learning: Reinforcement learning is trying to maximize a reward signal as opposed to finding some sort of hidden structure within the data.
|
||||
|
||||
|
||||
|
||||
One of the challenges that arise in reinforcement learning is the **trade-off** between exploration and exploitation. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task.
|
||||
|
||||
|
||||
|
||||
Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment. This is different than supervised learning since they're concerned with finding the best classifier/regression without explicitly specifying how such an ability would finally be useful.
|
||||
|
||||
|
||||
|
||||
A complete, interactive, goal-seeking agent can also be a component of a larger behaving system. A simple example is an agent that monitors the charge level of a robot's battery and sends commands to the robot's control architecture. This agent's environment is the rest of the robot together with the robot's environment.
|
||||
|
||||
|
||||
|
||||
## Definitions
|
||||
|
||||
A policy defines the learning agent's way of behaving at a given time
|
||||
|
||||
|
||||
|
||||
A reward signal defines the goal in a reinforcement learning problem. The agent's sole objective is to maximize the total reward it receives over the long run
|
||||
|
||||
|
||||
|
||||
A value function specifies what is good in the long run. Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. Without rewards there could be no value,s and the only purpose of estimating values is to achieve more reward. We seek actions that bring about states of highest value.
|
||||
|
||||
|
||||
|
||||
Unfortunately, it is much harder to determine values than it is to determine rewards. The most important component of almost all reinforcement learning algorithms we consider is a method for efficiently estimating values.
|
||||
|
||||
|
||||
|
||||
**Look at Tic-Tac-Toe example**
|
||||
|
||||
|
||||
|
||||
Most of the time in a reinforcement learning algorithm, we move greedily, selecting the move that leads to the state with greatest value. Occasionally, however, we select randomly from amoung the other moves instead. These are called exploratory moves because they cause us to experience states that we might otherwise never see.
|
||||
|
||||
|
||||
|
||||
Summary: Reinforcement learning is learning by an agent from direct interaction wit its environment, without relying on exemplary supervision or complete models of the environment.
|
||||
110
content/research/reinforcementlearning/notes/mcmethods.md
Normal file
110
content/research/reinforcementlearning/notes/mcmethods.md
Normal file
|
|
@ -0,0 +1,110 @@
|
|||
# Chapter 5: Monte Carlo Methods
|
||||
|
||||
Monte Carlo methods do not assume complete knowledge of the environment. They require only *experience* which is a sample sequence of states, actions, and rewards from actual or simulated interaction with an environment.
|
||||
|
||||
Monte Carlo methods are ways of solving the reinforcement learning problem based on averaging sample returns. To ensure that well-defined returns are available, we define Monte Carlo methods only for episodic tasks. Only on the completion of an episode are value estimates and policies changed.
|
||||
|
||||
Monte Carlo methods sample and average returns for each state-action pair is much like the bandit methods explored earlier. The main difference is that there are now multiple states, each acting like a different bandit problems and the problems are interrelated. Due to all the action selections undergoing learning, the problem becomes nonstationary from the point of view of the earlier state.
|
||||
|
||||
## Monte Carlo Prediction
|
||||
|
||||
Recall that the value of a state is the expected return -- expected cumulative future discounted reward - starting from that state. One way to do it is to estimate it from experience by averaging the returns observed after visits to that state.
|
||||
|
||||
Each occurrence of state $s$ in an episode is called a *visit* to $s$. The *first-visit MC method* estimates $v_\pi(s)$ as the average of the returns following first visits to $s$, whereas the *every-visit MC method* averages the returns following all visits to $s$. These two Monte Carlo methods are very similar but have slightly different theoretical properties.
|
||||
|
||||
<u>First-visit MC prediction</u>
|
||||
|
||||
```
|
||||
Initialize:
|
||||
π ← policy to be evaluated
|
||||
V ← an arbitrary state-value function
|
||||
Returns(s) ← an empty list, for all s ∈ S
|
||||
|
||||
Repeat forever:
|
||||
Generate an episode using π
|
||||
For each state s appearing in the episode:
|
||||
G ← the return that follows the first occurrence of
|
||||
s
|
||||
Append G to Returns(s)
|
||||
V(s) ← average(Returns(s))
|
||||
```
|
||||
|
||||
## Monte Carlo Estimation of Action Values
|
||||
|
||||
If a model is not available then it is particularly useful to estimate *action* values rather than state values. With a model, state values alone are sufficient to define a policy. Without a model, however, state values alone are not sufficient. One must explicitly estimate the value of each action in order for the values to be useful in suggesting a policy.
|
||||
|
||||
The only complication is that many state-action pairs may never be visited. If $\pi$ is a deterministic policy, then in following $\pi$ one will observe returns only for one of the actions from each state. With no returns to average, the Monte Carlo estimates of the other actions will not improve with experience. This is a serious problem because the purpose of learning action values is to help in choosing among the actions available in each state.
|
||||
|
||||
This is the general problem of *maintaining exploration*. For policy evaluation to work for action values, we must assure continual exploration. One way to do this is by specifying that the episodes *start in a state-action pair*, and that each pair has a nonzero probability of being selected as the start. We call this the assumption of *exploring starts*.
|
||||
|
||||
## Monte Carlo Control
|
||||
|
||||
We made two unlikely assumptions above in order to easily obtain this guarantee of convergence for the Monte Carlo method. One was that the episodes have exploring starts, and the other was that policy evaluation could be done with an infinite number of episodes.
|
||||
|
||||
<u>Monte Carlo Exploring Starts</u>
|
||||
|
||||
```
|
||||
Initialize, for all s ∈ S, a ∈ A(s):
|
||||
Q(s,a) ← arbitrary
|
||||
π(s) ← arbitrary
|
||||
Returns(s,a) ← empty list
|
||||
|
||||
Repeat forever:
|
||||
Choose S_0 ∈ S and A_0 ∈ A(S_0) s.t. all pairs have probability > 0
|
||||
Generate an episode starting from S_0,A_0, following
|
||||
π
|
||||
For each pair s,a appearing in the episode:
|
||||
G ← the return that follows the first occurrence of
|
||||
s,a
|
||||
Append G to Returns(s,a)
|
||||
Q(s,a) ← average(Returns(s,a))
|
||||
For each s in the episode:
|
||||
π(s) ← arg max_a Q(s,a)
|
||||
```
|
||||
|
||||
## Monte Carlo Control without Exploring Starts
|
||||
|
||||
The only general way to ensure that actions are selected infinitely often is for the agent to continue to select them. There are two approaches ensuring this, resulting in what we call *on-policy* methods and *off-policy* methods.
|
||||
|
||||
On-policy methods attempt to evaluate or improve the policy that is used to make decisions, whereas off-policy methods evaluate or improve a policy different from that used to generate the data.
|
||||
|
||||
In on-policy control methods the policy is generally *soft*, meaning that $\pi(a|s)$ for all $a \in \mathcal{A}(s)$. The on-policy methods in this section uses $\epsilon$-greedy policies, meaning that most of the time they choose an action that has maximal estimated action value, but with probability $\epsilon$ they instead select an action at random.
|
||||
|
||||
<u>On-policy first-visit MC control (for $\epsilon$-soft policies)</u>
|
||||
|
||||
```
|
||||
|
||||
```
|
||||
|
||||
Initialize, for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$:
|
||||
|
||||
$Q(s, a)$ ← arbitrary
|
||||
|
||||
$Returns(s,a)$ ← empty list
|
||||
|
||||
$\pi(a|s)$ ← an arbitrary $\epsilon$-soft policy
|
||||
|
||||
Repeat forever:
|
||||
|
||||
(a) Generate an episode using $\pi$
|
||||
|
||||
(b) For each pair $s,a$ appearing in the episoe
|
||||
|
||||
$G$ ← the return that follows the first occurance of s, a
|
||||
|
||||
Append $G$ to $Returns(s,a)$
|
||||
|
||||
$Q(s, a)$ ← average($Returns(s,a)$)
|
||||
|
||||
(c) For each $s$ in the episode:
|
||||
|
||||
$A^*$ ← argmax$_a Q(s,a)$ (with ties broken arbitrarily)
|
||||
|
||||
For all $a \in \mathcal{A}(s)$:
|
||||
|
||||
$\pi(a|s)$ ← $\begin{cases} 1 - \epsilon + \epsilon / |\mathcal{A}(s)| & a = A^* \\ \epsilon / | \mathcal{A}(s)| & a \neq A^* \end{cases}$
|
||||
|
||||
```
|
||||
|
||||
```
|
||||
|
||||
193
content/research/reinforcementlearning/notes/mdp.md
Normal file
193
content/research/reinforcementlearning/notes/mdp.md
Normal file
|
|
@ -0,0 +1,193 @@
|
|||
# Chapter 3: Finite Markov Decision Processes
|
||||
|
||||
Markov Decision processes are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to trade-off immediate and delayed reward. Whereas in bandit problems we estimated the value of $q_*(a)$ of each action $a$, in MDPs we estimate the value of $q_*(s, a)$ of each action $a$ in state $s$.
|
||||
|
||||
MDPs are a mathematically idealized form of the reinforcement learning problem. As we will see in artificial intelligence, there is often a tension between breadth of applicability and mathematical tractability. This chapter will introduce this tension and discuss some of the trade-offs and challenges that it implies.
|
||||
|
||||
## Agent-Environment Interface
|
||||
|
||||
The learner and decision maker is called the *agent*. The thing it interacts with is called the *environment*. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent.
|
||||
|
||||
The environment also give rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.
|
||||
|
||||

|
||||
|
||||
To make the future paragraphs clearer, a Markov Decision Process is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of the decision maker.
|
||||
|
||||
In a *finite* MDP, the set of states, actions, and rewards all have a finite number of elements. In this case, the random variables R_t and S_t have a well defined discrete probability distribution dependent only on the preceding state and action.
|
||||
$$
|
||||
p(s^\prime | s,a) = \sum_{r \in \mathcal{R}}{p(s^\prime, r|s, a)}
|
||||
$$
|
||||
Breaking down the above formula, it's just an instantiation of the law of total probability. If you partition the probabilistic space by the reward, if you sum up each partition you should get the overall space. This formula has a special name: the *state-transition probability*.
|
||||
|
||||
From this we can compute the expected rewards for each state-action pair by multiplying each reward with the probability of getting said reward and summing it all up.
|
||||
$$
|
||||
r(s, a) = \sum_{r \in \mathcal{R}}{r}\sum_{s^\prime \in \mathcal{S}}{p(s^\prime, r|s,a)}
|
||||
$$
|
||||
The expected reward for a state-action-next-state triple is
|
||||
$$
|
||||
r(s, a, s^\prime) = \sum_{r \in \mathcal{R}}{r\frac{p(s^\prime, r|s,a)}{p(s^\prime|s,a)}}
|
||||
$$
|
||||
I wasn't able to piece together this function in my head like the others. Currently I imagine it as if we took the formula before and turned the universe of discourse from the universal set to the state where $s^\prime$ occurred.
|
||||
|
||||
The MDP framework is abstract and flexible and can be applied to many different problems in many different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive states of decision making and acting.
|
||||
|
||||
### Agent-Environment Boundary
|
||||
|
||||
In particular, the boundary between agent and environment is typically not the same as the physical boundary of a robot's or animals' body. Usually, the boundary is drawn closer to the agent than that. For example, the motors and mechanical linkages of a robot and its sensing hardware should usually be considered parts of the environment rather than parts of the agent.
|
||||
|
||||
The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. We do not assume that everything in the environment is unknown to the agent. For example, the agent often knows quite a bit about how its rewards are computed as a function of its actions and the states in which they are taken. But we always consider the reward computation to be external to the agent because it defines the task facing the agent and thus must be beyond its ability to change arbitrarily. The agent-environment boundary represents the limit of the agent's absolute control, not of its knowledge.
|
||||
|
||||
This framework breaks down whatever the agent is trying to achieve to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent, one signal to represent the basis on which the choices are made (the states), and one signal to define the agent's goal (the rewards).
|
||||
|
||||
### Example 3.4: Recycling Robot MDP
|
||||
|
||||
Recall that the agent makes a decision at times determined by external events. At each such time the robot decides whether it should
|
||||
|
||||
(1) Actively search for a can
|
||||
|
||||
(2) Remain stationary and wait for someone to bring it a can
|
||||
|
||||
(3) Go back to home base to recharge its battery
|
||||
|
||||
Suppose the environment works as follows: the best way to find cans is to actively search for them, but this runs down the robot's battery, whereas waiting does not. Whenever the robot is searching the possibility exists that its battery will become depleted. In this case, the robot must shut down and wait to be rescued (producing a low reward.)
|
||||
|
||||
The agent makes its decisions solely as a function of the energy level of the battery, It can distinguish two levels, high and low, so that the state set is $\mathcal{S} = \{high, low\}$. Let us call the possible decisions -- the agent's actions -- wait, search, and recharge. When the energy level is high, recharging would always be foolish, so we do not include it in the action set for this state. The agent's action sets are
|
||||
$$
|
||||
\begin{align*}
|
||||
\mathcal{A}(high) &= \{search, wait\} \\
|
||||
\mathcal{A}(low) &= \{search, wait, recharge\}
|
||||
\end{align*}
|
||||
$$
|
||||
If the energy level is high, then a period of active search can always be completed without a risk of depleting the battery. A period of searching that begins with a high energy level leaves the energy level high with a probability of $\alpha$ and reduces it to low with a probability of $(1 - \alpha)$. On the other hand, a period of searching undertaken when the energy level is low leaves it low with a probability of $\beta$ and depletes the battery with a probability of $(1 - \beta)$. In the latter case, the robot must be rescued, and the batter is then recharged back to high.
|
||||
|
||||
Each can collected by the robot counts as a unit reward, whereas a reward of $-3$ occurs whenever the robot has to be rescued. Let $r_{search}$ and $r_{wait}$ with $r_{search } > r_{wait}$, respectively denote the expected number of cans the robot will collect while searching and waiting. Finally, to keep things simple, suppose that no cans can be collected ruing a run home for recharging and that no cans can be collected on a step in which the battery is depleted.
|
||||
|
||||
| $s$ | $a$ | $s^\prime$ | $p(s^\prime | s, a)$ | $r(s|
|
||||
| ---- | -------- | ---------- | ------------- | ------------ | --- |
|
||||
| high | search | high | $\alpha$ | $r_{search}$ |
|
||||
| high | search | low | $(1-\alpha)$ | $r_{search}$ |
|
||||
| low | search | high | $(1 - \beta)$ | -3 |
|
||||
| low | search | low | $\beta$ | $r_{search}$ |
|
||||
| high | wait | high | 1 | $r_{wait}$ |
|
||||
| high | wait | low | 0 | $r_{wait}$ |
|
||||
| low | wait | high | 0 | $r_{wait}$ |
|
||||
| low | wait | low | 1 | $r_{wait}$ |
|
||||
| low | recharge | high | 1 | 0 |
|
||||
| low | recharge | low | 0 | 0 |
|
||||
|
||||
A *transition graph* is a useful way to summarize the dynamics of a finite MDP. There are two kinds of nodes: *state nodes* and *action nodes*. There is a state node for each possible state and an action node for reach state-action pair. Starting in state $s$ and taking action $a$ moves you along the line from state node $s$ to action node $(s, a)$. The the environment responds with a transition ot the next states node via one of the arrows leaving action node $(s, a)$.
|
||||
|
||||

|
||||
|
||||
## Goals and Rewards
|
||||
|
||||
The reward hypothesis is that all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal called the reward.
|
||||
|
||||
Although formulating goals in terms of reward signals might at first appear limiting, in practice it has proved to be flexible and widely applicable. The best way to see this is to consider the examples of how it has been, or could be used. For example:
|
||||
|
||||
- To make a robot learn to walk, researchers have provided reward on each time step proportional to the robot's forward motion.
|
||||
- In making a robot learn how to escape from a maze, the reward is often $-1$ for every time step that passes prior to escape; this encourages the agent to escape as quickly as possible.
|
||||
- To make a robot learn to find and collect empty soda cans for recycling, one might give it a reward of zero most of the time, and then a reward of $+1$ for each can collected. One might also want to give the robot negative rewards when it bumps into things or when somebody yells at it.
|
||||
- For an agent to play checkers or chess, the natural rewards are $+1$ for winning, $-1$ for losing, and $0$ for drawing and for all nonterminal positions.
|
||||
|
||||
It is critical that the rewards we set up truly indicate what we want accomplished. In particular, the reward signal is not the place to impart to the agent prior knowledge about *how* to achieve what we want it to do. For example, a chess playing agent should only be rewarded for actually winning, not for achieving subgoals such as taking its opponent's pieces. If achieving these sort of subgoals were rewarded, then the agent might find a way to achieve them without achieving the real goal. The reward signal is your way of communicating what you want it to achieve, not how you want it achieved.
|
||||
|
||||
## Returns and Episodes
|
||||
|
||||
In general, we seek to maximize the *expected return*, where the return is defined as some specific function of the reward sequence. In the simplest case, the return is the sum of the rewards:
|
||||
$$
|
||||
G_t = R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots + R_{T}
|
||||
$$
|
||||
where $T$ is the final time step. This approach makes sense in applications in which there is a natural notion of a final time step. That is when the agent-environment interaction breaks naturally into subsequences or *episodes*, such as plays of a game, trips through a maze, or any sort of repeated interaction.
|
||||
|
||||
### Episodic Tasks
|
||||
|
||||
Each episode ends in a special state called the *terminal state*, followed by a reset to the standard starting state or to a sample from a standard distribution of starting states. Even if you think of episodes as ending in different ways the next episode begins independently of how the previous one ended. Therefore, the episodes can all be considered to ending the same terminal states with different rewards for different outcomes.
|
||||
|
||||
Tasks with episodes of this kind are called *episodic tasks*. In episodic tasks we sometimes need to distinguish the set of all nonterminal states, denoted $\mathcal{S}$, from the set of all states plus the terminal state, denoted $\mathcal{S}^+$. The time of termination, $T$, is a random variable that normally varies from episode to episode.
|
||||
|
||||
### Continuing Tasks
|
||||
|
||||
On the other hand, in many cases the agent-environment interaction goes on continually without limit. We call these *continuing tasks*. The return formulation above is problematic for continuing tasks because the final time step would be $T = \infty$, and the return which we are trying to maximize, could itself easily be infinite. The additional concept that we need is that of *discounting*. According to this approach, the agent tries to select actions so that the sum of the discounted rewards it receives over the future is maximized. In particular, it chooses $A_t$ to maximize the expected discounted return.
|
||||
$$
|
||||
G_t= \sum_{k = 0}^\infty{\gamma^kR_{t+k+1}}
|
||||
$$
|
||||
where $\gamma$ that is a parameter between $0$ and $1$ is called the *discount rate*.
|
||||
|
||||
#### Discount Rate
|
||||
|
||||
The discount rate determines the present value of future rewards: a reward received $k$ time steps in the future is worth only $\gamma^{k - 1}$ time what it would be worth if it were received immediately. If $\gamma < 1$, the infinite sum has a finite value as long as the reward sequence is bounded.
|
||||
|
||||
If $\gamma = 0$, the agent is "myopic" in being concerned only with maximizing immediate rewards. But in general, acting to maximize immediate reward can reduce access to future rewards so that the return is reduced.
|
||||
|
||||
As $\gamma$ approaches 1, the return objective takes future rewards into account more strongly; the agent becomes more farsighted.
|
||||
|
||||
### Example 3.5 Pole-Balancing
|
||||
|
||||
The objective in this task is to apply forces to a car moving along a track so as to keep a pole hinged to the cart from falling over.
|
||||
|
||||
A failure is said to occur if the pole falls past a given angle from the vertical or if the cart runs off the track.
|
||||
|
||||

|
||||
|
||||
#### Approach 1
|
||||
|
||||
The reward can be a $+1$ for every time step on which failure did not occur. In this case, successful balancing would mean a return of infinity.
|
||||
|
||||
#### Approach 2
|
||||
|
||||
The reward can be $-1$ on each failure and zero all other times. The return at each time would then be related to $-\gamma^k$ where $k$ is the number of steps before failure.
|
||||
|
||||
|
||||
|
||||
Either case the return is maximized by keeping the pole balanced for as long as possible.
|
||||
|
||||
## Policies and Value Functions
|
||||
|
||||
Almost all reinforcement learning algorithms involve estimating *value functions* which estimate what future rewards can be expected. Of course the rewards that the agent can expect to receive is dependent on the actions it will take. Accordingly, value functions are defined with respect to particular ways of acting, called policies.
|
||||
|
||||
Formally, a *policy* is a mapping from states to probabilities of selecting each possible action. The *value* of a state s under a policy \pi, denoted $v_{\pi}(s)$ is the expected return when starting in $s$ and following $\pi$ thereafter. For MDPs we can define $v_{\pi}$ as
|
||||
|
||||
$$
|
||||
v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s] = \mathbb{E}_{\pi}[\sum_{k = 0}^\infty{\gamma^kR_{t+k+1} | S_t = s}]
|
||||
$$
|
||||
|
||||
We call this function the *state-value function for policy $\pi$*. Similarly, we define the value of taking action $a$ in state $s$ under a policy $\pi$, denoted as $q_\pi(s,a)$ as the expected return starting from $s$, taking the action $a$, and thereafter following the policy $\pi$. Succinctly, this is called the *action-value function for policy $\pi$*.
|
||||
|
||||
### Optimality and Approximation
|
||||
|
||||
For some kinds of tasks we are interested in,optimal policies can be generated only with extreme computational cost. A critical aspect of the problemfacing the agent is always teh computational power available to it, in particular, the amount of computation it can perform in a single time step.
|
||||
|
||||
The memory available is also an important constraint. A large amount of memory is often required to build up approximations of value functions, policies, and models. In the case of large state sets, functions must be approximated using some sort of more compact parameterized function representation.
|
||||
|
||||
This presents us with unique oppurtunities for achieving useful approximations. For example, in approximating optimal behavior, there may be many states that the agent faces with such a low probability that selecting suboptimal actions for them has little impact on the amount of reward the agent receives.
|
||||
|
||||
The online nature of reinforcement learning which makes it possible to approximate optimal policies in ways that put more effort into learning to make good decisions for frequently encountered states at the expense of infrequent ones is the key property that distinguishes reinforcement learning from other approaches to approximately solve MDPs.
|
||||
|
||||
### Summary
|
||||
|
||||
Let us summarize the elements of the reinforcement learning problem.
|
||||
|
||||
Reinforcement learning is about learning from interaction how to behave in order to achieve a goal. The reinforcement learning *agent* and its *environment* interact over a sequence of discrete time steps.
|
||||
|
||||
The *actions* are the choices made by the agent; the states are the basis for making the choice; and the *rewards* are the basis for evaluating the choices.
|
||||
|
||||
Everything inside the agent is completely known and controllable by the agent; everything outside is incompletely controllable but may or may not be completely known.
|
||||
|
||||
A *policy* is a stochastic rule by which the agent selects actions as a function of states.
|
||||
|
||||
When the reinforcement learning setup described above is formulated with well defined transition probabilities it constitutes a Markov Decision Process (MDP)
|
||||
|
||||
The *return* is the function of future rewards that the agent seeks to maximize. It has several different definitions depending on the nature of the task and whether one wishes to *discount* delayed reward.
|
||||
|
||||
- The un-discounted formulation is appropriate for *episodic tasks*, in which the agent-environment interaction breaks naturally into *episodes*
|
||||
- The discounted formulation is appropriate for *continuing tasks* in which the interaction does not naturally break into episodes but continue without limit
|
||||
|
||||
A policy's *value functions* assign to each state, or state-action pair, the expected return from that state, or state-action pair, given that the agent uses the policy. The *optimal value functions* assign to each state, or state-action pair, the largest expected return achievable by any policy. A policy whose value unctions are optimal is an *optimal policy*.
|
||||
|
||||
Even if the agent has a complete and accurate environment model, the agent is typically unable to perform enough computation per time step to fully use it. The memory available is also an important constraint. Memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states that could possibly be entries in a table, and approximations must be made.
|
||||
|
||||
|
||||
|
||||
|
||||
19
content/research/reinforcementlearning/readings.md
Normal file
19
content/research/reinforcementlearning/readings.md
Normal file
|
|
@ -0,0 +1,19 @@
|
|||
# Readings for Lectures of Cluster Analysis
|
||||
|
||||
## Lecture 1
|
||||
Chapter 1: What is Reinforcement Learning?
|
||||
|
||||
## Lecture 2
|
||||
Chapter 2: Multi-armed Bandits
|
||||
|
||||
## Lecture 3
|
||||
Chpater 3: Finite Markov Decision Processes Part 1
|
||||
|
||||
## Lecture 4
|
||||
Chapter 3: Finite Markov Decision Processes Part 2
|
||||
|
||||
## Lecture 5
|
||||
[No Readings] Playing around with Multi-armed Bandits Code
|
||||
|
||||
|
||||
**Lost track of readings around this time period :(**
|
||||
78
content/research/reinforcementlearning/syllabus.md
Normal file
78
content/research/reinforcementlearning/syllabus.md
Normal file
|
|
@ -0,0 +1,78 @@
|
|||
# Reinforcement Learning
|
||||
|
||||
The goal of this independent study is to gain an introduction to the topic of Reinforcement Learning.
|
||||
|
||||
As such the majority of the semester will be following the textbook to gain an introduction to the topic, and the last part applying it to some problems.
|
||||
|
||||
|
||||
## Textbook
|
||||
|
||||
The majority of the content of this independent study will come from the textbook. This is meant to lessen the burden on the both us of as I already experimented with curating my own content.
|
||||
|
||||
The textbook also includes examples throughout the text to immediately apply what's learned.
|
||||
|
||||
Richard S. Sutton and Andrew G. Barto, "Reinforcement Learning: An Introduction" http://incompleteideas.net/book/bookdraft2017nov5.pdf
|
||||
|
||||
## Discussions and Notes
|
||||
|
||||
Discussions and notes will be kept track of and published on my tilda space as time and energy permits. This is for easy reference and since it's nice to write down what you learn.
|
||||
|
||||
## Topics to be Discussed
|
||||
|
||||
###The Reinforcement Learning Problem (3 Sessions)
|
||||
|
||||
In this section we will get ourselves familiar with the topics that are commonly discussed in Reinforcement learning problems.
|
||||
|
||||
In this section we will learn the different vocab terms such as:
|
||||
|
||||
- Evaluative Feedback
|
||||
- Non-Associative Learning
|
||||
- Rewards/Returns
|
||||
- Value Functions
|
||||
- Optimality
|
||||
- Exploration/Exploitation
|
||||
- Model
|
||||
- Policy
|
||||
- Value Function
|
||||
- Multi-armed Bandit Problem
|
||||
|
||||
### Markov Decision Processes (4 Sessions)
|
||||
|
||||
This is a type of reinforcement learning problem that is commonly studied and well documented. This helps form an environment for which the agent can operate within. Possible subtopics include:
|
||||
|
||||
- Finite Markov Decision Processes
|
||||
- Goals and Rewards
|
||||
- Returns and Episodes
|
||||
- Optimality and Approximation
|
||||
|
||||
### Dynamic Programming (3 Sessions)
|
||||
|
||||
Dynamic Programming refers to a collection of algorithms that can be used to compute optimal policies given an environment. Subtopics that we are going over is:
|
||||
|
||||
- Policy Evaluation
|
||||
- Policy Improvement
|
||||
- Policy Iteration
|
||||
- Value Iteration
|
||||
- Asynchronous DP
|
||||
- Generalized policy Iteration
|
||||
- Bellman Expectation Equations
|
||||
|
||||
### Monte Carlo Methods (3 Sessions)
|
||||
|
||||
Now we move onto not having complete knowledge of the environment. This will go into estimating value functions and discovering optimal policies. Possible subtopics include:
|
||||
|
||||
- Monte Carlo Prediction
|
||||
- Monte Carlo Control
|
||||
- Importance Sampling
|
||||
- Incremental Implementation
|
||||
- Off-Policy Monte Carlo Control
|
||||
|
||||
### Temporal-Difference Learning (4-5 Sessions)
|
||||
|
||||
Temporal-Difference learning is a combination of Monte Carlo ideas and Dynamic Programming. This can lead to methods learning directly from raw experience without knowledge of an environment. Subtopics will include:
|
||||
|
||||
- TD Prediction
|
||||
- Sarsa: On-Policy TD Control
|
||||
- Q-Learning: Off-Policy TD Control
|
||||
- Function Approximation
|
||||
- Eligibility Traces
|
||||
Loading…
Add table
Add a link
Reference in a new issue