Cleaned up some old pages and uploaded some proceedings

This commit is contained in:
Brandon Rozek 2021-12-13 16:56:35 -05:00
parent d222df272b
commit fd62fbaf72
26 changed files with 97 additions and 61 deletions

View file

@ -8,6 +8,8 @@ Chris Richters, Ethan Martin, and I embarked on the book ["Algorithms" by Jeff E
Carlos Ramirez, Ethan Martin, and I started on the book "Elements of Radio" by Abraham and William Marcus. It provides a great introduction to radio not assuming an engineering background.
[Notes on Dimensionality Reduction](dimensionalityreduction)
## Courses
Below are the courses that I have documents up online for.

View file

@ -0,0 +1,16 @@
---
Title: Dimensionality Reduction
Description: Reducing High Dimensional Datasets to what we can digest.
showthedate: false
---
In the Summer of 2018, another student and I started a study on Dimensionality Reduction. Sadly we became too busy a few weeks into the study. I decided to upload what we got through anyways.
[Syllabus](syllabus)
[Intro](intro)
[Feature Selection](featureselection)
[Optimality Criteria](optimalitycriteria)

View file

@ -0,0 +1,36 @@
---
title: Feature Selection
---
Feature selection is the process of selecting a subset of relevant features for use in model construction. The core idea is that data can contain many features that are redundant or irrelevant. Therefore, removing it will not result in much loss of information. We also wish to remove features that do not help in our goal.
Feature selection techniques are usually applied in domains where there are many features and comparatively few samples.
## Techniques
The brute force feature selection method exists to exhaustively evaluate all possible combinations of the input features, and find the best subset. The computational cost of this approach is prohibitively high and includes a considerable risk of overfitting to the data.
The techniques below describe greedy approaches to this problem. Greedy algorithms are ones that don't search the entire possible space but instead converges towards local maximums/minimums.
### Wrapper Methods
This uses a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. The error rate of the model results in a score for that subset. This method is computationally intensive, but usually provides the best performing feature set for that particular type of model.
### Filter Methods
This method uses a proxy measure instead of the error rate. The proxy measure can be specifically chosen to be faster to compute while still capturing the essence of the feature set. Common implementations include:
- Mutual information
- Pointwise mutual information
- Pearson product-moment correlation coefficient
- Relief-based algorithms
- Inter/intra class distances
- Scores of significance tests
Many filters provide a feature ranking rather than producing an explicit best feature subset
### Embedded Methods
This is a catch-all group of techniques which perform feature selection as part of the model. For example, the LASSO linear model penalizes the regression coefficients shrinking unimportant ones to zero.
Stepwise regression is a commonly used feature selection technique that acts greedily by adding the feature that results in the best result each turn.

View file

@ -0,0 +1,32 @@
---
title: Introduction to Dimensionality Reduction
---
## Motivations
We all have problems to solve, but the data we might have at our disposal is too sparse or has too many features that it makes it computationally difficult or maybe even impossible to solve the problem.
### Types of Problems
**Prediction**: This is taking some input and trying to predict an output of it. An example includes having a bunch of labeled pictures of people and having the computer predict who is in the next picture taken. (Face or Object Recognition)
**Structure Discovery**: Find an alternative representation of the data. Usually used to find groups or alternate visualizations
**Density Estimation**: Finding the best model that describes the data. An example includes explaining the price of a home depending on several factors.
## Advantages
- Reduces the storage space of data (possibly removing noise in the process!)
- Decreases complexity making algorithms run faster
- Removes multi-collinearity which in turn likely improves the performance of a given machine learning model
- Multi-collinearity usually indicates that multiple variables are correlated with each other. Most models make use of independent features to simplify computations. Therefore, ensuring independent features is important.
- Data becomes easier to visualize as it can be projected into 2D/3D space
- Lessens the chance of models *overfitting*
- This typically happens when you have less observations compared to variables (also known as sparse data)
- Overfitting leads to a model being able to have high accuracy on the test set, but generalize poorly to reality.
- Curse of dimensionality does not apply in resulting dataset
- Curse of dimensionality is that in high dimensional spaces, all points become equidistant
## Disadvantages
Data is lost through this method, potentially resulting in possibly insightful information being removed. Features from dimensionality reduction are typically harder to interpret leading to more confusing models.

View file

@ -0,0 +1,101 @@
---
title: Optimality Criteria
---
Falling under wrapper methods, optimality criterion are often used to aid in model selection. These criteria provide a measure of fit for the data to a given hypothesis.
## Akaike Information Criterion (AIC)
AIC is an estimator of <u>relative</u> quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model relative to each other.
This way, AIC provides a means for model selection. AIC offers an estimate of the relative information lost when a given model is used.
This metric does not say anything about the absolute quality of a model but only serves for comparison between models. Therefore, if all the candidate models fit poorly to the data, AIC will not provide any warnings.
It is desired to pick the model with the lowest AIC.
AIC is formally defined as
$$
AIC = 2k - 2\ln{(\hat{L})}
$$
## Bayesian Information Criterion (BIC)
This metric is based on the likelihood function and is closely related to the Akaike information criterion. It is desired to pick the model with the lowest BIC.
BIC is formally defined as
$$
BIC = \ln{(n)}k - 2\ln{(\hat{L})}
$$
Where $\hat{L}$ is the maximized value of the likelihood function for the model $M$.
$$
\hat{L} = p(x | \hat{\theta}, M)
$$
$x$ is the observed data, $n$ is the number of observations, and $k$ is the number of parameters estimated.
### Properties of BIC
- It is independent from the prior
- It penalizes the complexity of the model in terms of the number of parameters
### Limitations of BIC
- Approximations are only valid for sample sizes much greater than the number of parameters (dense data)
- Cannot handle collections of models in high dimension
### Differences from AIC
AIC is mostly used when comparing models. BIC asks the question of whether or not the model resembles reality. Even though they have similar functions, they are separate goals.
## Mallow's $C_p$
$C_p$ is used to assess the fit of a regression model that has been estimated using ordinary least squares. A small value of $C_p$ indicates that the model is relatively precise.
The $C_p$ of a model is defined as
$$
C_p = \frac{\sum_{i =1}^N{(Y_i - Y_{pi})^2}}{S^2}- N + 2P
$$
- $Y_pi$ is the predicted value of the $i$th observation of $Y$ from the $P$ regressors
- $S^2$ is the residual mean square after regression on the complete set of regressors and can be estimated by mean square error $MSE$,
- $N$ is the sample size.
An alternative definition is
$$
C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)
$$
- $RSS$ is the residual sum of squares
- $d$ is the number of predictors
- $\hat{\sigma}^2$ refers to an estimate of the variances associated with each response in the linear model
## Deviance Information Criterion
The DIC is a hierarchical modeling generalization of the AIC and BIC. it is useful in Bayesian model selection problems where posterior distributions of the model was <u>obtained by a Markov Chain Monte Carlo simulation</u>.
This method is only valid if the posterior distribution is approximately multivariate normal.
Let us define the deviance as
$$
D(\theta) = -2\log{(p(y|\theta))} + C
$$
Where $y$ is the data and $\theta$ are the unknown parameters of the model.
Let us define a helper variable $p_D$ as the following
$$
p_D = \frac{1}{2}\hat{Var}(D(\theta))
$$
Finally the deviance information criterion can be calculated as
$$
DIC = D(\bar{\theta}) + 2p_D
$$
Where $\bar{theta}$ is the expectation of $\theta$.

View file

@ -0,0 +1,73 @@
---
title: Dimensionality Reduction Independent Study Syllabus
---
Dimensionality reduction is the process of reducing the number of random variables under consideration. This study will last for 10 weeks, meeting twice a week for about an hour.
## Introduction to Dimensionality Reduction (0.5 Week)
- Motivations for dimensionality reduction
- Advantages of dimensionality reduction
- Disadvantages of dimensionality reduction
## Feature Selection (3 Weeks)
This is the process of selecting a subset of relevant features. The central premise of this technique is that many features are either redundant or irrelevant and thus can be removed without incurring much loss of information.
### Metaheuristic Methods (1.5 Weeks)
- Filter Method
- Wrapper Method
- Embedded Method
### Optimality Criteria (0.5 Weeks)
- Bayesian Information Criterion
- Mallow's C
- Akaike Information Criterion
### Other Feature Selection Techniques (1 Week)
- Subset Selection
- Minimum-Redundancy-Maximum-Relevance (mRMR) feature selection
- Global Optimization Formulations
- Correlation Feature Selection
### Applications of Metaheuristic Techniques (0.5 Weeks)
- Stepwise Regression
- Branch and Bound
## Feature Extraction (6 Weeks)
Feature extraction transforms the data in high-dimensional space to a space of fewer dimensions. In other words, feature extraction involves reducing the amount of resources required to describe a large set of data.
### Linear Dimensionality Reduction (3 Weeks)
- Principal Component Analysis (PCA)
- Singular Value Decomposition (SVD)
- Non-Negative Matrix Factorization
- Linear Discriminant Analysis (LDA)
- Multidimensional Scaling (MDS)
- Canonical Correlation Analysis (CCA) [If Time Permits]
- Linear Independent Component Analysis [If Time Permits]
- Factor Analysis [If Time Permits]
### Non-Linear Dimensionality Reduction (3 Weeks)
One approach to the simplification is to assume that the data of interest lie on an embedded non-linear manifold within higher-dimensional space.
- Kernel Principal Component Analysis
- Nonlinear Principal Component Analysis
- Generalized Discriminant Analysis (GDA)
- T-Distributed Stochastic Neighbor Embedding (T-SNE)
- Self-Organizing Map
- Multifactor Dimensionality Reduction (MDR)
- Isomap
- Locally-Linear Embedding
- Nonlinear Independent Component Analysis
- Sammon's Mapping [If Time Permits]
- Hessian Eigenmaps [If Time Permits]
- Diffusion Maps [If Time Permits]
- RankVisu [If Time Permits]