3.4 KiB
Optimality Criteria
Falling under wrapper methods, optimality criterion are often used to aid in model selection. These criteria provide a measure of fit for the data to a given hypothesis.
Akaike Information Criterion (AIC)
AIC is an estimator of relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model relative to each other.
This way, AIC provides a means for model selection. AIC offers an estimate of the relative information lost when a given model is used.
This metric does not say anything about the absolute quality of a model but only serves for comparison between models. Therefore, if all the candidate models fit poorly to the data, AIC will not provide any warnings.
It is desired to pick the model with the lowest AIC.
AIC is formally defined as
AIC = 2k - 2\ln{(\hat{L})}
Bayesian Information Criterion (BIC)
This metric is based on the likelihood function and is closely related to the Akaike information criterion. It is desired to pick the model with the lowest BIC.
BIC is formally defined as
BIC = \ln{(n)}k - 2\ln{(\hat{L})}
Where \hat{L} is the maximized value of the likelihood function for the model M.
\hat{L} = p(x | \hat{\theta}, M)
x is the observed data, n is the number of observations, and k is the number of parameters estimated.
Properties of BIC
- It is independent from the prior
- It penalizes the complexity of the model in terms of the number of parameters
Limitations of BIC
- Approximations are only valid for sample sizes much greater than the number of parameters (dense data)
- Cannot handle collections of models in high dimension
Differences from AIC
AIC is mostly used when comparing models. BIC asks the question of whether or not the model resembles reality. Even though they have similar functions, they are separate goals.
Mallow's C_p
C_p is used to assess the fit of a regression model that has been estimated using ordinary least squares. A small value of C_p indicates that the model is relatively precise.
The C_p of a model is defined as
C_p = \frac{\sum_{i =1}^N{(Y_i - Y_{pi})^2}}{S^2}- N + 2P
-
Y_piis the predicted value of the $i$th observation ofYfrom thePregressors -
S^2is the residual mean square after regression on the complete set of regressors and can be estimated by mean square errorMSE, -
Nis the sample size.
An alternative definition is
C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)
RSSis the residual sum of squaresdis the number of predictors\hat{\sigma}^2refers to an estimate of the variances associated with each response in the linear model
Deviance Information Criterion
The DIC is a hierarchical modeling generalization of the AIC and BIC. it is useful in Bayesian model selection problems where posterior distributions of the model was obtained by a Markov Chain Monte Carlo simulation.
This method is only valid if the posterior distribution is approximately multivariate normal.
Let us define the deviance as
D(\theta) = -2\log{(p(y|\theta))} + C
Where y is the data and \theta are the unknown parameters of the model.
Let us define a helper variable p_D as the following
p_D = \frac{1}{2}\hat{Var}(D(\theta))
Finally the deviance information criterion can be calculated as
DIC = D(\bar{\theta}) + 2p_D
Where \bar{theta} is the expectation of \theta.