3.4 KiB
Optimality Criteria
Falling under wrapper methods, optimality criterion are often used to aid in model selection. These criteria provide a measure of fit for the data to a given hypothesis.
Akaike Information Criterion (AIC)
AIC is an estimator of relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model relative to each other.
This way, AIC provides a means for model selection. AIC offers an estimate of the relative information lost when a given model is used.
This metric does not say anything about the absolute quality of a model but only serves for comparison between models. Therefore, if all the candidate models fit poorly to the data, AIC will not provide any warnings.
It is desired to pick the model with the lowest AIC.
AIC is formally defined as
AIC = 2k - 2\ln{(\hat{L})}
Bayesian Information Criterion (BIC)
This metric is based on the likelihood function and is closely related to the Akaike information criterion. It is desired to pick the model with the lowest BIC.
BIC is formally defined as
BIC = \ln{(n)}k - 2\ln{(\hat{L})}
Where \hat{L}
is the maximized value of the likelihood function for the model M
.
\hat{L} = p(x | \hat{\theta}, M)
x
is the observed data, n
is the number of observations, and k
is the number of parameters estimated.
Properties of BIC
- It is independent from the prior
- It penalizes the complexity of the model in terms of the number of parameters
Limitations of BIC
- Approximations are only valid for sample sizes much greater than the number of parameters (dense data)
- Cannot handle collections of models in high dimension
Differences from AIC
AIC is mostly used when comparing models. BIC asks the question of whether or not the model resembles reality. Even though they have similar functions, they are separate goals.
Mallow's C_p
C_p
is used to assess the fit of a regression model that has been estimated using ordinary least squares. A small value of C_p
indicates that the model is relatively precise.
The C_p
of a model is defined as
C_p = \frac{\sum_{i =1}^N{(Y_i - Y_{pi})^2}}{S^2}- N + 2P
-
Y_pi
is the predicted value of the $i$th observation ofY
from theP
regressors -
S^2
is the residual mean square after regression on the complete set of regressors and can be estimated by mean square errorMSE
, -
N
is the sample size.
An alternative definition is
C_p = \frac{1}{n}(RSS + 2d\hat{\sigma}^2)
RSS
is the residual sum of squaresd
is the number of predictors\hat{\sigma}^2
refers to an estimate of the variances associated with each response in the linear model
Deviance Information Criterion
The DIC is a hierarchical modeling generalization of the AIC and BIC. it is useful in Bayesian model selection problems where posterior distributions of the model was obtained by a Markov Chain Monte Carlo simulation.
This method is only valid if the posterior distribution is approximately multivariate normal.
Let us define the deviance as
D(\theta) = -2\log{(p(y|\theta))} + C
Where
y
is the data and \theta
are the unknown parameters of the model.
Let us define a helper variable p_D
as the following
p_D = \frac{1}{2}\hat{Var}(D(\theta))
Finally the deviance information criterion can be calculated as
DIC = D(\bar{\theta}) + 2p_D
Where
\bar{theta}
is the expectation of \theta
.