Find the simplest model to explain the data, ideally in an automatic way. Smaller models lead to less variable inference/prediction and can be more explainable than complex models.
Occam’s Razor principle: the smallest model that fits the data is best.
We want the simplest model for best fit (a model that explains a big portion of variability): Use any of the criteria discussed before to choose.
Parsimonious models tend to have less variability when used for prediction!
The most direct approach is called all subsets or best subsets regression: compute the least squares fit for all possible subsets and then choose between them based on some criterion. However we often can not examine all possible models, since they are
Instead we need an automated approach that searches through a subset of them. We discuss some commonly used approaches next.
Idea: from the simplest model we proceed by adding useful predictors. We remove predictors that do not improve the measure chosen!
- Let
$\mathcal{M_{0}}$ denote the null model which contains no predictors. -
$\forall k=1,...,p-2$ - Consider all
$p-k$ models that augment the predictors in$\mathcal{M_{k}}$ with one additional predictor. - Choose the best among these
$p-k$ models and call it$\mathcal{M_{k_1}}$ , where best is defined as$minimize(SS_{RES})$
- Consider all
- Select a single best model from among
$\mathcal{M_{0}}...\mathcal{M_{p-1}}$ using cross-validated prediction error, AIC, BIC, or Adjusted$R^{2}$
Computational advantage over best subset selection is clear.
However, it is not guaranteed to find the best possible model out of all
null <- lm(target ~ x1 + x2, data = dataset)
full <- lm(target ~ x1 + x2, data = dataset)
n <- nrow(dataset)
# Forward search
# trace=0 gives only the final model
# FS - AIC
step(null, scope=list(lower=null, upper=full), direction="forward", k=2, trace=1)
# FS - BIC
step(null, scope=list(lower=null, upper=full), direction="forward", k=log(n), trace=1)
Idea: from the most complex model we proceed by removing useless predictors. We remove predictors that do not improve the measure chosen!
- Let
$\mathcal{M_{p-1}}$ denote the full model which contains$p-1$ predictors. -
$\forall k=p-1,p-2...,1$ - Consider all
$k$ models that contain all but one of the predictors in$\mathcal{M_{k}}$ , for a$k-1$ predictors. - Choose the best among these
$k$ models and call it$\mathcal{M_{k_1}}$ , where best is defined as$minimize(SS_{RES})$
- Consider all
- Select a single best model from among
$\mathcal{M_{0}}...\mathcal{M_{p-1}}$ using cross-validated prediction error, AIC, BIC, or Adjusted$R^{2}$
Like forward stepwise selection, the backward selection approach searches through only
Backward selection requires that the number of samples n is larger than the number of variables
# Backward search
# BS - AIC
step(full, scope=list(lower=null, upper=full), direction="backward", k=2, trace=1)
# BS - BIC
step(full, scope=list(lower=null, upper=full), direction="backward", k=log(n), trace=1)
Stepwise search checks going both backwards and forwards at every step. It considers the addition of any variable not currently in the model, as well as the removal of any variable currently in the model.
# We perform stepwise search using AIC as our criterion.
intermediate <- lm(target ~ x1 + x2, data = dataset)
step(intermediate,
scope = list(lower = null, upper=full),
direction="both", trace=1, k=2)
step
algorithm tends to perform bad when there is missing data, since n keeps changing and the likelihood are not comparable. Consequently, x with a lot of missing data can be omitted as a predictor!
- These approaches are computationally expensive.
- Automatic methods are tempting but might not yield the ”best” model.
- Models still need to be assessed - for example checking the linearity and other assumptio (see also later).
- Not all methods get to the same final result!
- Stepwise variable selection tends to pick models that are smaller than desirable for prediction purposes.
- One variable being out of the model doesn’t mean it is not related to Y: it means other variables relate more or explain the same pattern (for the dataset under study).
- We select a model that predicts well. This does not mean the model uncovers a true relationship between real-world variables.
- We use the data twice: to select the model and then to do inference - that’s cheating! We would be overconfident in our results, and get and overfitting models.
- The theoretical results regarding the variability of the model do not take into account the fact there is a variability given by the model selection. We only derive it from the error. It is called the inference after selection problem.
Some issues:
- All statistical theory for the regression models is derived assuming the model is fixed and known before seeing the data.
- The evaluation of the uncertainty (errors) is usually not precise
- We want to know how wrong we are!
- If we use the data to select a model we will be selecting the ”best” variables: significance will be, for example, overstated.
- No theoretical way to assure the model is the best.
- The data is random, and model selection depends on the data, so it becomes random: the statistical formulas we use do not account for this additional randomness.
- This is a really complicated issue, even plotting the data to ”have a look” is using the data twice.
- Ideally we should not look at the data first, we should write the code and test on sample data. Then we should proceed with real data to avoid double dipping.
How can we solve this?
- If you are only interested in the ”best model” you can disregard this issue, but typically we are interested in inference
- Use more advanced statistical theory (we don’t see that here)
- Data splitting (assuming you have independent observations): choose the model using a data subset and estimate its final form using another the remaining data points.