**
The Most Important Lesson You'll Ever Learn AboutMultiple
Linear Regression Analysis
**

Gerard E. Dallal, Ph.D.

There are two main reasons for fitting a multiple linear regression:

**prediction**and**understanding the contribution of a particular predictor**.^{*}

When a model is being developed for **prediction**, then, with only
slight exaggeration, it doesn't matter how it was obtained or what variables
are in it. If it turned out that the prevalence of certain disease could be
accurately predicted by newspaper sales of the previous week, we wouldn't
worry about it. Instead, we'd monitor newspaper sales *very closely*.

Any number of techniques have been developed for building a predictive model. These include the stepwise procedures of forward selection regression, backwards elimination regression, and stepwise regression, along with looking at all possible regressions. In addition, there are the techniques of exploratory data analysis developed by John Tukey and others.

Questions about **the role of individual predictors** are different from
questions of pure prediction. Unlike the prediction problem where the model
is generally not known in advance, identifying the role of a particular
predictor generally involves a very few models carefully crafted at the start
of the study. It usually involves comparing two models--one model that
includes the predictor being investigated and another that other does not.
The research question can usually be restated as whether the model including
the predictor under study better predicts the outcome than the model that
excludes it.

- If the analysis involves
**observational data**, the models can be used to determine whether the predictor is**associated**with the response. - If the analysis involves data from a
**randomized trial**, the models can be used to determine whether the predictor**affects**the outcome. (For example, the predictor might be 0/1 depending on whether a subject receives placebo or the active treatment.)

There is other terminology that can be used to describe the two types of modeling.

- The prediction problem might also be called
*model building*to emphasize that the analyst, at the outset, is not sure of what will be included in the model, that is,**the form of the model is unknown**. - The problem of identifying the role of a specific predictor might be
called
*inferential modeling*since its purpose is to conduct formal statistical inference about a specific predictor, that is,**the form of the model is assumed known but there is some uncertainty about the coefficients.**

The most important lesson you'll ever learn about multiple linear regression analysis is well-stated by Chris Chatfield in "Model Uncertainty, Data Mining and Statistical Inference", Journal of the Royal Statistical Society, Series A, 158 (1995), 419-486 (p 421),

or, to put it another way,It is "well known" to be "logically unsound and practically misleading" to make inference as if a model is known to be true when it has, in fact, been selected from thesamedata to be used for estimation purposes.

One of the **most serious** but all-too-common **MISUSES** of
inferential statistics is to

- take a model that was developed through exploratory model building and
- subject it to the same sorts of statistical tests that are used to validate a model that was specified in advance.

This issue is not new. Chatfield gives references to it dating back nearly 30 years. Yet, many practicing data analysts do not fully appreciate the problem, as can be seen by looking at published scientific literature.

----------

^{*}In previous versions of this note, I'd called
this "questions about **mechanism**". However, that was a poor choice
because *mechanism* is too closely linked to *causality*. This
isn't about cause but, rather, about the *role* of a particular predictor
in making a prediction.