Announcement

The Most Important Lesson You'll Ever Learn About
Multiple Linear Regression Analysis

Gerard E. Dallal, Ph.D.

There are two main reasons for fitting a multiple linear regression:

When a model is being developed for prediction, then, with only slight exaggeration, it doesn't matter how it was obtained or what variables are in it. If it turned out that the prevalence of certain disease could be accurately predicted by newspaper sales of the previous week, we wouldn't worry about it. Instead, we'd monitor newspaper sales very closely.

Any number of techniques have been developed for building a predictive model. These include the stepwise procedures of forward selection regression, backwards elimination regression, and stepwise regression, along with looking at all possible regressions. In addition, there are the techniques of exploratory data analysis developed by John Tukey and others.

Questions about the role of individual predictors are different from questions of pure prediction. Unlike the prediction problem where the model is generally not known in advance, identifying the role of a particular predictor generally involves a very few models carefully crafted at the start of the study. It usually involves comparing two models--one model that includes the predictor being investigated and another that other does not. The research question can usually be restated as whether the model including the predictor under study better predicts the outcome than the model that excludes it.

There is other terminology that can be used to describe the two types of modeling.

The most important lesson you'll ever learn about multiple linear regression analysis is well-stated by Chris Chatfield in "Model Uncertainty, Data Mining and Statistical Inference", Journal of the Royal Statistical Society, Series A, 158 (1995), 419-486 (p 421),

It is "well known" to be "logically unsound and practically misleading" to make inference as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes.
or, to put it another way,

NEVER MIX THE TWO APPROACHES!

One of the most serious but all-too-common MISUSES of inferential statistics is to

If a model is built from the ground up, there are some things that might be said about its overall predictive capability, but there is little that can be said about the individual components. If you find a paper in which the authors use a model building technique such as stepwise regression and treat the resulting models and coefficients as though the model been specified in advance, be afraid, be very afraid!

This issue is not new. Chatfield gives references to it dating back nearly 30 years. Yet, many practicing data analysts do not fully appreciate the problem, as can be seen by looking at published scientific literature.

----------

*In previous versions of this note, I'd called this "questions about mechanism". However, that was a poor choice because mechanism is too closely linked to causality. This isn't about cause but, rather, about the role of a particular predictor in making a prediction.


Copyright © 2006 Gerard E. Dallal