The Most Important Lesson You'll Ever Learn About
Multiple Linear Regression Analysis

Gerard E. Dallal, Ph.D.

There are two main reasons for fitting a multiple linear regression:

When a model is being developed for prediction, then, with only slight exaggeration, it hardly matters how it was obtained or what variables are in it. If the variables were selected by throwing darts at a dart board and it turned out that the prevalence of certain disease could be accurately predicted by newspaper sales of the previous week, we wouldn't worry about it. Instead, we'd monitor newspaper sales very closely.

Any number of techniques have been developed for building a predictive model. These include the stepwise procedures of forward selection regression, backwards elimination regression, and stepwise regression, along with looking at all possible regressions. In addition, there are the techniques of exploratory data analysis developed by John Tukey and others.

Questions about mechanism are different from questions of pure prediction. The data might arise from a designed experiment or even from an observational study, but the question usually involves the effect of a particular predictor, or set of predictors, on a specified outcome. Sometime the question is not about effects but associations. For example,

Such questions might be approached by using observational data or through a randomized, double-blind, placebo-controlled trial. As with prediction, there are specific techniques for answering these questions, too. Unlike the prediction problem where the model is generally not known in advance, the search for mechanisms generally involves a very few models carefully crafted at the start of the study. It usually involves comparing two models--one model that includes the mechanism being investigated and another that other does not. The research question can usually be restated as whether the model including the mechanism better predicts the outcome that the model that excludes it.

There is other terminology that can be used to describe the two situations.

The most important lesson you'll ever learn about multiple linear regression analysis is well-stated by Chris Chatfield in "Model Uncertainty, Data Mining and Statistical Inference", Journal of the Royal Statistical Society, Series A, 158 (1995), 419-486 (p 421),

It is "well known" to be "logically unsound and practically misleading" to make inference as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes.
or, to put it another way,

NEVER MIX THE TWO APPROACHES!

One of the most serious but all-too-common MISUSES of inferential statistics is to

If a model is built from the ground up, there are some things that might be said about its overall predictive capability, but there is little that can be said about the individual coefficients. If you find a paper in which the authors use a model building technique such as stepwise regression and treat the resulting models and coefficients as though the model been specified in advance, be afraid, be very afraid!

This issue is not new. Chatfield gives references to it dating back nearly 30 years. Yet, many practicing data analysts do not fully appreciate the problem, as can be seen by looking at published scientific literature.


Copyright © 2006 Gerard E. Dallal