**Introduction to Simple Linear
Regression **Gerard E. Dallal, Ph.D.

How would you
characterize this display of muscle strength^{1} against lean
body mass? Those who have more lean body mass tend to be stronger. The
relationship isn't perfect. It's easy to find two people where the one
with more lean body mass is the weaker, but in general strength and lean
body mass tend to go up and down together. *Comment*: When two
variables are displayed in a scatterplot and one can be thought of as a
response to the other (here, muscles produce strength), standard practice
is to place the response on the vertical (or Y) axis. The names of the
variables on the X and Y axes vary according to the field of application.
Some of the more common usages are

X-axis | Y-axis |

independent | dependent |

predictor | predicted |

carrier | response |

input | output |

The association looks like it could be described by a straight line. There are many straight lines that could be drawn through the data. How to choose among them? On the one hand, the choice is not that critical because all of the reasonable candidates would show strength increasing with mass. On the other hand, a standard procedure for fitting a straight line is essential. Otherwise, different analysts working on the same data set would produce different fits and it would make communication difficult. Here, the fitted equation is

The analysis is always described as *the regression of the
response on the carrier*. Here, the example involves
"the regression of muscle strength on lean body mass", not the other way
around.

[Standard notation: The data are pairs of independent and dependent
variables {(x_{i},y_{i}): i=1,...,n}. The fitted
equation is written is the predicted
value of the response obtained by using the equation. The
*residuals* are the differences between the observed and the
predicted values . They are
*always* calculated as (observed-predicted), never the other way
'round.]

There are two primary reasons for fitting a regression equation to a set of data--first, to describe the data; second, to predict the response from the carrier. The rationale behind the way the regression line is calculated is best seen from the point-of-view of prediction. A line gives a good fit to a set of data if the points are close to it. Where the points are not tightly grouped about any line, a line gives a good fit if the points are closer to it than to any other line. For predictive purposes, this means that the predicted values obtained by using the line should be close to the values that were actually observed, that is, that the residuals should be small. Therefore, when assessing the fit of a line, the vertical distances of the points to the line are the only distances that matter. Perpendicular distances are not considered because errors are measured as vertical distances, not perpendicular distances.

The simple linear regression equation is also called the *least
squares* regression equation. Its name tells us the criterion used to
select the best fitting line, namely that the sum of the *squares*
of the residuals should be *least*. That is, the least squares
regression equation is the line for which the sum of squared residuals
is
a minimum.

It is not necessary to fit a large number of lines by trial-and-error to find the best fit. Some algebra shows the sum of squared residuals will be minimized by the line for which

When the analysis is performed by a statistical program package, the
output will look something like this.

A straight line can be
fitted to any set of data. The formulas for the coefficients of
the least squares fit are the same for a sample, a population, or any
arbitrary batch of numbers. However, regression is usually used to let
analysts generalize from the sample in hand to the population from which
the sample was drawn. There *is* a population regression
equation,

The sample regression equation is an estimate of the population regression equation. Like any other estimate, there is an uncertainty associated with it. The uncertainty is expressed in confidence bands about the regression line. They have the same interpretation as the standard error of the mean, except that the uncertainty varies according to the location along the line. The uncertainty is least at the sample mean of the Xs and gets larger as the distance from the mean increases. The regression line is like a stick nailed to a wall with some wiggle to it. The ends of the stick will wiggle more than the center. The distance of the confidence bands from the regression line is

There are also bands for predicting a single response at a particular value of X. The best estimate is given, once again, by the regression line. The distance of the prediction bands from the regression line is

The regression of X on Y is different from the regression of Y on X. If one wanted to predict lean body mass from muscle strength, a new model would have to be fitted (dashed line). It could not be obtained by taking the original regression equation and solving for strength. The reason is that in terms of the original scatterplot, the best equation for predicting lean body mass minimizes the errors in the horizontal direction rather than the vertical. For example,

- The regression of Strength on LBM is
**Strength = -13.971 + 3.016 LBM**. - Solving for LBM gives
**LBM = 4.632 + 0.332 Strength**. - However, the regression of LBM on Strength is
**LBM = 14.525 + 0.252 Strength**.

Simple linear regression is an example of borrowing strength from some observations to make sharper (that is, more precise) statements about others. If all we wanted to do was make statements about the strength of individuals with specific amounts lean body mass, we could recruit many individuals with that amount of LBM, test them, and report the appropriate summaries (mean, SD, confidence interval,...). We could do this for all of the LBMs of interest. Simple linear regression assumes we don't have to start from scratch for each new amount of LBM. It says that the expected amount of strength is linearly related to LBM. The regression line does two important things. First, it allows us to estimate muscle strength for a particular LBM more accurately than we could with only those subjects with the particular LBM. Second, it allows us to estimate the muscle strength of individuals with amounts of lean body mass that aren't in our sample!

These benefits don't come for free. The method is valid only insofar as the data follow a straight line, which is why it is essential to examine scatterplots.

*Interpolation* is making a prediction within the range of values of
the predictor in the sample used to generate the model. Interpolation is
generally safe. One could imagine odd situations where an investigator
collected responses at only two values of the predictor. Then,
interpolation might be uncertain since there would be no way to
demonstrate the linearity of the relationship between the two variables,
but such situations are rarely encountered in practice.
*Extrapolation* is making a prediction outside the range of values
of the predictor in the sample used to generate the model. The more
removed the prediction is from the range of values used to fit the model,
the riskier the prediction becomes because there is no way to check that
the relationship continues to be linear. For example, an individual with
9 kg of lean body mass would be expected to have a strength of -4.9
units. This is absurd, but it does not invalidate the model because it
was based on lean body masses in the range 27 to 71 kg.

-------------------------

^{1}The particular measure of strength is slow right extensor
peak torque in the knee.