Introduction to Simple Linear Regression

Introduction to Simple Linear Regression
Gerard E. Dallal, Ph.D.

How would you characterize this display of muscle strength¹ against lean body mass? Those who have more lean body mass tend to be stronger. The relationship isn't perfect. It's easy to find two people where the one with more lean body mass is the weaker, but in general strength and lean body mass tend to go up and down together. Comment: When two variables are displayed in a scatterplot and one can be thought of as a response to the other (here, muscles produce strength), standard practice is to place the response on the vertical (or Y) axis. The names of the variables on the X and Y axes vary according to the field of application. Some of the more common usages are

X-axis	Y-axis
independent	dependent
predictor	predicted
carrier	response
input	output

The association looks like it could be described by a straight line. There are many straight lines that could be drawn through the data. How to choose among them? On the one hand, the choice is not that critical because all of the reasonable candidates would show strength increasing with mass. On the other hand, a standard procedure for fitting a straight line is essential. Otherwise, different analysts working on the same data set would produce different fits and it would make communication difficult. Here, the fitted equation is Strength = -13.971 + 3.016 LBM . It says an individual's strength is predicted by multiplying lean body mass by 3.016 and subtracting 13.971. It also says the strength of two individuals is expected to differ by 3.016 times their difference in lean body mass.

The analysis is always described as the regression of the response on the carrier. Here, the example involves "the regression of muscle strength on lean body mass", not the other way around.

The Regression Equation

[Standard notation: The data are pairs of independent and dependent variables {(x_i,y_i): i=1,...,n}. The fitted equation is written is the predicted value of the response obtained by using the equation. The residuals are the differences between the observed and the predicted values . They are always calculated as (observed-predicted), never the other way 'round.]

There are two primary reasons for fitting a regression equation to a set of data--first, to describe the data; second, to predict the response from the carrier. The rationale behind the way the regression line is calculated is best seen from the point-of-view of prediction. A line gives a good fit to a set of data if the points are close to it. Where the points are not tightly grouped about any line, a line gives a good fit if the points are closer to it than to any other line. For predictive purposes, this means that the predicted values obtained by using the line should be close to the values that were actually observed, that is, that the residuals should be small. Therefore, when assessing the fit of a line, the vertical distances of the points to the line are the only distances that matter. Perpendicular distances are not considered because errors are measured as vertical distances, not perpendicular distances.

The simple linear regression equation is also called the least squares regression equation. Its name tells us the criterion used to select the best fitting line, namely that the sum of the squares of the residuals should be least. That is, the least squares regression equation is the line for which the sum of squared residuals is a minimum.

It is not necessary to fit a large number of lines by trial-and-error to find the best fit. Some algebra shows the sum of squared residuals will be minimized by the line for which

This can even be done by hand if need be.

When the analysis is performed by a statistical program package, the output will look something like this.

A straight line can be fitted to any set of data. The formulas for the coefficients of the least squares fit are the same for a sample, a population, or any arbitrary batch of numbers. However, regression is usually used to let analysts generalize from the sample in hand to the population from which the sample was drawn. There is a population regression equation,

₀ +

₁ X and Y_i =

₀ +

₁ X_i +

_i, where

₀ and

₁ are the population regression coefficients and

_i is a random error peculiar to the i-th observation. Thus, each response is expressed as the sum of a value predicted from the corresponding X, plus a random error.

The sample regression equation is an estimate of the population regression equation. Like any other estimate, there is an uncertainty associated with it. The uncertainty is expressed in confidence bands about the regression line. They have the same interpretation as the standard error of the mean, except that the uncertainty varies according to the location along the line. The uncertainty is least at the sample mean of the Xs and gets larger as the distance from the mean increases. The regression line is like a stick nailed to a wall with some wiggle to it. The ends of the stick will wiggle more than the center. The distance of the confidence bands from the regression line is

, where t is the appropriate percentile of the t distribution, s_e is the standard error of the estimate, and x^* is the location along the X-axis where the distance is being calculated. The distance is smallest when x^* =

. These bands also estimate the population mean value of Y for X=x^*.

There are also bands for predicting a single response at a particular value of X. The best estimate is given, once again, by the regression line. The distance of the prediction bands from the regression line is

. For large samples, this is essentially ts_e, so the standard error of the estimate functions like a standard deviation around the regression line.

The regression of X on Y is different from the regression of Y on X. If one wanted to predict lean body mass from muscle strength, a new model would have to be fitted (dashed line). It could not be obtained by taking the original regression equation and solving for strength. The reason is that in terms of the original scatterplot, the best equation for predicting lean body mass minimizes the errors in the horizontal direction rather than the vertical. For example,

The regression of Strength on LBM is
Strength = -13.971 + 3.016 LBM .
Solving for LBM gives
LBM = 4.632 + 0.332 Strength .
However, the regression of LBM on Strength is
LBM = 14.525 + 0.252 Strength .

Borrowing Strength

Simple linear regression is an example of borrowing strength from some observations to make sharper (that is, more precise) statements about others. If all we wanted to do was make statements about the strength of individuals with specific amounts lean body mass, we could recruit many individuals with that amount of LBM, test them, and report the appropriate summaries (mean, SD, confidence interval,...). We could do this for all of the LBMs of interest. Simple linear regression assumes we don't have to start from scratch for each new amount of LBM. It says that the expected amount of strength is linearly related to LBM. The regression line does two important things. First, it allows us to estimate muscle strength for a particular LBM more accurately than we could with only those subjects with the particular LBM. Second, it allows us to estimate the muscle strength of individuals with amounts of lean body mass that aren't in our sample!

These benefits don't come for free. The method is valid only insofar as the data follow a straight line, which is why it is essential to examine scatterplots.

Interpolation and Extrapolation

Interpolation is making a prediction within the range of values of the predictor in the sample used to generate the model. Interpolation is generally safe. One could imagine odd situations where an investigator collected responses at only two values of the predictor. Then, interpolation might be uncertain since there would be no way to demonstrate the linearity of the relationship between the two variables, but such situations are rarely encountered in practice. Extrapolation is making a prediction outside the range of values of the predictor in the sample used to generate the model. The more removed the prediction is from the range of values used to fit the model, the riskier the prediction becomes because there is no way to check that the relationship continues to be linear. For example, an individual with 9 kg of lean body mass would be expected to have a strength of -4.9 units. This is absurd, but it does not invalidate the model because it was based on lean body masses in the range 27 to 71 kg.

-------------------------

¹The particular measure of strength is slow right extensor peak torque in the knee.