**Regression Diagnostics
**Gerard E. Dallal, Ph.D.

In the 1970s and 80s, many statisticians developed techniques for
assessing multiple regression models. One of the most influential books
on the topic was *Regression Diagnostics: Identifyin Influential Data
and Sources of Collinearity* by Belsley, Kuh, and Welch. Roy Welch
tells of getting interested in regression diagnostics when he was once
asked to fit models to some banking data. When he presented his results
to his clients, they remarked that the model could not be right because
the sign of one of the predictors was different from what they expected.
When Welch looked closely at the data, he discovered the sign reversal
was due to an outlier in the data. This example motivated him to develop
methods to insure it didn't happen again!

Perhaps the best reason for studying regression diagnostics was given by Frank Anscombe when he was discussing outliers.

We are usually happier about asserting a regression relation if the relation is appropriate after a few observations (any ones) have been deleted--that is, we are happier if the regression relation seems to permeate all the observations and does not derive largely from one or two.

Regression diagnostics were developed to measure various ways in which
a regression relation might derive largely from one or two observations.
Observations whose inclusion or exclusion result in substnatial changes
in the fitted model (coefficients, fitted values) are said to be
**influential**.
Many of these diagnostics are available from standard statistical program
packages.

It is common practice to distinguish between two types of outliers.
Outliers in the response variable represent model failure. Such
obeservations are called **outliers**. Outliers with respect to the
predictors are called **leverage points**. They can affect the
regression model, too. Their response variables need not be outliers.
However, they may almost uniquely determine regression coefficients. They
may also cause the standard errors of regression coefficients to be much
smaller than they would be if the obeservation were excluded.

The ordinary or simple residuals (observed - predicted values) are the
most commonly used measures for detecting outliers. The ordinary
residuals sum to zero but do not have the same standard deviation. Many
other measures have been offered to improve on or complement simple
residuals.
**Standardized Residuals** are the residuals divided by the
estimates of their standard errors. They have mean 0 and standard
deviation 1. There are two common ways to caculate the standardized
residual for the i-th observation. One uses the residual mean square
error from the model fitted to the full dataset (internally studentized
residuals). The other uses the residual mean square error from the model
fitted to the all of the data except the i-th observation (externally
studentized residuals). The externally standardized residuals follow a t
distribution with n-p-2 df. They can be thought of as testing the
hypothesis that the corresponding observation does not follow the
regression model that describes the other observations.

In practice, I find ordinary residuals the most useful. While the standard deviations of the residuals are different, they are usually not different enough to matter when looking for outliers. They have the advantage of being in the same scale as the response.

**Cook's Distance**is an aggregate measure that shows the effect of the i-th observation on the fitted values for all*n*observations.- For the i-th observation, calculate the predicted responses for all
*n*observatioins from the model constructed by setting the i-th observation aside - Sum the squared differences between those predicted values and the predicted values obtained from fitting a model to the entire dataset.
- Divide by
*p+1*times the Residual Mean Square from the full model.

- For the i-th observation, calculate the predicted responses for all
**DFITS**is the scaled difference between the predicted responses from the model constructed from all of the data and the predicted responses from the model constructed by setting the i-th observation aside. It is similar to Cook's distance. Unlike Cook's distance, it does not look at all of the predicted values with the i-th observation set aside. It looks only at the predicted values for the i- th observation. Also, the scaling factor uses the standard error of the estimate with the i-th observation set aside. To see the effect of this, consider a dataset with one predictor in which all of the observations lie exactly on a straight line. The Residual Mean Square using all of the data will be positive. The standard errors of the estimate obained by setting one observation aside in turn will be positive except for the observation that does not lie on the line. When it is set aside, the standard error of the estimate will be 0 and DFITS_{i}_{i}will be arbitrarily large. Some analysts suggest investigating observations for which |DFITS_{i}| is greater than 2[(p+1)/(n-p-1)]. Others suggest looking at a dot plot to find extreme values.**DFBETAS**are similar to DFITS. Instead of looking at the difference in fitted value when the i-th observation is included or exlcuded, DFBETAS looks at the change in each regression coefficient._{i}

In theory, these can be useful measures. However, I have not found that to be the case in my own practice. It may be the sort of data I analyze. Often, I see people using these measure finding themselves in a vicious cycle. They calculate some measures, remove some observations, and find additional observations have suspicious measures when they recalculate. They remove more observations and the cycle starts all over again. By the time they are done, many observations are set aside, no one is quite sure why, and no one feels very good about the final model.

Leverage points do not necessarily correspond to outliers. There are a few reasons why this is so. First, an observation with sufficiently high leverage might exert enough influence to drag the regression equation close to its response and mask the fact that it might otherwise be an outlier. See the third and fourth Anscombe datasets, for example. When they do not, it's not clear what can be done about the leverage point except, perhaps, to note it. The fourth Anscombe example is as extreme as it gets. The regression coefficient is completely determined by a single obsrvation. Yet, what is one to do? If one believes the model (linearity, homoscedasticity, normal errors), then a regression coefficient determined by one or two observations is the best we can do if that's the way our data come to us or we choose to collect it.

A least squares model can be distorted by a single observation. The fitted line or surface might be tipped so that it no longer passes through the bulk of the data in order to intricued many small or moderate errors in order to reduce the effect of a very large error. For example, if a large error is reduced from 200 to 50, its square is reducted from 40,000 to 2,500. Increasing an error from 5 to 15 increases its square from 25 to 225. Thus, a least squares fit might introduce many small errors in order to reduce a large one.

Robust regression is a term used to desctribe model fitting procedures that are insensitive to the effects of maverick observations. My personal favorite is least median of squares (LMS) regression, developed by Peter Rousseeuw. LMS regression minimizes the median squared resduals. Since it focuses on the median residual, up to half of the observations can disagree without masking a model that fits the rest of the data.

Fitting an LMS regression model poses some difficulties. The first is
computational. Unlike least squares regression, there is no formula that
can be used to calculate the coefficents for an LMS regression.
Random
samples of size *p+1*,
are drawn. A regression surface is fitted to each set of observations
and the median squared residual is calculated. The model that had the
smallest median squared residual is used.

The LMS solution can be found by fitting regression surfaces to all
possible subsets of p+1 points, where *p* is the number of
predictors . (This is merely a matter of solving set of p+1 linear
equations with p+1 unknown parameters.) The LMS regression is given by
the parameters, chosen over all possible sets of p+1 observations, that
have the minimum median squared residual when applied to the entire data
set.
Evaluating all possible subsets of p observations can be
computationally infeasible for large data sets. When n is large,
Rousseeeuw recommends taking random samples of observations and using the
best solution obtained from these randomly selected subsets.
The second problem is that there is no theory for constructing
confidence intervals for LMS regression coefficients or for testing
hypotheses about them. Rousseeuw has proposed calculating a distance
measure based on LMS regression and using it to identify outliers with
respect to the LMS regression. These observations are set aside and
least squares regression is fitted to the rest of the data. The result
is called reweighted least squares regression.

This approach has some obvious appeal. A method insensitive to maverick observations is used to identify outliers that are set aside so an ordinary multiple regression can be fitted. However, there are no constrains that force the reweighted least squares model to resemble the LMS model. It is even possible for the signs of some regression coefficient to be different in the two models. This places the analyst in the awkward position of explaining how a model different from the final model was used to determine which observations determine the final model.

Now that robust regression procedures are becoming available in widely used statistical program pacakges such as SAS (PROC ROBUSTREG), it will be interesting to see what impact they will have on statistical practice.