Gerard E. Dallal, Ph.D.
In the 1970s and 80s, many statisticians developed techniques for assessing multiple regression models. One of the most influential books on the topic was Regression Diagnostics: Identifyin Influential Data and Sources of Collinearity by Belsley, Kuh, and Welch. Roy Welch tells of getting interested in regression diagnostics when he was once asked to fit models to some banking data. When he presented his results to his clients, they remarked that the model could not be right because the sign of one of the predictors was different from what they expected. When Welch looked closely at the data, he discovered the sign reversal was due to an outlier in the data. This example motivated him to develop methods to insure it didn't happen again!
Perhaps the best reason for studying regression diagnostics was given by Frank Anscombe when he was discussing outliers.
We are usually happier about asserting a regression relation if the relation is appropriate after a few observations (any ones) have been deleted--that is, we are happier if the regression relation seems to permeate all the observations and does not derive largely from one or two.
Regression diagnostics were developed to measure various ways in which a regression relation might derive largely from one or two observations. Observations whose inclusion or exclusion result in substnatial changes in the fitted model (coefficients, fitted values) are said to be influential. Many of these diagnostics are available from standard statistical program packages.
It is common practice to distinguish between two types of outliers. Outliers in the response variable represent model failure. Such obeservations are called outliers. Outliers with respect to the predictors are called leverage points. They can affect the regression model, too. Their response variables need not be outliers. However, they may almost uniquely determine regression coefficients. They may also cause the standard errors of regression coefficients to be much smaller than they would be if the obeservation were excluded.
The ordinary or simple residuals (observed - predicted values) are the most commonly used measures for detecting outliers. The ordinary residuals sum to zero but do not have the same standard deviation. Many other measures have been offered to improve on or complement simple residuals. Standardized Residuals are the residuals divided by the estimates of their standard errors. They have mean 0 and standard deviation 1. There are two common ways to caculate the standardized residual for the i-th observation. One uses the residual mean square error from the model fitted to the full dataset (internally studentized residuals). The other uses the residual mean square error from the model fitted to the all of the data except the i-th observation (externally studentized residuals). The externally standardized residuals follow a t distribution with n-p-2 df. They can be thought of as testing the hypothesis that the corresponding observation does not follow the regression model that describes the other observations.
In practice, I find ordinary residuals the most useful. While the standard deviations of the residuals are different, they are usually not different enough to matter when looking for outliers. They have the advantage of being in the same scale as the response.
In theory, these can be useful measures. However, I have not found that to be the case in my own practice. It may be the sort of data I analyze. Often, I see people using these measure finding themselves in a vicious cycle. They calculate some measures, remove some observations, and find additional observations have suspicious measures when they recalculate. They remove more observations and the cycle starts all over again. By the time they are done, many observations are set aside, no one is quite sure why, and no one feels very good about the final model.
Leverage points do not necessarily correspond to outliers. There are a few reasons why this is so. First, an observation with sufficiently high leverage might exert enough influence to drag the regression equation close to its response and mask the fact that it might otherwise be an outlier. See the third and fourth Anscombe datasets, for example. When they do not, it's not clear what can be done about the leverage point except, perhaps, to note it. The fourth Anscombe example is as extreme as it gets. The regression coefficient is completely determined by a single obsrvation. Yet, what is one to do? If one believes the model (linearity, homoscedasticity, normal errors), then a regression coefficient determined by one or two observations is the best we can do if that's the way our data come to us or we choose to collect it.
A least squares model can be distorted by a single observation. The fitted line or surface might be tipped so that it no longer passes through the bulk of the data in order to intricued many small or moderate errors in order to reduce the effect of a very large error. For example, if a large error is reduced from 200 to 50, its square is reducted from 40,000 to 2,500. Increasing an error from 5 to 15 increases its square from 25 to 225. Thus, a least squares fit might introduce many small errors in order to reduce a large one.
Robust regression is a term used to desctribe model fitting procedures that are insensitive to the effects of maverick observations. My personal favorite is least median of squares (LMS) regression, developed by Peter Rousseeuw. LMS regression minimizes the median squared resduals. Since it focuses on the median residual, up to half of the observations can disagree without masking a model that fits the rest of the data.
Fitting an LMS regression model poses some difficulties. The first is computational. Unlike least squares regression, there is no formula that can be used to calculate the coefficents for an LMS regression. Random samples of size p+1, are drawn. A regression surface is fitted to each set of observations and the median squared residual is calculated. The model that had the smallest median squared residual is used.
The LMS solution can be found by fitting regression surfaces to all possible subsets of p+1 points, where p is the number of predictors . (This is merely a matter of solving set of p+1 linear equations with p+1 unknown parameters.) The LMS regression is given by the parameters, chosen over all possible sets of p+1 observations, that have the minimum median squared residual when applied to the entire data set. Evaluating all possible subsets of p observations can be computationally infeasible for large data sets. When n is large, Rousseeeuw recommends taking random samples of observations and using the best solution obtained from these randomly selected subsets. The second problem is that there is no theory for constructing confidence intervals for LMS regression coefficients or for testing hypotheses about them. Rousseeuw has proposed calculating a distance measure based on LMS regression and using it to identify outliers with respect to the LMS regression. These observations are set aside and least squares regression is fitted to the rest of the data. The result is called reweighted least squares regression.
This approach has some obvious appeal. A method insensitive to maverick observations is used to identify outliers that are set aside so an ordinary multiple regression can be fitted. However, there are no constrains that force the reweighted least squares model to resemble the LMS model. It is even possible for the signs of some regression coefficient to be different in the two models. This places the analyst in the awkward position of explaining how a model different from the final model was used to determine which observations determine the final model.
Now that robust regression procedures are becoming available in widely used statistical program pacakges such as SAS (PROC ROBUSTREG), it will be interesting to see what impact they will have on statistical practice.