Collinearity

Collinearity
Gerard E. Dallal, Ph.D.

Prolog: Part 1

This message was posted to the Usenet group comp.soft-sys.stat.systat:

I have run a multiple linear regression model with about 20 independent variables regressed against a dependent variable. I am getting an output I have never seen. In the coefficients, it gives me values for 5 independent variables but all the t-stats are blank and the standard errors are all zeros. My F and SEE are also blank. Also, it excluded 15 of the independent variables. Some of the excluded variables would not surprise me to be insignificant, but many I know are significant.
The only note it gives me is tolerance = 0 limits reached. Can anyone give me some guidance on this output?

Prolog: Part 2

A predictor can't appear in a regression equation more than once. Suppose some response (Y) is regressed on height in inches (HIN) and the resulting equation is

Y = 17.38 + 5.08 * HIN .

Now suppose we attempt to fit an equation in which HIN appears twice as a predictor. To do this, let HINCOPY be an exact copy of HIN, that is, HINCOPY=HIN and fit the equation

Y = b₀ + b₁ HIN + b₂ HINCOPY .

What is a self-respecting computer program to do? It's supposed to come up with the best solution, but there are many equivalent solutions. All equations for which b0 = 17.38 and b1+b2=5.08 are equivalent. So, a self-respecting computer program might do you a favor by recognizing the problem, excluding either HIN or HINCOPY, and continuing to fit the model.

Collinearity

The problem described in the Prolog is collinearity, where variables are so highly correlated that it is impossible to come up with reliable estimates of their individual regression coefficients. Collinearity does not affect the ability of a regression equation to predict the response. It poses a real problem if the purpose of the study is to estimate the contributions of individual predictors.

The two variables don't have to be exact copies for problems to arise. If Y is regressed on height in centimeters (HCM), the resulting equation must be

Y = 17.38 + 2.00 * HCM .

Otherwise, the the two equations would not give the same predictions. Since 1 inch = 2.54 centimeters,

2 (height in cm) is the same as 5.08 (height in inches).

[Those with a science background might wonder how this works out in terms of "units of measuremnt". This is discussed on its own web page in order to keep the discussion of collinearity flowing smoothly.]

Suppose Y is regressed on both HIN and HCM. What are the resulting coefficients in the regression equation

Y = b₀ + b₁ HIN + b₂ HCM ?

Again, there is no unique answer. There are many sets of coefficients that give the same predicted values. Any b₁ and b₂ for which b₁ + 2.54 b₂ = 5.08 is a possibility. Some examples are

Y = 17.38 + 5.08 HIN + 0.00 HCM
Y = 17.38 + 2.54 HIN + 1.00 HCM
Y = 17.38 + 0.00 HIN + 2.00 HCM
Y = 17.38 + 6.35 HIN - 0.50 HCM

Collinearity (or multicollinearity or ill-conditioning) occurs when independent variables are so highly correlated that it becomes difficult or impossible to distinguish their individual influences on the response variable. As focus shifted from detecting exact linear relations among variables to detecting situations where things are so close that they cannot be estimated reliably, the meaning of collinear in a regression context was altered (some would say "devalued") to the point where it is sometimes used as a synonym for correlated, that is, correlated predictors are sometimes called collinear even when there isn't an exact linear relation among them.

Strictly speaking, "collinear" means just that--an exact linear relationship between variables. For example, if HIN is height in inches and HCM is height in centimeters, they are collinear because HCM = 2.54 HIN. If TOTAL is total daily caloric intake, and CARB, PROTEIN, FAT, and ALCOHOL are calories from TOTAL = CARB + PROTEIN + FAT + ALCOHOL.

[I prefer to write these linear relations as

HCM - 2.54 HIN = 0 and
TOTAL - CARB - PROTEIN - FAT - ALCOHOL = 0

in keeping with the general form of a linear relation

c₁ X₁ + ... + c_m X_m = k ,

where c₁,...,c_m, and k are constants.

This makes it easier to see that things like percent of calories from carbohydrates, protein, and fat are collinear, because

%CARB + %PROTEIN + %FAT + %ALCOHOL = 100 ,

with c₁=c₂=c₃=c₄=1 and k=100.]

Exact linear relationships might not appear exactly linear to a computer, while some relationships that were not collinear appeared to be collinear. This happens because computers store data to between 7 and 15 digits of precision. Roundoff error might mask some exact linear relationships and conceivably make other relationships look like they were collinear. This is reflected in the behavior of inexpensive calculators. When 1 is divided by 3 and the result is multiplied 3, the result is 0.9999999 rather than 1, so that 1 is not equal to the result of dividing 1 by 3 and multiplying it by 3!

For numerical analysts, the problem of collinearity had to do with identifying sets of predictors that were collinear or appeared to be collinear. Once "appear to be collinear" was part of the mix, "collinear" began to be used more and more liberally.

There are three different situations where the term "collinearity" is used:

where there is an exact linear relationship among the predictors by definition, as in percent of calories from fat, carbohydrate, protein, and alcohol,
where an exact or nearly exact linear relationship is forced on the data by the study design (Before the recent focus on vitamin E, supplementary vitamin E and A were almost always obtained through multi- vitamins. While the strength of the multi-vitamins varied among brands, A & E almost always appeared in the same proportion. This forced a linear relationship on the two vitamins and made it impossible to distinguish between their effects in observational studies.), and
where correlation among the predictors is serious enough to matter, in ways to be defined shortly.

In cases (1) and (2), any competent regression program will not allow all of the predictors to appear in the regression equation. Prolog 1 is the classic manifestation of the effects of collinearity in practice. In case (3), a model may be fitted, but there will be clear indications that something is wrong. If these indicators are present, it is appropriate to say there is a problem with collinearity. Otherwise, there is merely correlation among the predictors. While some authors equate collinearity with any correlation, I do not.

Serious correlations among predictors will have the following effects:

Regression coefficients will change dramatically according to whether other variables are included or excluded from the model.
The standard errors of the regression coefficients will be large.
In the worst cases, regression coefficients for collinear variables will be large in magnitude with signs that seem to be assigned at random.
Predictors with known, strong relationships to the response will not have their regression coefficients achieve statistical significance.

If variables are perfectly collinear, the coefficient of determination R² will be 1 when any one of them is regressed upon the others. This is the motivation behind calculating a variable's tolerance, a measure of collinearity reported by most linear regression programs. Each predictor is regressed on the other predictors. Its tolerance is 1-R². A small value of the tolerance indicates that the variable under consideration is almost a perfect linear combination of the independent variables already in the equation and that it should not be added to the regression equation. All variables involved in the linear relationship will have a small tolerance. Some statisticians suggest that a tolerance less than 0.1 deserves attention. If the goal of a study is to determine whether a particular independent variable has predictive capability in the presence of the others, the tolerance can be disregarded if the predictor reaches statistical significance despite being correlated with the other predictors. The confidence interval for the regression coefficient will be wider than if the predictors were uncorrelated, but the predictive capability will have been demonstrated nonetheless. If the low value of tolerance is accompanied by large standard errors and nonsignificance, another study may be necessary to sort things out if subject matter knowledge cannot be used to eliminate from the regression equation some of the variables involved in the linear relation.

The tolerance is sometimes reexpressed as the Variance Inflation Factor (VIF), the inverse of the tolerance (= 1/tolerance). Tolerances of 0.10 or less become VIFs of 10 or more.

Other measures of collinearity, such as condition numbers, have been appeared in the statistical literature and are available in full-featured statistical packages. They have their advantages. When many variables have low tolerances, there is no way to tell how many nearly linear relations there are among the predictors. The condition numbers tell the analyst the number of relations and the associated matrices identify the variables in each one. For routine use, however, the tolerance or VIF is sufficient to determine whether any problems exist.

Some statisticians have proposed techniques--including ridge regression, robust regression, and principal components regression--to fit a multiple linear regression equation despite serious collinearity. I'm uncomfortable with all of them because they are purely mathematical approaches to solving things.

Principal components regression, replaces the original predictor variables with uncorrelated linear combinations of them. (It might help to think of these linear combinations as scales. One might be the sum of the first three predictors, another might be the difference between the second and fourth, and so on.) The scales are constructed to be uncorrelated with each other. If collinearity among the predictors was not an issue, there would be as many scales as predictors. When collinearity is an issue, there are only as many scales as there are nearly noncollinear variables. To illustrate, suppose X₁ and X₂ are correlated but not collinear. The two principal components might be their sum and difference (X₁ + X₂ and X₁ - X₂). If X₁ and X₂ and nearly collinear, only one principal component (X₁ + X₂) would be used in a principal component regression. While the mathematics is elegant and the principal components will not be collinear, there is no guarantee that the best predictor of the response won't be the last principal (X₁ - X₂) that never gets used.

When all is said and done, collinearity has been masked rather than removed. Our ability to estimate the effects of individual predictors is still compromised.