**Collinearity
**Gerard E. Dallal, Ph.D.

This message was posted to the Usenet group comp.soft-sys.stat.systat:

I have run a multiple linear regression model with about 20 independent variables regressed against a dependent variable. I am getting an output I have never seen. In the coefficients, it gives me values for 5 independent variables but all the t-stats are blank and the standard errors are all zeros. My F and SEE are also blank. Also, it excluded 15 of the independent variables. Some of the excluded variables would not surprise me to be insignificant, but many I know are significant.The only note it gives me is tolerance = 0 limits reached. Can anyone give me some guidance on this output?

**A predictor can't appear in a regression equation more than once. **
Suppose some response (Y) is regressed on height in inches (HIN) and the
resulting equation is

Now suppose we attempt to fit an equation in which HIN appears twice as a predictor. To do this, let HINCOPY be an exact copy of HIN, that is, HINCOPY=HIN and fit the equation

What is a self-respecting computer program to do? It's supposed to come up with the best solution, but there are many equivalent solutions. All equations for which b0 = 17.38 and b1+b2=5.08 are equivalent. So, a self-respecting computer program might do you a favor by recognizing the problem, excluding either HIN or HINCOPY, and continuing to fit the model.

The problem described in the Prolog is **collinearity**, where
variables are so highly correlated that it is impossible to come up with
reliable estimates of their individual regression coefficients.
Collinearity does not affect the ability of a regression equation
to predict the response. It poses a real problem if the purpose of the
study is to estimate the contributions of individual predictors.

The two variables don't have to be exact copies for problems to
arise. If Y is regressed on height in centimeters (HCM), the resulting
equation **must** be

Otherwise, the the two equations would not give the same predictions. Since 1 inch = 2.54 centimeters,

[Those with a science background might wonder how this works out in terms of "units of measuremnt". This is discussed on its own web page in order to keep the discussion of collinearity flowing smoothly.]

Suppose Y is regressed on both HIN and HCM. What are the resulting coefficients in the regression equation

Again, there is no unique answer. There are many sets of coefficients
that give the same predicted values. Any b_{1} and b_{2}
for which b_{1} + 2.54 b_{2} = 5.08 is a possibility.
Some examples are

- Y = 17.38 + 5.08 HIN + 0.00 HCM
- Y = 17.38 + 2.54 HIN + 1.00 HCM
- Y = 17.38 + 0.00 HIN + 2.00 HCM
- Y = 17.38 + 6.35 HIN - 0.50 HCM

*Collinearity* (or *multicollinearity* or
*ill-conditioning*) occurs when independent variables are so highly
correlated that it becomes difficult or impossible to distinguish their
individual influences on the response variable. As focus shifted from
detecting exact linear relations among variables to detecting situations
where things are so close that they cannot be estimated reliably, the
meaning of *collinear* in a regression context was altered (some
would say "devalued") to the point where it is sometimes used as a synonym
for *correlated*, that is, correlated predictors are sometimes
called *collinear* even when there isn't an exact linear relation
among them.

Strictly speaking, "collinear" means just that--an exact linear relationship between variables. For example, if HIN is height in inches and HCM is height in centimeters, they are collinear because HCM = 2.54 HIN. If TOTAL is total daily caloric intake, and CARB, PROTEIN, FAT, and ALCOHOL are calories from TOTAL = CARB + PROTEIN + FAT + ALCOHOL.

[I prefer to write these linear relations as

TOTAL - CARB - PROTEIN - FAT - ALCOHOL = 0

in keeping with the general form of a linear relation

where c_{1},...,c_{m}, and k are constants.

This makes it easier to see that things like percent of calories from carbohydrates, protein, and fat are collinear, because

with
c_{1}=c_{2}=c_{3}=c_{4}=1 and k=100.]

Exact linear relationships might not appear exactly linear to a computer, while some relationships that were not collinear appeared to be collinear. This happens because computers store data to between 7 and 15 digits of precision. Roundoff error might mask some exact linear relationships and conceivably make other relationships look like they were collinear. This is reflected in the behavior of inexpensive calculators. When 1 is divided by 3 and the result is multiplied 3, the result is 0.9999999 rather than 1, so that 1 is not equal to the result of dividing 1 by 3 and multiplying it by 3!

For numerical analysts,
the problem of collinearity had to do with identifying sets of predictors
that were collinear or *appeared to be collinear*. Once "appear to
be collinear" was part of the mix, "collinear" began to be used more and
more liberally.

There are three different situations where the term "collinearity" is used:

- where there is an exact linear relationship among the predictors by definition, as in percent of calories from fat, carbohydrate, protein, and alcohol,
- where an exact or nearly exact linear relationship is forced on the data by the study design (Before the recent focus on vitamin E, supplementary vitamin E and A were almost always obtained through multi- vitamins. While the strength of the multi-vitamins varied among brands, A & E almost always appeared in the same proportion. This forced a linear relationship on the two vitamins and made it impossible to distinguish between their effects in observational studies.), and
- where correlation among the predictors is
*serious enough to matter*, in ways to be defined shortly.

In cases (1) and (2), any competent regression program will not allow all of the predictors to appear in the regression equation. Prolog 1 is the classic manifestation of the effects of collinearity in practice. In case (3), a model may be fitted, but there will be clear indications that something is wrong. If these indicators are present, it is appropriate to say there is a problem with collinearity. Otherwise, there is merely correlation among the predictors. While some authors equate collinearity with any correlation, I do not.

Serious correlations among predictors will have the following effects:

- Regression coefficients will change dramatically according to whether other variables are included or excluded from the model.
- The standard errors of the regression coefficients will be large.
- In the worst cases, regression coefficients for collinear variables will be large in magnitude with signs that seem to be assigned at random.
- Predictors with known, strong relationships to the response will not have their regression coefficients achieve statistical significance.

If variables are perfectly collinear, the coefficient of determination
R^{2} will be 1 when any one of them is regressed upon the
others. This is the motivation behind calculating a variable's
**tolerance**, a measure of collinearity reported by most linear
regression programs. Each predictor is regressed on the other predictors.
Its tolerance is 1-R^{2}. A small value of the tolerance
indicates that the variable under consideration is almost a perfect
linear combination of the independent variables already in the equation
and that it should not be added to the regression equation. All variables
involved in the linear relationship will have a small tolerance. Some
statisticians suggest that a tolerance less than 0.1 deserves attention.
If the goal of a study is to determine whether a particular independent
variable has predictive capability in the presence of the others, the
tolerance can be disregarded if the predictor reaches statistical
significance despite being correlated with the other predictors. The
confidence interval for the regression coefficient will be wider than if
the predictors were uncorrelated, but the predictive capability will have
been demonstrated nonetheless. If the low value of tolerance is
accompanied by large standard errors and nonsignificance, another study
may be necessary to sort things out if subject matter knowledge cannot be
used to eliminate from the regression equation some of the variables
involved in the linear relation.

The tolerance is sometimes reexpressed as the Variance Inflation Factor (VIF), the inverse of the tolerance (= 1/tolerance). Tolerances of 0.10 or less become VIFs of 10 or more.

Other measures of collinearity, such as condition numbers, have been appeared in the statistical literature and are available in full-featured statistical packages. They have their advantages. When many variables have low tolerances, there is no way to tell how many nearly linear relations there are among the predictors. The condition numbers tell the analyst the number of relations and the associated matrices identify the variables in each one. For routine use, however, the tolerance or VIF is sufficient to determine whether any problems exist.

Some statisticians have proposed techniques--including ridge regression, robust regression, and principal components regression--to fit a multiple linear regression equation despite serious collinearity. I'm uncomfortable with all of them because they are purely mathematical approaches to solving things.

Principal components regression, replaces the original predictor
variables with uncorrelated linear combinations of them. (It might help
to think of these linear combinations as scales. One might be the sum of
the first three predictors, another might be the difference between the
second and fourth, and so on.) The scales are constructed to be
uncorrelated with each other. If collinearity among the predictors was
not an issue, there would be as many scales as predictors. When
collinearity is an issue, there are only as many scales as there are
nearly noncollinear variables. To illustrate, suppose X_{1} and
X_{2} are correlated but not collinear. The two principal
components might be their sum and difference (X_{1} +
X_{2} and X_{1} - X_{2}). If X_{1} and
X_{2} and nearly collinear, only one principal component
(X_{1} + X_{2}) would be used in a principal component
regression. While the mathematics is elegant and the principal
components will not be collinear, there is no guarantee that the best
predictor of the response won't be the last principal (X_{1} -
X_{2}) that never gets used.

When all is said and done, collinearity has been masked rather than removed. Our ability to estimate the effects of individual predictors is still compromised.