Partial Correlation Coefficients
Gerard E. Dallal, Ph.D.
Scatterplots, correlation coefficients, and simple linear regression coefficients are inter-related. The scatterplot displays the data. The correlation coefficient measures linear association between the variables. The regression coefficient describes the linear association through a number that gives the expected change in the response per unit change in the predictor.
The coefficients of a multiple regression equation give the change in response per unit change in a predictor when all other predictors are held fixed. This raises the question of whether there are analogues to the correlation coefficient and the scatterplot to summarize the relation and display the data after adjusting for the effects of other variables.
This note answers these questions and illustrates them by using the crop yield example of Hooker reported by Kendall and Stuart in volume 2 of their Advanced Theory of Statistics, Vol, 2, 3rd ed.(example 27.1) Neither Hooker nor Kendall & Stuart provide the raw data, so I have generated a set of random data with means, standard deviations, and correlations identical to those given in K&S. These statistics are sufficient for all of the methods that will be discussed here (Sufficient is a technical term meaning nothing else to do with the data has any effect on the analysis. Any data set with the same values of the sufficient statistics will produce these results.), so the random data will be adequate.
The variables are yields of "seeds' hay" in cwt per acre, spring rainfall in inches and the accumulated temperature above 42 F in the spring for an English area over 20 years. The plots suggest yield and rainfall are positively correlated, while yield and temperature are negatively correlated! This is borne out by the correlation matrix itself.
Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0 YIELD RAIN TEMP YIELD 1.00000 0.80031 -0.39988 <.0001 0.0807 RAIN 0.80031 1.00000 -0.55966 <.0001 0.0103 TEMP -0.39988 -0.55966 1.00000 0.0807 0.0103
A partial correlation coefficient can be written in terms of simple
A partial correlation between two variables can differ substantially from their simple correlation. Sign reversals are possible, too. For example, the partial correlation between YIELD and TEMPERATURE holding RAINFALL fixed is 0.09664. While it does not reach statistical significance (P = 0.694), the sample value is positive nonetheless.
The partial correlation between X & Y holding a set of variables fixed
will have the same sign as the multiple regression coefficient of X when
Y is regressed on X and the set of variables being held fixed. Also,
Just as the simple correlation coefficient describes the data in an ordinary scatterplot, the partial correlation coefficient describes the data in the partial regression residual plot.
Let Y and X1 be the variables of primary interest and let X2..Xp be the variables held fixed.
For example, the partial
correlation of YIELD and TEMP adjusted for RAIN is the correlation
between the residuals from regressing YIELD on RAIN and the residuals
from regressing TEMP on RAIN. In this partial regression residual plot,
the correlation is 0.09664. The regression coefficient of TEMP when
the YIELD residuals are regessed on the TEMP residuals is
0.003636. The multiple regression equation for the original data set is
Because the data are residuals, they
are centered around zero. The values, then, are not similar to the
original values. However, perhaps this is an advantage. It stops them
from being misinterpreted as Y or X1 values "adjusted for
While the regression of Y on X2..Xp seems reasonable, it is not uncommon to hear questions about adjusting X1, that is, some propose comparing the residuals of Y on X2..Xp with X1directly.
This approach has been suggested many times over the years. Lately, it has been used in the field of nutrition by Willett and Stampfer (AJE, 124(1986):17-22) to produce "calorie-adjusted nutrient intakes", which are the residuals obtained by regressing nutrient intakes on total energy intake. These adjusted intakes are used as predictors in other regression equations. However, total energy intake does not appear in the equations and the response is not adjusted for total energy intake. Willett and Stampfer recognize this, but propose using calorie-adjusted intakes nonetheless. They suggest "calorie-adjusted values in multivariate models will overcomethe problem of high-collinearity frequently observed between nutritional factors", but this is just an artifact of adjusting only some of the factors. The correlation between an adjusted factor and an unadjusted factor is always smaller in magnitude than the correlation between two adjusted factors.
This method was first proposed before the ready availability of
computers as a way to approximate multiple regression with two
independent variables (regress Y on X1, regress the residuals on X2) and
was given the name two-stage regression. Today, however, it is
a mistake to use the approximation when the correct answer is easily
obtained. If the goal is to report on two variables after adjusting for
the effects of another set of variables, then both variables must be