Frank Anscombe's Regression Examples
The intimate relationship between correlation and regression raises
the question of whether it is possible for a regression analysis to be
misleading in the same sense as the set of scatterplots all of which had
a correlation coefficient of 0.70. In 1973, Frank Anscombe published a
set of examples showing the answer is a definite yes (Anscombe FJ (1973),
"Graphs in Statistical Analysis," The American Statistician, 27, 17-21).
Anscombe's examples share not only the same correlation coefficient, but also
the same value for any other summary statistic that is usually
|Regression equation |
of y on x
|y = 3 + 0.5 x|
|Residual SS||13.75 (9 df)|
|Estimated SE of b1||0.118|
Figure 1 is the picture drawn by the mind's eye when a simple linear regression equation is reported. Yet, the same summary statistics apply to figure 2, which shows a perfect curvilinear relation, and to figure 3, which shows a perfect linear relation except for a single outlier.
The summary statistics also apply to figure 4, which is the most troublesome. Figures 2 and 3 clearly call the straight line relation into question. Figure 4 does not. A straight line may be appropriate in the fourth case. However, the regression equation is determined entirely by the single observation at x=19. Paraphrasing Anscombe, we need to know the relation between y and x and the special contribution of the observation at x=19 to that relation.