It's rare when a paper says everything that needs to be said in a way that can be readily understood by a nontechnical audience, but this is one of those cases. The paper is "Statistical Methods for Assessing Agreement Between Two Methods of Clinical Measurement," by JM Bland and DG Altman (The Lancet, February 8, 1986, 307-310). Perhaps it is so approachable because it was written for medical researchers three years after an equally readable version appeared in the applied statistics literature (Altman and Bland, 1983) and about the same time as a heated exchange over another approach to the problem (Kelly, 1985; Altman and Bland, 1987; Kelly, 1987).
This could very well have been a two-sentence note: "Here's the Bland and Altman reference. Please, read it." Still, its message is so elegant by virtue of its simplicity that it's worth the time and space to review the approach and see why it works while other approaches do little more than confuse the issues.
Suppose there are two measurement techniques*, both of which have a certain amount of measurement error**, and we widh to know whether they are comparable. (Altman and Bland use the phrasing, "Do the two methods of measurement agree sufficiently closely?") Data are obtained by collecting samples and splitting them in half. One piece is analyzed by each method.
The meaning of "comparable" will vary according to the particular application. For the clinician, it might mean that diagnoses and prescriptions would not change according to the particular technique that generated a particular value. For the researcher, "comparable" might mean being indifferent to (and not even caring to know) the technique used to make a particular measurement--in the extreme case, even if the choice was made purposefully, such as having all of the pre-intervention measurements made using one technique and the post-intervention measurements made with the other. (This would always make me nervous, regardless of what had been learned about the comparability of the methods!)
The Bland-Altman approach is so simple because, unlike other methods, it never loses focus of the basic question of whether the two methods of measurement agree sufficiently closely. The quantities that best answer this question are the differences in each split-sample, Bland and Altman focus on the differences exclusively. Other approaches, involving correlation and regression, can never be completely successful because they summarize the data through things other than the differences.
The Bland-Altman papers begin by discussing inappropriate methods and then shows how the comparison can be made properly. This note takes the opposite approach. It first shows the proper analysis and then discuss how other methods fall short. In fact, this note has already presented the Bland-Altman approach in the previous paragraph--do whatever you can to understand the observed differences between the paired measurements:
Examples
These data represent an
attempt to determine whether glucose levels of mice determined by a
simple device such as a Glucometer could be used in place of standard lab
techniques. The plots of Glucometer value against lab values and their
difference against their mean shows that there is essentially no
agreement between the two measurements. Any formal statistical analyses
would be icing for a nonexistent cake!
These data represent an
attempt to determine whether vitamin C levels obtained from micro-samples
of blood from tail snips could be used in place of the standard technique
(heart puncture, which sacrifices the animal). The plots clearly
demonstrate that the tail snips tend to give values that are 0.60 units
higher than the standard technique. With a standard deviation of the
differences of 0.69 units, perhaps the tail snip could be of practical
use provided a small downward adjustment was applied to the measurements.
These data come from a study
of the comparability of three devices for measuring bone density. The
observations labelled 'H' are human subjects; those labelled 'P' are
measurements made on phantoms. Since there are three devices, there are
three pairs of plots: 1/2, 1/3, 2/3. Here we see why the plot of one
measurement against another may be inadequate. All three plots look
satisfactory. However, when we plot the differences against the mean
values, we see that the measurements from site 2 are consistently less
than the measurements from the other two sites, which are comparable.
It may take large samples to determine that there is no statistically significant difference of practical importance, but it often takes only a small sample to show that the two techniques are dramatically different. When it comes to comparability, the standard deviation of the differences is as important as their mean. Even a small sample can demonstrate a large standard deviation.
Calibration and comparability differ in one important respect. In the comparability problem, both methods have about the same amount of error (reproducibility). Neither method is inherently more accurate than the other. In the calibration problem, an inexpensive, convenient, less precise measurement technique (labelled C, for "crude") is compared to an expensive, inconvenient, highly precise technique (labelled P, for "precise"). Considerations of cost and convenience make the crude technique attractive despite the decrease in precision.
The goal of the calibration problem is use the value from the crude method to estimate the value that would have been obtained from the precise method. This sounds like a problem regression in regression, which it is but with a twist!
With ordinary regression, an outcome variable (labelled Y) is regressed on an input (labelled X) to get an equation of the form Y = a + b X. However, the regression model says the response for fixed X varies about the regression line with a small amount of random error. In the calibration problem, the error is attached to the predictor C, while there is no error attached to P. For this reason, many authors recommend the use of inverse regression, in which the crude technique is regressed on the precise technique (in keeping with the standard regression model: response is a linear function of the predictor, plus error) and the equation is inverted in order to make predictions. That is, the equation C = b0 + b1 P is obtained by least squares regression and inverted to obtain
for prediction purposes. For further discussion, see Neter, Wasserman, and Kutner (1989, sec 5.6).
The calibration literature can become quite confusing (see Chow and Shao, 1990, for example) because the approach using inverse regression is called the "classical method" while the method of regressing P on C directly is called the "inverse method"!
--------------------------
*Device would be a better work than technique. I've seen the Bland-Altman method used in situations where one or both of the "techniques" were prediction equations. This might be appropriate according to the error structure of the data, but it is unlikely that such an error structure can be justified.
**Even gold standards have measurement error. The Bland-Altman technique assumes the measurement errors of the two devices are comparable. This will be discussed further in Part II.
References