Gerard E. Dallal, PhD

Scientist I, JM USDA HNRC

[Much of this discussion involves tests of significance. Since most tests are performed at the 0.05 level, I will use 0.05 throughout rather than an abstract symbol such as that might make some readers uncomfortable. Whenever you see "0.05 level", feel free to substitute your own favorite value, such as 0.01, or even a symbol such as , if you'd like.]

At some point in a career that requires the use of statistical analysis, an investigator will be asked by a statistician or a referee to use a multiple comparison procedure to adjust for having performed many tests or for having constructed many confidence intervals. What, exactly, is the issue being raised, why is it important, and how is it best addressed? We'll start with significance tests and later draw some comparisons with confidence intervals.

Let's start with some "dumb" questions. The answers will be obvious. Yet, they're all one needs to know to understand the issue surrounding multiple comparisons.

*When playing the lottery, would you rather have one
ticket or many tickets?* Many. Lottery numbers are a random
phenomenon and having more tickets increases your chances of
winning.

*There's a severe electrical storm and you have to
travel across a large, open field. Your main concern is about
being hit by lightning, a somewhat random phenomenon. Would you
rather make the trip once or many times?* Once. The more
trips you make, the more likely it is that you get hit by
lightning.

Similar considerations apply to observing statistically significant test results. When there is no underlying effect or difference, we want to keep the chance of obtaining statistically significant results small. Otherwise, it would be difficult to claim that that our observed differences were anything more than the vagaries of sampling and measurement.

For better or worse, much of statistical analysis is driven by
significance tests. The scientific community as a whole has decided that
the vast majority of those tests will be carried out at the 0.05 level of
significance. This level of significance is a value that separates
results typically seen when a null hypothesis is true from those that are
rare when the null hupothesis is true. The classic frequentist methods
do not give a probability that a hypothesis is true or false. Instead,
they provided indirect evidence. The rules of the game say that if
results are typical of what happens when there is no effect,
investigators can't claim evidence of an effect. However, if the
observed results occur rarely when there is no effect, investigators may
say there *is* evidence of an effect. The level of significance is
the probability of those rare events that permit investigators to claim
an effect. When we test at the 0.05 level of significance, the
probability of observing one of these rare results when there is no
effect is 5%.

In summary, a significance test is a is a way of deciding whether
something rare has occurred if there is no effect. It may well be that
there is no effect and something rare has occurred, but we cannot know
that. By the rules of the game, we conclude that there is an effect and
*not* that we've observed a "rare event".

When there's no underlying effect or difference, getting statistically significant results is supposed to be like winning the lottery or getting hit by lightning. The probability is supposed to be small (well, 5%, anyway). But just as with the lottery or with lightning--where the probability of winning or getting hit can increase dramatically if you take lots of chances--many tests, increase the chance that at something will be statistically significant at the nominal 5%. In the case of 4 independent tests each at the 0.05 level, the probability that one or more will achieve significance is about 19%. This violates the spirit of the significance test. The chance of a statistically significant result is suppose to be small when there's no underlying effect, but performing lots of tests makes it large.

If the chance of seeing a statistically significant result is
large, why should we pay it any attention and why should a journal
publish it as though it were small? Well, we shouldn't and they
shouldn't. In order to insure that the statistically significant results
we observe really are rare when there is no underlying effect, some
adjustment is needed to keep the probability of getting *any*
statistically significant results small when many tests are performed.
This is the issue of multiple comparisons. The way we adjust for multiple
tests will depend on the number and type of comparisons that are made.
There are common situations that occur so often they merit special
attention.

Consider an experiment to determine differences among three or more treatment groups (e.g., cholesterol levels resulting from diets rich in different types of of oil: olive, canola, rice bran, peanut). This is a generalization of Student's t test, which compares 2 groups.

How might we proceed? One way is to perform all possible t tests.
But this raises the problem we discussed earlier. When there are 4
treatments, there are 6 comparisons and the chance that *some*
comparison will be significant (that some pair of treatments will look
different from each other) is much greater than 5% if they all have the
same effect. (I'd guess it's around 15%.) If we notice a t statistic
greater than 1.96 in magnitude, we'd like to say, "Hey, those two diets
are different because, if they weren't, there's only a 5% chance of an
observed difference this large." However, with that many tests (lottery
tickets, trips in the storm) the chance of a significant result (a win,
getting hit) is much larger, the t statistic is no longer what it appears
to be, and the argument is no longer sound.

Statisticians have developed many "multiple comparison procedures"
to let us proceed when there are many tests to be performed or
comparisons to be made. Two of the most commonly used procedures are
**Fisher's Least Significant Difference (LSD)** and **Tukey's
Honestly Significant Difference (HSD)**.

**Fisher's LSD:** We begin with a one-way analysis of
variance. If the overall F-ratio (which tests that hypothesis that all
group means are equal) is statistically significant, we can safely
conclude that not all of the treatment means are identical. Then, and
only then...we carry out all possible t tests! Yes, the same "all
possible t tests" that were just soundly criticized. The difference is
that the t tests can't be performed unless the overall F-ratio is
statistically significant. There is only a 5% chance of that the overall
F ratio will reach statistical significance when there are no
differences. Therefore, the chance of reporting a significant difference
when there are none is held to 5%. Some authors refer to this procedure
as Fisher's **Protected** LSD to emphasize the protection that the
preliminary F-test provides. It is not uncommon to see the term
*Fisher's LSD* used to describe all possible t tests without a
preliminary F test, so stay alert and be a careful consumer of
statistics.

**Tukey's HSD:** Tukey attacked the problem a different way by
following in Student's (WS Gosset) footsteps. Student discovered the
distribution of the t statistic when there were [b]two[/b] groups to be
compared and there was no underlying mean difference between them. When
there are *g* groups, there are *g(g-1)/2* pairwise comparisons
that can be made. Tukey found the distribution of the *largest* of
these t statistic when there were no underlying differences. For
example, when there are 4 treatements and 6 subjects per treatment, there
are 20 degrees of freedom for the various test statistics. For Student's
t test, the critcal value is 2.09. To be statistically significant
according to Tukey's HSD, a t statistic must exceed 2.80. Because the
number of groups is accounted for, there is only a 5% chance that Tukey's
HSD will declare something to be statistically significant when all
groups have the same population mean. While HSD and LSD are the most
commonly used procedures, there are many more in the statistical
literature (a dozen are listed in the PROC GLM section of the SAS/STAT
manual) and some see frequent use.

Multiple comparison procedures can be compared to buying
insurance. Here, the insurance is against making a claim of a
statistically significant result when it is just the result of chance
variation. Tukey's HSD is the right amount of insurance when all possible
pairwise comparisons are being made in a set of *g* groups. However,
sometimes not all comparisons will be made and Tukey's HSD buys too much
insurance.

In the preliminary stages of development, drug companies are interested in identifing compounds that have some activity relative to placebo, but they are not yet trying to rank the active compounds. When there are g treatments including placebo, only g-1 of the g(g-1)/2 possible pairwise comparisons will be performed. Charles Dunnett determined the behavior of the largest t statistic when comparing all treatments to a control. In the case of 4 groups with 6 subjects per group, the critical value for the three comparions of Dunnett's test is 2.54.

Similar considerations apply to Scheffe's test, which was once one of the most popular procedures but has now fallen into disuse. Scheffe's test is the most flexible of the multiple comparison procedures. It allows analysts to perform any comparison they might think of--not just all pairs, but the mean of the 1st and 2nd with the mean of the 4th and 6th, and so on. However, this flexibility comes with a price. The critical value for the four group, six subjects per group situation we've been considering is 3.05. This makes it harder to detect any differences that might be present. If pairwise comparisons were the only things an investigator wants to do, then it is unnecessary (foolish?) to pay the price of protection that the Scheffe test demand.

The moral of the story is to never take out more insurance than necessary. If you use Scheffe's test so that you're allowed to perform any comparison you can think of when all you really want to do is compare all treatments to a control, you'll be using a critical value of 3.05 instead of 2.54 and may miss some effective treatments.

The most flexible multiple comparisons procedure is the
**Bonferroni adjustment**. In order to insure that the probability is
no greater than 5% that something will appear to be statistically
significant when there are no underlying differences, each of 'm'
individual comparisons is performed at the (0.05/m) level of
significance. For example, with 4 treatments, there are m=4(4-1)/2=6
comparisons. In order to insure that the probability of no greater than
5% that something will appear to be statistically significant when there
are no underlying differences, each of 'm' individual comparisons is
performed at the 0.0083 (=0.05/6) level of significance. An equivalent
procedure is to multiply the unadjusted P values by the number of test
and compare the results to the nominal significance level--that is,
comparing P to 0.05/m is equivalent to comparing mP to 0.05.

The Bonferroni adjustment has the advantage that it can be used in
*any* multiple testing situation. For example, when an investigator
and I analyzed cataract data at five time points, we were able to assure
the paper's reviewers that our results were not merely an artifact of
having examined the data at five different points in time because we had
used the Bonferroni adjustment and performed each test at the 0.01
(=0.05/5) level of significance.

The major disadvantage to the Bonferroni adjustment is that it is not exact procedure. The Bonferroni adjusted P value is larger than the true P value. Therefore, in order for the Bonferroni adjusted P value to be 0.05, the true P-value must be smaller. No one likes using a smaller P value than necessary because it makes effects harder to detect. An exact procedure will be preferred when one is available. Tukey's HSD will afford the same protection as the Bonferroni adjustment when comparing many treatment groups and the HSD makes it easier to reject the hypothesis of no difference when there are real differences. In our example of four groups with six subjects per group, the critical value for Tukey's HSD is 2.80, while for the Bonferroni adjustment it is 2.93 (the percentile of Student's t distribution with 20 df corrsponding to a two-tail probability of 0.05/6=0.008333).

This might make it seem as though there is no place for the Bonferroni adjustment. However, as already noted, the Bonferroni adjustment can be used in any multiple testing situation. If only 3 comparions are to be carried out, the Bonferroni adjustment would have them performed at the 00.5/3=0.01667 level with a critical value of 2.63, which is less than the critical value for Tukey's HSD.

The critical values a t statistic must achieve to reach statistical significance at the 0.05 level

(4 groups, 6 subjects per group, and 20 degrees of freedom for the error variance).

Test |
critical value |

t test (LSD) | 2.09 |

Duncan^{*} |
2.22 |

Dunnett | 2.54 |

Bonferroni (3) | 2.63 |

Tukey's HSD | 2.80 |

Bonferroni (6) | 2.93 |

Scheffe | 3.05 |

^{*}Duncan's New Multiple Range Test is a stepwise procedure. This is the critical value for assessing the homogeneity of all 4 groups.

If you look these values up in a table, Duncan, Dunnett, and Tukey's HSD will be larger by a factor of 2. I have divided them by 2 to make them comparable. The reason for the difference is the tables assume equal sample sizes ofn, say. In that case, the denominator of the t statistic would contain the factor [(1/n)+(1/n)] = (2/n). Instead of referring to the usual t statistic (xbar_{i}-xbar_{j})/[s_{p}(2/n)], the tables refer to the statistic (xbar_{i}öxbar_{j})/[s_{p}(1/n)]. Since this statistic is the ordinary t statistic multiplied by 2, the critical values must be adjusted accordingly. If you should have occasion to use such a table, check the critical value for 2 groups and infinite degrees of freedom. If the critical value is 1.96, the test statistic is the usual t statistic. If the critical value is 2.77, the table expects the 2 to be removed from the denominator of the t statistic.

Most analysts agree that Fisher's LSD is too liberal. Some feel that
Tukey's HSD is too conservative. While it is clear that the largest
difference between two means should be compared by using Tukey's HSD, it
is less obvious why the same criterion should be used to judge the
*smallest* difference. The **[Student]-Newman-Keuls Procedure**
is a compromise between LSD and HSD. It acknowledges the multiple
comparison problem but invokes the following argument: Once we determine
that the two extreme treatments are different according to the Tukey HSD
criterion, we no longer have a homogeneous set of 'g' groups. At most,
'g-1' of them are the same. Newman and Keuls proposed that these means be
compared by using the Tukey criteria to assess homogeneity in 'g-1'
groups. The procedure continued in like fashion considering homogeneous
groups of 'g-2' groups, 'g-3' groups, and so on, as long as heterogeneity
continued to be uncovered. That is, the critical value of the t statistic
got smaller (approaching the critical value for Student's t test) as the
number of groups that might have the same mean decreased. At one time,
the SNK procedure was widely used not only because it provided genuine
protection against falsely declaring differences to be real but also
because it let researchers have more significant differences than Tukey's
HSD would allow. It is now used less often, for two reasons. The first
is that, unlike the HSD or even the LSD approach, it cannot be used to
construct confidence intervals for differences between means. The second
reason is the growing realization that differences that depend strongly
on the choice of particular multiple comparison procedure are probably
not readily replicated.

[You have two choices. You can promise never to use this test or you can read this section!]

Duncan's New Multiple Range Testis a wolf in sheep's clothing. It looks like the SNK procedure. It has a fancy name suggesting that it adjusts for multiple comparisons. And, to the delight of its advocates, gives many more satistically significant differences. It does this, despite its official sounding name, by failing to give real protection to the significance level. Whenever I am asked to review a paper that uses this procedure, I always ask the investigators to reanalyze their data.This New Multiple Range Test, despite its suggestive name, does not really adjust for multiple comparisions. It is a stepwise procedure that uses the Studentized range statistic, the same statistic used by Tukey's HSD, but it undoes the adjustment for multiple comparisons!

The logic goes something like this: When there are g groups, there are g(g-1)/2 comparisons that can be made. There is some redundancy here because there are only g-1 independent pieces of information. Use the Studentized range statistic for g groups and the appropriate number of error degrees of freedom. To remove the penalty on the g-1 independent pieces of information, perform the Studentized range test at the 1-(1-)

^{g-1}level of significance. In the case of 4 groups (3 independent pieces of information), this corresponds to performing the Studentized range test at the 0.143 level of significance.When 'm' independent tests of true null hypotheses are carried out at some level , the probability that none are statistically significant is (1-)

^{m}and the Type I error is 1-(1-)^{m}. Therefore, to insure that the Studentized range statistic does not penalize me, I use at the level that corresponds to having used for my individual tests. In the case of 4 groups, there are three independent pieces of information. Testing the three peices at the 0.05 level is like using the Studentized range statistic at the 1-(1-0.05)^{3}(=0.143) level. That is, if I use the Studentized range statistic with =0.143, it is just as though I performed my 3 independent tests at the 0.05 level.

The problem of multiple tests occurs when two groups are compared with
respect to many variables. For example, suppose we have two groups and
wish to compare them with respect to three measures of folate status.
Once again, the fact that three tests are performed make it much more
likely than 5% that something will be statistically significant at a
nominal 0.05 level when there is no real underlying difference between
the two groups. Hotelling's T^{2} statistic could be used to test
the hypothesis that the means of all variables are equal. A Bonferroni
adjustment could be used, as well.

An investigator compares three treatments A, B, and C. The only significant difference is between B and C with a nominal P value of 0.04. However, when any multiple comparison procedure is used, the result no longer achieves statistical significance. Across town, three different investigators are conducting three different experiments. One is comparing A with B, the second is comparing A with C, and the third is comparing B with C. Lo and behold, they get the same P values as the investigator running the combined experiment. The investigator comparing B with C gets a P value of 0.04 and has no adjustment to make; thus, the 0.04 stands and the investigator will have an easier time of impressing others with the result.

Why should the investigator who analyzed all three treatments at once be penalized when the investigator who ran a single experiment is not? This is part of Kenneth Rothman's argument that there should be no adjustment for multiple comparisons; that all significant results should be reported and each result will stand or fall depending on whether it is replicated by other scientists.

I find this view shortsighted. The two P-values are quite different, even though they are both 0.04. In the first case (big experiment) the investigator felt it necessary to work with three groups. This suggests a different sort of intuition than that of the scientist who investigated the single comparison. The investigator working with many treatments should recognize that there is a larger chance of achieving nominal significance and ought to be prepared to pay the price to insure that many false leads do not enter the scientific literature. The scientist working with the single comparison, on the other hand, has narrowed down the possibilities from the very start and can correctly have more confidence in the result. For the first scientist, it's, "I made 3 comparisons and just one was barely significant." For the second scientist, it's, "A difference, right where I expected it!"

The discussion of the previous section may be unrealistically tidy. Suppose, for example, the investigator working with three treatments really felt that the only important comparison was between treatments B and C and that treatment A was added only at the request of the funding agency or a fellow investigator. In that case, I would argue that the investigator be allowed to compare B and C without any adjustment for multiple comparisons because the comparison was planned in advance and had special status.

It is difficult to give a firm rule for when multiple comparison procedures are required. The most widely respected statistician in the field was Rupert G. Miller, Jr. who made no pretense of being able to resolve the question but offered some guidelines in his book Simultaneous Statistical Inference, 2nd edition (Chapter 1, section 5, emphasis is his):

Time has now run out. There is nowhere left for the author to go but to discuss just what constitutes a family [of comparisons to which multiple comparison procedures are applied]. This is the hardest part of the book because this is where statistics takes leave of mathematics and must be guided by subjective judgment. . . .

Provided the nonsimultaneous statistician [one who never adjusts for multiple comparisons] and his client are well aware of their error rates for groups of statements, and feel the group rates are either satisfactory or unimportant, the author has no quarrel with them. Every man should get to pick his own error rates. SImultaneous techniques certainly do not apply, or should not be applied, to every problem.

[I]t is important to distinguish between two types of experiments. The first is the preliminary, search- type experiment concerned with uncovering leads that can be pursued further to determine their relevance to the problem. The second is the final, more definitive experiment from which conclusions will be drawn and reported. Most experiments will involve a little of both, but it is conceptually convenient to being basically distinct. The statistician does not have to be as conservative for the first type as for the second, but simultaneous techniques are still quite useful for keeping the number of leads that must be traced within reasonable bounds. In the latter type multiple comparison techniques are very helpful in avoiding public pronouncements of red herrings simply because the investigation was very large.

The

natural familyfor the authorin the majority of instancesis theindividual experimentof asingle researcher. . . . The loophole is of course the clausein the majority of instances. Whether or not this rule of thumb applies will depend upon the size of the experiment. Large single experiments cannot be treated as a whole without an unjustifiable loss in sensitivity. . . .There are no hard-and-fast rules for where the family lines should be drawn, and the statistician must rely on his own judgment for the problem at hand.

If sample sizes are unequal, exact multiple comparison procedures may
not be available. In 1984, Hayter showed that the unequal sample size
modification of Tukey's HSD is conservative. that is, the true
significance level is no greater than the observed significance level.
Some computer programs perform multiple comparison procedures for unequal
sample sizes by pretending that the sample sizes are equal to their
harmonic mean. This is called an *unweighted means analysis*. It
was developed before the time of computers when the more precise
calculations could not be done by hand. When the first computer programs
were written, the procedure was implemented because analysts were used to
it and it was easy to program. Thus, we found ourselves using computers
to perform an analysis that was developed to be done by hand because
there were no computers! The unweighted means analysis is not
necessarily a bad thing to do if the sample sizes are all greater than
10, say, and differ by only 1 or 2, but this approximate test is becoming
unnecessary as software packages are updated.

My philosophy for handling multiple comparisons is identical to that of Cook RJ and Farewell VT (1996), "Multiplicity Considerations in the Design and Analysis of Clinical Trials," Journal of the Royal Statistical Society, Series A, 159, 93-110. (The link will get you to the paper if you subscribe to JSTOR.) An extreme view that denies the need for multiple comparison procedures is Rothman K (1990), "No Adjustments Are Needed for Multiple Comparisons," Epidemiology, 1, 43-46.

I use Tukey's HSD for the most part, but I'm always willing to use unadjusted t tests for planned comparisons. One general approach is to use both Fisher's LSD and Tukey's HSD. Differences that are significant according to HSD are judged significant; differences that are not significant according to LSD are judged nonsignificant; differences that are judged significant by LSD by not by HSD are judged open to further investigation.

For sample size calculations, I apply the standard formula for the two sample t test to the most important comparisons, with a Bonferroni adjustment of the level of the test. This guarantees me the necessary power for critical pairwise comparisons.