**One Sided Tests
**

Gerard E. Dallal, PhD

One common criticism of significance tests is that no null hypothesis
is ever true. Two population means or propportions are *always*
unequal as long as measurements have been carried out to enough decimal
places. Why, then, should we bother testing whether the means are equal?
The answer is contained in a comment by John Tukey regarding multiple
comparisons: The alternative hypothesis says we are unsure of the
*direction* of the difference. In keeping with Tukey's comment,
tests of the null hypothesis that two population means or proportions are
equal

are almost always two-sided (or two-tailed^{*}). That is, the
alternative hypothesis is

which says that the difference between means or proportions can be positive or negative.

Every so often, someone claims that a difference, if there is one, can
be in only one direction. For example, an investigator might claim that
newly proposed treatment *N* must be at least as good as the
standard treatment, *S*. *It cannot be worse*, especially when
the "Standard" is a placebo. One-sided tests have been proposed for such
circumstances. Suppose small values are
good, that is, the goal of the treatment is to produce small values of
something like cholesterol, blood pressure, or weight. The null
hypothesis of equal effectiveness is

The alternative hypothsis states that the difference can be in only one direction

For example, an investigator might propose using a one-tailed test to
test the efficacy of a cholesterol lowering drug because the drug cannot
raise cholesterol. With a one-tailed test, the hypothesis of no
difference is rejected if and only if the subjects taking the drug have
cholesterol levels significantly lower than those of controls. Outcomes
in which subjects taking the drug have cholesterol levels *higher*
than those of controls are treated as failing to show a
difference **no matter how much higher they may be**.

One-tailed tests make it easier to reject the null hypothesis when the
alternative is true. A large sample, two-sided, 0.05 level t test puts a
probability of 0.025 in each tail. It needs a
t statistic of less than -1.96 to reject the null hypothesis of no
difference in means. A one-sided test puts all of the probability into
a single tail. It rejects the hypothesis for values
of t less than -1.645. Therefore, a one-sided test is more likely likely
to reject the null hypothesis *when the difference is in the expected
direction*. This makes one-sided tests very attractive to those whose
definition of success is having a statistically significant result.

What damns one-tailed tests in the eyes of most statisticians is the
demand that *all* differences in the unexpected direction--large and
small--be treated as simply nonsignificant. I have never seen a
situation where researchers were willing to do this in practice. In
practice, things can *always* get worse! Suppose subjects taking the
new cholesterol lowering drug ended up with levels 5010 mg/dl *higher* than those of
the control group. The use of a one-tailed test implies that the
researchers would chalk it up to random variation and pursue it no
further. However, we know they would immediately begin looking for an
underlying cause and question why the drug was considered for human
intervention trials.

A case in point is the Finnish Alpha-Tocopherol, Beta-Carotene Cancer Prevention Trial ("The Effect Of Vitamin E and Beta-Carotene on the Incidence of Lung Cancer and other Cancers in Male Smokers" N Engl J Med 1994;330:1029-35). There were 18% more lung cancers diagnosed and 8% more overall deaths in study participants taking beta carotene. If a one-sided analysis had been proposed for the trial, these results would have been ignored on the grounds that they were the result of unlikely random variability under a hypothesis of no difference between beta-carotene and placebo. When the results of the trial were first reported, this was suggested as one of the many possible reasons for the anomolous outcome. However, after these results were reported, investigators conducting the Beta Carotene and Retinol Efficacy Trial (CARET), a large study of the combination of beta carotene and vitamin A as preventive agents for lung cancer in high-risk men and women, terminated the intervention after an average of four years of treatment and told the 18,314 participants to stop taking their vitamins. Interim study results indicate that the supplements provide no benefit and may be causing harm. There were 28% more lung cancers diagnosed and 17% more deaths in participants taking beta carotene and vitamin A than in those taking placebos. Thus, the CARET study replicated the ATBC findings. More details can be found in this NIH fact sheet and this one, too.

It is surprising to see one-sided tests still being used in the 21-st century, even in a journal as reknowned as the Journal of the American Medical Association. The study by Graat et al. (JAMA, Volume 288(6). August 14, 2002.715-721). "Effect of Daily Vitamin E and Multivitamin-Mineral Supplementation on Acute Respiratory Tract Infections in Elderly Persons: A Randomized Controlled Trial" provides a perfect illustration of how one-sided tests can leave an investigator..chagrinned. The Statistical Analyses section (p 717) contains the comment, "Although the initial sample size was based on a 1-sided test on the assumption that effects would only be seen in 1 direction, after the study was completed the need for 2-sided tests became evident. P values are therefore based on 2-sided tests." One does have to admire the investigators for their honesty.

The usual 0.05 level two-tailed test puts half of the probabilty (2.5%) in each tail of the reference distribution, that is, the cutoff points for the t statistic are 1.96. Some analysts have proposed two-sided tests with unequal tail areas. Instead of having 2.5% in each tail, there might be 4% in the expected direction and 1% in the other tail (for example, cutoffs of -1.75 and 2.33) as insurance against extreme results in the unexpected direction. However, there is no consensus or obvious choice for the way to divide the probability (e.g., 0.005/0.045, 0.01/0.04, 0.02/0.03) and some outcomes might give the false impression that the split was chosen after the fact to insure statistical signifcance. This leads us back to the usual two-tailed test (0.025, 0.025).

Marvin Zelen dismisses one-sided tests in another way--he finds them
unethical! His argument is as simple as it is elegant. Put in terms of
comparing a new treatment to standard, anyone who *insists* on a
one-tailed test is saying the new treatment *cannot* do worse than the
standard. If the new treament has any effect, it can only do better. However,
if that's the case right at the start of the study, then it is unethical not
to give the new treatment to everyone!

-------------

^{*}Some statisticians find the word
*tails* to be ambiguous and use *sided* instead. *Tails*
refers to the distribution of the test statistic and there can be many
test statistics. While the most familiar test statistic might lead to a
two-tailed test, other statistics might not. When the hypothesis
H_{0}: _{1} = _{2} is tested against the alternative
of inequality, it is rejected for large positive values of t (which lie
in the upper tail) and large negative values of t (which lie in the lower
tail). However, this test can also be performed by using the square of
the t or z statistics (t^{2} = F_{1,n}; z^{2} =
^{2}_{1}).
Then only large values of the test statistic will lead to rejecting the
null hypothesis. Since only one tail of the reference distribution leads
to rejection, it is a one-*tailed* test, even thought the alternative
hypothesis is two-*sided*.

*Side* refers to the hypothesis, namely, on which the side of 0
the difference _{1} - _{2} lies (positive or negative).
Since this is a statement about the hypothesis, it is independent of the
choice of test statistic. Nevertheless, the terms *two-tailed* and
*two-sided* are often used interchangeably.