P Values

To understand P values, you have to understand fixed level testing. With fixed level testing, a null hypothesis is proposed (usually, specifying no treatment effect) along with a level for the test, usually 0.05. All possible outcomes of the experiment are listed in order to identify extreme outcomes that would occur less than 5% of the time in aggregate if the null hypothesis were true. This set of values is known as the critical region. They are critical because if any of them are observed, something extreme has occurred. Data are now collected and if any one of those extreme outcomes occur the results are said to be significant at the 0.05 level. The null hypothesis is rejected at the 0.05 level of significance and one star (*) is printed somewhere in a table. Some investigators note extreme outcomes that would occur less than 1% of the time and print two stars (**) if any of those are observed.

The procedure is known as fixed level testing because the level of the test is fixed prior to data collection. In theory if not in practice, the procedure begins by the specifying the hypothesis to be tested and the test statistic to be used along with the set of outcomes that will cause the hypothesis to be rejected. Only then are data collected to see whether they lead to rejection of the null hypothesis.

Many researchers quickly realized the limitations of reporting only whether a result achieved the 0.05 level of significance. Was a result just barely significant or wildly so? Would data that were significant at the 0.05 level be significant at the 0.01 level? At the 0.001 level? Even if the result are wildly statistically significant, is the effect large enough to be of any practical importance?

As computers became readily available, it became common practice to report the observed significance level (or P value)--the smallest fixed level at which the the null hypothesis can be rejected. If your personal fixed level is greater than or equal to the P value, you would reject the null hypothesis. If your personal fixed level is less than to the P value, you would fail to reject the null hypothesis. For example, if a P value is 0.027, the results are significant for all fixed levels greater than 0.027 (such as 0.05) and not significant for all fixed levels less than 0.027 (such as 0.01). A person who uses the 0.05 level would reject the null hypothesis while a person who uses the 0.01 level would fail to reject it.

A P value is often described as the probability of seeing results as or more extreme as those actually observed if the null hypothesis were true. While this description is correct, it invites the question of why we should be concerned with the probability of events that have not occurred! (As Harold Jeffreys quipped, "What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.") In fact, we care because the P value is just another way to describe the results of fixed level tests.

Every so often, a call is made for a ban on significance tests. Papers and books are written, conferences are held, and proceedings are published. The main reason behind these movements this is that P values tell us nothing about the magnitudes of the effects that might lead to us to reject or fail to reject the null hypothesis. Significance tests blur the distinction between statistical significance and practical importance. It is possible for a difference of little practical importance to achieve a high degree of statistical significance. It is also possible for clinically important differences to be missed because an experiment lacks the power to detect them. However, significance tests provide a useful summary of the data and these concerns are easily remedied by supplementing significance tests with the appropriate confidence intervals for the effects of interest.

When hypotheses of equal population means are tested, determining whether P is less than 0.05 is just another way of examining a confidence interval for the mean difference to see whether it excludes 0. The hypothesis of equality will be rejected at level if and only if a 100 (1-)% confidence interval for the mean difference fails to contain 0. For example, the hypothesis of equality of population means will be rejected at the 0.05 level if and only if a 95% CI for the mean difference does not contain 0. The hypothesis will be rejected at the 0.01 level if and only if a 99% CI does not contain 0, and so on.






95% CI

Practical Importance

1 2 0.5 4 <0.0001 Y (1,3) N
2 30 5 6 <0.0001 Y (20,40) Y
3 30 14 2.1 0.032 Y (2,58) ?
4 1 1 1 0.317 N (-1,3) N
5 2 30 0.1 0.947 N (-58,62) ?
6 30 16 1.9 0.061 N (-2,62) ?

This is a good time to revist the cholesterol studies presented during the discussion of confidence intervals. We assumed a treatment mean difference of a couple of units (mg/dl) was of no consequence, but differences of 10 mg/dl and up had important public health and policy implications. The discussion and interpretation of the 6 cases remains the same, except that we can add the phrase statistically significant to describe the cases where the P values are less than 0.05.

Significance tests can tell us whether a difference between sample means is statistically significant, that is, whether the observed difference is larger than would be due to random variation if the underlying population difference were 0. But significance tests do not tell us whether the difference is of practical importance. Statistical significance and practical importance are distinct concepts.

In cases 1-3, the data are judged inconsistent with a population mean difference of 0. The P values are less than 0.05 and the 95% confidence intervals do not contain 0. The sample mean difference is much larger than can be explained by random variability about a population mean difference of 0. In cases 4-6, the data are consistent with a population mean difference of 0. The P values are greater than 0.05 and the 95% confidence intervals contain 0. The observed difference is consistent with random variability about 0.

Case 1: There is a statistically significant difference between the diets, but the difference is of no practical importance, being no greater than 3 mg/dl.

Case 2: The difference is statistically significant and is of practical importance even though the confidence interval is 20 mg/dl wide. This case illustrates that a wide confidence interval is not necessarily a bad thing, if all of the values point to the same conclusion. Diet 2 is clearly superior to diet 1, even though we the likely benefit can't be specified to within a range of 20 mg/dl.

Case 3: The difference is statistically significant but it may or may not be of practical importance. The confidence interval is too wide to say for sure. The difference may be as little as 2 mg/dl, but could be as great as 58 mg/dl. More study may be needed. However, knowledge of a difference between the diets, regardless of its magnitude, may lead to research that exploits and enhances the beneficial effects of the more healthful diet.

Case 4: The difference is not statistically significant and we are confident that if there is a real difference it is of no practical importance.

Cases 5 and 6: The difference is not statistically significant, so we cannot claim to have demonstrated a difference. However, the population mean difference is not well enough determined to rule out all differences of practical importance.

Cases 5 and 6 require careful handling. Case 6, unlike Case 5, seems to rule out any advantage of practical importance for Diet 1, so it might be argued that Case 6 is like Case 3 in that both of them are consistent with important and unimportant advantages for Diet 2 while neither suggests any advantage to Diet 1.

Many analysts accept illustrations such as these as a blanket indictment of significance tests. I prefer to see them as a warning to continue beyond significance tests to see what other information is contained in the data. In some situations, it's important to know if there is an effect no matter how slight, but in most cases it's hard to justify publishing the results of a significance test without saying something about the magnitude of the effect*. If a result is statistically significant, is it of practical importance? If the result is not statistically significant, have effects of practical importance been ruled out? If a result is not statistically significant but has not ruled out effects of practical importance, YOU HAVEN'T LEARNED ANYTHING!

Case 5 deserves another visit in order to underscore an important lesson that is usually not appreciated the first time 'round: "Absence of evidence is not evidence of absence!" In case 5, the observed difference is 2 mg/dl, the value 0 is nearly at the center of the confidence interval, and the P value for testing the equality of means is 0.947. It is correct to say that the difference between the two diets did not reach statistical significance or that no statistically significant difference was shown. Some researchers refer to such findings as "negative", yet, it would be incorrect to say that the diets are the same. The absence of evidence for a difference is not the same thing as evidence of absence of an effect. In BMJ,290(1985),1002, Chalmers proposed outlawing the term "negative trial" for just this reason.

When the investigator would like to conjecture about the absence of an effect, the most effective procedure is to report confidence intervals so that readers have a feel for the sensitivity of the experiment. In cases 4 and 5, the researchers are entitled to say that there was no significant finding. Both have P values much larger than 0.05. However, only in case 4 is the researcher entitled to say that the two diets are equivalent: the best available evidence is that they produce mean cholesterol values within 3 mg/dl of each other, which is probably too small to worry about. One can only hope that a claim of no difference based on data such as in case 5 would never see publication.

Should P values be eliminated from the research literature in favor of confidence intervals? This discussion provides some support for this proposal, but there are many situations were the magnitude of an effect is not as important as whether or not an effect is present. I have no objection to using P values to focus on the presence or absence of an effect, provided the confidence intervals are available for those who want them, statistical significance is not mistaken for practical importance, and absence of evidence is not mistaken for evidence of absence.

As useful as confidence intervals are, they are not a cure-all. They offer estimates of the effects they measure, but only in the context in which the data were collected. It would not be surprising to see confidence intervals vary between studies much more than any one interval would suggest. This can be the result of the technician, measurement technique, or the particular group of subjects being measured, among other causes. This is one of the things that plagues meta-analysis, even in medicine where the outcomes are supposedly well-defined. This is yet another reason why significance tests are useful. There are many situations where the most useful piece of information that a confidence interval provides is simply that there is an effect or treatment difference.

What P values are not!

A P value is the probability of observing data as or more extreme as the actual outcome when the null hypothesis is true. A small P value means that data as extreme as these are unlikely under the null hypothesis. The P value is NOT the probability that the null hypothesis is true. A small P value makes us reject the null hypothesis because an event has occurred that is unlikely if H0 were true.

Classical (or frequentist) statistics does not allow us to talk about the probability that a hypothesis is true. Statements such as, "There's a 5 percent chance that these two diets are equally effective at lowering cholesterol" have no meaning in classical statistics. Either they are equally effective or they aren't. All we can talk about is the probability of seeing certain outcomes if the hypothesis were true**.

The reason these methods work regardless is that, although we haven't said so explicitly, there is a tacit presumption that the alternative hypothesis provides a more reasonable explanation for the data. However, it's not built into the methods, and need not be true. It is possible to reject a hypothesis even though it is the best explanation for the data, as the following two examples illustrate.

Example 1: A single value is observed from a normal distribution with a standard deviation of 1. Suppose there are only two possibilities: Either the population mean is 0 or it is 100. Let H0 be =0 and H1 be =100. Suppose a value of 3.8 is observed. The P value is 0.0001 because, if the population mean is 0, the probability of observing an observation as or more extreme than 3.8 is 0.0001. We have every right to reject H0 at the 0.05, 0.01, or even the 0.001 level of significance. However, the probability of observing 3.8 is even less under the alternative hypothesis! Even though we can reject H0 at the usual levels of significance, common sense says that the null hypothesis is more likely to be true than the alternative hypothesis.

Example 2: Suppose only 35 heads occur in 100 flips of a coin. The P value for testing the null hypothesis that the coin is fair (equally likely to come up heads or tails) versus the alternative that is it unfair is 0.0035. We can reject the hypothesis that the coin is fair at the 0.01 level of significance, but does this mean that there is less than a 1-% chance that the coin is fair? It depends on things other than the number of heads and tails. If the coin were a gambling device belonging to someone else and it was causing you to lose money, you might think it highly unlikely that the coin was fair. However, if the coin was taken from a roll of newly minted coins just delivered to your bank and you did the flipping yourself by letting the coin bounce off some soft surface (to foil any possible regularity in your flipping motion), you might still find it quite likely that the coin is fair. Standard statistical theory cannot answer this question.

*I was asked recently why confidence intervals were common in the medical literature but not in other fields. My immediate, tongue-partially-in-cheek response was that for a confidence interval to be useful, you had to have some idea of what it meant! Many areas of investigation summarize their experiments in scales and indices that often lack an operational interpretation. Some scales are the sum of positive responses to items on a questionnaire. Others are composites of related but different components. Those scoring higher are different from those scoring lower, but it's often not clear what a 1 or 10 unit difference means in any sense, let alone in terms of practical importance.

**While standard frequentist methods cannot answer the question, another approach to statistics--Bayesian methods--attempts to provide an answer. If prior to flipping the coin, you could quantify the probability that the coin is fair, Bayesian methods provide a way to update this probability after the coin is flipped. The trick is in coming up with the inital probability. For example, before flipping the coin, what is the probability that the coin is fair?

[back to The Little Handbook of Statistical Practice]

Copyright © 2000 Gerard E. Dallal