Significance Tests / Hypothesis Testing
Suppose someone suggests a hypothesis that a certain population is 0. Recalling the convoluted way in which statistics works, one way to do this would be to
We fail to reject the hypothesis if
which can be rewritten
On the other hand, we reject the hypothesis if
The statistic is denoted by the symbol t. The test can be summarized as: Reject the hypothesis that the population mean is 0 if and only if the absolute value of t is greater than 1.96.
There is a 5% chance of obtaining a 95% CI that excludes 0 when it is in fact the population mean. For this reason, we say that this test has been performed at the 0.05 level of significance. Had a 99% CI been used, we would say that the test had been performed at the 0.01 level of significance, that is, the significance level (or simply the level) of the test is the probability of rejecting a hypothesis when it is true.
Statistical theory says that in many situations where a population value is estimated by drawing random samples, the sample and population values will be within two standard errors of each other 95% of the time. That is, 95% of the time,
This is the case for means, differences between means, proportions, differences between proportions, and regression coefficients. After an appropriate transformation, this is the case for odds ratios and even correlation coefficients.
We have used this fact to construct 95% confidence intervals by restating the result as
For example, a 95% CI for the difference between two population means, x- y, is given by
.
When we perform significance tests, we reexpress [*] by noting that 95% of the time
Suppose you wanted to test whether a population quantity were equal to 0. You could calculate the value of
which we get by inserting the hypothesized value of the population mean difference (0) for the population_quantity. If t<-1.96 or t>1.96 (that is, |t|>1.96), we say the data are not consistent with a population mean difference of 0 (because t does not have the sort of value we expect to see when the population value is 0) or "we reject the hypothesis that the population mean difference is 0". If t were -3.7 or 2.6, we would reject the hypothesis that the population mean difference is 0 because we've observed a value of t that is unusual if the hypothesis were true.
If -1.96t1.96 (that is, |t|1.96), we say the data are consistent with a population mean difference of 0 (because t has the sort of value we expect to see when the population value is 0) or "we fail to reject the hypothesis that the population mean difference is 0". For example, if t were 0.76, we would fail reject the hypothesis that the population mean difference is 0 because we've observed a value of t that is unremarkable if the hypothesis were true.
This is called "fixed level testing" (at the 0.05 level).
For example, if H0: x = y (which can be rewritten H0: x - y = 0), the test statistic is
If |t|>1.96, reject H0: x = y at the 0.05 level of significance.
When we were constructing confidence intervals, it mattered whether the data were drawn from normally distributed populations, whether the population standard deviations were equal, and whether the sample sizes were large or small, The answers to these questions helped us determine the proper multiplier for the standard error. The same considerations apply to significance tests. The answers determine the critical value of t for a result to be declared statistically significant.
When populations are normally distributed with unequal standard deviations and the sample size is small, the multiplier used to construct CIs is based on the t distribution with noninteger degrees of freedom. The same noninteger degrees of freedom appear when performing significance tests. Many ways to calculate the degrees of freedom have been proposed. The statistical program package SPSS, for example, uses the Satterthwaite formula
, where .
The difference between type I & type II errors is illustrated by the following legal analogy. Under United States law, defendants are presumed innocent until proven guilty. The purpose of a trial is to see whether a null hypothesis of innocence is rejected by the weight of the data (evidence). A type I error (rejecting the null hypothesis when it is true) is "convicting the innocent." A type II error (failing to reject the null hypothesis when it is false) is "letting the guilty go free."
A common mistake is to confuse a type I or II error with its probability. is not a type I error. It is the probability of a type I error. Similarly, is not a type II error. It is the probability of a type II error.
There's a trade-off between and . Both are probabilities of making an error. With a fixed sample size, the only way to reduce the probability of making one type of error is to increase the other. For the problem of comparing population means, consider the rejection region whose critical values are . This excludes every possible difference in sample means. H0 will never be rejected. Since the null hypothesis will never be rejected, the probability of rejecting the null hypothesis when it is true is 0. So, =0. However, since the null hypothesis will never be rejected, the probability of failing to reject the null hypothesis when it is false is 1, that is, =1.
Now consider the opposite extreme--a rejection region whose critical values are 0,0. The rejection region includes every possible difference in sample means. This test always rejects H0. Since the null hypothesis is always rejected, the probability of rejecting H0 when it is true is 1, that is, =1. On the other hand, since the null hypothesis is always rejected, the probability of failing to reject it when it is false is 0, that is, =0.
To recap, the test with a critical region bounded by has =0 and =1, while the test with a critical region bounded by 0,0 has =1 and =0. Now consider tests with intermediate critical regions bounded by k. As k increases from 0 to , decreases from 1 to 0 while increases from 0 to 1.
Every statistics textbook contains discussions of , , type I error, type II error, and power. Analysts should be familiar with all of them. However, is the only one that is encountered regularly in reports and published papers. That's because standard statistical practice is to carry out significance tests at the 0.05 level. As we've just seen, choosing a particular value for determines the value of .
The one place where figures prominently in statistical practice is in determining sample size. When a study is being planned, it is possible to choose the sample size to set the power to any desired value for some particular alternative to the null hypothesis. To illustrate this, suppose we are testing the hypothesis that two population means are equal at the 0.05 level of significance by selecting equal sample sizes from the two populations. Suppose the common population standard deviation is 12. Then, if the population mean difference is 10, a sample of 24 subjects per group gives an 81% chance of rejecting the null hypothesis of no difference (power=0.81, =0.19). A sample of 32 subjects per group gives an 91% chance of rejecting the null hypothesis of no difference (power=0.91, =0.09). This is discussed in detail in the section on sample size determination.
[back to LHSP]