Coming Attractions: Where Are We Going?
Our goal is to get to the point were we can read, understand, and write statements like
Does the mean vitamin C blood level of smokers differ from that of nonsmokers? Let's suppose for a moment they do, with smokers tending to have lower levels. Nevertheless, we wouldn't expect every smoker to have levels lower than those of every nonsmoker. There would be some overlap in the two distributions. This is one reason why questions like this are usually answered in terms of population means, namely, how the mean level of all smokers compares to that of all nonsmokers.
The statistical tool used to answer such questions is the confidence interval (CI) for the difference between the two population means. But let's forget the formal study of statistics for the moment. What might you do to answer the question if you were on your own? You might get a random sample of smokers and nonsmokers, measure their vitamin C levels, and see how they compare. Suppose we've done it. In a sample of 40 Boston male smokers, vitamin C levels had a mean of 0.60 mg/dl and an SD of 0.32 mg/dl while in a sample of 40 Boston male nonsmokers (Strictly speaking, we can only talk about Boston area males rather than all smokers and nonsmokers. No one ever said research was easy.), the levels had a mean of 0.90 mg/dl and an SD of 0.35 mg/dl. The difference in means between nonsmokers and smokers is 0.30 mg/dl!
The difference of 0.30 looks impressive compared to means of 0.60 and 0.90, but we know that if we were to take another random sample, the difference wouldn't be exactly the same. It might be greater, it might be less. What kind of population difference is consistent with this observed value of 0.30 mg/dl? How much larger or smaller might the difference in population means be if we could measure all smokers and nonsmokers? In particular, is 0.30 mg/dl the sort of sample difference that might be observed if there were no difference in the population mean vitamin C levels? We estimate the difference in mean vitamin C levels at 0.30 mg/dl, but 0.30 mg/dl "giveortake what"? This is where statistical theory comes in.
One way to answer these questions is by reporting a 95% confidence interval. A 95% confidence interval is an interval generated by a process that's right 95% of the time. Similarly, a 90% confidence interval is an interval generated by a process that's right 90% of the time and a 99% confidence interval is an interval generated by a process that's right 99% of the time. If we were to replicate our study many times, each time reporting a 95% confidence interval, then 95% of the intervals would contain the population mean difference. In practice, we perform our study only once. We have no way of knowing whether our particular interval is correct, but we behave as though it is. Here, the 95% confidence interval for the difference in mean vitamin C levels between nonsmokers and smokers is 0.15 to 0.45 mg/dl. Thus, not only do we estimate the difference to be 0.30 mg/dl, but we are 95% confident it is no less than 0.15 mg/dl or greater than 0.45 mg/dl.
In theory, we can construct intervals of any level of confidence from 0 to 100%. There is a tradeoff between the amount of confidence we have in an interval and its length. A 95% confidence interval for a population mean difference is constructed by taking the sample mean difference and adding and subtracting 1.96 standard errors of the mean difference. A 90% CI adds and subtracts 1.645 standard errors of the mean difference, while a 99% CI adds and subtracts 2.57 standard errors of the mean difference. The shorter the confidence interval, the less likely it is to contain the quantity being estimated. The longer the interval, the more likely to contain the quantity being estimated. Ninetyfive percent has been found to be a convenient level for conducting scientific research, so it is used almost universally. Intervals of lesser confidence would lead to too many misstatements. Greater confidence would require more data to generate intervals of usable lengths.
[Zero is a special value. If a difference between two means is 0, then the two means are equal!]
Confidence intervals contain population values found to be consistent with the data. If a confidence interval for a mean difference includes 0, the data are consistent with a population mean difference of 0. If the difference is 0, the population means are equal. If the confidence interval for a difference excludes 0, the data are not consistent with equal population means. Therefore, one of the first things to look at is whether a confidence interval for a difference contains 0. If 0 is not in the interval, a difference has been established. If a CI contains 0, then a difference has not been established. When we start talking about significance tests, we'll refer to differences that exclude 0 as a possibility as statistically significant. For the moment, we'll use the term sparingly.
A statistically significant difference may or may not be of practical importance. Statistical significance and practical importance are separate concepts. Some authors confuse the issues by taking about statistical significance and practical significance or by talking about, simply, significance. In these notes, there will be no mixing and matching. It's either statistically significant or practically important any other combination should be consciously avoided.
Serum cholesterol values (mg/dl) in a freeliving population tend to be between the mid 100s and the high 200s. It is recommended that individuals have serum cholesterols of 200 or less. A change of 1 or 2 mg/dl is of no importance. Changes of 1020 mg/dl and more can be expected to have a clinical impact on the individual subject. Consider an investigation to compare mean serum cholesterol levels produced by two diets by looking at confidence intervals for _{1}  _{2} based on . High cholesterol levels are bad. If is positive, the mean from diet 1 is greater than the mean from diet 2, and diet 2 is favored. If is negative, the mean from diet 1 is less than the mean from diet 2, and diet 1 is favored. Here are six possible outcomes of experiment.

95% CI 

(what was observed) 
(what the truth might be) 

Case 1  2  (1,3) 
Case 2  30  (20,40) 
Case 3  30  (2,58) 
Case 4  1  (1,3) 
Case 5  2  (58,62) 
Case 6  30  (2,62) 
For each case, let's consider, first, whether a difference between population means has been demonstrated and then what the clinical implications might be.
In cases 13, the data are judged inconsistent with a population mean difference of 0. In cases 46, the data are consistent with a population mean difference of 0.
Cases 5 and 6 require careful handling. While neither interval formally demonstrates a difference between diets, case 6 is certainly more suggestive of something than Case 5. Both cases are consistent with differences of practical importance and differences of no importance at all. However, Case 6, unlike Case 5, seems to rule out any advantage of practical importance for Diet 1, so it might be argued that Case 6 is like Case 3 in that both are consistent with important and unimportant advantages to Diet 2 while neither suggests any advantage to Diet 1.
It is common to find reports stating that there was no difference between two treatment. As Douglas Altman and Martin Bland emphasize, absence of evidence is not evidence of absence, that is, failure to show a difference is not the same thing as showing two treatments are the same. Only Case 4 allows the investigators to say there is no difference between the diets. The observed difference is not statistically significant and, if it should turn out there really is a difference (no two population means are exactly equal to an infinite number of decimal places), it would not be of any practical importance.
Many writers make the mistake of interpreting cases 5 and 6 to say there is no difference between the treatments or that the treatments are the same. This is an error. It is not supported by the data. All we can say in cases 5 and 6 is that we have been unable to demonstrate a difference between the diets. We cannot say they are the same. The data say they may be the same, but they may be quite different. Studies like thisthat cannot distinguish between situations that have very different implicationsare said to be underpowered, that is, they lack the power to answer the question definitively one way or the other.
In some situations, it's important to know if there is an effect no matter how small, but in most cases it's hard to rationalize saying whether or not a confidence interval contains 0 without reporting the CI, and saying something about the magnitude of the values it contains and their practical importance. If a CI does not include 0, are all of the values in the interval of practical importance? If the CI includes 0, have effects of practical importance been ruled out? If the CI includes 0 AND values of practical importance, YOU HAVEN'T LEARNED ANYTHING!
...but it's important!
Statistics is backwards! and Confidence Intervals are a perfect illustration. Analysts are fond of means...for good reason. It is usually reasonable to assume that the effect of something, if it does anything, is to take the distribution of the population (the histogram) and shift it to the right or to the left. In some cases, the effect may be to shift things by a fixed percentage rather than a fixed amount, but this can be viewed as shifting things by a fixed amount in the log scale. In that case, the effect can be summarized by what it does to the population mean.
We would like to be able to say something like, "There's a 95% probability that mean total cholesterol on this new statin will be 10 to 20 mg/dl lower than on the old formulation." Classical frequentist methods do not permit such statements. Instead, we get to say things like, "We are 95% confident that mean total cholesterol on this new statin will be 10 to 20 mg/dl lower than on the old formulation."
It's easy to fall into the trap of thinking that "95% confident" is the same as "95% probability", but they are two different things. To illustrate the difference and keep life simple, we will deal with 50% confidence intervals. The same thing could be done with 95% CIs, but it would not be as transparent.
If you believe "50% confident" means "50% probability", then you MUST be willing to accept the following terms:
If "50% confident" meant "50% probability", then I'd be just as likely to be right as wrong. Since the same amount of money changes hands, neither of us would have an advantage. However, as we shall now see, confidence intervals are not betworthy. If you are willing to accept the terms I outlined, you will quickly be bankrupt, as the following example demonstrates:
Consider a measurement whose values are uniformly distributed between the values 1 and +1. Suppose we wish to estimate the mean, , the value in the middle of the distribution (histogram). We draw two observations independently at random. It is straightforward to show that the ordered pair of observations is a 50% confidence interval for .
There is no trickery there. If the mathematics seems minimal, straightforward, and easily understood, that's because it is. Two observations from a Uniform(1, +1) distribution give a 50% CI for because 50% of the time the interval will contain . If you draw successive pairs of observations and look to see whether is between them, half of the time (50%), the answer will be yes.
It is easily seen that the two observations can be anywhere from 0 to 2 units apart. If the two observations are more than 1 unit apart, it is impossible for them NOT to contain ! That is, it is impossible to draw an interval of length greater than 1 in the interval 1 to +1 without containing even though it is a 50% CI! (It is straightforward to show that this will happen 25% of the time, that is, with probability 0.25 the difference between the two observations will exceed 1 in magnitude.)
Confidence refers not to the particular interval but to the process that generated it! Every pair of observations drawn at random is properly called a 50% confidence intervaleven though there are some intervals that must contain the parameter of interestbecause they are a realization of a process that covers 50% of the time.
It would be nice to think that this is an anomaly caused by the uniform distribution and that it can't happen with a confidence interval for the difference between two means based on the normal distribution. Unfortunately, that's not true. You can go broke betting on 95% CIs based on the normal distribution. The difference is that one cannot point to one of these intervals with certainty the way one could earlier. Without going into the details, the trick, as Hartigan demonstrated in the 1970s, is to pick some value K>0 and bet on the interval if its length exceeds K and against the interval if its length is shorter than K.
[back to LHSP]