What Underlies Sample Size Calculations
Gerard E. Dallal, Ph.D.
Just as the analysis of a set of data is determined by the research question and the study design, the way the sample size is estimated is determined by the way the data will be analyzed. This note (at least until the next draft!) is concerned with comparing population means. There are similar methods for comparing proportions and different methods for assessing correlation coefficients. Unfortunately, it is not uncommon to see sample size calculations that are totally divorced from the study for which they are being constructed because the sample sizes are calculated for analyses that will never be used to answer the question prompting the research. The way to begin, then, is by thinking of the analysis that will ultimately be performed to insure that the corresponding sample size calculations have been used. This applies even to comparing two population means. If experience suggests a logarithmic transformation will be applied to the data prior to formal analysis, then the sample size calculations should be performed in the log scale.
Studies are generally conducted because an investigator expects to see a specific treatment effect.* Critical regions and tests of significance are determined by the way data should behave if there is no treatment effect. Sample sizes are determined by the way data should behave if the investigator has estimated the treatment effect correctly.**
Consider a study using two independent samples to compare their population means. Let the common population standard deviation be 60. The behavior of the difference in sample means under the null hypothesis of equal population means is illustrated by the normal distributions on the left-hand side of displays (a) through (d) for sample sizes of 12, 24, 48, and 96 per group, respectively.
Suppose the investigator expects the difference in population means to be 50 units. Then, the behavior of the difference in sample means is described by the curves on the right-hand side of the displays.
Things to notice about (a)--(d):
Other things to notice about (a)--(d):
The region shaded blue gives the power of the test. It is 0.497, 0.807, 0.981, and 1.000 for panels (a) through (d), respectively.
Choosing a sample size is just a matter of getting the picture "just right", that is, seeing to it that there's just the right amount of blue.
It seems clear that a sample size of 12 is too small because there's a large chance that the expected effect will not be detected even if it is true. At the other extreme, a sample size of 96 is unnecessarily large. Standard practice is to choose a sample size such that the power of the test is no less than 80% when the effect is as expected. In this case, the sample size would be 24 per group. Whether a sample size larger than 24 should be used is a matter of balancing cost, convenience, and concern the effect not be missed.
The pictures show how the sample size is a function of four quantities.
The sample size is determined by the values of these four quantities. Denoting the expected mean difference locates the centers of the distributions on the number line. Picking the size of the test determines the amount of area that will be shaded red. For a fixed sample size, it also determines the critical values and the amount of area that will be shaded blue. Increasing the sample size makes the distributions narrower which moves the critical values closer to the mean of the distribution of the test statistic under the null hypothesis. This increases the amount of area shaded blue.
In practice, we don't draw lots of diagrams. Instead, there is a formula that yields the per group sample size when the four quantities are specified. For large samples, the per group sample size is given by
Technical detail: For small sample sizes, percentiles of the t distribution replace the percentiles of the normal distribution. Since the particular t distribution depends on the sample size, the equation must be solved iteratively (trial-and-error). There are computer programs that do this with little effort.
The sample size increases with the square of the within group standard deviation and decreases with the square of the expected mean difference. If, for example, when testing a new treatment a population can be found where the standard deviation is half that of other populations, the sample size will be cut by a factor of 4.
The alternative to equality must be realistic. The larger the expected difference, the smaller the required sample size. It can be QUITE TEMPTING to overstate the expected difference to lower the sample size and convince one's self or a funding agency of the feasibility of the study. All this strategy will do, however, is cause a research team to spend months or years engaged in a hopeless investigation--an underpowered study that cannot meet its goals. A good rule is to ask whether the estimated difference would still seem reasonable if the study were being proposed by someone else.
The power, --that is, probability of rejecting H0 when the alternative holds--can, in theory, be made as large or small as desired. Larger values of require larger sample sizes, so the experiment might prove too costly. Smaller values of require smaller sample sizes, but only by reducing the chances of observing a significant difference if the alternative holds. Most funding agencies look for studies with at least 80-% power. In general, they do not question the study design if the power is 80-% or greater. Experiments with less power are considered too chancy to fund.
The estimate of the within group standard deviation often comes from similar studies, sometimes even 50 years old. If previous human studies are not available to estimate the variability in a proposed human study, animal studies might be used, but animals in captivity usually show much less variability than do humans. Sometimes it is necessary to guess or run a pilot study solely to get some idea of the inherent variability.
Many investigators have difficulty estimating standard deviations simply because it is not something they do on a regular basis. However, standard deviations can often be obtained in terms of other measures that are more familiar to researchers. For example, a researcher might specify a range of values that contains most of the observations. If the data are roughly normally distributed, this range could be treated as an interval that contains 95% of the observations, that is, as an interval of length 4. The standard deviation, then, is taken to be one-fourth of this range. If the range were such that it contains virtually all of the population, it might be treated as an interval of length 6. The standard deviation, then, is taken to be one-sixth of this range.
Underestimating the standard deviation to make a study seem more feasible is as foolhardy as overestimating an expected difference. Such estimates result in the investment of up resources in studies that should never have been performed. Conservative estimates (estimates that lead to a slightly larger sample size) are preferable. If a study is feasible when conservative estimates are used, then it is well worth doing.
When the response being studied is change or a difference, the sample size formulas require the standard deviation of the difference between measurements, not the standard deviation of the individual measurements. It is one thing to estimate the standard deviation of total cholesterol when many individuals are measure once; it is quite another to estimate the standard deviation of the change in cholesterol levels when changes are measured.
One trick that might help: Often a good estimate of the standard deviation of the differences is unavailable, but we have reasonable estimates of the standard deviation of a single measurement. The standard deviations of the individual measurements will often be roughly equal. Call that standard deviation . Then, the standard deviation of the paired differences is equal to
Sometimes a study involves the comparison of many treatments. The statistical methods are discussed in detail under Analysis of Variance (ANOVA). Historically, the analysis of many groups begins by asking whether all means are the same. There are formulas for calculating the sample size necessary to reject this hypothesis according to the particular configuration of population means the researchers expect to encounter. These formulas are usually a bad way to choose a sample size because the purpose of the experiment is rarely (never?) to see whether all means are the same. Rather, it is to catalogue the differences. The sample size that may be adequate to demonstrate that the population means are not all the same may be inadequate to demonstrate exactly where the differences occur.
When many means are compared, statisticians worry about the problem of multiple comparisons, that is, the possibility that some comparison may be call statistically significant simply because so many comparisons were performed. Common sense says that if there are no differences among the treatments but six comparisons are performed, then the chance that something reaches the level of statistical significance is a lot greater than 0.05. There are special statistical techniques such as Tukey's Honestly Significant Differences (HSD) that adjust for multiple comparisons, but there are no easily accessible formulas or computer programs for basing sample size calculations on them. Instead, sample sizes are calculated by using a Bonferroni adjustment to the size of the test, that is, the nominal size of the test is divided by the number of comparisons that will be performed. When there are three means, there are three possible comparisons (AB,AC,BC). When there are four means, there are six possible comparisons (AB,AC,AD,BC,BD,CD), and so on. Thus, when three means are to be compared at the 0.05 level, the two-group sample size formula is used, but the size of each individual comparison is taken to be 0.05/3 (=0.0167). When four means are compared, the size of the test is 0.05/6 (=0.0083).
The Log Scale
Sometimes experience suggests a logarithmic transformation will be applied to the data prior to formal analysis. This corresponds to looking at ratios of population parameters rather than differences. When the analysis will be performed in the log scale, the sample size calculations should be performed in the log scale, too. If only summary data are available for sample size calculations and they are in the original scale, the behavior in the log scale can be readily approximated. The expected difference in means in the log scale is approximately equal to the log of the ratio of means in the original scale. The common within group standard deviation in the natural log scale (base e) is approximately equal to the coefficient of variation in the original scale (the roughly constant ratio of the within standard deviation to the mean). If the calculations are being performed in the common log scale (base 10), divide the cv by 2.3026 to estimate the common within group standard deviation.
Example: (=0.05, =0.80) Suppose a response will be analyzed in the log scale and that in the original scale, the population means are expected to be 40 and 50 mg/dl and the common coefficient of variation (/) is estimated to be 0.30. Then, in the (natural) log scale the estimated effect is ln(50/40) = ln(1.25) = 0.2231 and common within group standard deviation is estimated to be 0.30 (the cv). The per group sample size is approximately 1+16(0.30/0.2231)^2 or 30. In the common log scale, the estimated effect is log(50/40) = 0.0969 and the estimated common within group standard deviation is estimated to be 0.30/2.3026 = 0.1303. The per group sample size is approximately 1+16(0.1301/0.0969)^2 or 30. It is not an accident that the sample sizes are the same. The choice of a particular base for logarithms is like choosing to measure height in cm or in. It doesn't matter which you use as long as you are consistent! No mixing allowed! A few things worth noting:
A potential gotcha!: When calculating the treatment effect in the log scale, you can never go wrong calculating the log of the ratio of the means in the original scale. However, you have to be careful if the effect is stated in terms of a percent increase or decrease. Increases and decreases are not equivalent. Suppose the standard treatment yields a mean of 100. A 50% increase gives a mean of 150. The ratio of the means is 150/100(=3/2) or 100/150(=2/3), Now consider a 50% decrease from standard. This leads to a mean of 50. The ratio is now 100/50(=2) or 50/100(=1/2). There's no trick here. The mathematics is correct. The message is that you have to be careful when you translate statements about expected effects into numbers needed for the formal calculations.
Sometimes responses are truly paired. Two treatments are applied to the same individual or the study involves matched or paired subjects. In the case of paired samples, the formula for the total number of pairs is the same as for the number of independent samples except that the factor of 2 is dropped, that is,
It is clear from the formulas why paired studies are so attractive. First, is the factor of 2. All other things being equal, a study of independent samples that requires, say, 100 subjects per group or a total of 200 subjects, requires only 50 pairs for a total of 100 subjects. Also, if the pairing is highly effective, the standard deviation of the differences within pair can be quite small, thereby reducing the sample size even further. However, these saving occur because elements within the same pair are expected to behave somewhat the same. If the pairing is ineffective, that is, if the elements within each pair are independent of each other, the standard deviation of the difference will be such that the number of pairs for the paired study turns out to be equal to the number of subjects per group for the independent samples study so that the total sample size is the same.
There is a more important concern than ineffective pairing. When some investigators see how the sample sizes required for paired studies compared to those involving independent samples, their first thought is to drop any control group in favor of "using subjects as their own control". Who wouldn't prefer to recruit 50 subjects and look at whether their cholesterol levels change over time rather than 200 subjects (100 on treatment; 100 on placebo) to see if the mean change in the treatment group is different from that in the control group? However, this is not an issue of sample size. It is an issue of study design. An investigator who measured only the 50 subjects at two time points would be able to determine whether there was a change over time, but s/he would not be able to say how it compared to what would have happened over the same time period in the absence of any intervention.
*There are exceptions such as equivalence trials where the
goal is to show that two population means are the same, but they will not
concern us here.
**It may sound counter-intuitive for the investigator to
have to estimate the difference when the purpose of the study is to
determine the difference. However, it can't be any other way. Common
sense suggests it takes only a small number of observations to detect a
large difference while it takes a much larger sample size to detect a
small difference. Without some estimate of the likely effect, the sample
size cannot be determined. Sometimes there will be no basis for
estimating the likely effect. The best that can be done in such
circumstances is a pilot study to generate some preliminary data and
*There are exceptions such as equivalence trials where the goal is to show that two population means are the same, but they will not concern us here.
**It may sound counter-intuitive for the investigator to have to estimate the difference when the purpose of the study is to determine the difference. However, it can't be any other way. Common sense suggests it takes only a small number of observations to detect a large difference while it takes a much larger sample size to detect a small difference. Without some estimate of the likely effect, the sample size cannot be determined. Sometimes there will be no basis for estimating the likely effect. The best that can be done in such circumstances is a pilot study to generate some preliminary data and estimates.
[back to LHSP]