Some Sample Size Theory

What Underlies Sample Size Calculations
Gerard E. Dallal, Ph.D.

Prologue

Just as the analysis of a set of data is determined by the research question and the study design, the way the sample size is estimated is determined by the way the data will be analyzed. This note (at least until the next draft!) is concerned with comparing population means. There are similar methods for comparing proportions and different methods for assessing correlation coefficients. Unfortunately, it is not uncommon to see sample size calculations that are totally divorced from the study for which they are being constructed because the sample sizes are calculated for analyses that will never be used to answer the question prompting the research. The way to begin, then, is by thinking of the analysis that will ultimately be performed to insure that the corresponding sample size calculations have been used. This applies even to comparing two population means. If experience suggests a logarithmic transformation will be applied to the data prior to formal analysis, then the sample size calculations should be performed in the log scale.

Comparing Population Means

Studies are generally conducted because an investigator expects to see a specific treatment effect.^* Critical regions and tests of significance are determined by the way data should behave if there is no treatment effect. Sample sizes are determined by the way data should behave if the investigator has estimated the treatment effect correctly.^**

Comparing Two Population Means:
Independent Samples

Consider a study using two independent samples to compare their population means. Let the common population standard deviation be 60. The behavior of the difference in sample means under the null hypothesis of equal population means is illustrated by the normal distributions on the left-hand side of displays (a) through (d) for sample sizes of 12, 24, 48, and 96 per group, respectively.

Suppose the investigator expects the difference in population means to be 50 units. Then, the behavior of the difference in sample means is described by the curves on the right-hand side of the displays.

Things to notice about (a)--(d):

The horizontal scales are the same.
The normal curves on the left-hand side of the display are centered at 0.
As the sample size increases, the distribution of the difference in sample means as given by the normal curves on the left-hand side of the display are more tightly concentrated about 0.
The critical values for an 0.05 level test--sample mean differences that will lead to rejecting the hypothesis of equal population means--are given by the vertical dashed lines. The critical region is shaded red. If the mean difference falls outside the vertical lines (in the critical region), the hypothesis of equal population means is rejected.
As the sample size increases, the critical values move closer to 0. This reflects the common sense notion that the larger the sample size, the harder it is (less likely) for the sample mean difference to be at any distance from 0.

Other things to notice about (a)--(d):

The normal curves on the right-hand side of the display are centered at 50.
As the sample size increases, the distribution of the difference in sample means as given by the normal curves on the right-hand side of the display are more tightly concentrated about 50.
As the sample size increases, more of the curve on the right-hand side of the displays falls into the critical region. The portion of the distribution on the right-hand side of the displays that falls into the critical region is shaded blue.
The region shaded blue gives the power of the test. It is 0.497, 0.807, 0.981, and 1.000 for panels (a) through (d), respectively.

Choosing a sample size is just a matter of getting the picture "just right", that is, seeing to it that there's just the right amount of blue.

It seems clear that a sample size of 12 is too small because there's a large chance that the expected effect will not be detected even if it is true. At the other extreme, a sample size of 96 is unnecessarily large. Standard practice is to choose a sample size such that the power of the test is no less than 80% when the effect is as expected. In this case, the sample size would be 24 per group. Whether a sample size larger than 24 should be used is a matter of balancing cost, convenience, and concern the effect not be missed.

The pictures show how the sample size is a function of four quantities.

the presumed underlying difference (), that is, that is, the expected difference between the two populations means should they be unequal. In each of the displays, changing the expected difference moves the two distributions further apart or closer together. This will affect the amount of area that is shaded blue. Move them farther apart and the area increases. Move them closer together and the area decreases.
the within group standard deviation (), which is a measure of the variability of the response. The width of the curves in the displays is determined by the with group standard deviation and the sample size. If the sample size is fixed, then the greater/smaller the standard deviation, the wider/narrower the curves. If the standard deviation is fixed, then the larger/smaller the sample size, the narrower/wider the curves. Changing width of the curves will move the critical values, too. Displays (a)--(d) were constructed for different sample sizes with the population standard deviation fixed. However, the same pictures could have been obtained by holding the sample size fixed but changing the population standard deviation.
the size or level of the statistical test (). Decreasing the level of the test--from 0.05 to 0.01, say--moves the critical valued further away from 0, reducing the amount of area that is shaded red. It also reduces the amount of area shaded blue. This represents a trade off. Reducing the amount of area shaded red reduces the probability of making an error when there is no difference. This is good. Reducing the amount of area shaded blue reduces the probability of making the correct decision when the difference is as expected. This is bad.
the probability of rejecting the hypothesis of equal means if the difference is as specified, that is, the power of the test () when the difference in means is as expected. This is the area that is shaded blue.

The sample size is determined by the values of these four quantities. Denoting the expected mean difference locates the centers of the distributions on the number line. Picking the size of the test determines the amount of area that will be shaded red. For a fixed sample size, it also determines the critical values and the amount of area that will be shaded blue. Increasing the sample size makes the distributions narrower which moves the critical values closer to the mean of the distribution of the test statistic under the null hypothesis. This increases the amount of area shaded blue.

In practice, we don't draw lots of diagrams. Instead, there is a formula that yields the per group sample size when the four quantities are specified. For large samples, the per group sample size is given by

, where z_(1-/2) (>0) is the percentile of the normal distribution used as the critical value in a two-tailed test of size

(1.96 for an 0.05 level test) and z is the 100

-th percentile of the normal distribution (0.84 for the 80-th percentile).

Technical detail: For small sample sizes, percentiles of the t distribution replace the percentiles of the normal distribution. Since the particular t distribution depends on the sample size, the equation must be solved iteratively (trial-and-error). There are computer programs that do this with little effort.

The sample size increases with the square of the within group standard deviation and decreases with the square of the expected mean difference. If, for example, when testing a new treatment a population can be found where the standard deviation is half that of other populations, the sample size will be cut by a factor of 4.

Points To Keep In Mind

The alternative to equality must be realistic. The larger the expected difference, the smaller the required sample size. It can be QUITE TEMPTING to overstate the expected difference to lower the sample size and convince one's self or a funding agency of the feasibility of the study. All this strategy will do, however, is cause a research team to spend months or years engaged in a hopeless investigation--an underpowered study that cannot meet its goals. A good rule is to ask whether the estimated difference would still seem reasonable if the study were being proposed by someone else.

The power, --that is, probability of rejecting H0 when the alternative holds--can, in theory, be made as large or small as desired. Larger values of require larger sample sizes, so the experiment might prove too costly. Smaller values of require smaller sample sizes, but only by reducing the chances of observing a significant difference if the alternative holds. Most funding agencies look for studies with at least 80-% power. In general, they do not question the study design if the power is 80-% or greater. Experiments with less power are considered too chancy to fund.

Estimating the within group standard deviation, ,
When The Response Is a Single Measurement

The estimate of the within group standard deviation often comes from similar studies, sometimes even 50 years old. If previous human studies are not available to estimate the variability in a proposed human study, animal studies might be used, but animals in captivity usually show much less variability than do humans. Sometimes it is necessary to guess or run a pilot study solely to get some idea of the inherent variability.

Many investigators have difficulty estimating standard deviations simply because it is not something they do on a regular basis. However, standard deviations can often be obtained in terms of other measures that are more familiar to researchers. For example, a researcher might specify a range of values that contains most of the observations. If the data are roughly normally distributed, this range could be treated as an interval that contains 95% of the observations, that is, as an interval of length 4. The standard deviation, then, is taken to be one-fourth of this range. If the range were such that it contains virtually all of the population, it might be treated as an interval of length 6. The standard deviation, then, is taken to be one-sixth of this range.

Underestimating the standard deviation to make a study seem more feasible is as foolhardy as overestimating an expected difference. Such estimates result in the investment of up resources in studies that should never have been performed. Conservative estimates (estimates that lead to a slightly larger sample size) are preferable. If a study is feasible when conservative estimates are used, then it is well worth doing.

Estimating the within group standard deviation, ,
When the Response Is a Difference

When the response being studied is change or a difference, the sample size formulas require the standard deviation of the difference between measurements, not the standard deviation of the individual measurements. It is one thing to estimate the standard deviation of total cholesterol when many individuals are measure once; it is quite another to estimate the standard deviation of the change in cholesterol levels when changes are measured.

One trick that might help: Often a good estimate of the standard deviation of the differences is unavailable, but we have reasonable estimates of the standard deviation of a single measurement. The standard deviations of the individual measurements will often be roughly equal. Call that standard deviation . Then, the standard deviation of the paired differences is equal to

(2[1-

]), where

is the correlation coefficient when the two measurements are plotted against each other. If the correlation coefficient is a not terribly strong 0.50, the standard deviation of the differences will be equal to

and gets smaller as the correlation increases.

Many Means

Sometimes a study involves the comparison of many treatments. The statistical methods are discussed in detail under Analysis of Variance (ANOVA). Historically, the analysis of many groups begins by asking whether all means are the same. There are formulas for calculating the sample size necessary to reject this hypothesis according to the particular configuration of population means the researchers expect to encounter. These formulas are usually a bad way to choose a sample size because the purpose of the experiment is rarely (never?) to see whether all means are the same. Rather, it is to catalogue the differences. The sample size that may be adequate to demonstrate that the population means are not all the same may be inadequate to demonstrate exactly where the differences occur.

When many means are compared, statisticians worry about the problem of multiple comparisons, that is, the possibility that some comparison may be call statistically significant simply because so many comparisons were performed. Common sense says that if there are no differences among the treatments but six comparisons are performed, then the chance that something reaches the level of statistical significance is a lot greater than 0.05. There are special statistical techniques such as Tukey's Honestly Significant Differences (HSD) that adjust for multiple comparisons, but there are no easily accessible formulas or computer programs for basing sample size calculations on them. Instead, sample sizes are calculated by using a Bonferroni adjustment to the size of the test, that is, the nominal size of the test is divided by the number of comparisons that will be performed. When there are three means, there are three possible comparisons (AB,AC,BC). When there are four means, there are six possible comparisons (AB,AC,AD,BC,BD,CD), and so on. Thus, when three means are to be compared at the 0.05 level, the two-group sample size formula is used, but the size of each individual comparison is taken to be 0.05/3 (=0.0167). When four means are compared, the size of the test is 0.05/6 (=0.0083).

The Log Scale

Sometimes experience suggests a logarithmic transformation will be applied to the data prior to formal analysis. This corresponds to looking at ratios of population parameters rather than differences. When the analysis will be performed in the log scale, the sample size calculations should be performed in the log scale, too. If only summary data are available for sample size calculations and they are in the original scale, the behavior in the log scale can be readily approximated. The expected difference in means in the log scale is approximately equal to the log of the ratio of means in the original scale. The common within group standard deviation in the natural log scale (base e) is approximately equal to the coefficient of variation in the original scale (the roughly constant ratio of the within standard deviation to the mean). If the calculations are being performed in the common log scale (base 10), divide the cv by 2.3026 to estimate the common within group standard deviation.

Example: (=0.05, =0.80) Suppose a response will be analyzed in the log scale and that in the original scale, the population means are expected to be 40 and 50 mg/dl and the common coefficient of variation (/) is estimated to be 0.30. Then, in the (natural) log scale the estimated effect is ln(50/40) = ln(1.25) = 0.2231 and common within group standard deviation is estimated to be 0.30 (the cv). The per group sample size is approximately 1+16(0.30/0.2231)^2 or 30. In the common log scale, the estimated effect is log(50/40) = 0.0969 and the estimated common within group standard deviation is estimated to be 0.30/2.3026 = 0.1303. The per group sample size is approximately 1+16(0.1301/0.0969)^2 or 30. It is not an accident that the sample sizes are the same. The choice of a particular base for logarithms is like choosing to measure height in cm or in. It doesn't matter which you use as long as you are consistent! No mixing allowed! A few things worth noting:

log(40/50) = -0.0969, that is, -log(50/40). Since this quantity is squared when sample sizes are being estimated, it doesn't matter which way the ratio is calculated.
The cv estimates the common within group SD for log transformed data works only for natural logs. When you take the log of the ratio to estimate the treatment effect in the log scale, you pick the particular type of log you prefer. Since cv estimates the common within group SD for natural-log transformed data, you have to adjust it accordingly if you calculate the treatment effect in logs of a different base.
2.3026--the factor which, when divided into natural logs, converts lns to logs-- = ln(10).

A potential gotcha!: When calculating the treatment effect in the log scale, you can never go wrong calculating the log of the ratio of the means in the original scale. However, you have to be careful if the effect is stated in terms of a percent increase or decrease. Increases and decreases are not equivalent. Suppose the standard treatment yields a mean of 100. A 50% increase gives a mean of 150. The ratio of the means is 150/100(=3/2) or 100/150(=2/3), Now consider a 50% decrease from standard. This leads to a mean of 50. The ratio is now 100/50(=2) or 50/100(=1/2). There's no trick here. The mathematics is correct. The message is that you have to be careful when you translate statements about expected effects into numbers needed for the formal calculations.

Comparing Two Population Means:
Dealing With Paired Responses

Sometimes responses are truly paired. Two treatments are applied to the same individual or the study involves matched or paired subjects. In the case of paired samples, the formula for the total number of pairs is the same as for the number of independent samples except that the factor of 2 is dropped, that is,

, where

is now the standard deviation of the differences between the paired measurements. In many (most?) cases, especially where a study involves paired changes,

is not easy to estimate. You're on your own!

It is clear from the formulas why paired studies are so attractive. First, is the factor of 2. All other things being equal, a study of independent samples that requires, say, 100 subjects per group or a total of 200 subjects, requires only 50 pairs for a total of 100 subjects. Also, if the pairing is highly effective, the standard deviation of the differences within pair can be quite small, thereby reducing the sample size even further. However, these saving occur because elements within the same pair are expected to behave somewhat the same. If the pairing is ineffective, that is, if the elements within each pair are independent of each other, the standard deviation of the difference will be such that the number of pairs for the paired study turns out to be equal to the number of subjects per group for the independent samples study so that the total sample size is the same.

There is a more important concern than ineffective pairing. When some investigators see how the sample sizes required for paired studies compared to those involving independent samples, their first thought is to drop any control group in favor of "using subjects as their own control". Who wouldn't prefer to recruit 50 subjects and look at whether their cholesterol levels change over time rather than 200 subjects (100 on treatment; 100 on placebo) to see if the mean change in the treatment group is different from that in the control group? However, this is not an issue of sample size. It is an issue of study design. An investigator who measured only the 50 subjects at two time points would be able to determine whether there was a change over time, but s/he would not be able to say how it compared to what would have happened over the same time period in the absence of any intervention.

----------------

^*There are exceptions such as equivalence trials where the goal is to show that two population means are the same, but they will not concern us here.

^**It may sound counter-intuitive for the investigator to have to estimate the difference when the purpose of the study is to determine the difference. However, it can't be any other way. Common sense suggests it takes only a small number of observations to detect a large difference while it takes a much larger sample size to detect a small difference. Without some estimate of the likely effect, the sample size cannot be determined. Sometimes there will be no basis for estimating the likely effect. The best that can be done in such circumstances is a pilot study to generate some preliminary data and estimates.

[back to LHSP]