Announcement
The Behavior of the Sample Mean
(or Why Confidence Intervals Always Seem to be Based On the Normal Distribution)

[Many of the figures in this note are screen shots from a simulation at the Rice Virtual Lab in Statistics. You might enjoy trying the simulation yourself after (or even while) reading this note. Java must be enabled in your browser for this simulation to run.]

There is arguably no more important lesson to be learned in statistics than how sample means behave. It explains why statistical methods work. The vast majority of the things people do with statistics is compare populations, and most of the time populations are compared by comparing their means.

The way individual observations behave depends on the population from which they are drawn. If we draw a sample of individuals from a normally distributed population, the sample will follow a normal distribution. If we draw a sample of individuals from a population with a skewed distribution, the sample values will display the same skewness. Whatever the population looks like--normal, skewed, bimodal, whatever--a sample of individual values will display the same characteristics. This should be no surprise. Something would be very wrong if the sample of individual observations didn't share the characteristics of the parent population.

We are now going to see a truly wondrous result. Statisticians refer to it as The Central Limit Theorem. It says that if you draw a large enough sample, the way the sample mean varies around the population mean can be described by a normal distribution, NO MATTER WHAT THE POPULATION HISTOGRAM LOOKS LIKE!

I'll repeat and summarize because this result is so important. If you draw a large sample, the histogram of the individual observations will look like the population histogram from which the observations were drawn. However, the way the sample mean varies around the population mean can be described by the normal distribution. This makes it very easy to describe the way population means behave. The way they vary about the population mean, for large samples, is unrelated to the shape of the population histogram.

Let's look at an example. In the picture to the left,

• the top panel shows a population skewed to the right
• the middle panel shows a sample of 25 observations drawn from that population
• the bottom panel shows the sample mean.

The 25 observations show the kind of skewness to be expected from a sample of 25 from this population.

Let's do it again and keep collecting sample means.

And one more time. In each case, the individual observations are spread out in a manner reminiscent of the population histogram. The sample means, however, are tightly grouped. This is not unexpected. In each sample, we get observations from throughout the distribution. The larger values keep the mean from being very small while the smaller values keep the mean from being very large. There are so many observations, some large, some small, that the mean ends up being "average". If the sample contained only a few observations, the sample mean might jump around considerably from sample to sample, but with lots of observations the sample mean doesn't get a chance to change very much.

Since the computer is doing all the work, let's go hog wild and do it 10,000 times!

Here's how those means from the 10,000 samples of 25 observations each, behave. They behave like things drawn from a normal distribution centered about the mean of the original population!

At this point, the most common question is, "What's with the 10,000 means?" and it's a good question. Once this is sorted out, everything will fall into place.

• We do the experiment only once, that is, we get to see only one sample of 25 observations and one sample mean.
• The reason we draw the sample is to say something about the population mean.
• In order to use the sample mean to say something about the population mean, we have to know something about how different the two means can be.
• This simulation tells us. The sample mean varies around the population mean as though
• it came from a normal distribution
• whose standard deviation is estimated by the Standard Error of the Mean, SEM = s/n. (More about the SEM below.)
• All of the properties of the Normal Distribution apply:
• 68% of the time, the sample mean and population mean will be within 1 SEM of each other.
• 95% of the time, the sample mean and population mean will be within 2 SEMs of each other.
• 99% of the time, the sample mean and population mean will be within 2.57 SEMs of each other, and so on.
We will make formal use of this result in the note on Confidence Intervals.

This result is so important that statisticians have given it a special name. It is called The Central Limit Theorem. It is a limit theorem because it describes the behavior of the sample mean in the limit as the sample size grows large. It is called the Central limit theorem not because there's any central limit, but because it's a limit theorem that is central to the practice of statistics!

The key to the Central limit Theorem is large sample size. The closer the histogram of the individual data values is to normal, the smaller large can be.

• If individual observations follow a normal distribution exactly, the behavior of sample means can be described by the normal distribution for any sample size, even 1.
• If the departure from normality is mild, large could be as few as 10. For biological units measured on a continuous scale (food intake, weight) it's hard to come up with a measurement for which a sample of 100 observations is not sufficient.
• One can always be perverse. If a variable is equal to 1 if "struck by lightning" and 0 otherwise, it might take many millions of observations before the normal distribution can be used to describe the behavior of the sample mean.
For variables like birth weight, caloric intake, cholesterol level, and crop yield measured on a continuous underlying scale, large is somewhere between 30 and 100. Having said this, it's only fair that I try to convince you that it's true.

The vast majority of the measurements we deal with are made on biological units on a continuous scale (cholesterol, birth weight, crop yield, vitamin intakes or levels, income). Most of the rest are indicators of some characteristic (0/1 for absence/presence of premature birth, disease). Very few individual measurements have population histograms that look less normal than one with three bars of equal height at 1,2, and 9, that is, a population that is one-third 1s, one- third 2s, and one-third 9s. It's not symmetric. One-third of the population is markedly different from the other two-thirds. If the claim is true for this population, then perhaps it's true for population histograms closer to the normal distribution.

The distribution of the sample mean for various sample sizes is shown at the left. When the sample size is 1, the sample mean is just the individual observation. As the number of samples of a single observation increases, the histogram of sample means gets closer and closer to three bars of equal height at 1,2,9--the population histogram for individual values. The histogram of sample individual values always looks like the population histogram of individual values as you take more samples of individual values. It does NOT look more and more normal unless the population from which the data are drawn is normal.

When samples of size two are taken, the first observation is equally likely to be 1, 2 or 9, as is the second observation.

 Obs 1 Obs 2 Mean 1 1 1.0 1 2 1.5 1 9 5.0 2 1 1.5 2 2 2.0 2 9 5.5 9 1 5.0 9 2 5.5 9 9 9.0
The sample mean can take on the values 1, 1.5, 2, 5, 5.5, and 9.
• There is only one way for the mean to be 1 (both observations are 1), but
• there are two ways to get a mean of 1.5 (the first can be 1 and the second 2, or the first can be 2 and the second 1).
• There is one way to get a mean of 2,
• two ways to get a mean of 5,
• two ways to get a mean of 5.5, and
• one way to get a mean of 9.
Therefore, when many samples of size 2 are taken and their means calculated, 1, 2, and 9 will each occur 1/9 of the time, while 1.5, 5, and 5.5 will each occur 2/9 of the time, as shown in the picture.

And so it goes for all sample sizes. Leave that to the mathematicians. The pictures are correct. Trust me. However, you are welcome to try to construct them for yourself, if you wish.

When n=10, the histogram of the sample means is very bumpy, but is becoming symmetric. When n=25, the histogram looks like a stegosaurus, but the bumpiness is starting to smooth out. When n=50, the bumpiness is reduced and the normal distribution is a good description of the behavior of the sample mean. The behavior (distribution) of the mean of samples of 100 individual values is nearly indistinguishable from the normal distribution to the resolution of the display. If the mean of 100 observations from this population of 1s, 2s, and 9s can be described by a normal distribution, then perhaps the mean of our data can be described by a normal distribution, too.

When the distribution of the individual observations is symmetric, the convergence to normal is even faster. In the diagrams to the left, one-third of the individual observations are 1s, one-third are 2s, and one-third are 3s. The normal approximation is quite good, even for samples as small as 10. In fact, even n=2 isn't too bad!

To summarize once again, the behavior of sample means of large samples can be described by a normal distribution even when individual observations are not normally distributed.

This is about as far as we can go without introducing some notation to maintain rigor. Otherwise, we'll sink into a sea of confusion over samples and populations or between the standard deviation and the (about-to-be-defined) standard error of the mean.

 Sample Population mean s standard deviation n sample size

The sample has mean and standard deviation s. The sample comes from a population of individual values with mean and standard deviation .

The behavior of sample means of large samples can be described by a normal distribution, but which normal distribution? If you took a course in distribution theory, you could prove the following results: The mean of the normal distribution that describes the behavior of a sample mean is equal to , the mean of the distribution of the individual observations. For example, if individual daily caloric intakes have a population mean = 1800 kcal, then the mean of 50 of them, say, is described by a normal distribution with a mean also equal to 1800 kcal.

The standard deviation of the normal distribution that describes the behavior of the sample mean is equal to the standard deviation of the individual observations divided by the square root of the sample size, that is, /n. Our estimate of this quantity, s/n, is called the Standard Error of the Mean (SEM), that is,

SEM = s/n.

I don't have a nonmathematical answer for the presence of the square root. Intuition says the mean should vary less from sample-to-sample as the sample sizes grow larger. This is reflected in the SEM, which decreases as the sample size increases, but it drops like the square root of the sample size, rather than the sample size itself.

To recap...
1. There are probability distributions. They do two things.
• They describe the population, that is, they say what proportion of the population can be found between any specified limits.
• They describe the behavior of individual members of the population, that is, they give the probability that an individual selected at random from the population will lie between any specified limits.
2. When single observations are being described, the "population" is obvious. It is the population of individuals from which the sample is drawn. When probability distributions are used to describe statistics such as sample means, there is a population, too. It is the (hypothetical) collection of values of the statistic should the experiment or sampling procedure be repeated over and over.
3. (Most important and often ignored!) The common statistical procedures we will be discussing are based on the probabilistic behavior of statistical measures. They are guaranteed to work as advertised, but only if the data arise from a probability based sampling scheme or from randomizing subjects to treatments. If the data do not result from random sampling or randomization, there is no way to judge the reliability of statistical procedures based on random sampling or randomization.

The Sample Mean As an Estimate of The Population Mean

These results say that for large sample sizes the behavior of sample means can be described by a normal distribution whose mean is equal to the population mean of the individual values, , and whose standard deviation is equal to /n, which is estimated by the SEM. In a course in probability theory, we use this result to make statements about the a yet-to-be-obtained sample mean when the population mean is known. In statistics, we use this result to make statements about an unknown population mean when the sample mean is known.

Preview: Let's suppose we are talking about 100 dietary intakes and the SEM is 40 kcal. The results of this note say the behavior of the sample mean can be described by a normal distribution whose SD is 40 kcal. We know that when things follow a normal distribution, they will be within 2 SDs of the population mean 95% of the time. In this case, 2 SDs is 80 kcal. Thus, the sample mean and population mean will be within 80 kcal of each other 95% of the time.

• If we were told the population mean were 2000 kcal and were asked to predict the sample mean, we would say there's a 95% chance that our sample mean would be in the range (1920[=2000-80], 2080[-2000+80]) kcal.
• It works the other way, too. If the population mean is unknown, but the sample mean is 1980 kcal, we would say we were 95% confident that the population mean was in the range (1900[=1980-80], 2060[=1980+80]) kcal.
Note: The use of the word confident in the previous sentence was not accidental. Confident and confidence are the technical words used to describe this type of estimation activity. Further discussion occurs in the notes on Confidence Intervals

The decrease of SEM with sample size reflects the common sense idea that the more data you have, the better you can estimate something. Since the SEM goes down like the square root of the sample size, the bad news is that to cut the uncertainty in half, the sample size would have to quadrupled. The good news is that if you can gather only half of the planned data, the uncertainty is only 40% larger than what it would have been with all of the data, not twice as large.

Potential source of confusion: How can the SEM be an SD? Probability distributions have means and standard deviations. This is true of the probability distribution that describes individual observations and the probability distribution that describes the behavior of sample means drawn from that population Both of these distributions have the same mean, denoted here. If the standard deviation of the distribution that describes the individual observations is , then the standard deviation of the distribution that describes the sample mean is /n, which is estimated by the SEM.

When you write your manuscripts, you'll talk about the SD of individual observations and the SEM as a measure of uncertainty of the sample mean as an estimate of the population mean. You'll never see anyone describing the SEM as estimating the SD of the sample mean. However, we have to be aware of this role for the SEM if we are to be able to understand and discuss statistical methods clearly.

[back to LHSP]