Announcement
The Ubiquitous Sample Mean!

The sample mean plays many distinct roles.

• It is the best estimate of an individual value in the sample. ("If I were to select one observation at random from the sample, what would you guess that value to be?" "<The sample mean>.")
• It is the best estimate of an individual value drawn from the population. ("If I were to select one observation from the population, what would you guess that value to be?" "<The sample mean>." Or, "If we were to collect one more observation, what would you guess its value to be?" "<The sample mean>.") Notice that collecting one more observation is the same thing as drawing an observation at random from the population.
• It is the best estimate of the mean of the population from which the sample was drawn. ("What would you guess the mean of all values in the population to be?" "<The sample mean>.")
• Whatever else it is, it is the mean of the sample.
The differences between these roles must be appreciated and understood. Failing to distinguish between them is a common cause of confusion about many basic statistical techniques.

The sample mean and standard deviation (, s) together summarize individual data values when the data follow a normal distribution or something not too far from it. The sample mean describes a typical value. The sample standard deviation (SD) measures the spread of individual values about the sample mean. The SD also estimates the spread of individual values about the population mean teh extent to which a single value chosen at random might differ from the population mean.

Just as the sample standard deviation measures the uncertainty with which the sample mean estimates individual measurements, a quantity called the Standard Error of the Mean (SEM = ) measures the uncertainty with which the sample mean estimates a population mean. Read the last sentence again...and again.

• The sample mean estimates individual values.
• The uncertainty with which  estimates individual values is given by the SD.
• The sample mean estimates the population mean.
• The uncertainty with which  estimates the population mean is given by the SEM.

Intuition says the more data there are, the more accurately we can estimate a population mean. With more data, the sample and population means are likely to be closer. The SEM expresses this numerically. The SEM says the likely difference between the sample and population means, , decreases as the sample size increases, but the decrease is proportional to the square root of the sample size. To decrease the uncertainty by a factor of 2, the sample size must be increased by a factor of 4; to cut the uncertainty by a factor of 10, a sample 100 times larger is required.

We have already noted that when individual data items follow something not very far from a normal distribution, 68% of the data will be within one standard deviation of the mean, 95% will be within two standard deviations of the mean, and so on. But, this is true only when the individual data values are roughly normally distributed.

There is an elegant statistical limit theorem that describes the likely difference between sample and population means, , when sample sizes are large. It is so central to statistical practice that is is called the Central Limit Theorem. It says that, for large samples, the normal distribution can be used to describe the likely difference between the sample and population means regardless of the distribution of the individual data items! In particular, 68% of the time the difference between the sample and population means will be less than 1 SEM, 95% of the time the difference will be less than 2 SEMs, and so on. You can see why the result is central to statistical practice. It lets us ignore the distribution of individual data values when talking about the behavior of sample means in large samples. The distribution of individual data values becomes irrelevant when making statements about the difference between sample and population means. From a statistical standpoint, sample means obtained by replicating a study can be thought of as individual observations whose standard deviation is equal to the SEM.

Let's stop and summarize: When describing the behavior of individual values, the normal distribution can be used only when the data themselves follow something close to a normal histogram. When describing the difference between sample and population means based on large enough samples, the normal distribution can be used whatever the histogram of the individual observations. Let's continue…

Anyone familiar with mathematics and limit theorems knows that limit theorems begin, "As the sample size approaches infinity . . ." No one has infinite amounts of data. The question naturally arises about the sample size at which the result can be used in practice. Mathematical analysis, simulation, and empirical study have demonstrated that for the types of data encountered in the natural and social sciences (and certainly almost any response measured on a continuous scale) sample sizes as small as 30 to 100 (!) will be adequate.

To reinforce these ideas, consider dietary intake, which tends to follow a normal distribution. Suppose we find that daily caloric intakes in a random sample of 100 undergraduate women have a mean of 1800 kcal and a standard deviation of 200 kcal. Because the individual values follow a normal distribution, approximately 95% of them will be in the range (1400, 2200) kcal . The Central Limit theorem lets us do the same thing to estimate the (population) mean daily caloric intake of all undergraduate women. The SEM is 20 (=200/100). A 95% confidence interval for the mean daily caloric intake of all undergraduate women is (1760, 1840) kcal . That is, we are 95% confident the mean daily caloric intake of all undergraduate women falls in the range (1760, 1840) kcal.

Consider household income, which invariably is skewed to the right. Most households have low incomes while a few have very large incomes. Suppose household incomes measured in a random sample of 400 households have a mean of \$10,000 and a SD of \$3000. The SEM is \$150 (= 3000/ 400). Because the data do not follow a normal distribution, there is no simple rule involving the sample mean and SD that can be used to describe the location of the bulk of the individual values. However, we can still construct a 95% confidence interval for the population mean income as  or \$(9700, 10300). Because the sample size is large, the distribution of individual incomes is irrelevant to constructing confidence intervals for the population mean.