Summary Statistics: Location & Spread

Summary Statistics:
Location & Spread

Prologue: Terminology

A sample is a set of observations drawn from a larger population. The sample is the numbers in hand. The population is the larger set from which the sample was taken. A sample is usually drawn to make a statement about the larger population from which it was taken. Sample and population are two different things.

The people who are asked how they intend to vote are the sample. The population is all voters.
The people who have Neilsen meters attached to their television sets are the sample. The population is all viewers.
The cans of product take off the assembly line to measure their nutrient content is the sample. The population is the cannery's entire output.
The set of heart disease patients who are randomized to one of two cholesterol lowering diets is the sample. The population is the set of all heart disease patients.

It is essential to maintain the distinction between sample and population so that we can express ourselves clearly and be understood.

Descriptive Statistics

After constructing graphical displays of a batch of numbers, the next thing to do is summarize the data numerically. Statistics are summaries derived from the data. The two important statistics that describe a single response are measures of location (typical value; the word location is a reference to the data's location on the number line) and spread (variability). The number of observations (the sample size, n) is important, too, but it is generally considered a "given". It is not counted as one of the summary statistics even though it fits the definition ("a function solely of the data") perfectly.

Mean and Standard Deviation

There are many reasonable single number summaries that describe where a set of values is located. Any statistic that describes a typical value will do. Statisticians refer to these measures, as a group, as averages.

The most commonly reported average is the mean--the sum of the observations divided by the sample size. The mean of the values 5, 6, 9, 13, 17 is (5+6+9+13+17)/5 or 50/5 = 10.

The mean is invariably what people intend when they say average. Mean is a more precise term than average because the mean can only be the sum divided by the sample size. There are other quantities that are sometimes called averages. These include the median (or middle value), the mode (most commonly occurring value), and even the midrange (mean of minumum and maximum values). Statisticians prefer means because they understand them better, that is, they understand the relation between sample and population means better than the relation between the sample and population value of other averages.

The most commonly reported measure of variability or spread is the standard deviation (SD). The SD is also called the "root-mean-square deviation", which describes the way it is calculated. The operations root, mean, and square are applied in reverse order to the deviations--the individual differences between the observations and the mean.

First, the deviations are calculated.
Then, the deviations are squared.
Next, the mean of the deviations is calculated.
Finally, the square root of the mean is taken to obtain the SD.

To be precise, when this mean is calculated, the sum of the squared deviations is divided by one less than the sample size rather than the sample size itself. There's no reason why it must be done this way, but this is the modern convention. It's not important that this seem the most natural measure of spread. It's the way it's done. You can just accept it (which I recommend) or you'll have to study the mathematics behind it. But, that's another course.

To see how the SD works, consider the values 5,6,9,13,17, whose mean as we've already seen is 10. The deviations are {(5-10), (6-10), (9-10), (13-10), (17-10)} or -5, -4, -1, 3, 7. (It is not an accident that the deviations sum to 0, but I digress.) The squared deviations are 25, 16, 1, 9, 49 and the standard deviation is the square root of (25+16+1+9+49)/(5-1), that is, the square root of (100/4) or 25 = 5.

Why do we use something that might seem so complicated? Why not the range (difference between the highest and lowest observations) or the mean of the absolute values of the deviations? Without going into details, the SD has some attractive mathematical properties that make it the measure of choice. It's easy for statisticians to develop statistical techniques around it. So we use it. In any case, the SD satisfies the most important requirement of a measure of variability--the more spread out the data, the larger the SD. And the best part is, we have computers to calculate the SD for us. We don't have to compute it. We just have to know how to use it...properly!

Some Mathematical Notation

Back in olden times, mathematics papers contained straightforward notation like

a+b+c+d+...

It was awkward having all of those symbols, especially if you wanted to be adding up heights, weights, incomes, and so on. So, someone suggested using subscripts and writing sums in the form

x₁+x₂+x₃+x₄+…+ x_n, where 'n' is the sample size or number of observations, and using different letters for each quantity ('h' for heights, 'w' for weights, and so on).

The plus signs could be eliminated by writing the expression as

Sum(x₁,x₂,x₃,x₄,…,x_n) and once people were used to, Sum could be abbreviated to just S, as in S(x₁,x₂,x₃,x₄,…, x_n). The notion of limits of notation was then introduced so the expression could be reduced to S(x_i,:i=1,…,n) and the limits of notation were moved to decorate the "S"

Now all that was left was to replace the S by its Greek equivalent, sigma, and here we are in modern times!

Because we almost always sum from 1 to 'n', the limits of summation are often left off unless the sum is not from 1 to 'n'.

Now that we have this nice notation, let's use it to come up with expressions for the sample mean, which we'll write as the letter 'x' with a bar over it, and the standard deviation, s. The mean is easy. It's the sum of the observations (which we've already done) divided by the sample size

The standard deviation isn't much more difficult. Recall "root-mean-square". Begin with the deviations

then square them

then take their "mean"

then take a square root

All done.

Some facts about the mean and standard deviation

If you're the mathematical type, you can prove these statements for yourself by using the formulas just developed for the mean and standard deviation. If you're the visual type, you should be able to see why these results are so by looking at the pictures to the left.

When a constant is added to every observation, the new sample mean is equal to original mean plus the constant.
When a constant is added to every observation, the standard deviation is unaffected.
When every observation is multiplied by the same constant, the new sample mean is equal to original mean multiplied by the constant.
When every observation is multiplied by the same constant, the new sample standard deviation is equal to original standard deviation multiplied by the magnitude of the constant. (The reason for including the phrase "the magnitude of" is that if the constant is negative, the sign is dropped when the new SD is calculated.)

Mental Pictures

The mean and SD are a particularly appropriate summary for data whose histogram approximates a normal distribution (the bell-shaped curve). If you say that a set of data has a mean of 220, the typical listener will picture a bell-shaped curve centered with its peak at 220.

What information does the SD convey? When data are approximately normally distributed,

approximately 68% of the data lie within one SD of the mean.
approximately 95% of the data lie within two SDs of the mean.
approximately 99.7% of the data lie within three SDs of the mean.

For example, if a set of total cholesterol levels has a mean of 220 mg/dl and a SD of 20 mg/dl and its histogram looks like a normal distribution, then

about 68% of the cholesterol values will be in the range 200 to 240 (200 = 220 - 20 and 240 = 220 + 20).
Similarly, about 95% of the values will be in the range 180 to 260 (180 = 220 - 2*20 and 280 = 220 + 2*20) and
99.7% of the values will be in the range 160 to 280 (160 = 220 - 3*20 and 280 = 220 + 3*20).

Percentiles

When the histogram of the data does not look approximately normal, the mean and SD can be misleading because of the mental picture they paint. Give people a mean and standard deviation and they think of a bell-shaped curve with observations equally likely to be a certain distance above the mean as below. But, there's no guarantee that the data aren't really skewed or that outliers aren't distorting the mean and SD, which would invalidate the rule of thumb, from the last section, describing the proportion of observations within so many SDs of the mean.

One way to describe such data in a way that does not give a misleading impression of where they lie is to report some percentiles. The p-th percentile is the value that p-% of the data lie are less than or equal to. If p-% of the data lie below the p-th percentile, it follows that (100-p)-% of the data lie above it. For example, if the 85-% percentile of household income is $60,000, then 85% of households have incomes of $60,000 or less and the top 15% of households have incomes of $60,000 or more.

The most famous of all percentiles--the 50-th percentile--has a special name: the median. Think of the median as the value that splits the data in half--half of the data are above the median; half of the data are below the median^*. Two other percentiles with special names are the quartiles: the lower quartile (the 25-th percentile) and the upper quartile (the 75-th percentile).

The median and the quartiles divide the data into quarters:

One-quarter of the data is less than the lower quartile;
one-quarter of the data falls between the lower quartile and the median;
one-quarter of the data falls between the median and the upper quartile;
one-quarter of the data is greater than the upper quartile.

Sometimes the minimum and maximum are presented along with the median and the quartiles to provide a five number summary of the data. Unlike a mean and SD, this five number summary can be used to identify skewed data. When there are many observations (hundreds or thousands), some investigators report the 5-th and 95-th percentiles (or the 10th and 90- th or the 2.5-th and the 97.5-th percentiles) instead of the minimum and maximum to establish so-called normal ranges.

You'll sometimes see the recommendation that the Inter-Quartile Range (the difference between the upper and lower quartiles) be reported as a measure of spread. It's certainly a measure of spread--it measures the spread of the middle half of the data. But as a pair, the median and IQR have the same deficiency as the mean and the SD. There's no way a two number summary can describe the skewness of the data. When one sees a median and an IQR, one suspects they are being reported because the data are skewed, but one has no sense of how skewed! It would be much better to report the median and the quartiles.

In practice, you'll almost always see means and SDs. If your goal is to give a simple numerical summary of the distribution of your data, look at graphical summaries of your data to get a sense of whether the mean and SD might produce the wrong mental picture. If they might, consider reporting percentiles instead.

Mean versus Median

The mean is the sum of the data divided by the sample size. If a histogram could be placed on a weightless bar and the bar on a fulcrum, the histogram would balance perfectly when the fulcrum is directly under the mean. The median is the value in the middle of the histogram. If the histogram is symmetric, the mean and the median are the same. If the histogram is not symmetric, the mean and median can be quite different. Take a data set whose histogram is symmetric. Balance it on the fulcrum. Now take the largest observation and start moving it to the right. The fulcrum must move to the right with the mean, too, if the histogram is to stay balanced. You can make the mean as large as you want by moving this one observation farther and farther to the right, but all this time the median stays the same!

A point of statistical trivia: If a histogram with a single peak is skewed to the right, the order of the three averages lie along the measurement scale in reverse alphabetical order--mode, median, mean.

Geometric Mean

When data do not follow a normal distribution, reports sometimes contain a statement such as, "Because the data were not normally distributed, {some transformation} was applied to the data before formal analyses were performed. Tables and graphs are presented in the original scale."

When data are skewed to the right, it often happens that the histogram looks normal, or at least symmetric, after the data are logged. The transformation would be applied prior to formal analysis and this would be reported in the Statistical Methods section of the manuscript. In summary tables, it is common for researchers to report geometric means. The geometric mean is the antilog of the mean of the logged data--that is, the data are logged, the mean of the logs is calculated, and the anti-log of the mean is obtained. The presence of geometric means indicates the analysis was done in the log scale, but the results were transformed back to the original scale for the convenience of the reader.

If the histogram of the log-transformed data is approximately symmetric, the geometric mean of the original data is approximately equal to the median of the original data.

The logarithmic transformation is monotone, that is, if a<b, then log(a)<log(b) and vice-versa. This means that when a set of observations is ordered, the order is the same whether you use the original values or their logs.
It follows from (1) that the observation that is in the middle in the original scale will be in the middle in the log scale^**. The reverse is true, too. The observation that is in the middle in the log scale will be in the middle in the original scale.
Since it is assumed that the log-transformed data are approximately symmetric, the mean and median of the log-transformed data are roughly equal. (If the data were perfectly symmetric, the mean and the median would be identical.)
It follows from (2) that the anti-log of the mean of the logs is roughly equal to the anti-log of the median of the logs.
Since the Geometic Mean is the anti-log of the mean of the logs, it follows from (4) that it is roughly equal to the anti-log of the median of the logs, but that's the value in the middle in the original scale--the median!

The accuracy of the approximation will depend on the extent to which the distribution in the log scale is symmetric.

Is there a geometric SD? Yes. It's the antilog of the SD of the log transformed values. The interpretation is similar to the SD. If GBAR is the geometric mean and GSD is the geometric standard deviation, 95% of the data lie in the range from GBAR/(GSD²) to GBAR*(GSD²), that is, instead of adding and subtracting 2 SDs we multiply and divided by the square of the SD.

These differences follow from properties of the logarithm, namely,

log(ab) = log(a) + log(b) and

log(a/b) = log(a) - log(b)

that is, the log of a product is the sum of the logs, while the log of a ratio is the difference of the logs.

Since the data are approximately normally distributed in the log scale, it follows that 95% of the data lie in the range mean-2SD to mean+2SD. But this is

log(GBAR) + log(GSD) + log(GSD) = log(GBAR*GSD²) and

log(GBAR) - log(GSD) - log(GSD) = log(GBAR/GSD²)

------------

^*This definition isn't rigorous for two reasons. First, the median may not be unique. If there is an even number of observations, then any number between the two middle values qualifies as a median. Standard practice is to report the mean of the two middle values. Second, if there is an odd number of observations or if the two middle values are tied, no value has half of the data greater than it and half less. A rigorous definition of the median is that it is a value such that at least half of the data are less than or equal to it and half of the data are greater than or equal to it. Consider the data set 0,0,0,0,1,7. The median is 0 since 4/6 of the data are less than or equal to 0, while all of the data are greater than or equal to 0. Similar remarks apply to all other percentiles. However, so we don't get bogged down in details, let's think of the p-th percentile as the value that has "p-% of the data below it; (100-p)-% of the data above it".

^**If the number of observations is even, it is more correct to say that the log of a median in the original scale is a median in the log scale. That's because when the number of observations is even, any value between the two middle values satisfies the definition of a median. Standard practice is to report the mean of the two middle values, but that's just a convention.

Consider a data set with two observations--10 and 100, with a median of 55. Their common logarithms are 1 and 2, with a median of 1.5. Now, log(55)=1.74 which is not 1.5. Nevertheless, 1.74 is a median in the log scale since it lies between the two middle values.

Summary Statistics:Location & Spread

Summary Statistics:
Location & Spread