Confidence Intervals Involving Data
to Which a Logarithmic Transformation Has Been Applied

These data were originally presented in Simpson J, Olsen A, and Eden J (1975), "A Bayesian Analysis of a Multiplicative Treatment effect in Weather Modification," Technometrics, 17, 161-166, and subsequently reported and analyzed by Ramsey FL and Schafer DW (1997), The Statistical Sleuth: A Course in Methods of Data Analysis. Belmont, CA: Duxbury Press. They involve an experiment performed in southern Florida between 1968 and 1972. An aircraft was flown through a series of cloud and, at random, seeded some of them with massive amounts of silver iodide. Precipitation after the aircraft passed through was measured in acre-feet.

The distribution of precipitation within group (seeded or not) is positively skewed (long-tailed to the right). The group with the higher mean has a proportionally larger standard deviation as well. Both characteristics suggest that a logarithmic transformation be used to make the data more symmetric and homoscedastic (more equal spread). The second pair of box plots bears this out. This transformation will tend to make CIs more reliable, that is, the level of confidence is more likely to be what is claimed.

N Mean Std. Deviation Median
Rainfall Not Seeded 26 164.6 278.4 44.2
Seeded 26 442.0 650.8 221.6

N Mean Std. Deviation Geometric Mean
LOG_RAIN Not Seeded 26 1.7330 .7130 54.08
Seeded 26 2.2297 .6947 169.71

95% Confidence Interval for the Mean Difference
Seeded - Not Seeded
(logged data)
Lower Upper
Equal variances assumed 0.1046 0.8889
Equal variances not assumed 0.1046 0.8889

Researchers often transform data back to the original scale when a logarithmic transformation is applied to a set of data. Tables might include Geometric Means, which are the anti-logs of the mean of the logged data. When data are positively skewed, the geometric mean is invariably less than the arithmetic mean. This leads to questions of whether the geometric mean has any interpretation other than as the anti-log of the mean of the log transformed data.

The geometric mean is often a good estimate of the original median. The logarithmic transformation is monotonic, that is, data are ordered the same way in the log scale as in the original scale. If a is greater than b, then log(a) is greater than log(b). Since the observations are ordered the same way in both the original and log scales, the observation in the middle in the original scale is also the observation in the middle in the log scale, that is,

the log of the median = the median of the logs

If the log transformation makes the population symmetric, then the population mean and median are the same in the log scale. Whatever estimates the mean also estimates the median, and vice-versa. The mean of the logs estimates both the population mean and median in the log transformed scale. If the mean of the logs estimates the median of the logs, its anti-log--the geometric mean--estimates the median in the original scale!

The median rainfall for the seeded clouds is 221.6 acre-feet. In the picture, the solid line between the two histograms connects the median in the original scale to the mean in the log-transformed scale.

One property of the logarithm is that "the difference between logs is the log of the ratio", that is, log(x)-log(y)=log(x/y). The confidence interval from the logged data estimates the difference between the population means of log transformed data, that is, it estimates the difference between the logs of the geometric means. However, the difference between the logs of the geometric means is the log of the ratio of the geometric means. The anti-logarithms of the end points of this confidence interval give a confidence interval for the ratio of geometric means itself. Since the geometric mean is sometime an estimate of the median in the original scale, it follows that a confidence interval for the geometric means is approximately a confidence interval for the ratio of the medians in the original scale.

In the (common) log scale, the mean difference between seeded and unseeded clouds is 0.4967. Our best estimate of the ratio of the median rainfall of seeded clouds to that of unseeded clouds is 100.4967 [= 3.14]. Our best estimate of the effect of cloud seeding is that it produces 3.14 times as much rain on average as not seeding.

Even when the calculations are done properly, the conclusion is often misstated.

The a 95% CI for the population mean difference in the log scale (Seeded - Not Seeded) is (0.1046, 0.8889). For reporting purposes, this CI should be transformed back to the original scale. A CI for a difference in the log scale becomes a CI for a ratio in the original scale.

The antilogarithms of the endpoints of the confidence interval are 100.1046 = 1.27, and 100.8889 = 7.74. Thus, the report could read: "The geometric mean of the amount of rain produced by a seeded cloud is 3.14 times as much as that produced by an unseeded cloud (95% CI: 1.27 to 7.74 times as much)." Perhaps the cleanest way to report the results--the least prone to misstatement or misinterpretation--is to express everything as ratios:

The ratio of the geometric mean amount of rainfall from seeded clouds to that from unseeded clouds is 3.14 (95% CI: 1.27 to 7.74).

If the logged data have a roughly symmetric distribution, you might go so far as to say,"The median amount of approximately..."

Comment: The logarithm is the only non-linear transformation that produces results that can be cleanly expressed in terms of the original data. Other transformations, such as the square root, are sometimes used, but it is difficult to restate their results in terms of the original data.

Copyright © 2000 Gerard E. Dallal