to Which a Logarithmic Transformation Has Been Applied

These data were originally
presented in Simpson J, Olsen A, and Eden J (1975), "A Bayesian Analysis
of a Multiplicative Treatment effect in Weather Modification,"
Technometrics, 17, 161-166, and subsequently reported and analyzed by
Ramsey FL and Schafer DW (1997), *The Statistical Sleuth: A Course in
Methods of Data Analysis*. Belmont, CA: Duxbury Press. They involve an
experiment performed in southern Florida between 1968 and 1972. An
aircraft was flown through a series of cloud and, at random, seeded some
of them with massive amounts of silver iodide. Precipitation after
the aircraft passed through was measured in acre-feet.

The distribution of precipitation within group (seeded or not) is positively skewed (long-tailed to the right). The group with the higher mean has a proportionally larger standard deviation as well. Both characteristics suggest that a logarithmic transformation be used to make the data more symmetric and homoscedastic (more equal spread). The second pair of box plots bears this out. This transformation will tend to make CIs more reliable, that is, the level of confidence is more likely to be what is claimed.

N | Mean | Std. Deviation | Median | ||
---|---|---|---|---|---|

Rainfall | Not Seeded | 26 | 164.6 | 278.4 | 44.2 |

Seeded | 26 | 442.0 | 650.8 | 221.6 |

N | Mean | Std. Deviation | Geometric Mean | ||
---|---|---|---|---|---|

LOG_RAIN | Not Seeded | 26 | 1.7330 | .7130 | 54.08 |

Seeded | 26 | 2.2297 | .6947 | 169.71 |

95%
Confidence Interval for the Mean Difference Seeded - Not Seeded (logged data) | ||
---|---|---|

Lower | Upper | |

Equal variances assumed | 0.1046 | 0.8889 |

Equal variances not assumed | 0.1046 | 0.8889 |

Researchers often transform
data back to the original scale when a logarithmic transformation is
applied to a set of data. Tables might include *Geometric Means*,
which are the anti-logs of the mean of the logged data. When data are
positively skewed, the geometric mean is invariably less than the
arithmetic mean. This leads to questions of whether the geometric mean
has any interpretation other than as the anti-log of the mean of the log
transformed data.

The geometric mean is often a good estimate of the original median. The logarithmic transformation is monotonic, that is, data are ordered the same way in the log scale as in the original scale. If a is greater than b, then log(a) is greater than log(b). Since the observations are ordered the same way in both the original and log scales, the observation in the middle in the original scale is also the observation in the middle in the log scale, that is,

If the log transformation makes the population symmetric, then the population mean and median are the same in the log scale. Whatever estimates the mean also estimates the median, and vice-versa. The mean of the logs estimates both the population mean and median in the log transformed scale. If the mean of the logs estimates the median of the logs, its anti-log--the geometric mean--estimates the median in the original scale!

The median rainfall for the seeded clouds is 221.6 acre-feet. In the picture, the solid line between the two histograms connects the median in the original scale to the mean in the log-transformed scale.

One property of the logarithm is that "the difference between logs is the log of the ratio", that is, log(x)-log(y)=log(x/y). The confidence interval from the logged data estimates the difference between the population means of log transformed data, that is, it estimates the difference between the logs of the geometric means. However, the difference between the logs of the geometric means is the log of the ratio of the geometric means. The anti-logarithms of the end points of this confidence interval give a confidence interval for the ratio of geometric means itself. Since the geometric mean is sometime an estimate of the median in the original scale, it follows that a confidence interval for the geometric means is approximately a confidence interval for the ratio of the medians in the original scale.

In the (common) log scale, the mean difference between seeded and
unseeded clouds is 0.4967. Our best estimate of the ratio of the median
rainfall of seeded clouds to that of unseeded clouds is
10^{0.4967} [= 3.14]. Our best estimate of the effect of cloud
seeding is that it produces 3.14 times as much rain on average as not
seeding.

Even when the calculations are done properly, the conclusion is often misstated.

- The difference 0.4967 does

- The 3.14 means 3.14 times that of unseeded clouds. It does

The a 95% CI for the population mean difference in the log scale
(Seeded - Not Seeded) is (0.1046, 0.8889). For reporting purposes,
this CI should be transformed back to the original scale. A CI for a
**difference** in the log scale becomes a CI for a **ratio** in
the original scale.

The antilogarithms of the endpoints of the confidence interval are
10^{0.1046} = 1.27, and 10^{0.8889} = 7.74. Thus, the
report could read: "The geometric mean of the amount of rain
produced by a seeded cloud is 3.14 times as much as that produced by an
unseeded cloud (95% CI: 1.27 to 7.74 times as much)." Perhaps the
cleanest way to report the results--the least prone to misstatement or
misinterpretation--is to **express everything as ratios**:

The ratio of the geometric mean amount of rainfall from seeded clouds to that from unseeded clouds is 3.14 (95% CI: 1.27 to 7.74).

If the logged data have a roughly symmetric distribution, you might go so far as to say,"The median amount of rain...is approximately..."

Comment: The logarithm is the only non-linear transformation that produces results that can be cleanly expressed in terms of the original data. Other transformations, such as the square root, are sometimes used, but it is difficult to restate their results in terms of the original data.