Announcement
Poisson Regression
Gerard E. Dallal, Ph.D.

Poisson regression is yet another form of, well,...regression. A model is fitted. Coefficients are obtained and interpreted as in any other regression model. As with logistic regression, the underlying mathematics and underlying probability distribution theory are different from ordinary least-squares regression, which is why Poisson regression is treated as a separate topic even though, from the consumer's viewpoint, it's all regression.

Poisson regression seeks to model counts--the number of children, number of colds, number of falls, number of telephone calls...you get the idea. Just as logistic regression models the log odds of an event, Poisson regression models the (natural) log of the expected count.

Since the log of the expected count is being modeled, there is no problem with negative predicted values, since negative values correspond to expected counts between 0 and 1.

An example

A Poisson regression is used to describe whether a subject's zinc levels are predictive of the number of episodes of pneumonia experienced during a one-year period by nursing home residents. There is no reason to expect the response to be linear in zinc levels. Rather, it is thought that the critical factor is whether or not a subject is deficient (levels below 70 ug/dL). For the purposes of this exercise, suppose it is decided to regress the number of cases of pneumonia a resident has in a given year on age, sex, BMI, and whether zinc levels were below 70 ug/dL at the start of the study. The resulting equation is

log count = -5.2314 + 0.6531 znLow - 0.0085 Age - 0.0327 BMI + 0.4507 male

where znLow = 1 if zinc levels were below 70 ug/dL and 0 otherwise, while male = 1 if the subject is male and 0 if female.

Since the purpose of the study was to assess the relationship of zinc and the number of pneumonias, we focus on the coefficient for znLow. Because its P value is 0.0003 and sign of the coefficient is positive, we conclude that the number of pneumonias a subject experiences is related to zinc and that those with low levels are expected to have more pneumonias.

Interpreting The Coefficients of a
Poisson Regression Equation

Like any other regression coefficient, a Poisson regression coefficient represents the change in response corresponding to a one unit difference in the corresponding predictor. Here, the response is expected (natural) logged count.

In general, when Xi=x

log count|Xi=x = b0 + b1 X1 + bi x + ... + bp Xp
while when Xi=x+1
log count|Xi=x+1 = b0 + b1 X1 + bi (x+1) + ... + bp Xp

Subtracting the first equation from the second gives

(log count|Xi=x+1)-(log count|Xi=x) = bi
and exponentiating both sides gives

(count|Xi=x+1)/(count|Xi=x) = exp(bi) .

Thus, we se that the exponentiated Poisson regression coefficient is a rate ratio corresponding to a one unit difference in the predictor.

Returning to the pneumonia example, the coefficient for low zinc status is 0.6531. Its anti-log is 1.92. We therefore conclude that those with low zinc status have 1.92 time as many pneumonias per year as those with normal zinc status.

Exposure Time (Time on Study)

In many such studies, not everyone is observed for the same length of time. So, rather than model counts, we would rather model counts per unit time. Poisson regression programs allow for this through something called an offset. An offset is variable that is forced to have a regression coefficient of 1. Here, it contains the (natural) logarithm of the time on study. Then, the fitted equation will be

log count = b0 + b1 X1 + ... + bp Xp + log time
log(count) - log(time)= b0 + b1 X1 + ... + bp Xp
log(count/time)= b0 + b1 X1 + ... + bp Xp
log rate = b0 + b1 X1 + ... + bp Xp

Overdispersion

The Poisson distribution is a perfectly fine distribution. It can be thought of as a binomial distribution where the probability of a "success" grows very small while the number of "trials" grows very large in such a way that the number of successes stays finite. For example, the probability that a randomly selected telephone is involved in a call is very small and the number of phones is huge in such a way that there is a finite random number of active calls at any moment. The probability that any automobile is involved in an accident is very small, but there are so many cars that there are accidents every day.

As nice as the Poisson distribution is, it is often the case (more often than not, for me) that data are more variable (over dispersed) than the Poisson distribution predicts. The typical Poisson regression program reports some statistics that indicate when overdispersion might be a problem. These are the deviance and Pearson statistics and their degrees of freedom.

The dispersion is the deviance statistic divided by its degrees of freedom. If there is no overdispersion, the ratio will be close to 1. Joseph Hilbe, in his book Negative Binomial Regression (2007, Cambridge University Press, p 73) reports, "Some statisticians have used the deviance dispersion as the basis for scaling standard errors. However...simulation studies indicate that the Pearson dispersion better captures the excess variability."

Negative binomial regression is typically used when there are signs of overdispersion in Poisson regression. Negative binomial regression uses a different probability model which allows for more variability in the data.

Applying Poisson regression to the pneumonia data yields a deviance statistic of 411.7 and a Pearson statistic of 666.0, both with 410 degrees of freedom. The dispersions (ratios) are 1.004 and 1.624, respectively. While the deviance statistic does not suggest a problem, the Pearson statistic shows strong evidence of overdispersion. The coefficient for znLow is 0.6531 with an SE of 0.1823 and a P value of 0.0003. It leads to a rate ratio of exp(0.6531)=1.92 and a 95% CI of (1.34, 2.75).

Applying negative binomial regression to the pneumonia data yields a deviance statistic of 256.0 and a Pearson statistic of 438.7, both with 410 degrees of freedom. The dispersions (ratios) are 0.624 and 1.070, respectively. Now the deviance statistic suggests UNDER dispersion. However, the Pearson statistic suggests that the fit is adequate. The coefficient for znLow barely changes--it is now 0.6306-- but its SE and P value have increased to 0.2385 and 0.0008. It leads to a rate ratio of exp(0.6306)=1.88 (not much different from the earlier 1.92). However, the 95% CI, (1.18, 3.00), is wider, reflecting the 30% increase in the SE of the regression coefficient.

In other words, the effect of overdispersion it to say, "Your point estimates are accurate but they are not as precise as you think they are."

It is theoretically possible for data to exhibit underdispersion relative to the Poisson distribution. Just as failure to adjust for overdispersion would lead to underestimating the variability in our estimates, failing to adjust for underdispersion would cause us to think that our estimates are less accurate than they truly are. However, I have yet to see underdispersion relative to the Poisson model in my own practice.

It can't be said often enough!

Once again, virtually any sin that can be committed with least squares regression can be committed with Poisson and negative binomial regression. These include stepwise procedures and arriving at a final model by looking at the data. All of the warnings and recommendations made for least squares regression apply to Poisson and negative binomial regression, as well.

[back to LHSP]

Copyright © 2008 Gerard E. Dallal