Announcement

Logistic Regression
Gerard E. Dallal, Ph.D.

Prologue
(feel free to skip it,
but I can't suppress the urge to write it!)

From the statistican's technical standpoint, logistic regression is very different from linear least-squares regression. The underlying mathematics is different and the computational details are different. Unlike a linear least-squares regression equation which can be solved explicitly--that is, there is a formula for it--logistic regression equations are solved iteratively. A trial equation is fitted and tweaked over and over in order to improve the fit. Iterations stop when the improvement from one step to the next is suitably small.

Also, there are statistical arguments that lead to linear least squares regression. Among other situations, linear least squares regression is the thing to do when one asks for the best way to estimate the response from the predictor variables when they all have a joint multivariate normal distribution. There is no similar argument for logistic regression. In practice it often works, but there's nothing that says it has to.

Logistic Regression

From a practical standpoint, logistic regression and least squares regression are almost identical. Both methods produce prediction equations. In both cases the regression coefficients measure the predictive capability of the independent variables.

The response variable that characterizes logistic regression is what makes it special. With linear least squares regression, the response variable is a quantitative variable. With logistic regression, the response variable is an indicator of some characteristic, that is, a 0/1 variable. Logistic regression is used to determine whether other measurements are related to the presence of some characteristic--for example, whether certain blood measures are predictive of having a disease. If analysis of covariance can be said to be a t test adjusted for other variables, then logistic regression can be thought of as a chi-square test for homogeneity of proportions adjusted for other variables.

While the response variable in a logistic regression is a 0/1 variable, the logistic regression equation, which is a linear equation, does not predict the 0/1 variable itself. In fact, before the development of logistic regression in the 1970s, this is what was done under the name of discriminant analysis. A multiple linear least squares regression was fitted with a 0/1 variable as a response. The method fell out of favor because the discriminant function was not easy to interpret. The significance of the regression coefficients could be used to claim specific independent variables had predictive capability, but the coefficients themselves did not have a simple interpretation. In practice, a cutoff prediction value was determined. A case was classified as a 1 or a 0 depending on whether it's predicted value exceeded the cutoff. The predicted value could not be interpreted as a probability because it could be less than 0 or greater than 1.

Instead of classifying an observation into one group or the other, logistic regression predicts the probability that an indicator variable is equal to 1. To be precise, logistic regression equation does not directly predict the probability that the indicator is equal to 1. It predicts the log odds that an observation will have an indicator equal to 1. The odds of an event is defined as the ratio of the probability that an event occurs to the probability that it fails to occur. Thus,

Odds(indicator=1) = Pr(indicator=1) / [1 - Pr(indicator=1)]
or
Odds(indicator=1) = Pr(indicator=1) / Pr(indicator=0)

The log odds is just the (natural) logarithm of the odds.

Probabilities are constrained to lie between 0 and 1, with 1/2 as a neutral value for which both outcomes are equally likely. The constraints at 0 and 1 make it impossible to construct a linear equation for predicting probabilities.

Odds lie between 0 and +, with 1 as a neutral value for which both outcomes are equally likely. Odds are asymmetric. When the roles of the two outcomes are switched, each value in the range 0 to 1 is transformed by taking its inverse (1/value) to a value in the range 1 to +. For example, if the odds of having a low birthweight baby is 1/4, the odds of not having a low birthweight baby is 4/1.

Log odds are symmetric. They lie in the range - to +. The value for which both outcomes are equally likely is 0. When the roles of the two outcomes are switched, the log odds are multiplied by -1, since log(a/b) = -log(b/a). For example, if the log odds of having a low birthweight baby are -1.39, the odds of not having a low birthweight baby are 1.39.

Those new to log odds can take comfort in knowing that as the probability of something increases, the odds and log odds increase, too. Talking about the behavior of the log odds an event is qualitatively the same thing as talking about the behavior of the probability of the event.

Because log odds take on any value between - and +, the coefficients from a logistic regression equation can be interpreted in the usual way, namely, they represent the change in log odds of the response per unit change in the predictor.

Some detail...

Suppose we've fitted the logistic regression equation to a group of postmenopausal women, where Y=1 if a subject is osteoporotic and 0 otherwise, with the result

log odds (Y=1) = -4.353 + 0.038 age
or
log [Pr(osteo)/Pr(no osteo)] = -4.353 + 0.038 age

Since the coefficient for AGE is positive, the log odds (and, therefore, the probability) of osteoporosis increases with age. Taking anti-logarithms of both sides gives

Pr(osteo)/Pr(no osteo) = exp(-4.353+ 0.038 age)

With a little manipulation, it becomes

Pr(osteo) = exp(-4.353 + 0.038 age) / [1 + exp(-4.353 + 0.038 age)]
or
Pr(osteo) = 1 / {1 + exp[-(-4.353 + 0.038 age)]}

This is an example of the general result that if

then
or

Interpreting The Coefficients of a
Logistic Regression Equation

If b is the logistic regression coefficient for AGE, then exp(b) is the odds ratio corresponding to a one unit change in age. For example for AGE=a,

odds(osteo|AGE=a) = exp(-4.353 + 0.038 a)

while for AGE=a+1

odds(osteo|age=a+1) = exp(-4.353 + 0.038 (a+1))

Dividing one equation by the other gives

or

which equals 1.0387. Thus, the odds that an older individual has osteoporosis increases 3.87% over that of a younger individual with each year of age. For a 10 year age difference, say, the increase is exp(b)10 [= 1.038710] = 1.46, or a 46% increase.

Virtually any sin that can be committed with least squares regression can be committed with logistic regression. These include stepwise procedures and arriving at a final model by looking at the data. All of the warnings and recommendations made for least squares regression apply to logistic regression as well.

[back to LHSP]

Copyright © 2001 Gerard E. Dallal