Announcement

Single Factor Analysis of Variance
Gerard E. Dallal, Ph.D.

Terminology

A factor is a categorical predictor variable. Factors are composed of levels. For example, treatment is a factor with the various types of treatments comprising the levels. In the type of analyses discussed here, a subject should appear under only one level. In this case, it would mean that a subject is given only one of the many possible treatments.

While I've yet to see it stated explicitly in any textbook, it is important to be aware of two different types of factors--those where subjects are randomized to the levels and those where no randomization is involved. The same statistical methods are use for analyzing both types of factors, but the justification for the use of statistical methods differs, just as for intervention trials and observational studies. When subjects are randomized to levels, as in the case of treatments, the validity of the analysis comes from the act of randomization. When subjects are not randomized to levels, as in the case of sex or smoking status, the validity of the analysis follows either from having random samples from each level or, more likely, from having used an enrollment procedure that is believed to treat all levels the same. For example, a door-to-door study of adults with and without children in primary school conducted in the early afternoon is likely to produce very different results from what would be obtained in the early evening.

The terms Single Factor Analysis of Variance, Single Factor ANOVA, One Way Analysis of Variance, and One Way ANOVA are used interchangeably to describe the situation where a continuous response is being described in terms of a single factor composed of two or more levels (categories). It is a generalization of Student's t test for independent samples to situations with more that two groups.

I have sometimes been guilty of putting a hyphen in single-factor analysis of variance. This was prompted by a reviewer who confused the analysis of variance with another statistical technique, factor analysis, and asked why we had failed to report the results of the single "factor analysis" of variance!

Notation

[For years I tried to teach the principles of analysis of variance by avoiding as much reference to mathematical models as possible. However, I've learned that it can't be done. Analysis of variance, which is a special case of multiple linear regression, is all about model fitting! Without a basic understanding of the underlying models, many of the basic principles, such as

• the meaning of main effects in the presence of interactions and
• why main effects must be approached carefully, if at all, in the presence of interactions
will always seem mysterious. However, they are trivial when viewed in the context of the models being fitted.]
• Let there be g groups.
• Let yij be the value for the j-th subject in the i-th group, where i=1,..,g and j=1,..,ni. That is, the number of subjects in group i is ni.
• Let N = ni.

Means are denoted by putting a dot in place of the subscripts over which the means are calculated. The mean for the i-th group is denoted and the overall mean is denoted .

The Model

The model for one way ANOVA can be written simply as

Yij = i + ij

where Yij is the response for the j-th subject in the i-th group, i is the mean of the i-th group, and ij is a random error associated with the j-th subject in the i-th group. The model usually specifies the errors to be independent, normally distributed, and with constant variance.

While this model is fine for one way ANOVA, it is usually written in a different way that generalizes more easily when there is more than one factor in the model.

Yij = + i + ij

where
• Yij is the response for the j-th subject in the i-th group,
• is an overall effect and
• i is the effect of the i-th group.
One problem with this model is that there are more parameters than groups. Some constraint must be placed on the parameters so they can be estimated. This is easily seen with just two groups. The predicted value for group 1 is + 1 while for group 2 it is + 2. Three parameters, , 1, and 2 are being used to model two values, so there are many ways the parameters can be chosen.

The interpretation of the model's parameters depends on the constraint that is placed upon the them. Let there be g groups. If g is set to 0 as many software packages do, then estimates the mean of group g and i estimates the mean difference between groups i and g.

The so-called usual constraint has the parameters sum to 0, that is,  i = 0. In the case of two groups, 1 = - 2. In this case, is the simple mean of the group means, that is, .

The constraint ni i = 0 is also worth noting because then estimates the overall mean, , while estimates the difference between the mean of the i-th group and the overall mean.

The simple mean of the group means, , looks odd when first encountered but is often more useful than the overall mean. Suppose in order to do background work on a proposed exercise study we take a random cross-section of people who exercise. We classify them according to their form of exercise and measure their blood pressure. The overall mean estimates the mean blood pressure in the population of exercisers. However, there may be many more joggers than anything else and relatively few weightlifters. The overall mean would then be weighted toward the effect of jogging. On the other hand, the mean of the joggers--no matter how few or many--is our best estimate of the mean blood pressure in the population of joggers. The mean of the weightlifters--no matter how few or many--is our best estimate of the mean blood pressure in the population of weightlifters, and similarly for all of the other forms of exercise. The simple mean of the group means represents the mean of the different types of exercise and the s estimates the difference between the i-th form of exercise and this mean. This seems like a more compelling measure of the effect of a particular form of exercise.

Still, after all the notation has been introduced and formulas have been written, single factor analysis of variance is nothing more than a generalization of Student's t test for independent samples to allow for more than two groups. The new wrinkles involve having multiple comparisons to make and multiple testing to perform since there are now more than two groups to compare. The two immediate questions we face are

1. how do we decide whether there are any differences among the groups, that is, how do we test the hypothesis (stated in three equivalent forms)
• H0: all population means are equal
• H0: 1 = .. = g
• H0: 1 = .. = g = 0
and
2. if there are differences, how do we decide which groups are different from which other groups?

[back to LHSP]