**Single Factor Analysis of Variance
**Gerard E. Dallal, Ph.D.

A **factor** is a categorical predictor variable. Factors are
composed of **levels**. For example, **treatment** is a factor with
the various types of treatments comprising the levels. In the type of
analyses discussed here, a subject should appear under only one level. In
this case, it would mean that a subject is given only one of the many
possible treatments.

While I've yet to see it stated explicitly in any textbook, it is important to be aware of two different types of factors--those where subjects are randomized to the levels and those where no randomization is involved. The same statistical methods are use for analyzing both types of factors, but the justification for the use of statistical methods differs, just as for intervention trials and observational studies. When subjects are randomized to levels, as in the case of treatments, the validity of the analysis comes from the act of randomization. When subjects are not randomized to levels, as in the case of sex or smoking status, the validity of the analysis follows either from having random samples from each level or, more likely, from having used an enrollment procedure that is believed to treat all levels the same. For example, a door-to-door study of adults with and without children in primary school conducted in the early afternoon is likely to produce very different results from what would be obtained in the early evening.

The terms **Single Factor Analysis of Variance**, **Single Factor
ANOVA**, **One Way Analysis of Variance**, and **One Way ANOVA**
are used interchangeably to describe the situation where a continuous
response is being described in terms of a single factor composed of two
or more levels (categories). It is a generalization of Student's t test
for independent samples to situations with more that two groups.

I have sometimes been guilty of putting a hyphen in *single-factor
analysis of variance*. This was prompted by a reviewer who confused
the analysis of variance with another statistical technique, factor
analysis, and asked why we had failed to report the results of the single
"factor analysis" of variance!

[For years I tried to teach the principles of analysis of variance by avoiding as much reference to mathematical models as possible. However, I've learned that it can't be done. Analysis of variance, which is a special case of multiple linear regression, is all about model fitting! Without a basic understanding of the underlying models, many of the basic principles, such as

- the meaning of main effects in the presence of interactions and
- why main effects must be approached carefully, if at all, in the presence of interactions

- Let there be
*g*groups. - Let y
_{ij}be the value for the j-th subject in the i-th group, where i=1,..,g and j=1,..,n_{i}. That is, the number of subjects in group*i*is n_{i}. - Let N
= n
_{i}.

Means are denoted by putting a dot in place of the subscripts over which the means are calculated. The mean for the i-th group is denoted

The model for one way ANOVA can be written simply as

where Y

While this model is fine for one way ANOVA, it is usually written in a different way that generalizes more easily when there is more than one factor in the model.

where

- Y
_{ij}is the response for the j-th subject in the i-th group, - is an overall effect and
_{i}is the effect of the i-th group.

The interpretation of the model's parameters depends on the constraint
that is placed upon the them. Let there be *g* groups. If _{g} is set to 0 as many software packages
do, then estimates the mean of group
*g* and _{i} estimates the
mean difference between groups *i* and *g*.

The so-called *usual constraint* has the parameters sum to 0,
that is, _{i} = 0. In the case of
two groups, _{1} = -_{2}. In this case, is the simple mean of the group means, that is,
.

The constraint n_{i}_{i} = 0 is also worth noting because then estimates the overall mean,
, while estimates the difference between the mean of the
i-th group and the overall mean.

The simple mean of the group means, , looks odd when first encountered but is often more useful than the overall mean. Suppose in order to do background work on a proposed exercise study we take a random cross-section of people who exercise. We classify them according to their form of exercise and measure their blood pressure. The overall mean estimates the mean blood pressure in the population of exercisers. However, there may be many more joggers than anything else and relatively few weightlifters. The overall mean would then be weighted toward the effect of jogging. On the other hand, the mean of the joggers--no matter how few or many--is our best estimate of the mean blood pressure in the population of joggers. The mean of the weightlifters--no matter how few or many--is our best estimate of the mean blood pressure in the population of weightlifters, and similarly for all of the other forms of exercise. The simple mean of the group means represents the mean of the different types of exercise and the s estimates the difference between the i-th form of exercise and this mean. This seems like a more compelling measure of the effect of a particular form of exercise.

Still, after all the notation has been introduced and formulas have been written, single factor analysis of variance is nothing more than a generalization of Student's t test for independent samples to allow for more than two groups. The new wrinkles involve having multiple comparisons to make and multiple testing to perform since there are now more than two groups to compare. The two immediate questions we face are

- how do
we decide whether there are any differences among the groups, that is,
how do we test the hypothesis (stated in three equivalent forms)
- H
_{0}: all population means are equal - H
_{0}:_{1}= .. =_{g} - H
_{0}:_{1}= .. =_{g}= 0

- H
- if there are differences, how do we decide which groups are different from which other groups?