Announcement

### How to Read the Output From One Way Analysis of Variance

Here's a typical piece of output from a single-factor analysis of variance. The response is the two year change in bone density of the spine (final - initial) for postmenopausal women with low daily calcium intakes ( 400 mg) assigned at random to one of three treatments--placebo, calcium carbonate, calcium citrate maleate).
```
Class         Levels    Values
GROUP              3    CC CCM P

Dependent Variable: DBMD05
Sum of
Source               DF        Squares    Mean Square   F Value   Pr > F
Model                 2     44.0070120     22.0035060      5.00   0.0090
Error                78    343.1110102      4.3988591
Corrected Total      80    387.1180222

R-Square     Coeff Var      Root MSE    DBMD05 Mean
0.113679     -217.3832      2.097346      -0.964815

Source               DF      Type I SS    Mean Square   F Value   Pr > F
GROUP                 2    44.00701202    22.00350601      5.00   0.0090

Source               DF    Type III SS    Mean Square   F Value   Pr > F
GROUP                 2    44.00701202    22.00350601      5.00   0.0090

Standard
Parameter             Estimate             Error    t Value    Pr > |t|

Intercept         -1.520689655 B      0.38946732      -3.90      0.0002
GROUP     CC       0.075889655 B      0.57239773       0.13      0.8949
GROUP     CCM      1.597356322 B      0.56089705       2.85      0.0056
GROUP     P        0.000000000 B       .                .         .

NOTE: The X'X matrix has been found to be singular, and a generalized inverse
was used to solve the normal equations.  Terms whose estimates are
followed by the letter 'B' are not uniquely estimable.

The GLM Procedure
Least Squares Means

DBMD05      LSMEAN
GROUP          LSMEAN      Number
CC        -1.44480000           1
CCM        0.07666667           2
P         -1.52068966           3

Least Squares Means for effect GROUP
Pr > |t| for H0: LSMean(i)=LSMean(j)

i/j              1             2             3
1                      0.0107        0.8949
2        0.0107                      0.0056
3        0.8949        0.0056

NOTE: To ensure overall protection level, only probabilities
associated with pre-planned comparisons should be used.

Least Squares Means for effect GROUP
Pr > |t| for H0: LSMean(i)=LSMean(j)

i/j              1             2             3
1                      0.0286        0.9904
2        0.0286                      0.0154
3        0.9904        0.0154
```

The Analysis of Variance Table

The Analysis of Variance table is just like any other ANOVA table. The Total Sum of Squares is the uncertainty that would be present if one had to predict individual responses without any other information. The best one could do is predict each observation to be equal to the overall sample mean. The ANOVA table partitions this variability into two parts. One portion is accounted for (some say "explained by") the model. It's the reduction in uncertainty that occurs when the ANOVA model,

Yij = + i + ij
is fitted to the data. The remaining portion is the uncertainty that remains even after the model is used. The model is considered to be statistically significant if it can account for a large amount of variability in the response.

Model, Error, Corrected Total, Sum of Squares, Degrees of Freedom, F Value, and Pr F have the same meanings as for multiple regression. This is to be expected since analysis of variance is nothing more than the regression of the response on a set of indicators definded by the categorical predictor variable.

The degrees of freedom for the model is equal to one less than the number of categories. The F ratio is nothing more than the extra sum of squares principle applied to the full set of indicator variables defined by the categorical predictor variable. The F ratio and its P value are the same regardless of the particular set of indicators (the constraint placed on the -s) that is used.

Sums of Squares:  The total amount of variability in the response can be written , the sum of the squared differences between each observation and the overall mean. If we were asked to make a prediction without any other information, the best we can do, in a certain sense, is the overall mean. The amount of variation in the data that can't be accounted for by this simple method of prediction is the Total Sum of Squares.

When the Analysis of Variance model is used for prediction, the best that can be done is to predict each observation to be equal to its group's mean. The amount of uncertainty that remains is sum of the squared differences between each observation and its group's mean, . This is the Error sum of squares. In this outpur it also appears as the GROUP sum of squares. The difference between the Total sum of squares and the Error sum of squares is the Model Sum of Squares, which happens to be equal to .

Each sum of squares has corresponding degrees of freedom (DF) associated with it.  Total df is one less than the number of observations, N-1. The Model df is the one less than the number of levels The Error df is the difference between the Total df (N-1) and the Model df (g-1), that is, N-g. Another way to calculate the error degrees of freedom is by summing up the error degrees of freedom from each group, ni-1, over all g groups.

The Mean Squares are the Sums of Squares divided by the corresponding degrees of freedom.

The F Value or F ratio is the  test statistic used to decide whether the sample means are withing sampling variability of each other. That is, it tests the hypothesis H0: 1... g. This is the same thing as asking whether the model as a whole has statistically significant predictive capability in the regression framework. F is the ratio of the Model Mean Square to the Error Mean Square.  Under the null hypothesis that the model has no predictive capability--that is, that all of thepopulation means are equal--the F statistic follows an F distribution with p numerator degrees of freedom and n-p-1 denominator degrees of freedom. The null hypothesis is rejected if the F ratio is large. This statstic and P value might be ignored depending on the primary research question and whether a multiple comparisons procedure is used. (See the discussion of multiple comparison procedures.)

The Root Mean Square Error (also known as the standard error of the estimate) is the square root of the Residual Mean Square. It estimates the common within-group standard deviation.

Parameter Estimates

The parameter estimates from a single factor analysis of variance might best be ignored. Different statistical program packages fit different paraametrizations of the one-way ANOVA model to the data. SYSTAT, for example, uses the usual constraint where  i=0. SAS, on the other hand, sets g to 0. Any version of the model can be used for prediction, but care must be taken with significance tests involving individual terms in the model to make sure they correspond to hypotheses of interest. In the SAS output above, the Intercept tests whether the mean bone density in the Placebo group is 0 (which is, after all, to be expected) while the coefficients for CC and CCM test whether those means are different from placebo. It is usually safer to test hypotheses directly by using the whatever facilities the software provides that by taking a chance on the proper interpretation of the model parametrization the software might have implemented. The possiblity of many different parametrizations is the subject of the warning that Terms whose estimates are followed by the letter 'B' are not uniquely estimable.

After the parameter estimates come two examples of multiple comparisons procedures, which are used to determine which groups are different given that they are not all the same. These methods are discussed in detail in the note on multiple comparison procedures. The two methods presented here are Fisher's Least Significant Differences and Tukey's Honestly Signficant Differences. Fisher's Least Significant Differences is essentially all possible t tests. It differs only in that the estimate of the common within group standard deviation is obtained by pooling information from all of the levels of the factor and not just the two being compared at the moment. The values in the matrix of P values comparing groups 1&3 and 2&3 are identical to the values for the CC and CCM parameters in the model.

[back to LHSP]