Categorical Variables With More Than Two Categories

The Mechanics of Categorical Variables
With More Than Two Categories
Gerard E. Dallal, Ph.D.

Categorical variables with only two categories can be included in a multiple regression equation without introducing complications. As already noted, such a predictor specifies a regression surface composed of two parallel hyperplanes. The sign of the regression coefficients determines which plane lies above the other while the magnitude of the coefficient determines the distance between them.

When a categorical variable containing more than two categories is place in a regression model, the coding places specific contstraints on the estimated effects. This can be seen by generalizing the regression model for the t test to three groups. Consider the simple linear regression model

Y = b₀ + b₁ X where X is a categorical predictor taking on the values 1,2,3, that is, X is either 1, 2, or 3, but the numbers represent categories, such as country, diet, drug, or type of fertilizer. The model gives the fitted values

Y = b₀ + b₁ for the first category
Y = b₀ + 2 b₁ for the second category
Y = b₀ + 3 b₁ for the third category

The model forces a specific ordering on the predicted values. The predicted value for the second category must be exactly half-way between first and third category. However, category labels are usually chosen arbitrarily. There is no reason why the group with the middle code can't be the one with the largest or smallest mean value. If the goal is to decide whether the categories are different, a model that treats a categorical variable as though its numerical codes were really numbers is the wrong model.

One way to decide whether g categories are not all the same is to create a set of g-1 indicator variables. Arbitrarily choose g-1 categories and, for each category, define one of the indicator variables to be 1 if the observation is from that category and 0 otherwise. For example, suppose X takes on the values A, B, or C. Create the variables X₁ and X₂, where X₁ = 1 if the categorical variable is A and X₂ = 1 if the categorical variable is B, as in

  X   X1   X2
  A    1    0
  B    0    1
  A    1    0
  C    0    0

  and so on...

The regression model is now

Y = b₀ + b₁ X₁ + b₂ X₂ and the predicted values are

Group A: Y = b₀ + b₁ 1 + b₂ 0 = b₀ + b₁
Group B: Y = b₀ + b₁ 0 + b₂ 1 = b₀ + b₂
Group C: Y = b₀ + b₁ 0 + b₂ 0 = b₀

The hypothesis of no differences between groups can be tested by applying the extra sum of squares principle to the set (X₁,X₂). This is what ANalysis Of VAriance (ANOVA) routines do automatically.