**Contingency Tables
**Gerard E. Dallal, Ph.D.

A **contingency table** is a table of counts. A two-dimensional
contingency table is formed by classifying subjects by two variables.
One variable determines the row categories; the other variable defines
the column categories. The combinations of row and column categories are
called *cells*. Examples include classifying subjects by sex
(male/female) and smoking status (current/former/never) or by "type of
prenatal care" and "whether the birth required a neonatal ICU" (yes/no).
For the mathematician, a two-dimensional contingency table with *r*
rows and *c* columns is the set {x_{ij}: i=1,...,r;
j=1,...,c}.

In order to use the statistical methods usually applied to such
tables, subjects must fall into one and only one row and column
categories. Such categories are said to be **exclusive** and
**exhaustive**. **Exclusive** means the categories don't overlap,
so a subject falls into only one category. **Exhaustive** means that
the categories include all possibilities, so there's a category for
everyone. Often, categories can be made exhaustive by creating a
catch-all such as "Other" or by changing the definition of those being
studied to include only the available categories.

Also, the observations must be independent. This can be a problem when, for example, families are studied, because members of the same family are more similar than individuals from different families. The analysis of such data is beyond the current scope of these notes.

Textbooks often devoting a chapter or two to the comparison of two proportions (the percentage of high school males and females with eating disorders, for example) by using techniques that are similar to those for comparing two means. However, two proportions can be represented by a 2-by-2 contingency table in which one of the classification variables defines the groups (male/female) and the other is the presence or absence of the characteristic (eating disorder), so standard contingency table analyses can be used, instead.

When plots are made from two continuous variables where one is an obvious response to the other (for example, cholesterol level as a response to saturated fat intake), standard practice is to put the response (cholesterol) on the vertical (Y) axis and the carrier (fat intake) on the horizontal (X) axis. For tables of counts, it is becoming common practice for the row categories to specify the populations or groups and the column categories to specify the responses. For example, in studying the association between smoking and disease, the rows categories would be the categories of smoking status while the columns would denote the presence or absence of disease. This is in keeping with A.S.C. Ehernberg's observation that it is easier to make a visual comparison of values in the same column than in the same row. Consider

Disease | Disease Yes No | Yes No Smoke Yes 13 37 | 26% 74% | 100% No 6 144 | 4% 96% | 100% (A) | (B) (In table A the entries are counts; in table B the entries are percentages within each row.)

The 26 and 4% are easy to compare because they are lined up in the same column.

There are many ways to generate tables of counts. Three of the most common sampling schemes are

**Unrestricted (Poisson) sampling:** Collect data until the sun
sets, the money runs out, fatigue sets in,...

**Sampling with the grand total fixed (multinomial
sampling):** Collect data on a predetermined number of
individuals and classify them according to the two
classification variables.

**Sampling with one set of marginal totals fixed (compound
multinomial sampling):** Collect data on a predetermined number of
individuals from each category of one of the variables and classify them
according to the other variable. This approach is useful when some of the
categories are rare and might not be adequately represented if the
sampling were unrestricted or only the grand total were fixed. For
example, suppose you wished to assess the association between tobacco use
and a rare disease. It would be better to take fixed numbers of subjects
with and without the disease and examine them for tobacco use. If you
sampled a large number of individuals and classified them with respect to
smoking and disease, there might be too few subjects with the disease to
draw any meaningful conclusions^{*}.

Each sampling scheme results in a table of counts. It is impossible to determine which sampling scheme was used merely by looking at the data. Yet, the sampling scheme is important because some things easily estimated from one scheme are impossible to estimate from the others. The more that is specified by the sampling scheme, the fewer things that can be estimated from the data. For example, consider the 2 by 2 table

Eating Disorder Yes No Public College: Private

If sampling occurs with only the grand total fixed, then any population proportion of interest can be estimated. For example, we can estimate the population proportion of individuals with eating disorders, the proportion attending public colleges, the proportion attending public college and are without eating disorder, and so on.

Suppose, due to the rarity of eating disorders, 50 individuals with eating disorders and 50 individuals without eating disorders are studied. Many population proportions can no longer be estimated from the data. It's hardly surprising we can't estimate the proportion of the population with eating disorders. If we choose to look at 50 individuals with eating disorders and 50 without, we obviously shouldn't be able to estimate the population proportion that suffers from eating disorders. The proportion with eating disorders in our sample will be 50%, not because 50% of the population have eating disorders but because we specifically chose a sample in which 50% have eating disorders.

Is it as obvious that we cannot estimate the proportion of the population that attends private colleges? We cannot if there is an association between eating disorder and type of college. Suppose students with eating disorders are more likely to attend private colleges than those without eating disorders. Then, the proportion of students attending a private college in the combined sample will change according to the way the sampling scheme fixes the proportions of students with and without an eating disorder.

Even though the sampling scheme affects what we can estimate, all three sampling schemes use the same test statistic and reference distribution to decide whether there is an association between the row and column variables. However, the name of the problem changes according to the sampling scheme.

When the sampling is unrestricted or when only the grand total is
fixed, the hypothesis of no association is called **independence** (of
the row and column variables)--the probability of falling into a
particular column is independent of the row. It does not change with the
row a subject is in. Also, the probability of falling into a particular
row does not depend on the column the subject is in.

If the row and column variables are independent, the probability of
falling into a particular *cell* is the product of *the
probability of being in a particular row* and *the probability of
being in a particular column*. For example, if 2/5 of the population
attends private colleges and, independently, 1/10 of the population has
an eating disorder, then 1/10 of the 2/5 of the population that attends
private colleges should suffer from eating disorders, that is, 2/50 (=
1/10 2/5) attend private college and
suffer from eating disorders.

When one set of marginal totals--the rows, say--is fixed
by the sampling scheme, the hypothesis of no association is
called **homogeneity of proportions**. It says the proportion
of individuals in a particular column the same for all rows.

The *chi-square statistic*, ^{2}, is used to test both null
hypotheses ("independence" and "homogeneity of proportions"). It is also
known as the *goodness-of-fit statistic* or *Pearson's
goodness-of-fit statistic*. The test is known as the *chi-square
test* or the *goodness-of-fit test*.

Let the observed cell counts be denoted by {x_{ij}:
i=1,...,r; j=1,...,c} and the expected cell counts under a model of
independence or homogeneity of proportions be denoted by {e_{ij}:
i=1,...,r; j=1,...,c}.
The test statistic is

where the expected cell counts are given by

The derivation of the expression for expected values is straightforward. Consider the cell at row 1, column 1. Under a null hypothesis of homogeneity of proportions, say, both rows have the same probability that an observation falls in column 1. The best estimate of this common probability is

Then, the expected number of observations in the cell at row 1, column 1 is the number of observations in the first row (row 1 total) multiplied by this probability, that is,

In the chi-square statistic, the square of the difference between observed and expected cell counts is divided by the expected cell count. This is because probability theory shows that cells with large expected counts vary more than cells with small expected cell counts. Hence, a difference in a cell with a larger expected cell count should be downweighted to account for this.

The chi-square statistic is compared to the percentiles of a
chi-square distribution. The chi-square distributions are like the t
distributions in that there are many of them, indexed by their degrees of
freedom. For the goodness-of-fit statistics, the degrees of freedom
equal the product of (the number of rows - 1) and (the number of columns
- 1), or **(r-1)(c-1)**. When there are two rows and two columns, the
degrees of freedom is 1. Any disagreement between the observed and
expected values will result in a large value of the chi-square statistic,
because the test statistic is the sum of the squared differences. The
null hypothesis of independence or homogeneity of proportions is rejected
for large values of the test statistic.

Three tests have been suggested for testing the null hypotheses of independence or homogeneity of proportions. Pearson's goodness-of-fit test, the goodness-of-fit test with Yates's continuity correction, and Fisher's exact test.

Pearson's Goodness-of-Fit Test

We just discussed Pearson's goodness of fit statistic.

The way it is typically used--compared to percentiles of the
chi-square distribution with (r-1)(c-1) degrees of freedom--is based on
large sample theory. Many recommendations for what constitutes a large
sample can be found in the statistical literature. The most conservative
recommendation says all expected cell counts should be 5 or more. Cochran
recommends that **at least 80% of the expected cell count be 5 or more
and than no expected cell count be less than 1**. For a two-by-two
table, which has only four cells, Cochran's recommendation is the same as
the "all expected cell counts should be 5 or more" rule.

Fisher's Exact Test

Throughout the 20th century, statisticians argued over the best way to analyze contingency tables. As with other test procedures, mathematics is use to decide whether the observed contingency table is in some sense extreme. The debate, which is still not fully resolved, has to do with what set of tables to use. For example, when multinomial sampling is used, it might seem obvious that the set should include all possible tables with the same total sample size. However, today most statisticians agree that the set should include only those tables with the same row and column totals as the observed table, regardless of the sampling scheme that was used. (Mathematical statisticians refer to this as performing the test conditional on the margins, that is, the table's marginal totals.)

This procedure is known as Fisher's Exact Test. All tables with the
same row and column totals have their probability of occurrence
calculated according to a probability distribution known as *the
hypergeometric distribution*. For example, if the table

1 3 | 4 4 3 | 7 ------ 5 6

were observed, Fisher's exact test would look at the set of all tables that have row totals of (4,7) and column totals of (5,6). They are

0 4 | 1 3 | 2 2 | 3 1 | 4 0 5 2 | 4 3 | 3 4 | 2 5 | 1 6 probability 21/462 140/462 210/462 84/462 7/462

While it would not be profitable to go into a detailed explanation of the hypergeometric distribution, it's useful to remove some of the mystery surrounding it. That's more easily done when the table has labels, so lets recast the table in the context of discrimination in the workplace. Suppose there are 11 candidates for 5 partnerships in a law firm. The results of the selection are

Partner | ||

Yes | No | |

Female | 1 | 3 |

Male | 4 | 3 |

Five partners were selected out of 11 candidates--4 of 7 men, but only 1 of 4 women.

The hypergeometric distribution models the partnership process this
way. Imagine a box with 11 slips of paper, one for each candidate.
*Male* is written on 7 of them while *female* is written on the
other 4. If the partnership process is sex-blind, the number of men and
women among the new partners should be similar to what would result from
drawing 5 slips at random from the box. The hypergeometric distribution
gives the probability of drawing specific numbers of males and females
when 5 slips are drawn at random from a box containing 7 slips marked
*males* and 4 slips marked *females*. Those are the values in
the line above labeled "probability".

The calculation of a one-tailed P value begins by ordering the set of all tables with the same margins (according to the value of the cell in the upper right hand corner, say). The probability of observing each table is calculated by using the hypergeometric distribution. Then the probabilities are summed from each end of the list to the observed table. The smaller sum is the one-tailed P value. In this example, the two sums are 21/462+140/462 (=161/462) and 7/462+84/462+210/462+140/462 (=441/462), so the one-tailed P value is 161/462. Yates (1984) argues that a two-tailed P value should be obtained by doubling the one-tailed P value, but most statisticians would compute the two tailed P value as the sum of the probabilities, under the null hypothesis, of all tables having a probability of occurrence no greater than that of the observed table. In this case it is 21/462+140/462+84/462+7/462 (=252/462). And, yes, if the observed table had been (4,0,1,6) the one-sided and two-sided P values would be the same (=7/462).

The Yates Continuity Correction

The Yates continuity correction was designed to make the Pearson chi- square statistic have better agreement with Fisher's Exact test when the sample size is small. The corrected goodness-of-fit statistic is

While Pearson's goodness-of-fit test can be applied to tables with any number of rows or columns, the Yates correction applies only to 2 by 2 tables.

There were compelling arguments for using the Yates correction when Fisher's exact test was tedious to do by hand and computer software was unavailable. Today, it is a trivial matter to write a computer program to perform Fisher's exact test for any 2 by 2 table, and there no longer a reason to use the Yates correction.

Advances in statistical theory and computer software (Cytel
Software's StatXact, in particular, and versions of their algorithms
incorporated into major statistical packages) make it possible to use
Fisher's exact test to analyze tables larger than 2 by 2. This was
unthinkable 15 years ago. In theory, an exact test could be constructed
for any contingency table. In practice, the number of tables that have a
given set of margins is so large that the problem would be insoluble for
all but smaller sample sizes and the fastest computers. Cyrus Mehta and
Nitin Patel, then the Harvard School of Public Health, devised what they
called a *network algorithm*, which performs Fisher's exact test on
tables larger than 2 by 2. Their technique identifies large sets of
tables which will be negligible in the final tally and skips over them
during the evaluation process. Thus, they are able to effectively examine
all tables when computing their P values by identifying large sets of
tables that don't have to be evaluated.

At one time, I almost always used the Yates correction. Many statisticians did not, but the arguments for its use were compelling (Yates, 1984). Today, most computer programs report Fisher's exact test for every 2x2 table, so I use that. For larger tables, I follow Cochran's rule. I use the uncorrected test statistic (Pearson's) for large samples and Fisher's exact test whenever the size of the sample is called into question and available software will allow it.

Example

8 5 3 10

X^{2}=3.94 (P=0.047). X_{c}^{2}=2.52
(P=0.112). Fisher's exact test gives P=0.111.

Use Fisher's Exact Test whenever the software provides it. Otherwise, follow Cochrans rule. If Cochran's rule is satisfied (no expected cell count is less than 1 and no more than 20% are less than 5), use the uncorrected Pearson goodness-of-fit statistic. If the sample size is called into question, use Fisher's exact test if your software can provide it.

It is straightforward mathematically to show for large samples that P values based on Pearson's goodness-of-fit test and Fisher's exact test are virtually identical. I do not recall a single case where a table satisfied Cochran's rule and the two P values differed in any manner of consequence.

------------

^{*}This is true from a statistical standpoint, but it is
overly simplistic from a practical standpoint. Case-control studies
involve sampling fixed numbers of those with and without a disease. The
cases (those with the disease) are compared to those without the disease
(controls) for the presence of some potential causal factor (exposure).
However, it is often the case that there are no sampling frames (lists
of individuals) for drawing random samples of those with and without the
disease. It has been argued that case-control studies are inherently
flawed because of biases between the case and control groups. In order
to meet this criticism, it has become common to conduct nested
case-control studies in which the cases and controls are extracted truly
at random from an identifiable group being studied over time for some
other purpose, such as Framingham or the Nurses Health Study. While the
generalizability of nested case-control studies might be questioned, they
are internally valid because cases and controls were recruited in the
same way.