Contingency Tables
Gerard E. Dallal, Ph.D.

A contingency table is a table of counts. A two-dimensional contingency table is formed by classifying subjects by two variables. One variable determines the row categories; the other variable defines the column categories. The combinations of row and column categories are called cells. Examples include classifying subjects by sex (male/female) and smoking status (current/former/never) or by "type of prenatal care" and "whether the birth required a neonatal ICU" (yes/no). For the mathematician, a two-dimensional contingency table with r rows and c columns is the set {xij: i=1,...,r; j=1,...,c}.

In order to use the statistical methods usually applied to such tables, subjects must fall into one and only one row and column categories. Such categories are said to be exclusive and exhaustive. Exclusive means the categories don't overlap, so a subject falls into only one category. Exhaustive means that the categories include all possibilities, so there's a category for everyone. Often, categories can be made exhaustive by creating a catch-all such as "Other" or by changing the definition of those being studied to include only the available categories.

Also, the observations must be independent. This can be a problem when, for example, families are studied, because members of the same family are more similar than individuals from different families. The analysis of such data is beyond the current scope of these notes.

Textbooks often devoting a chapter or two to the comparison of two proportions (the percentage of high school males and females with eating disorders, for example) by using techniques that are similar to those for comparing two means. However, two proportions can be represented by a 2-by-2 contingency table in which one of the classification variables defines the groups (male/female) and the other is the presence or absence of the characteristic (eating disorder), so standard contingency table analyses can be used, instead.

When plots are made from two continuous variables where one is an obvious response to the other (for example, cholesterol level as a response to saturated fat intake), standard practice is to put the response (cholesterol) on the vertical (Y) axis and the carrier (fat intake) on the horizontal (X) axis. For tables of counts, it is becoming common practice for the row categories to specify the populations or groups and the column categories to specify the responses. For example, in studying the association between smoking and disease, the rows categories would be the categories of smoking status while the columns would denote the presence or absence of disease. This is in keeping with A.S.C. Ehernberg's observation that it is easier to make a visual comparison of values in the same column than in the same row. Consider

                             Disease     |     Disease
                            Yes    No    |    Yes    No
            Smoke  Yes       13    37    |    26%   74% | 100%
                    No        6   144    |     4%   96% | 100%
                                (A)      |       (B)

                     (In table A the entries are counts;
           in table B the entries are percentages within each row.)

The 26 and 4% are easy to compare because they are lined up in the same column.

Sampling Schemes

There are many ways to generate tables of counts. Three of the most common sampling schemes are

Unrestricted (Poisson) sampling: Collect data until the sun sets, the money runs out, fatigue sets in,...

Sampling with the grand total fixed (multinomial sampling): Collect data on a predetermined number of individuals and classify them according to the two classification variables.

Sampling with one set of marginal totals fixed (compound multinomial sampling): Collect data on a predetermined number of individuals from each category of one of the variables and classify them according to the other variable. This approach is useful when some of the categories are rare and might not be adequately represented if the sampling were unrestricted or only the grand total were fixed. For example, suppose you wished to assess the association between tobacco use and a rare disease. It would be better to take fixed numbers of subjects with and without the disease and examine them for tobacco use. If you sampled a large number of individuals and classified them with respect to smoking and disease, there might be too few subjects with the disease to draw any meaningful conclusions*.

Each sampling scheme results in a table of counts. It is impossible to determine which sampling scheme was used merely by looking at the data. Yet, the sampling scheme is important because some things easily estimated from one scheme are impossible to estimate from the others. The more that is specified by the sampling scheme, the fewer things that can be estimated from the data. For example, consider the 2 by 2 table

                                     Yes     No

If sampling occurs with only the grand total fixed, then any population proportion of interest can be estimated. For example, we can estimate the population proportion of individuals with eating disorders, the proportion attending public colleges, the proportion attending public college and are without eating disorder, and so on.

Suppose, due to the rarity of eating disorders, 50 individuals with eating disorders and 50 individuals without eating disorders are studied. Many population proportions can no longer be estimated from the data. It's hardly surprising we can't estimate the proportion of the population with eating disorders. If we choose to look at 50 individuals with eating disorders and 50 without, we obviously shouldn't be able to estimate the population proportion that suffers from eating disorders. The proportion with eating disorders in our sample will be 50%, not because 50% of the population have eating disorders but because we specifically chose a sample in which 50% have eating disorders.

Is it as obvious that we cannot estimate the proportion of the population that attends private colleges? We cannot if there is an association between eating disorder and type of college. Suppose students with eating disorders are more likely to attend private colleges than those without eating disorders. Then, the proportion of students attending a private college in the combined sample will change according to the way the sampling scheme fixes the proportions of students with and without an eating disorder.

Even though the sampling scheme affects what we can estimate, all three sampling schemes use the same test statistic and reference distribution to decide whether there is an association between the row and column variables. However, the name of the problem changes according to the sampling scheme.

When the sampling is unrestricted or when only the grand total is fixed, the hypothesis of no association is called independence (of the row and column variables)--the probability of falling into a particular column is independent of the row. It does not change with the row a subject is in. Also, the probability of falling into a particular row does not depend on the column the subject is in.

If the row and column variables are independent, the probability of falling into a particular cell is the product of the probability of being in a particular row and the probability of being in a particular column. For example, if 2/5 of the population attends private colleges and, independently, 1/10 of the population has an eating disorder, then 1/10 of the 2/5 of the population that attends private colleges should suffer from eating disorders, that is, 2/50 (= 1/10 2/5) attend private college and suffer from eating disorders.

When one set of marginal totals--the rows, say--is fixed by the sampling scheme, the hypothesis of no association is called homogeneity of proportions. It says the proportion of individuals in a particular column the same for all rows.

The chi-square statistic, 2, is used to test both null hypotheses ("independence" and "homogeneity of proportions"). It is also known as the goodness-of-fit statistic or Pearson's goodness-of-fit statistic. The test is known as the chi-square test or the goodness-of-fit test.

Let the observed cell counts be denoted by {xij: i=1,...,r; j=1,...,c} and the expected cell counts under a model of independence or homogeneity of proportions be denoted by {eij: i=1,...,r; j=1,...,c}. The test statistic is


where the expected cell counts are given by

The derivation of the expression for expected values is straightforward. Consider the cell at row 1, column 1. Under a null hypothesis of homogeneity of proportions, say, both rows have the same probability that an observation falls in column 1. The best estimate of this common probability is

Then, the expected number of observations in the cell at row 1, column 1 is the number of observations in the first row (row 1 total) multiplied by this probability, that is,

In the chi-square statistic, the square of the difference between observed and expected cell counts is divided by the expected cell count. This is because probability theory shows that cells with large expected counts vary more than cells with small expected cell counts. Hence, a difference in a cell with a larger expected cell count should be downweighted to account for this.

The chi-square statistic is compared to the percentiles of a chi-square distribution. The chi-square distributions are like the t distributions in that there are many of them, indexed by their degrees of freedom. For the goodness-of-fit statistics, the degrees of freedom equal the product of (the number of rows - 1) and (the number of columns - 1), or (r-1)(c-1). When there are two rows and two columns, the degrees of freedom is 1. Any disagreement between the observed and expected values will result in a large value of the chi-square statistic, because the test statistic is the sum of the squared differences. The null hypothesis of independence or homogeneity of proportions is rejected for large values of the test statistic.

Tests of Significance

Three tests have been suggested for testing the null hypotheses of independence or homogeneity of proportions. Pearson's goodness-of-fit test, the goodness-of-fit test with Yates's continuity correction, and Fisher's exact test.

Pearson's Goodness-of-Fit Test

We just discussed Pearson's goodness of fit statistic.

The way it is typically used--compared to percentiles of the chi-square distribution with (r-1)(c-1) degrees of freedom--is based on large sample theory. Many recommendations for what constitutes a large sample can be found in the statistical literature. The most conservative recommendation says all expected cell counts should be 5 or more. Cochran recommends that at least 80% of the expected cell count be 5 or more and than no expected cell count be less than 1. For a two-by-two table, which has only four cells, Cochran's recommendation is the same as the "all expected cell counts should be 5 or more" rule.

Fisher's Exact Test

Throughout the 20th century, statisticians argued over the best way to analyze contingency tables. As with other test procedures, mathematics is use to decide whether the observed contingency table is in some sense extreme. The debate, which is still not fully resolved, has to do with what set of tables to use. For example, when multinomial sampling is used, it might seem obvious that the set should include all possible tables with the same total sample size. However, today most statisticians agree that the set should include only those tables with the same row and column totals as the observed table, regardless of the sampling scheme that was used. (Mathematical statisticians refer to this as performing the test conditional on the margins, that is, the table's marginal totals.)

This procedure is known as Fisher's Exact Test. All tables with the same row and column totals have their probability of occurrence calculated according to a probability distribution known as the hypergeometric distribution. For example, if the table

                                     1  3 | 4
                                     4  3 | 7
                                     5  6

were observed, Fisher's exact test would look at the set of all tables that have row totals of (4,7) and column totals of (5,6). They are

              0 4    |    1 3    |    2 2    |    3 1    |    4 0
              5 2    |    4 3    |    3 4    |    2 5    |    1 6

probability  21/462     140/462     210/462      84/462      7/462

While it would not be profitable to go into a detailed explanation of the hypergeometric distribution, it's useful to remove some of the mystery surrounding it. That's more easily done when the table has labels, so lets recast the table in the context of discrimination in the workplace. Suppose there are 11 candidates for 5 partnerships in a law firm. The results of the selection are


Five partners were selected out of 11 candidates--4 of 7 men, but only 1 of 4 women.

The hypergeometric distribution models the partnership process this way. Imagine a box with 11 slips of paper, one for each candidate. Male is written on 7 of them while female is written on the other 4. If the partnership process is sex-blind, the number of men and women among the new partners should be similar to what would result from drawing 5 slips at random from the box. The hypergeometric distribution gives the probability of drawing specific numbers of males and females when 5 slips are drawn at random from a box containing 7 slips marked males and 4 slips marked females. Those are the values in the line above labeled "probability".

The calculation of a one-tailed P value begins by ordering the set of all tables with the same margins (according to the value of the cell in the upper right hand corner, say). The probability of observing each table is calculated by using the hypergeometric distribution. Then the probabilities are summed from each end of the list to the observed table. The smaller sum is the one-tailed P value. In this example, the two sums are 21/462+140/462 (=161/462) and 7/462+84/462+210/462+140/462 (=441/462), so the one-tailed P value is 161/462. Yates (1984) argues that a two-tailed P value should be obtained by doubling the one-tailed P value, but most statisticians would compute the two tailed P value as the sum of the probabilities, under the null hypothesis, of all tables having a probability of occurrence no greater than that of the observed table. In this case it is 21/462+140/462+84/462+7/462 (=252/462). And, yes, if the observed table had been (4,0,1,6) the one-sided and two-sided P values would be the same (=7/462).

The Yates Continuity Correction

The Yates continuity correction was designed to make the Pearson chi- square statistic have better agreement with Fisher's Exact test when the sample size is small. The corrected goodness-of-fit statistic is

While Pearson's goodness-of-fit test can be applied to tables with any number of rows or columns, the Yates correction applies only to 2 by 2 tables.

There were compelling arguments for using the Yates correction when Fisher's exact test was tedious to do by hand and computer software was unavailable. Today, it is a trivial matter to write a computer program to perform Fisher's exact test for any 2 by 2 table, and there no longer a reason to use the Yates correction.

Advances in statistical theory and computer software (Cytel Software's StatXact, in particular, and versions of their algorithms incorporated into major statistical packages) make it possible to use Fisher's exact test to analyze tables larger than 2 by 2. This was unthinkable 15 years ago. In theory, an exact test could be constructed for any contingency table. In practice, the number of tables that have a given set of margins is so large that the problem would be insoluble for all but smaller sample sizes and the fastest computers. Cyrus Mehta and Nitin Patel, then the Harvard School of Public Health, devised what they called a network algorithm, which performs Fisher's exact test on tables larger than 2 by 2. Their technique identifies large sets of tables which will be negligible in the final tally and skips over them during the evaluation process. Thus, they are able to effectively examine all tables when computing their P values by identifying large sets of tables that don't have to be evaluated.

At one time, I almost always used the Yates correction. Many statisticians did not, but the arguments for its use were compelling (Yates, 1984). Today, most computer programs report Fisher's exact test for every 2x2 table, so I use that. For larger tables, I follow Cochran's rule. I use the uncorrected test statistic (Pearson's) for large samples and Fisher's exact test whenever the size of the sample is called into question and available software will allow it.


         8  5
         3 10 

X2=3.94 (P=0.047). Xc2=2.52 (P=0.112). Fisher's exact test gives P=0.111.


Use Fisher's Exact Test whenever the software provides it. Otherwise, follow Cochrans rule. If Cochran's rule is satisfied (no expected cell count is less than 1 and no more than 20% are less than 5), use the uncorrected Pearson goodness-of-fit statistic. If the sample size is called into question, use Fisher's exact test if your software can provide it.

It is straightforward mathematically to show for large samples that P values based on Pearson's goodness-of-fit test and Fisher's exact test are virtually identical. I do not recall a single case where a table satisfied Cochran's rule and the two P values differed in any manner of consequence.


*This is true from a statistical standpoint, but it is overly simplistic from a practical standpoint. Case-control studies involve sampling fixed numbers of those with and without a disease. The cases (those with the disease) are compared to those without the disease (controls) for the presence of some potential causal factor (exposure). However, it is often the case that there are no sampling frames (lists of individuals) for drawing random samples of those with and without the disease. It has been argued that case-control studies are inherently flawed because of biases between the case and control groups. In order to meet this criticism, it has become common to conduct nested case-control studies in which the cases and controls are extracted truly at random from an identifiable group being studied over time for some other purpose, such as Framingham or the Nurses Health Study. While the generalizability of nested case-control studies might be questioned, they are internally valid because cases and controls were recruited in the same way.

Copyright © 2000 Gerard E. Dallal