[Notation: The obvious notational choice for proportion or probability
is *p*. The standard convention is to use Roman letters for sample
quantities and the corresponding Greek letter for population quantities.
Some books do just that. However, the Greek letter has its own special place in mathematics.
Therefore, instead of using *p*for sample proportion and for population proportion, many authors use
*p* for population proportion and *p* with a hat (caret) on
it, (called p-hat), as the
sample proportion. The use of "hat" notation for differentiating between
sample and population quantities is quite common.]

There's really nothing new to learn to compare two proportions because we know how to compare means. Proportions are just means! The proportion having a particular characteristic is the number of individuals with the characteristic divided by total number of individuals. Suppose we create a variable that equals 1 if the subject has the characteristic and 0 if not. The proportion of individuals with the characteristic is the mean of this variable because the sum of these 0s and 1s is the number of individuals with the characteristic.

While it's never done this way (I don't know why not^{*}), two
proportions could be compared by using Student's t test for independent
samples with the new 0/1 variable as the response.

An approximate 95-% confidence interval for the difference between
two population proportions (p_{1}-p_{2}) based on two
independent samples of size n_{1} and n_{2} with sample
proportions and is given by

Even though this looks different from other formulas we've seen, it's
nearly identical to the formula for the "equal variances not assumed"
version of Student's t test for independent samples. The only difference
is that the SDs are calculated with *n* in the
denominator instead of *n-1*.

An approximate 95-% confidence interval for a single population proportion based on a sample of size n with sample proportion is

*Comparing Two Proportions*

There is a choice of test statistics for testing the null hypothesis
H_{0}: p_{1}=p_{2} (the population proportions
are equal) against H_{1}: p_{1}p_{2} (the population proportions are
not equal). The test is performed by calculating one of these statistics
and comparing its value to the percentiles of the standard normal
distribution to obtain the observed significance level. If this P value
is sufficiently small, the null hypothesis is rejected.

Which statistic should be used? Many statisticians have offered arguments for preferring one statistic over the others but, in practice, most researchers use the one that is provided by their statistical software or that is easiest to calculate by hand.

All of the statistics can be justified by large sample statistical
theory. They all reject H_{0} 100(1-)% of the time when H_{0}is true.
(However, they don't always agree on the same set of data.) Since they
all reject H_{0} with the same frequency when it is true, you
might think of using the test that is more likely to reject H_{0}
when it is false, but none has been shown to be more likely than the
others to reject H_{0} when it is false for all alternatives to
H_{0}.

The first statistic is

The second is

where is the proportion of individuals having the characteristic when the two samples are lumped together.

A third statistic is

The test statistic **z _{1}** is consistent with the
corresponding confidence interval, that is, z

The test statistic **z _{2}** is equivalent to the chi-
square goodness-of-fit test, also called (correctly) a test of
homogeneity of proportions and (incorrectly, for this application) a test
of independence.

The test statistic **z _{3}** is equivalent to the chi-
square test with Yates's continuity correction. It was developed to
approximate another test statistic (Fisher's exact test) that was
difficult to compute by hand. Computers easily perform this calculation,
so this statistic is now obsolete. Nevertheless, most statistical program
packages continue to report it as part of their analysis of proportions.

*Examples*

- =8/13 and =3/13. Then, z
_{1}=2.155 (P=0.031), z_{2}=1.985 (p=0.047), and z_{3}=1.588 (P=0.112). Fisher's exact test gives P=0.111. - =16/34 and =6/26. Then, z
_{1}=2.016 (P=0.044) and z_{2}=1.910 (p=0.056). A 95% CI for p1-p2 is 0.23980.2332=(0.0066,0.4730). The confidence interval agrees with z_{1}. The CI does not contain 0, while z_{1}rejects H_{0}: p_{1}=p_{2}. However, z_{1}and the CI disagree with z_{2}which fails to reject H_{0}.

Common sense suggests using z_{1} because it avoids
conflicts with the corresponding confidence interval. However, in
practice, the chi-square test for homogeneity of proportions (equivalent
to z_{2}) is used because that's what statistical software
packages report. I don't know any that report z_{1}. However,
z_{2} (in the form of the chi-square test) has the advantage of
generalizing to tests of the equality of more than two proportions.

When testing the null hypothesis H_{0}: the population
proportion equals some specified value p_{0} against
H_{1}: the population proportion does not equal p_{0},
there is, once again, a choice of test statistics.

all of which are compared to the percentiles of the standard normal distribution.

Again, z_{1} gives tests that are consistent with the
corresponding confidence intervals, z_{2} is equivalent to the
chi-square goodness-of-fit test, and z_{3} gives one-sided P-
values that usually have better agreement with exact P-values obtained,
in this case, by using the binomial distribution.

*Comment*

These techniques are based on large sample theory. Rough rules of thumb say they may be applied when there are at least five occurrences of each outcome in each sample and, in the case of a single sample, provided confidence intervals lie entirely in the range (0,1).

- We can construct confidence intervals for population proportions and for the difference between population proportions just as we did for population means.
- We can test the hypothesis that two population proportions are equal just as we did for population means.
- The formulas for constructing confidence intervals and for testing the hypothesis of equal proportions are slightly different, unlike the case of means where the two formulas are the same.
- As a consequence of (3), it is possible (although uncommon) for the test to reject the hypothesis of equal proportions while the CI for their difference contains 0, or for the test to fail to reject while the CI does not contain 0!
- The formula for CIs can be adapted for significance testing. However, the formula for significance tests cannot be adapted for constructing CIs.
- Which test statistic should be used? All are equally valid. Almost
every statistical program provides a test procedure that is equivalent to
z
_{2}for comparing proportions, so that's what people use. - Why is the test statistic based on the CI for population differences not widely available in statistical software? Because the chi-square test is easily generalized to classifications with more than two categories. The other test statistic is not.
- This is just the tip of the iceberg. When the response is counts, there can be dozens of valid test statistics and methods for constructing confidence intervals, all giving slightly different results. The good news is that they tend to give the same inference (lead to the same conclusion).