**Nonparametric Statistics
**Gerard E. Dallal, Ph.D.

Before discussing *non*parametric techniques, we should consider
why the methods we usually use are called *parametric*. Parameters
are indices. They index (or label) individual distributions within a
particular family. For example, there are an infinte number of normal
distributions, but each normal distribution is uniquely determined by its
mean () and standard deviation
(). If you specify all of
the parameters (here, and ), you've specified a unique
normal distribution.

Most commonly used statistical techniques are properly called parametric because they involve estimating or testing the value(s) of parameter(s)--usually, population means or proportions. It should come as no suprise, then, that nonparametric methods are procedures that work their magic without reference to specific parameters.

The precise definition of nonparametric varies slightly among
authors^{1}. You'll see the terms *nonparametric* and
*distribution-free*. They have slightly different meanings, but are
often used interchangeably--like *arteriosclerosis* and
*atherosclerosis*.

**Ranks**

Many nonparametric procedures are based on ranked data. Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Ties are resolved by assigning tied values the mean of the ranks they would have received if there were no ties, e.g., 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5. (If the two 119s were not tied, they would have been assigned the ranks 2 and 3. The mean of 2 and 3 is 2.5.)

For large samples, many nonparametric techniques can be viewed as the
usual normal-theory-based procedures applied to ranks. The following
table contains the names of some normal-theory-based procedures and their
nonparametric counterparts. For smaller sample sizes, the same statistic
(or one mathematically equivalent to it) is used, but decisions regarding
its significance are made by comparing the observed value to special
tables of critical values^{2}.

Some Commonly Used Statistical Tests |
||

Normal theory based test |
Corresponding nonparametric test |
Purpose of test |

t test for independent samples |
Mann-Whitney U test; Wilcoxon rank-sum test | Compares two independent samples |

Paired t test |
Wilcoxon matched pairs signed-rank test | Examines a set of differences |

Pearson correlation coefficient | Spearman rank correlation coefficient | Assesses the linear association between two variables. |

One way analysis of variance (F test) |
Kruskal-Wallis analysis of variance by ranks | Compares three or more groups |

Two way analysis of variance | Friedman Two way analysis of variance | Compares groups classified by two different factors |

Some nonparametric procedures

The *Wilcoxon signed rank test* is used to test whether the
median of a symmetric population is 0. First, the data are ranked without
regard to sign. Second, the signs of the original observations are
attached to their corresponding ranks. Finally, the one sample z
statistic (mean / standard error of the mean) is calculated from the
signed ranks. For large samples, the z statistic is compared to
percentiles of the standard normal distribution. For small samples, the
statistic is compared to likely results if each rank was equally likely
to have a + or - sign affixed.

The *Wilcoxon rank sum test* (also known as *the Mann-Whitney U
test* or the *Wilcoxon-Mann-Whitney test*) is used to test
whether two samples are drawn from the same population. It is most
appropriate when the likely alternative is that the two populations are
shifted with respect to each other. The test is performed by ranking the
combined data set, dividing the ranks into two sets according the group
membership of the original observations, and calculating a two sample z
statistic, using the pooled variance estimate. For large samples, the
statistic is compared to percentiles of the standard normal distribution.
For small samples, the statistic is compared to what would result if the
data were combined into a single data set and assigned at random to two
groups having the same number of observations as the original samples.

Spearman's rho (*Spearman rank correlation coefficient*) is the
nonparametric analog of the usual Pearson product-moment correlation
coefficent. It is calculated by converting each variable to ranks and
calculating the Pearson correlation coefficient between the two sets of
ranks. For small sample sizes, the observed correlation coefficient is
compared to what would result if the ranks of the X- and Y-values were
random permuations of the integers 1 to *n* (sample size).

Since these nonparametic procedures can be viewed as the usual parametric procedures applied to ranks, it is reasonable to ask what is gained by using ranks in place of the raw data.

Advantages of nonparametric procedures

(1) Nonparametric test make less stringent demands of the data. For
standard parametric procedures to be valid, certain underlying conditions
or assumptions must be met, particularly for smaller sample sizes. The
one-sample t test, for example, requires that the observations be drawn
from a normally distributed population. For two independent samples, the
t test has the additional requirement that the population standard
deviations be equal. If these assumptions/conditions are violated, the
resulting P-values and confidence intervals may not be
trustworthy^{3}. However, normality is not required for the
Wilcoxon signed rank or rank sum tests to produce valid inferences about
whether the median of a symmetric population is 0 or whether two samples
are drawn from the same population.

(2) Nonparametric procedures can sometimes be used to get a quick answer with little calculation.

Two of the simplest nonparametric procedures are the sign test and
median test. The *sign test* can be used with paired data to test the
hypothesis that differences are equally likely to be positive or
negative, (or, equivalently, that the median difference is 0). For small
samples, an exact test of whether the proportion of positives is 0.5 can
be obtained by using a binomial distribution. For large samples, the test
statistic is

where *plus* is the number of positive values and *minus*
is the number of negative values. Under the null hypothesis that the
positive and negative values are equally likely, the test statistic
follows the chi-square distribution with 1 degree of freedom. Whether the
sample size is small or large, the sign test provides a quick test of
whether two paired treatments are equally effective simply by counting
the number of times each treatment is better than the other.

Example: 15 patients given both treatments A and B to test the hypothesis that they perform equally well. If 13 patients prefer A to B and 2 patients prefer B to A, the test statistic is (13 - 2)² / (13 + 2) [= 8.07] with a corresponding P-value of 0.0045. The null hypothesis is therefore rejected.

The *median test* is used to test whether two
samples are drawn from populations with the same median. The median of
the combined data set is calculated and each original observation is
classified according to its original sample (A or B) and whether it is
less than or greater than the overall median. The chi-square test for
homogeneity of proportions in the resulting 2-by-2 table tests whether
the population medians are equal.

(3) Nonparametric methods provide an air of objectivity when there is
no reliable (universally recognized) underlying scale for the original
data and there is some concern that the results of standard parametric
techniques would be criticized for their dependence on an artificial
metric. For example, patients might be asked whether they feel
*extremely uncomfortable* / *uncomfortable* / *neutral* /
*comfortable* / *very comfortable*. What scores should be
assigned to the comfort categories and how do we know whether the outcome
would change dramatically with a slight change in scoring? Some of these
concerns are blunted when the data are converted to ranks^{4}.

(4) A historical appeal of rank tests is that it was easy to
construct tables of exact critical values, provided there were no ties in
the data. The same critical value could be used for all data sets with
the same number of observations because every data set is reduced to the
ranks 1,...,*n*. However, this advantage has been eliminated by the
ready availability of personal computers^{5}.

(5) Sometimes the data do not constitute a random sample from a
larger population. The data in hand are all there are. Standard
parametric techniques based on sampling from larger populations are no
longer appropriate. Because there are no larger populations, there are no
population parameters to estimate. Nevertheless, certain kinds of
nonparametric procedures can be applied to such data by using
*randomization models*.

From Dallal (1988):

Consider, for example, a situation in which a company's workers are assigned in haphazard fashion to work in one of two buildings. After yearly physicals are administered, it appears that workers in one building have higher lead levels in their blood. Standard sampling theory techniques are inappropriate because the workers do not represent samples from a large population--there is no large population. The randomization model, however, provides a means for carrying out statistical tests in such circumstances. The model states that if there were no influence exerted by the buildings, the lead levels of the workers in each building should be no different from what one would observe after combining all of the lead values into a single data set and dividing it in two, at random, according to the number of workers in each building. The stochastic component of the model, then, exists only in the analyst's head; it is not the result of some physical process, except insofar as the haphazard assignment of workers to buildings is truly random.

Of course, randomization tests cannot be applied blindly any more than normality can automatically be assumed when performing a t test. (Perhaps, in the lead levels example, one building's workers tend to live in urban settings while the other building's workers live in rural settings. Then the randomization model would be inappropriate.) Nevertheless, there will be many situations where the less stringent requirements of the randomization test will make it the test of choice. In the context of randomization models, randomization tests are the ONLY legitimate tests; standard parametric test are valid only as approximations to randomization tests.

^{[6]}

Disadvantages of nonparametric procedures

Such a strong case has been made for the benefits of nonparametric procedures that some might ask why parametric procedures aren't abandoned entirely in favor of nonparametric methods!

The major disadvantage of nonparametric techniques is contained in
its name. Because the procedures are *nonparametric*, there are no
parameters to describe and it becomes more difficult to make quantitative
statements about the actual difference between populations. (For example,
when the sign test says two treatments are different, there's no
confidence interval and the test doesn't say by how much the treatments
differ.) However, it is sometimes possible with the right software to
compute estimates (and even confidence intervals!) for medians,
differences between medians. However, the calculations are often too
tedious for pencil-and-paper. A computer is required. As statistical
software goes though its various iterations, such confidence intervals
may become readily available, but I'm still waiting!^{7}

The second disadvantage is that nonparametric procedures throw away information! The sign test, for example, uses only the signs of the observations. Ranks preserve information about the order of the data but discard the actual values. Because information is discarded, nonparametric procedures can never be as powerful (able to detect existing differences) as their parametric counterparts when parametric tests can be used.

How much information is lost? One answer is given by the asymptotic relative efficiency (ARE) which, loosely speaking, describes the ratio of sample sizes required (parametric to nonparametric) for a parametric procedure to have the same ability to reject a null hypothesis as the corresponding nonparametric procedure. When the underlying distributions are normal (with equal population standard deviations for the two-sample case)

Procedure | ARE |

sign test | 2/ = 0.637 |

Wilcoxon signed-rank test | 3/ = 0.955 |

median test | 2/ = 0.637 |

Wilcoxon-Mann-Whitney U test | 3/ = 0.955 |

Spearman correlation coefficient | 0.91 |

Thus, if the data come from a normally distributed population, the usual z statistic requires only 637 observations to demonstrate a difference when the sign test requires 1000. Similarly, the t test requires only 955 to the Wilcoxon signed-rank test's 1000. It has been shown that the ARE of the Wilcoxon-Mann-Whitney test is always at least 0.864, regardless of the underlying population. Many say the AREs are so close to 1 for procedures based on ranks that they are the best reason yet for using nonparametric techniques!

Other procedures

Nonparametric statistics is a field of specialization in its own right. Many procedures have not been touched upon here. These include the Kolmogorov-Smirnov test for the equality of two distribution functions, Kruskal-Wallis one-way analysis of variance, Friedman two-way analysis of variance, and the logrank test and Gehan's generalized Wilcoxon test for comparing two survival distributions. It would not be too much of an exaggeration to say that for every parametric test there is a nonparametric analogue that allows some of the assumptions of the parametric test to be relaxed. Many of these procedures are discussed in Siegel (1956), Hollander and Wolfe (1973) and Lee (1992).

Ellis et al. (1986) report in summary form the retinyl ester concentrations (mg/dl) of 9 normal individuals and 9 type V hyperlipoproteinemic individuals. Although all of the normal individuals have higher concentrations than those of the abnormals, these data are not quite barely significant at the 0.05 level according to the t test using Satterthwaite's approximation for unequal variances. But, even the lowly median test points to substantial differences between the two groups.

Type V hyper- Normal lipoproteinemic 1.4 30.9 2.5 134.6 4.6 13.6 0.0 28.9 0.0 434.1 2.9 101.7 1.9 85.1 4.0 26.5 2.0 44.8 H H H H X H XXXXX X X min--------------------max min--------------------max an H = 2 cases an X = 2 cases mean 2.1444 mean 100.0222 SD 1.5812 SD 131.7142 SEM .5271 SEM 43.9048 sample size 9 sample size 9 statistics P-value df t (separate) -2.23 .0564 8.0 t (pooled) -2.23 .0405 16 F (variances) 6938.69 .0000 8, 8 < median > median Group 1 9 0 Group 2 0 9 P-value (exact) = .0000 Wilcoxon-Mann-Whitney test: P-value = .0000 Pitman randomization test: P-value = .0000 (data * 1E 0)

- Bradley JV (1968), Distribution Free Statistical Tests. Prentice Hall: Englewood Cliffs, NJ.
- Dallal GE (1988), "PITMAN: A FORTRAN Program for Exact Randomization Tests," Computers and Biomedical Research, 21, 9-15.
- Ellis JK Russell RM Makraurer FL and Schaefer EJ (1986), "Increased Risk for Vitamin A Toxicity in Severe Hypertriglyceridemia," Annals of Internal Medicine, 105, 877-879.
- Fisher LD and van Belle G (1993), Biostatistics: A Methodology for the Health Sciences. New York: John Wiley & Sons, Inc.
- Hollander M and Wolfe DA (1973), Nonparametric Statistical Methods. New York: John Wiley & Sons, Inc.
- Lee ET (1992), Statistical Methods for Survival Data Analysis. New York: John Wiley & Sons, Inc.
- Lehmann EL (1975), Nonparametrics: Statistical Methods Based on Ranks. San Francisco: Holden-Day, Inc.
- Mehta C and Patel N (1992), StatXact-Turbo: Statistical Software for Exact Nonparametric Inference. Cambridge, MA: CYTEL Software Corporation.
- Siegel S (1956), Nonparametric Satistics. New York: Mc Graw- Hill Book Company, Inc.
- Velleman PF and Wilkinson L (1993), "Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading," The American Statistician, 47, 65-72.

- For example:
Fisher and van Belle (1993, p. 306): A family of probability distributions is nonparametric if the distributions of the family cannot be conveniently characterized by a few parameters. [For example, all possible continuous distributions.] Statistical procedures that hold or are valid for a nonparametric family of distributions, are called nonparametric statistical procedures.

Bradley (1968, p. 15): The terms nonparametric and distribution-free are not synonymous . . . Popular usage, however, has equated the terms . . . Roughly speaking, a nonparametric test is test one which makes no hypothesis about the value of a parameter in a statistical density function, whereas a distribution-free test is one which makes no assumptions about the precise form of the sampled population.

Lehmann (1975, p. 58): . . . distribution-free or nonparametric, that is, free of the assumption that [the underlying distribution of the data] belongs to some parametric family of distributions.

- For small samples, the tables are constructed by straightforward
enumeration. For Spearman's correlation coefficient, the possible values
of the correlation coefficient are enumerated by holding one set of
values held fixed at 1,...,
*n*and paired with every possible permutation of 1,...,*n*. For the Wilcoxon signed rank test, the values of the test statistic (whether it be the t statistic or, equivalently, the sum of the positive ranks) are enumerated for all 2^{n}ways of labelling the ranks with + or - signs. Similar calculations underlie the construction of tables of critical values for other procedures. Because the critical values are based on all possible permutations of the ranks, these procedures are sometimes called*permutation tests*. - On the other hand, a violation of the standard assumptions can
often be handled by analyzing some transformation of the raw
data (logarithmic, square root, and so on). For example, when
the within-group standard deviation is seen to be roughly
proportional to the mean, a logarithmic transformation will
produce samples with approximately equal standard deviations.
Some researchers are unnecessarily anxious about transforming
data because they view it as tampering. However, it is
important to keep in mind that the point of the transformation
is to insure the validity of the analysis (normal
distribution, equal standard deviations) and
*not*to insure a certain type of outcome. Given a choice between two transformations, one that produced a statistically significant result and another that produced an insignificant result, I would always believe the result for which the data more closely met the requirments of the procedure being applied. This is no different from trusting the results of a fasting blood sample, if that is what is required, when both fasting and non-fasting samples are available. - Many authors discuss "scales of measurement," using terms such as
nominal, ordinal, interval, or ratio data as guides to what statistical
procedure can be applied to a data set. The terminology often fails in
practice because, as Velleman and Wilkinson (1993) observe, "scale
type...is not an attribute of the data, but rather depends upon the
questions we intend to ask of the data and upon any additional
information we might have." Thus, patient identification number might be
ordinarily viewed as a nominal variable (that is, a mere label). However,
IDs are often assigned sequentially and in some cases it may prove
fruitful to look for relationships between ID and other important
variables. While the ideas behind scales of measurement are important,
the terminology itself is best ignored. Just be aware that when you score
*neutral*as 0,*comfortable*as 1, and*very comfortable*as 2, you should be wary of any procedure that relies heavily on treating "*very comfortable*" as being twice as comfortable as*comfortable*. - The ready availability of computers has made much theoretical work
concerning approximations and corrections for ties in the data is
obsolete, too. Ties were a problem because, with ties, a set of
*n*observations does not reduce to the set of ranks 1,...,n. The particular set of ranks depends on the number and pattern of ties. In the past, corrections to the usual z statistic were developed to adjust for tied ranks. Today, critical values for exact nonparametric tests involving data with ties can be calculated on demand by specialized computer programs such as StatXact (Mehta, 1992). - The data need not be converted to ranks in order to perform a permutation test. However, if the raw data are used, a critical value must be calculated for the specific data set if the sample size is small or moderate. (The usual t test has been shown to be a large sample approximation to the permutation test!) At one time, the computational complexity of this task for moderate and even small samples was considered a major disadvantage. It has become largely irrelevant due to specialized computer programs that perform the calculations in an efficient manner.
- This illustrates an often unspoken aspect of statistical computing:
**We are prisoners of our software!**Most analysts can do only what their software allows them to do. When techniques become available in standard software packages, they'll be used. Until then, the procedures stay on the curio shelf. The widespread availability of personal computers and statistical program packages have caused a revolution in the way data are analyzed. These changes continue with the release of each new package and update.