Announcement

Nonparametric Statistics
Gerard E. Dallal, Ph.D.

Before discussing nonparametric techniques, we should consider why the methods we usually use are called parametric. Parameters are indices. They index (or label) individual distributions within a particular family. For example, there are an infinte number of normal distributions, but each normal distribution is uniquely determined by its mean () and standard deviation (). If you specify all of the parameters (here, and ), you've specified a unique normal distribution.

Most commonly used statistical techniques are properly called parametric because they involve estimating or testing the value(s) of parameter(s)--usually, population means or proportions. It should come as no suprise, then, that nonparametric methods are procedures that work their magic without reference to specific parameters.

The precise definition of nonparametric varies slightly among authors1. You'll see the terms nonparametric and distribution-free. They have slightly different meanings, but are often used interchangeably--like arteriosclerosis and atherosclerosis.

Ranks

Many nonparametric procedures are based on ranked data. Data are ranked by ordering them from lowest to highest and assigning them, in order, the integer values from 1 to the sample size. Ties are resolved by assigning tied values the mean of the ranks they would have received if there were no ties, e.g., 117, 119, 119, 125, 128 becomes 1, 2.5, 2.5, 4, 5. (If the two 119s were not tied, they would have been assigned the ranks 2 and 3. The mean of 2 and 3 is 2.5.)

For large samples, many nonparametric techniques can be viewed as the usual normal-theory-based procedures applied to ranks. The following table contains the names of some normal-theory-based procedures and their nonparametric counterparts. For smaller sample sizes, the same statistic (or one mathematically equivalent to it) is used, but decisions regarding its significance are made by comparing the observed value to special tables of critical values2.

 Some Commonly Used Statistical Tests Normal theory based test Corresponding nonparametric test Purpose of test t test for independent samples Mann-Whitney U test; Wilcoxon rank-sum test Compares two independent samples Paired t test Wilcoxon matched pairs signed-rank test Examines a set of differences Pearson correlation coefficient Spearman rank correlation coefficient Assesses the linear association between two variables. One way analysis of variance (F test) Kruskal-Wallis analysis of variance by ranks Compares three or more groups Two way analysis of variance Friedman Two way analysis of variance Compares groups classified by two different factors

Some nonparametric procedures

The Wilcoxon signed rank test is used to test whether the median of a symmetric population is 0. First, the data are ranked without regard to sign. Second, the signs of the original observations are attached to their corresponding ranks. Finally, the one sample z statistic (mean / standard error of the mean) is calculated from the signed ranks. For large samples, the z statistic is compared to percentiles of the standard normal distribution. For small samples, the statistic is compared to likely results if each rank was equally likely to have a + or - sign affixed.

The Wilcoxon rank sum test (also known as the Mann-Whitney U test or the Wilcoxon-Mann-Whitney test) is used to test whether two samples are drawn from the same population. It is most appropriate when the likely alternative is that the two populations are shifted with respect to each other. The test is performed by ranking the combined data set, dividing the ranks into two sets according the group membership of the original observations, and calculating a two sample z statistic, using the pooled variance estimate. For large samples, the statistic is compared to percentiles of the standard normal distribution. For small samples, the statistic is compared to what would result if the data were combined into a single data set and assigned at random to two groups having the same number of observations as the original samples.

Spearman's rho (Spearman rank correlation coefficient) is the nonparametric analog of the usual Pearson product-moment correlation coefficent. It is calculated by converting each variable to ranks and calculating the Pearson correlation coefficient between the two sets of ranks. For small sample sizes, the observed correlation coefficient is compared to what would result if the ranks of the X- and Y-values were random permuations of the integers 1 to n (sample size).

Since these nonparametic procedures can be viewed as the usual parametric procedures applied to ranks, it is reasonable to ask what is gained by using ranks in place of the raw data.

(1) Nonparametric test make less stringent demands of the data. For standard parametric procedures to be valid, certain underlying conditions or assumptions must be met, particularly for smaller sample sizes. The one-sample t test, for example, requires that the observations be drawn from a normally distributed population. For two independent samples, the t test has the additional requirement that the population standard deviations be equal. If these assumptions/conditions are violated, the resulting P-values and confidence intervals may not be trustworthy3. However, normality is not required for the Wilcoxon signed rank or rank sum tests to produce valid inferences about whether the median of a symmetric population is 0 or whether two samples are drawn from the same population.

(2) Nonparametric procedures can sometimes be used to get a quick answer with little calculation.

Two of the simplest nonparametric procedures are the sign test and median test. The sign test can be used with paired data to test the hypothesis that differences are equally likely to be positive or negative, (or, equivalently, that the median difference is 0). For small samples, an exact test of whether the proportion of positives is 0.5 can be obtained by using a binomial distribution. For large samples, the test statistic is

(plus - minus)² / (plus + minus) ,

where plus is the number of positive values and minus is the number of negative values. Under the null hypothesis that the positive and negative values are equally likely, the test statistic follows the chi-square distribution with 1 degree of freedom. Whether the sample size is small or large, the sign test provides a quick test of whether two paired treatments are equally effective simply by counting the number of times each treatment is better than the other.

Example: 15 patients given both treatments A and B to test the hypothesis that they perform equally well. If 13 patients prefer A to B and 2 patients prefer B to A, the test statistic is (13 - 2)² / (13 + 2) [= 8.07] with a corresponding P-value of 0.0045. The null hypothesis is therefore rejected.

The median test is used to test whether two samples are drawn from populations with the same median. The median of the combined data set is calculated and each original observation is classified according to its original sample (A or B) and whether it is less than or greater than the overall median. The chi-square test for homogeneity of proportions in the resulting 2-by-2 table tests whether the population medians are equal.

(3) Nonparametric methods provide an air of objectivity when there is no reliable (universally recognized) underlying scale for the original data and there is some concern that the results of standard parametric techniques would be criticized for their dependence on an artificial metric. For example, patients might be asked whether they feel extremely uncomfortable / uncomfortable / neutral / comfortable / very comfortable. What scores should be assigned to the comfort categories and how do we know whether the outcome would change dramatically with a slight change in scoring? Some of these concerns are blunted when the data are converted to ranks4.

(4) A historical appeal of rank tests is that it was easy to construct tables of exact critical values, provided there were no ties in the data. The same critical value could be used for all data sets with the same number of observations because every data set is reduced to the ranks 1,...,n. However, this advantage has been eliminated by the ready availability of personal computers5.

(5) Sometimes the data do not constitute a random sample from a larger population. The data in hand are all there are. Standard parametric techniques based on sampling from larger populations are no longer appropriate. Because there are no larger populations, there are no population parameters to estimate. Nevertheless, certain kinds of nonparametric procedures can be applied to such data by using randomization models.

From Dallal (1988):

Consider, for example, a situation in which a company's workers are assigned in haphazard fashion to work in one of two buildings. After yearly physicals are administered, it appears that workers in one building have higher lead levels in their blood. Standard sampling theory techniques are inappropriate because the workers do not represent samples from a large population--there is no large population. The randomization model, however, provides a means for carrying out statistical tests in such circumstances. The model states that if there were no influence exerted by the buildings, the lead levels of the workers in each building should be no different from what one would observe after combining all of the lead values into a single data set and dividing it in two, at random, according to the number of workers in each building. The stochastic component of the model, then, exists only in the analyst's head; it is not the result of some physical process, except insofar as the haphazard assignment of workers to buildings is truly random.

Of course, randomization tests cannot be applied blindly any more than normality can automatically be assumed when performing a t test. (Perhaps, in the lead levels example, one building's workers tend to live in urban settings while the other building's workers live in rural settings. Then the randomization model would be inappropriate.) Nevertheless, there will be many situations where the less stringent requirements of the randomization test will make it the test of choice. In the context of randomization models, randomization tests are the ONLY legitimate tests; standard parametric test are valid only as approximations to randomization tests.[6]

Such a strong case has been made for the benefits of nonparametric procedures that some might ask why parametric procedures aren't abandoned entirely in favor of nonparametric methods!

The major disadvantage of nonparametric techniques is contained in its name. Because the procedures are nonparametric, there are no parameters to describe and it becomes more difficult to make quantitative statements about the actual difference between populations. (For example, when the sign test says two treatments are different, there's no confidence interval and the test doesn't say by how much the treatments differ.) However, it is sometimes possible with the right software to compute estimates (and even confidence intervals!) for medians, differences between medians. However, the calculations are often too tedious for pencil-and-paper. A computer is required. As statistical software goes though its various iterations, such confidence intervals may become readily available, but I'm still waiting!7

The second disadvantage is that nonparametric procedures throw away information! The sign test, for example, uses only the signs of the observations. Ranks preserve information about the order of the data but discard the actual values. Because information is discarded, nonparametric procedures can never be as powerful (able to detect existing differences) as their parametric counterparts when parametric tests can be used.

How much information is lost? One answer is given by the asymptotic relative efficiency (ARE) which, loosely speaking, describes the ratio of sample sizes required (parametric to nonparametric) for a parametric procedure to have the same ability to reject a null hypothesis as the corresponding nonparametric procedure. When the underlying distributions are normal (with equal population standard deviations for the two-sample case)

 Procedure ARE sign test 2/ = 0.637 Wilcoxon signed-rank test 3/ = 0.955 median test 2/ = 0.637 Wilcoxon-Mann-Whitney U test 3/ = 0.955 Spearman correlation coefficient 0.91

Thus, if the data come from a normally distributed population, the usual z statistic requires only 637 observations to demonstrate a difference when the sign test requires 1000. Similarly, the t test requires only 955 to the Wilcoxon signed-rank test's 1000. It has been shown that the ARE of the Wilcoxon-Mann-Whitney test is always at least 0.864, regardless of the underlying population. Many say the AREs are so close to 1 for procedures based on ranks that they are the best reason yet for using nonparametric techniques!

Other procedures

Nonparametric statistics is a field of specialization in its own right. Many procedures have not been touched upon here. These include the Kolmogorov-Smirnov test for the equality of two distribution functions, Kruskal-Wallis one-way analysis of variance, Friedman two-way analysis of variance, and the logrank test and Gehan's generalized Wilcoxon test for comparing two survival distributions. It would not be too much of an exaggeration to say that for every parametric test there is a nonparametric analogue that allows some of the assumptions of the parametric test to be relaxed. Many of these procedures are discussed in Siegel (1956), Hollander and Wolfe (1973) and Lee (1992).

Example

Ellis et al. (1986) report in summary form the retinyl ester concentrations (mg/dl) of 9 normal individuals and 9 type V hyperlipoproteinemic individuals. Although all of the normal individuals have higher concentrations than those of the abnormals, these data are not quite barely significant at the 0.05 level according to the t test using Satterthwaite's approximation for unequal variances. But, even the lowly median test points to substantial differences between the two groups.

```         Type V hyper-                    Normal
lipoproteinemic

1.4                          30.9
2.5                         134.6
4.6                          13.6
0.0                          28.9
0.0                         434.1
2.9                         101.7
1.9                          85.1
4.0                          26.5
2.0                          44.8

H
H
H
H                              X
H                             XXXXX X            X
min--------------------max    min--------------------max
an H =    2 cases             an X =    2 cases

mean          2.1444          mean        100.0222
SD            1.5812          SD          131.7142
SEM            .5271          SEM          43.9048
sample size        9          sample size        9

statistics           P-value    df

t (separate)    -2.23       .0564     8.0
t (pooled)      -2.23       .0405    16
F (variances) 6938.69       .0000     8,  8

< median   > median
Group 1       9          0
Group 2       0          9           P-value (exact) =  .0000

Wilcoxon-Mann-Whitney test:  P-value =  .0000
Pitman randomization  test:  P-value =  .0000   (data * 1E 0)
```

References
• Bradley JV (1968), Distribution Free Statistical Tests. Prentice Hall: Englewood Cliffs, NJ.
• Dallal GE (1988), "PITMAN: A FORTRAN Program for Exact Randomization Tests," Computers and Biomedical Research, 21, 9-15.
• Ellis JK Russell RM Makraurer FL and Schaefer EJ (1986), "Increased Risk for Vitamin A Toxicity in Severe Hypertriglyceridemia," Annals of Internal Medicine, 105, 877-879.
• Fisher LD and van Belle G (1993), Biostatistics: A Methodology for the Health Sciences. New York: John Wiley & Sons, Inc.
• Hollander M and Wolfe DA (1973), Nonparametric Statistical Methods. New York: John Wiley & Sons, Inc.
• Lee ET (1992), Statistical Methods for Survival Data Analysis. New York: John Wiley & Sons, Inc.
• Lehmann EL (1975), Nonparametrics: Statistical Methods Based on Ranks. San Francisco: Holden-Day, Inc.
• Mehta C and Patel N (1992), StatXact-Turbo: Statistical Software for Exact Nonparametric Inference. Cambridge, MA: CYTEL Software Corporation.
• Siegel S (1956), Nonparametric Satistics. New York: Mc Graw- Hill Book Company, Inc.
• Velleman PF and Wilkinson L (1993), "Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading," The American Statistician, 47, 65-72.

Notes

1. For example:

Fisher and van Belle (1993, p. 306): A family of probability distributions is nonparametric if the distributions of the family cannot be conveniently characterized by a few parameters. [For example, all possible continuous distributions.] Statistical procedures that hold or are valid for a nonparametric family of distributions, are called nonparametric statistical procedures.

Bradley (1968, p. 15): The terms nonparametric and distribution-free are not synonymous . . . Popular usage, however, has equated the terms . . . Roughly speaking, a nonparametric test is test one which makes no hypothesis about the value of a parameter in a statistical density function, whereas a distribution-free test is one which makes no assumptions about the precise form of the sampled population.

Lehmann (1975, p. 58): . . . distribution-free or nonparametric, that is, free of the assumption that [the underlying distribution of the data] belongs to some parametric family of distributions.

2. For small samples, the tables are constructed by straightforward enumeration. For Spearman's correlation coefficient, the possible values of the correlation coefficient are enumerated by holding one set of values held fixed at 1,...,n and paired with every possible permutation of 1,...,n. For the Wilcoxon signed rank test, the values of the test statistic (whether it be the t statistic or, equivalently, the sum of the positive ranks) are enumerated for all 2n ways of labelling the ranks with + or - signs. Similar calculations underlie the construction of tables of critical values for other procedures. Because the critical values are based on all possible permutations of the ranks, these procedures are sometimes called permutation tests.
3. On the other hand, a violation of the standard assumptions can often be handled by analyzing some transformation of the raw data (logarithmic, square root, and so on). For example, when the within-group standard deviation is seen to be roughly proportional to the mean, a logarithmic transformation will produce samples with approximately equal standard deviations. Some researchers are unnecessarily anxious about transforming data because they view it as tampering. However, it is important to keep in mind that the point of the transformation is to insure the validity of the analysis (normal distribution, equal standard deviations) and not to insure a certain type of outcome. Given a choice between two transformations, one that produced a statistically significant result and another that produced an insignificant result, I would always believe the result for which the data more closely met the requirments of the procedure being applied. This is no different from trusting the results of a fasting blood sample, if that is what is required, when both fasting and non-fasting samples are available.
4. Many authors discuss "scales of measurement," using terms such as nominal, ordinal, interval, or ratio data as guides to what statistical procedure can be applied to a data set. The terminology often fails in practice because, as Velleman and Wilkinson (1993) observe, "scale type...is not an attribute of the data, but rather depends upon the questions we intend to ask of the data and upon any additional information we might have." Thus, patient identification number might be ordinarily viewed as a nominal variable (that is, a mere label). However, IDs are often assigned sequentially and in some cases it may prove fruitful to look for relationships between ID and other important variables. While the ideas behind scales of measurement are important, the terminology itself is best ignored. Just be aware that when you score neutral as 0, comfortable as 1, and very comfortable as 2, you should be wary of any procedure that relies heavily on treating "very comfortable" as being twice as comfortable as comfortable.
5. The ready availability of computers has made much theoretical work concerning approximations and corrections for ties in the data is obsolete, too. Ties were a problem because, with ties, a set of n observations does not reduce to the set of ranks 1,...,n. The particular set of ranks depends on the number and pattern of ties. In the past, corrections to the usual z statistic were developed to adjust for tied ranks. Today, critical values for exact nonparametric tests involving data with ties can be calculated on demand by specialized computer programs such as StatXact (Mehta, 1992).
6. The data need not be converted to ranks in order to perform a permutation test. However, if the raw data are used, a critical value must be calculated for the specific data set if the sample size is small or moderate. (The usual t test has been shown to be a large sample approximation to the permutation test!) At one time, the computational complexity of this task for moderate and even small samples was considered a major disadvantage. It has become largely irrelevant due to specialized computer programs that perform the calculations in an efficient manner.
7. This illustrates an often unspoken aspect of statistical computing: We are prisoners of our software! Most analysts can do only what their software allows them to do. When techniques become available in standard software packages, they'll be used. Until then, the procedures stay on the curio shelf. The widespread availability of personal computers and statistical program packages have caused a revolution in the way data are analyzed. These changes continue with the release of each new package and update.