Significance Tests Simplified

If an observed significance level (P value) is less than 0.05, reject the hypothesis under test; if it is greater than 0.05, fail to reject the null hypothesis.

That was the most difficult sentence I've had to write for these notes, not because it's wrong (indeed, it's what good analysts often do when they conduct a statistical test) but because its indiscriminate and blind use is at the root of much bad statistical practice. So, I hate to just come out and say it.

It is what good analysts do! At some point, all of the principles of good study design and execution have been followed, a well- posed research question will have been developed, and the proper data will have been collected. An essential part of the subsequent analysis will address whether the results are "statistically significant" or just random noise. In classical statistics, significance tests are the way statistical significance is assessed.

There are two main reasons why the indiscriminate and blind use of significance tests is at the root of much bad statistical practice.

Often, investigators are so blinded by statistical signicance that they all but ignore practical importance. Any result that is statistically signifcant is the cause of great joy and celebration, even if it is of no practical importance whatsoever. Groups whose difference is not statistically significant are reported as being "the same" or we are told that "there is no difference between the groups"!
The value 0.05 takes on mystical characteristics. A good analyst also knows that 0.05 is not a magic number. There is little difference between P=0.04 and P=0.06. Construct a few 94 and 96% CIs from the same data set and see how little they differ (one is about 10% longer than the other.) Significance tests by themselves suggest that the research question is only about whether there is a difference, no matter the size or direction. The confidence intervals corresponding to P=0.04 and P=0.06 will probably show much the same thing, namely, they will rule out differences of practical importance in a particular direction. They will also suggest that, if there is a difference, it may or may not be of practical importance.

Significance tests are never the whole answer. They are just a piece of the puzzle. Statistical significance is irrelevant if the effect is of no practical importance. That said, significance tests are an important and useful piece of the puzzle. Every so often, a cry is raised that P values should no longer be used because of the way they can be abused. Those who would abandon significance tests entirely because of the potential of misuse make an even greater mistake that those who abuse them.

Good Analyst, Bad Analyst

The difference between a good analyst and a bad analyst is that the bad analyst wants a P value and pays no attention to the quality of the data. The bad analyst sees only P<0.05 or P>0.05 with no regard for confidence intervals or for the context in which the data were collected.

The good analyst first consideres whether all of the principles of good study design have been followed. The good analyst knows what a test procedure requires for the resulting P value to be valid. The good analyst treats the P value as an important part of the analysis, but not as the whole answer.