Significance

2/15/2015

I was struck by this quote from a Forbes piece on the secondhand smoke research.

"there’s no such thing as borderline statistical significance. It’s either significant or it’s not."

It's attributed to a journalist named Christopher Snowdon (I don't know who that is.)

It's false, and I think it's important for us to convey a clearer message to the public about what statistical significance is.

tl;dr: It's a business decision, and by the way, how many fingers do you have? (thumbs included...)

Significance is not a binary or discrete property of a scientific finding. Our convention in social science, and I think in lots of biomedical fields, is the .05 threshold. I'll return to this.

The statistical significance of an effect is the likelihood of drawing a random sample with the measured characteristics of our sample if the null hypothesis is true. Note that this is not the same thing as saying the likelihood of our research hypothesis being true is 1 – p, or 95% or greater given our standard .05 threshold. Significance is often mis-explained as the inverse likelihood of our hypothesis being true. That's not what it means. And there are other assumptions, particularly regarding normal distributions, that will impact the meaning of all of this.

By measured characteristics, I mean a sample that looks like our sample in the study. So if we conduct a longitudinal study with a large sample of women and track them on variables like lung cancer and passive smoking, we end up with X% having lung cancer, Y% having lived in home with a smoker, Z% who have lung cancer and lived with a smoker, and variance on other variables like length of time a person has lived with a smoker, age, race, lifestyle, etc.

The null hypothesis is that passive smoking does not cause lung cancer in nonsmokers – that there is no relationship between these variables (that would be one of several hypotheses in the actual study referenced by Forbes, because they also tracked smokers.)

So significance here means the probability of drawing a random sample from the population, with the exact percentages and so forth that we see in our sample, assuming there is in fact no link between passive smoking and lung cancer rates (that the null hypothesis is true.)

We can see a few things here. Given typical sample sizes, if 2.00% of women who lived with a smoker get lung cancer and 2.00% of women who never lived with a smoker get lung cancer, there won't be a significant effect. The core reason is that this is exactly what we'd expect to see if the null hypothesis is true. If there's no actual link between these variables in the population, it's likely that we'd draw a random sample that looked like ours – a sample with no differences between the groups. This likelihood goes up as the sample size goes up. In this kind of scenario, your p-value might be something like 0.70 or 0.80. The particular value doesn't matter – what matters is that it's well above our threshold of 0.05 (and more importantly, that there is no difference between groups, no effect.)

If we do see differences between groups in our sample, the p-value will be lower, because if the null hypothesis is true, we wouldn't expect to see such differences.

How low that p-value goes will depend on the size of the difference between these groups (the effect size) and the sample size. As the effect size goes up, the p-value goes down because it becomes less and less likely that we'd see such differences in a random sample if the null hypothesis is true.

If the sample size goes up, the p-value goes down, because a larger sample size reduces the likelihood of random sampling error. It's like flipping a coin over and over. If you flip it only three times, you could easily get three straight tails, but as you keep flipping you'll get to a more or less even split of heads and tails.

As I said, our threshold is 0.05, meaning a 5% or less chance that we would draw a sample like ours if the null hypothesis were true.

Why .05? At the margins, it's arbitrary. What's the point of a threshold? The point is to have some standard that reduces Type 1 error – detecting an effect that is not real. At the same time, we want to be able to talk about effects and to report findings that are likely to be real.

Scientists could have settled on .10 or .04 or any of a number of values. Like I said, at the margins it's arbitrary. The specific choice of .05 is partly due to the fact that you probably have ten fingers. You might remember I asked you to count them. A lot of our choices of thresholds and rules of thumb are driven by the fact that we use a base-ten number system. Five is half of ten, and so we tend to settle on values that are multiples of five or ten. There was almost no chance we would've chosen .04 or .06. Those numbers don't satisfy us the way fives and tens do. (If humans had eight fingers instead of ten, we might very well have chosen .04.)

As you can infer from above, significance is a continuum. We could have a p = 0.08 situation and that effect could easily be a true effect. In fact, an effect with p = 0.30 could be a true effect. But especially in that case, when we're getting up to a 30% chance of drawing a sample like ours, we don't want to report that as significant. Whereas, effects with 0.06 or 0.09 p-values are often reported, and should be. We report them as something like "this was significant at p = 0.06" or "this was marginally significant at p = 0.09". Note that we're still using the word significant. We can use that word given any p-value, as long as we include the p-value.

That's why Snowdon is wrong. The choice of 0.05 is a business decision that achieves a good tradeoff in our levels of Type 1 error vs Type 2 error (failing to detect a true effect.) But there's nothing natural or inherently meaningful about p = 0.05. It's not a value derived from nature, like Planck's constant. It wasn't discovered. There's no "significance" in nature. Like I said, it's a business decision.

3 Comments

Jonathan Jones link

2/15/2015 06:28:04 pm

If you mean Christopher Snowdon then he's a fairly well know UK journalist and writer, probably best known for "The Spirit Level Delusion". He blogs at http://velvetgloveironfist.blogspot.co.uk/

If humans had eight fingers instead of ten, we might indeed very well have chosen 0.04, but it would have been 0.04 base-8, which is 0.0625 base-10.

Joe Duarte

2/15/2015 07:39:32 pm

Excellent point. I should have clarified that.

For people who aren't following, Jonathan is noting that the numerals 0.04, what linguists call glyphs, represent an entirely different value in a base-8 number system.

The lower the base, the larger the value that the glyph 0.04 is going to represent, which is interesting. In base-7 it equals 0.082 base-10.

So the fewer fingers humans had would perhaps increase our Type-1 error rate, up to some limit probably. It would be complicated. Type-1 error is the same across different bases and universes, but it would change what was published. More effects would be published. 0.06 base-10 would not be marginally significant, but fully "significant" without qualification. So 0.08 and 0.09 base-10 would be more acceptable, equivalent to our 0.06. The base rate of *published* Type-1 errors would increase. The base rate of actual Type-1 errors would be the same unless the different threshold changed the quality of the research in some way.

There would be no 97% in a base-8 world, because the glyphs 8 and 9 would not exist. The psychological equivalent of 97% would be displayed as 74%, probably (100 - 3; 100 base-8 is one higher than 77 base-8) which is 95% base-10 (61/64).

This was fun :-)

Nick

3/5/2015 09:29:31 pm

Excellent point about the arbitrariness of .05 (and, indeed, much else that occurs in research when arbitrary numbers become "facts"). We mentioned this in our debunking of the positivity ratio (http://arxiv.org/abs/1307.7006), where an arbitrary "10" chosen by someone else acquired quasi-axiomatic significance.

I just wish I'd used 7 rather than 8 fingers on each hand, as using 8 introduced a second occurrence of the number 16 for the pair(*) of hands, which managed to confuse at least one non-mathematical reader who was doing his best to understand our arguments!

(*) Another arbitrary number. Maybe that's where Fisher got "half" of .10 from. Compared to 10 fingers, an even larger number of things would be different in science if humans didn't have such overwhelming left-right symmetry.

Significance

Leave a Reply.

José L. Duarte

Archives

Categories