I was struck by this quote from a Forbes piece on the secondhand smoke research.
"there’s no such thing as borderline statistical significance. It’s either significant or it’s not."
It's attributed to a journalist named Christopher Snowdon (I don't know who that is.)
It's false, and I think it's important for us to convey a clearer message to the public about what statistical significance is.
tl;dr: It's a business decision, and by the way, how many fingers do you have? (thumbs included...)
Significance is not a binary or discrete property of a scientific finding. Our convention in social science, and I think in lots of biomedical fields, is the .05 threshold. I'll return to this.
The statistical significance of an effect is the likelihood of drawing a random sample with the measured characteristics of our sample if the null hypothesis is true. Note that this is not the same thing as saying the likelihood of our research hypothesis being true is 1 – p, or 95% or greater given our standard .05 threshold. Significance is often mis-explained as the inverse likelihood of our hypothesis being true. That's not what it means. And there are other assumptions, particularly regarding normal distributions, that will impact the meaning of all of this.
By measured characteristics, I mean a sample that looks like our sample in the study. So if we conduct a longitudinal study with a large sample of women and track them on variables like lung cancer and passive smoking, we end up with X% having lung cancer, Y% having lived in home with a smoker, Z% who have lung cancer and lived with a smoker, and variance on other variables like length of time a person has lived with a smoker, age, race, lifestyle, etc.
The null hypothesis is that passive smoking does not cause lung cancer in nonsmokers – that there is no relationship between these variables (that would be one of several hypotheses in the actual study referenced by Forbes, because they also tracked smokers.)
So significance here means the probability of drawing a random sample from the population, with the exact percentages and so forth that we see in our sample, assuming there is in fact no link between passive smoking and lung cancer rates (that the null hypothesis is true.)
We can see a few things here. Given typical sample sizes, if 2.00% of women who lived with a smoker get lung cancer and 2.00% of women who never lived with a smoker get lung cancer, there won't be a significant effect. The core reason is that this is exactly what we'd expect to see if the null hypothesis is true. If there's no actual link between these variables in the population, it's likely that we'd draw a random sample that looked like ours – a sample with no differences between the groups. This likelihood goes up as the sample size goes up. In this kind of scenario, your p-value might be something like 0.70 or 0.80. The particular value doesn't matter – what matters is that it's well above our threshold of 0.05 (and more importantly, that there is no difference between groups, no effect.)
If we do see differences between groups in our sample, the p-value will be lower, because if the null hypothesis is true, we wouldn't expect to see such differences.
How low that p-value goes will depend on the size of the difference between these groups (the effect size) and the sample size. As the effect size goes up, the p-value goes down because it becomes less and less likely that we'd see such differences in a random sample if the null hypothesis is true.
If the sample size goes up, the p-value goes down, because a larger sample size reduces the likelihood of random sampling error. It's like flipping a coin over and over. If you flip it only three times, you could easily get three straight tails, but as you keep flipping you'll get to a more or less even split of heads and tails.
As I said, our threshold is 0.05, meaning a 5% or less chance that we would draw a sample like ours if the null hypothesis were true.
Why .05? At the margins, it's arbitrary. What's the point of a threshold? The point is to have some standard that reduces Type 1 error – detecting an effect that is not real. At the same time, we want to be able to talk about effects and to report findings that are likely to be real.
Scientists could have settled on .10 or .04 or any of a number of values. Like I said, at the margins it's arbitrary. The specific choice of .05 is partly due to the fact that you probably have ten fingers. You might remember I asked you to count them. A lot of our choices of thresholds and rules of thumb are driven by the fact that we use a base-ten number system. Five is half of ten, and so we tend to settle on values that are multiples of five or ten. There was almost no chance we would've chosen .04 or .06. Those numbers don't satisfy us the way fives and tens do. (If humans had eight fingers instead of ten, we might very well have chosen .04.)
As you can infer from above, significance is a continuum. We could have a p = 0.08 situation and that effect could easily be a true effect. In fact, an effect with p = 0.30 could be a true effect. But especially in that case, when we're getting up to a 30% chance of drawing a sample like ours, we don't want to report that as significant. Whereas, effects with 0.06 or 0.09 p-values are often reported, and should be. We report them as something like "this was significant at p = 0.06" or "this was marginally significant at p = 0.09". Note that we're still using the word significant. We can use that word given any p-value, as long as we include the p-value.
That's why Snowdon is wrong. The choice of 0.05 is a business decision that achieves a good tradeoff in our levels of Type 1 error vs Type 2 error (failing to detect a true effect.) But there's nothing natural or inherently meaningful about p = 0.05. It's not a value derived from nature, like Planck's constant. It wasn't discovered. There's no "significance" in nature. Like I said, it's a business decision.