Here is a tiny preview of upcoming methodological work relevant to recent events:
The Frog Jump
You have two survey variables:
Variable 1: AB
Variable 2: CD
AB: A and B are opposites sharing a 4-point scale (2 pts for A; 2 pts for B)
CD: C and D are opposites sharing a 4-point scale (2 pts for C; 2 pts for D)
What do I mean by opposites? We have lots of Opposite Scales. Canonical forms are:
Disagreement -- Agreement
Dislike -- Like
Unhappy -- Happy
They can vary in how stark the flip is, but in general one side is expressing the substantive opposite of the other side, and has profound implications for how we describe things like correlations – or whether it's even appropriate for a scientist to use correlations on such measures. In our example, it's a stark flip, a virtually dichotomous four-point scale: Absolutely Disagree, Disagree, Agree, and Absolutely Agree. There's not even mild or moderate disagreement or agreement, nor is there a neutral or ambivalent midpoint, nor is there a separate unscored No Opinion option (which I strongly advocate.)
99% of the sample is at the D side of CD.
- 99% of people at A are at D (A is large majority of sample)
- 95% of people at B are at D (B is small fraction of sample)
Still, for some reason you decide to run a Pearson correlation between AB and CD.
Negative correlation, nice mystical, meaningless p-value; sacrifice a goat to the Gods.
Write it up and say "B predicts C."
(Instead of saying AB is negatively correlated with CD, which will be hard because A and B are opposites, and C and D are opposites. Choose to talk only about B and C, even though no one is at C.)
(And don't tell anyone that there's no one at C, or that 95% of people at B are at D.)
Part of the dark magic here is using variance within D – remember D includes two points of the scale (Absolutely Agree and Agree) – to power a negative correlation between AB and CD (there won't always variance on one side that will power a correlation – it just happened to go down that way here), then frog jump it to C when you write it up. Even though everyone's at D, the unscrupulous researcher only refers to C, which is the substantive opposite of D.
There are several root causes here. One is that disagreement and agreement scales are being wrongly treated as continuous variables. This is why we can't use a single letter for each variable in the example. There are really two variables in each variable. One is disagreement, and one is agreement, which are substantive opposites (and there was no midpoint here) This will be less bad if you've got variance across both sides, not a 99% on one side situation. But if all your participants are one side of a substantively dichotomous scale, if you treat that scale as continuous, run a correlation on it, and then frog jump to the other side of the variable (where no one is) when you write it up, that's dark, dark business. It's wildly irresponsible, and the severity of the misconduct goes up as the amount of harm people could suffer from being falsely linked to that empty side of the opposite variable goes up.
In this case, C was disagreement with "HIV causes AIDS." (D was agreement, where 99% were.)
Running linear correlations on substantively dichotomous variables opens up the opportunity to frog jump from one side to the other when researchers write up the results. It conflates direction (+ or - correlation) for destination (C when everybody's at D), and creates enormous opportunities for bias and fraud. A researcher could use a correlation to proclaim the opposite of the truth, in this case "B predicts C", when in fact almost all B are at D, not C.
The word "predict" is a common way to describe correlations and regressions, but using it as it was used in this case is false ("B predicts C".) If you wanted to find out if in fact B predicts C, a linear correlation between AB and CD will not give you the answer. Completely different analyses are needed, which will require real scientists.
This might be underexposed, but linear correlation is not an inherently valid methodological decision. An r statistic with a desirable p-value is not inherently meaningful. Nothing is. This is especially true when the variables are opposite variables. We can't use variance on one side of an opposite variable to say anything about the other side.
People who would frog jump on something like "HIV causes AIDS" and falsely link millions of innocent people to denial of that fact should turn in their lab coats. Editors who allow this should edit no more. Journals with the word "science" in their titles and who publish this incredibly harmful, wildly unethical malpractice should vigorously reform and reconnect with their calling.