The Cook et al. (2013) 97% paper included a bunch of psychology studies, marketing papers, and surveys of the general public as scientific endorsement of anthropogenic climate change.
Let's walk through that sentence again. The Cook et al 97% paper included a bunch of psychology studies, marketing papers, and surveys of the general public as scientific endorsement of anthropogenic climate change. This study was multiply fraudulent and multiply invalid already – e.g their false claim that the raters were blind to the identities of the authors of the papers they were rating, absolutely crucial for a subjective rating study. (They maliciously and gleefully revealed "skeptic" climate science authors to each other in an online forum, as well as other authors. Since they were random people working at home, they could simply google the titles of papers and see the authors, making blindness impossible to enforce or claim to begin with. This all invalidates a subjective rater study.) But I was blindsided by the inclusion of non-climate papers. I found several of these in ten minutes with their database – there will be more such papers for those who search longer. I'm not willing to spend a lot of time with their data – invalid or fraudulent studies should simply be retracted, because they have no standing. Sifting through all the data is superfluous when the methods are invalid and structurally biased, which is the case here for several different reasons, as I discuss further down.
I contacted the journal – Environmental Research Letters – in June, and called for the retraction of this paper, and it's currently in IOP's hands (the publisher of ERL). I assume they found all these papers already, and more. The authors explicitly stated in their paper (Table 1) that "social science, education and research on people's views" were classified as Not Climate Related, and thus not counted as evidence of scientific endorsement of anthropogenic climate change. All of the papers below were counted as endorsement.
Chowdhury, M. S. H., Koike, M., Akther, S., & Miah, D. (2011). Biomass fuel use, burning technique and reasons for the denial of improved cooking stoves by Forest User Groups of Rema-Kalenga Wildlife Sanctuary, Bangladesh. International Journal of Sustainable Development & World Ecology, 18(1), 88–97. (This is a survey of the public's stove choices in Bangladesh, and discusses their value as status symbols, defects in the improved stoves, the relative popularity of cow dung, wood, and leaves as fuel, etc. They mention climate somewhere in the abstract, or perhaps the word denial in the title sealed their fate.)
Boykoff, M. T. (2008). Lost in translation? United States television news coverage of anthropogenic climate change, 1995–2004. Climatic Change, 86(1-2), 1–11.
De Best-Waldhober, M., Daamen, D., & Faaij, A. (2009). Informed and uninformed public opinions on CO2 capture and storage technologies in the Netherlands. International Journal of Greenhouse Gas Control, 3(3), 322–332.
Tokushige, K., Akimoto, K., & Tomoda, T. (2007). Public perceptions on the acceptance of geological storage of carbon dioxide and information influencing the acceptance. International Journal of Greenhouse Gas Control, 1(1), 101–112.
Egmond, C., Jonkers, R., & Kok, G. (2006). A strategy and protocol to increase diffusion of energy related innovations into the mainstream of housing associations. Energy Policy, 34(18), 4042–4049.
Gruber, E., & Brand, M. (1991). Promoting energy conservation in small and medium-sized companies. Energy Policy, 19(3), 279–287.
Şentürk, İ., Erdem, C., Şimşek, T., & Kılınç, N. (2011). Determinants of vehicle fuel-type preference in developing countries: a case of Turkey. (This was a web survey of the general public in Turkey.)
Grasso, V., Baronti, S., Guarnieri, F., Magno, R., Vaccari, F. P., & Zabini, F. (2011). Climate is changing, can we? A scientific exhibition in schools to understand climate change and raise awareness on sustainability good practices. International Journal of Global Warming, 3(1), 129–141. (This paper is literally about going to schools in Italy and showing an exhibition.)
Palmgren, C. R., Morgan, M. G., Bruine de Bruin, W., & Keith, D. W. (2004). Initial public perceptions of deep geological and oceanic disposal of carbon dioxide. Environmental Science & Technology, 38(24), 6441–6450. (Two surveys of the general public.)
Semenza, J. C., Ploubidis, G. B., & George, L. A. (2011). Climate change and climate variability: personal motivation for adaptation and mitigation. Environmental Health, 10(1), 46. (This was a phone survey of the general public.)
Héguy, L., Garneau, M., Goldberg, M. S., Raphoz, M., Guay, F., & Valois, M.-F. (2008). Associations between grass and weed pollen and emergency department visits for asthma among children in Montreal. Environmental Research, 106(2), 203–211. (They mention in passing that there are some future climate scenarios predicting an increase in pollen, but their paper has nothing to do with that. It's just medical researchers talking about asthma and ER visits in Montreal, in the present. They don't link their findings to past or present climate change at all (in their abstract), and they never mention human-caused climate change – not that it would matter if they did.)
Lewis, S. (1994). An opinion on the global impact of meat consumption. The American Journal of Clinical Nutrition, 59(5), 1099S–1102S. (Just what it sounds like.)
De Boer, I. J. (2003). Environmental impact assessment of conventional and organic milk production. Livestock Production Science, 80(1), 69–77
Acker, R. H., & Kammen, D. M. (1996). The quiet (energy) revolution: analysing the dissemination of photovoltaic power systems in Kenya. Energy Policy, 24(1), 81–111. (This is about the "dissemination" of physical objects, presumably PV power systems in Kenya. To illustrate the issue here, if I went out and analyzed the adoption of PV power systems in Arizona, or of LED lighting in Lillehammer, my report would not be scientific evidence of anthropogenic climate change, or admissable into a meaningful climate consensus. Concretize it: Imagine a Mexican walking around counting solar panels, obtaining sales data, typing in MS Word, and e-mailing the result to Energy Policy. What just happened? Nothing relevant to a climate consensus.)
Vandenplas, P. E. (1998). Reflections on the past and future of fusion and plasma physics research. Plasma Physics and Controlled Fusion, 40(8A), A77. (This is a pitch for public funding of the ITER tokamak reactor, and compares it to the old INTOR. For example, we learn that the major radius of INTOR was 5.2 m, while ITER is 8.12 m. I've never liked the funding conflict-of-interest argument against the AGW consensus, but this is an obvious case. The abstract closes with "It is our deep moral obligation to convince the public at large of the enormous promise and urgency of controlled thermonuclear fusion as a safe, environmentally friendly and inexhaustible energy source." I love the ITER, but this paper has nothing to do with climate science.)
Gökçek, M., Erdem, H. H., & Bayülken, A. (2007). A techno-economical evaluation for installation of suitable wind energy plants in Western Marmara, Turkey. Energy, Exploration & Exploitation, 25(6), 407–427. (This is a set of cost estimates for windmill installations in Turkey.)
Gampe, F. (2004). Space technologies for the building sector. Esa Bulletin, 118, 40–46. (This is magazine article – a magazine published by the European Space Agency. Given that the ESA calls it a magazine, it's unlikely to be peer-reviewed, and it's not a climate paper of any kind – after making the obligatory comments about climate change, it proceeds to its actual topic, which is applying space vehicle technology to building design.)
Ha-Duong, M. (2008). Hierarchical fusion of expert opinions in the Transferable Belief Model, application to climate sensitivity. International Journal of Approximate Reasoning, 49(3), 555–574. (The TBM is a theory of evidence and in some sense a social science theory – JDM applied to situations where the stipulated outcomes are not exhaustive, and thus where the probability of the empty set is not zero. This paper uses a dataset (Morgan & Keith, 1995) that consists of interviews with 16 scientists in 1995, and applies TBM to that data. On the one hand, it's a consensus paper (though dated and small-sampled), and would therefore not count. A consensus paper can't include other consensus papers – circular. On the other hand, it purports to estimate of the plausible range of climate sensitivity, using the TBM, which could make it a substantive climate science paper. This is ultimately moot given everything else that happened here, but I'd exclude it from a valid study, given that it's not primary evidence, and the age of the source data. (I'm not sure if Ha-Duong is talking about TCS or ECS, but I think it's ECS.))
Douglas, J. (1995). Global climate research: Informing the decision process. EPRI Journal. (This is an industry newsletter essay – the Electric Power Research Institute. It has no abstract, which would make it impossible for the Cook crew to rate it. It also pervasively highlights downward revisions of warming and sea level rise estimates, touts Nordhaus' work, and stresses the uncertainties – everything you'd expect from an industry group. For example: "A nagging problem for policy-makers as they consider the potential costs and impacts of climate change is that the predictions of change made by various models often do not agree." In any case, this isn't a climate paper, or peer-reviewed, and it has no abstract. They counted it as Implicit Endorsement – Mitigation. (They didn't have the author listed in their post-publication database, so you won't find it with an author search.))
(I previously listed two other papers as included in the 97%, but I was wrong. They were rated as endorsement of AGW, but were also categorized as Not Climate Related.)
The inclusion of non-climate papers directly contradicts their stated exclusion criteria. The Not Climate Related category was supposed to include "Social science, education, research about people’s views on climate." (Their Table 1, page 2) Take another look at the list above. Look for social science (e.g. psychology, attitudes), education, and research on people's views...
(To be clear, I'm not at all criticizing the above-listed works. They all looks like perfectly fine scholarship, and many are not from fields I can evaluate. My point is that they don't belong in a paper-counting consensus of climate science, even if we were counting directly related fields outside of climate science proper.)
The authors' claim to have excluded these unrelated papers was false, and they should be investigated for fraud. There are more papers like this, and if we extrapolate, a longer search will yield even more. This paper should be retracted post haste, and perhaps the university will conduct a more thorough investigation and audit. There are many, many more issues with what happened here, as detailed below.
Now, let's look at a tiny sample of papers they didn't include:
Lindzen, R. S. (2002). Do deep ocean temperature records verify models? Geophysical Research Letters, 29(8), 1254.
Lindzen, R. S., Chou, M. D., & Hou, A. Y. (2001). Does the earth have an adaptive infrared iris? Bulletin of the American Meteorological Society, 82(3), 417–432.
Lindzen, R. S., & Giannitsis, C. (2002). Reconciling observations of global temperature change. Geophysical Research Letters, 29(12), 1583.
Spencer, R. W. (2007). How serious is the global warming threat? Society, 44(5), 45–50.
There are many, many more excluded papers like these. They excluded every paper Richard Lindzen has published since 1997. How is this possible? He has over 50 publications in that span, most of them journal articles. They excluded almost all of the relevant work of arguably the most prominent skeptical or lukewarm climate scientist in the world. Their search was staggering in its incompetence. They searched the Web of Science for the topics of "global warming" and "global climate change", using quotes, so those exact phrases. I don't know how Web of Science defines a topic, but designing the search that way, constrained to those exact phrases as topics, excluded everything Lindzen has done in the current century, and a lot more.
Anyone care to guess which kinds of papers will tend to use the exact phrase "global warming" as official keywords? Which way do you think such papers will lean? Did no one think about about any of this? Their search method excluded vast swaths of research by Lindzen, Spencer, Pielke, and others. I'm not going to do all the math on this – someone else should dig into the differential effects of their search strategy.
However, this doesn't explain the exclusion of the above Spencer paper. It comes up in the Web of Science search they say they ran, yet it's somehow absent from their database. They included – and counted as evidence of scientific endorsement – papers about TV coverage, public surveys, psychology theories, and magazine articles, but failed to count a journal article written by a climate scientist called "How serious is the global warming threat?" It was in the search results, so its exclusion is a mystery. If the idea is that it was in a non-climate journal, they clearly didn't exclude such journals (see above), and they were sure to count the paper's opposite number (as endorsement):
Oreskes, N., Stainforth, D. A., & Smith, L. A. (2010). Adaptation to global warming: do climate models tell us what we need to know? Philosophy of Science, 77(5), 1012–1028.
In any case, they excluded a vast number of relevant and inconvenient climate science papers. In all of this, I'm just scratching the surface.
Let's look at a climate-related paper they did include:
Idso, C. D., Idso, S. B., Kimball, B. A., HyoungShin, P., Hoober, J. K., Balling Jr, R. C., & others. (2000). Ultra-enhanced spring branch growth in CO2-enriched trees: can it alter the phase of the atmosphere’s seasonal CO2 cycle? Environmental and Experimental Botany, 43(2), 91–100.
The abstract says nothing about AGW or human activity. It doesn't even talk about CO2 causing an increase in temperature. In fact, it's the reverse. It talks about increases in temperature affecting the timing of seasonal CO2 oscillations, and talks about the amplitude of such oscillations. It's a focused and technical climate science paper talking about a seasonal mechanism. The raters apparently didn't understand it, which isn't surprising, since many of them lacked the scientific background to rate climate science abstracts – there are no climate scientists among them, although there is at least one scientist in another field, while several are laypeople. They counted it as endorsement of AGW. I guess the mere mention of CO2 doomed it.
Here's another paper, one of only three papers by Judith Curry in the present century that they included:
Curry, J. A., Webster, P. J., & Holland, G. J. (2006). Mixing politics and science in testing the hypothesis that greenhouse warming is causing a global increase in hurricane intensity. Bulletin of the American Meteorological Society, 87(8), 1025–1037.
Among other things, it disputes the attribution of increased hurricane intensity to increased sea surface temperature, stressing the uncertainty of the causal evidence. The Cook crew classified it as taking no "No Position". Note that they had an entire category of endorsement called Impacts. Their scheme was rigged in multiple ways, and one of those ways was their categorization. It was asymmetric, having various categories that assumed endorsement, like Impacts, without any counterpart categories that could offset them – nothing for contesting or disputing attributions of impacts. Such papers could be classified as No Position, and conveniently removed, inflating the "consensus". This is perhaps the 17th reason why the paper is invalid.
There is another pervasive phenomenon in their data – all sorts of random engineering papers that merely mention global warming and then proceed to talk about their engineering projects. For example:
Tran, T. H. Y., Haije, W. G., Longo, V., Kessels, W. M. M., & Schoonman, J. (2011). Plasma-enhanced atomic layer deposition of titania on alumina for its potential use as a hydrogen-selective membrane. Journal of Membrane Science, 378(1), 438–443.
They counted it as endorsement. There are lots of engineering papers like this. They usually classify them as "Mitigation." Most such "mitigation" papers do not represent or carry knowedge, data, or findings about AGW, or climate, or even the natural world. There are far more of these sorts of papers than actual climate science papers in their data. Those who have tried to defend the Cook paper should dig out all the social science papers that were included, all the engineering papers, all the surveys of the general public and op-eds. I've given you more than enough to go on – you're the ones who are obligated to do the work, since you engaged so little with the substance of the paper and apparently gave no thought to its methods and the unbelievable bias of its researchers. The paper will be invalid for many reasons, including the exclusion of articles that took no position on AGW, which were the majority.
Critically, they allowed themselves a category of "implicit endorsement". Combine this with the fact that the authors here were political activists who wanted to achieve a specific result and appointed themselves subjective raters of abstracts, and the results are predictable. Almost every paper I listed above was rated as implicit endorsement. The operative method seemed to be that if an abstract mentioned climate change (or even just CO2), it was treated as implicit endorsement by many raters, regardless of what the paper was about.
There's yet another major problem that interweaves with the above. Counting mitigation papers creates a fundamental structural bias that will inflate the consensus. In a ridiculous study where we're counting papers that endorse AGW and offsetting them only with papers that reject AGW, excluding the vast majority of topical climate science papers in the search results that don't take simple positions, what is the rejection equivalent of a mitigation paper? What is the disconfirming counterpart? Do the math in your head. Start with initial conditions of some kind of consensus in climate science, or the widespread belief that there is a consensus (it doesn't matter whether it's true for the math here.) Then model the outward propagation of the consensus to a bunch of mitigation papers from all sorts of non-climate fields. Climate science papers reporting anthropogenic forcing have clear dissenting counterparts – climate science papers that dispute or minimize anthropogenic forcing (ignore the fallacy of demanding that people prove a negative and all the other issues here.) Yet the mitigation papers generally do not have any such counterpart, any place for disconfirmation (at least not the engineering, survey, and social science papers, which were often counted as "mitigation"). As a category, they're whipped cream on top, a buy-one-get-three-free promotion – they will systematically inflate estimates of consensus.
It's even clearer when we consider social science papers. Like most "mitigation" papers, all that's happening is that climate change is mentioned in an abstract, public views of climate change are being surveyed, etc. In what way could a social science paper or survey of the general public be classified as rejecting AGW by this method? In theory, how would that work? Would we count social science papers that don't mention AGW or climate as rejection? Could a psychology paper say "we didn't ask about climate change" and be counted as rejection? Remember, what got them counted was asking people about climate, or the psychology of belief, adoption of solar panels, talking about TV coverage, etc. What's the opposite of mentioning climate? Would papers that report the lack of enthusiasm for solar panels, or the amount of gasoline purchased in Tanzania, count as evidence against AGW, as rejection? What if, instead of analyzing TV coverage of AGW, a researcher chose to do a content analysis of Taco Bell commercials from 1995-2004? Rejection? And if a physicist calling for funding for a tokamak reactor counts as endorsement, would a physicist not calling for funding of a tokamak reactor, or instead calling for funding of a particle accelerator, count as rejection? I assume this is clear at this point. Counting mitigation papers, and certainly social science, engineering, and economic papers, rigs the results (many of which I didn't list above, because I'm stubborn and shouldn't have to list them all.)
Not surprisingly, I haven't found a single psychology, social science, or survey study that they classified as rejection... (but it's not my job to find them -- in a catastrophe like this paper, all burdens shift to the authors. Note that finding a mitigation paper or several that was counted as rejection won't do anything to refute what I just said. First, I doubt you'd find several social science or survey papers that were counted as rejection. Second, you're not going to find more rejections than endorsements in the mitigation category, especially the engineering and social science papers, so the net bias would still be there. e.g. there won't be papers on TV coverage that counted as rejection, in symmetry with those that counted as endorsement. And no amount of counting or math will change anything here. We can't do anything with a study based on political activists rating abstracts on their cause, and we certainly can't do anything with it if they violated blindness, independence, and anonymity, or if their interrater reliability is incredibly low, which it is, or if they excluded neutrality in their measure of consensus, or given the other dozen issues here. We can't trust anything about this study, and if it had been published in a social science journal, I think this would've been obvious to the gatekeepers much sooner.)
(Some might argue that choice of research topic, e.g. choosing to research public opinion on climate change, TV coverage of the same, or the proliferation of solar panels, carries epistemic information relevant to scientific consensus on AGW. 1) This misses the point that they claimed not to include such studies, but included them anyway -- such a false claim would normally void a paper all by itself. 2) But you can just fast-forward the math in your head. The pool of non-climate science research that doesn't mention AGW overwhelms the pool of non-climate research that does mention it, so it's going to come down to an initial condition of 0% of non-climate science people talking about climate in their research (in the past), bumped up to at most something like 1% (but very likely less) because of a stipulated and imported consensus in climate science, where that delta carries epistemic information -- adds to our confidence estimate -- which would in part require you to show that this delta is driven by accurate exogenous consensus detection (AECD) by non-climate researchers, out of all the noise and factors that drive research topic choices, ruling out political biases, funding, advisors, self-reinforcement, social acceptance, etc., and that AECD by a micro-minority of non-climate people, combined with their importation of said consensus, adds to our confidence.)
The inclusion of so many non-climate papers is just one of the three acts of fraud in this publication. It might be a fraud record... There's too much fraud in scientific journals, just an unbelievable amount of fraud. Whatever we're doing isn't working. Peer review in its current form isn't working. There's an added vulnerability when journals publish work that is completely outside their field, as when a climate science journal publishes what is essentially a social science study (this fraudulent paper was published in Environmental Research Letters.)
They claimed to use independent raters, a crucial methodological feature of any subjective rater study conducted by actual researchers: "Each abstract was categorized by two independent, anonymized raters." (p. 2)
Here's an online forum where the raters are collaborating with each other on their ratings. The forum directory is here. Whistleblower Brandon Shollenberger deserves our thanks for exposing them.
And here's another discussion.
And another one. Here, Tom Curtis asks:
"Giving (sic) the objective to being as close to double blind in methodology as possible, isn't in inappropriate to discuss papers on the forum until all the ratings are complete?"
The man understands the nature of the enterprise (Curtis keeps coming up as someone with a lot of integrity and ability. He was apparently a rater, but is not one of the authors.) There was no response in the forum (I assume there was some backchannel communication.) The raters carried on in the forum, and in many other discussions, and violated the protocol they subsequently claimed in their paper.
In another discussion about what to do about the high level of rater disagreement on categories and such, one rater said: "But, this is clearly not an independent poll, nor really a statistical exercise. We are just assisting in the effort to apply defined criteria to the abstracts with the goal of classifying them as objectively as possible. Disagreements arise because neither the criteria nor the abstracts can be 100% precise. We have already gone down the path of trying to reach a consensus through the discussions of particular cases. From the start we would never be able to claim that ratings were done by independent, unbiased, or random people anyhow."
Linger on that last sentence. "We would never be able to claim that ratings were done by independent, unbiased, or randon people anyhow."
Yet in the paper, they tell us: "Each abstract was categorized by two independent, anonymized raters."
This is remarkable case of malpractice, and it only gets worse.
When a study uses subjective human raters, those raters must generally be blind to the identities of their participants. Here, where we have humans reading and rating abstracts of journal articles, the raters most definitely need to be blind to the identities of the authors (who are the participants here) to avoid bias. It wouldn't be valid otherwise, not when rating abstracts of a contentious issue like climate change. There are very few studies based on rating abstracts of journal articles, and there may be as many studies on the bias and disagreement in abstract raters (Cicchetti & Conn, 1976; Schmidt, Zhao, & Turkelson, 2009). Those studies were about scientists or surgeons rating abstracts in their fields – the researchers did not contemplate the idea of a study where laypeople rate scientific abstracts.
The Cook study makes these methodological issues secondary given the inherent invalidity of political activists subjectively rating abstracts on the issue of their activism – an impossible method. In addition to their false claim of independent raters, the authors claimed that the raters were blind to the authors of the papers they rated:
"Abstracts were randomly distributed via a web-based system to raters with only the title and abstract visible. All other information such as author names and affiliations, journal and publishing date were hidden." (p. 2)
They lied about that too. From one of their online discussions:
One rater asks: "So how would people rate this one:..."
After pasting the abstract, he asks:
"Now, once you know the author's name, would you change your rating?"
The phrase "the author's name" was linked as above, to the paper, openly obliterating rater blindness.
It was a Lindzen paper.
The rater openly outed the author of one the papers to all the other raters. He was never rebuked, and everyone carried on as if fraud didn't just happen, as if the protocol hadn't been explicitly violated on its most crucial feature (in addition to rater independence.)
The rater later said "I was mystified by the ambiguity of the abstract, with the author wanting his skeptical cake and eating it too. I thought, "that smells like Lindzen" and had to peek."
It smelled like Lindzen. They were sniffing out skeptics. Do I have to repeat how crucial rater blindness is to the validity of a subjective rating study?
Another rater says "I just sent you the stellar variability paper. I can access anything in Science if you need others."
She sent the paper to another rater. The whole paper. Meaning everything is revealed, including the authors.
Another rater says, about another paper: "It's a bad translation. The author, Francis Meunier, is very much pro alternative / green energy sources."
Helpfully, another rater links to the entire paper: "I have only skimmed it but it's not clear what he means by that and he merely assumes the quantity of it. Note that it was published in the journal Applied Thermal Engineering."
Cook helpfully adds "FYI, here are all papers in our database by the author Wayne Evans:"
Let's quote the methods section once more: "Abstracts were randomly distributed via a web-based system to raters with only the title and abstract visible. All other information such as author names and affiliations, journal and publishing date were hidden. Each abstract was categorized by two independent, anonymized raters."
They do it over and over and over. People kept posting links to the full articles, revealing their provenance. It wasn't enough to shatter blindness and reveal the sources of the papers they were rating – one rater even linked to Sourcewatch to expose the source and authors of an article, adding "Note the usual suspects listed in the above link."
Another rater openly discussed rigging the second part of the study, which involved e-mailing authors and getting their ratings using the same invalid categorization scheme, by waiting to the end to contact "skeptic" scientists (after all non-skeptic scientists had been contacted): "Just leave the skeptic paper e-mails until right near the end. If they wish to leak this, then they'll do a lot of the publicising for us."
I have no idea if they impemented such methodological innovations, but it's remarkable that no one in the forum censured this rater or voiced any objection to such corruption.
Throughout their rating period – tasked with rating scientific abstracts on the issue of AGW – the raters openly discussed their shared politics and campaign on AGW with each other, routinely linked to some article by or about "deniers", and savored the anticipated results of their "study". They openly discussed how to market those results to the media – in advance of obtaining said results. One rater bragged to the other raters that he had completed 100 ratings without a single rejection, contaminating and potentially biasing other raters to match his feat...
More partial results were posted before ratings were completed. They seemed to find every possible way to bias and invalidate the study, to encourage and reinforce bias.
And because their design rested on a team of activists working from their homes, raters were completely free to google the titles, find the papers and authors at will – that breaks the method all by itself. They exhibited zero commitment to maintaining blindness in the forums, and the design made it unenforceable.
The day after ratings started, Cook alerted the raters to "16 deniers" in the Wall Street Journal – a group of scientists who were authors or signatories on a WSJ op-ed, who argued that global warming had been overestimated by prevailing models (which is simply true), and that the disagreement between scientists about the size and nature of the human contribution, not dichotomous yes-no positions on AGW. Cook also called them "climate misinformers." There was no elaboration or explanation that would clue us in to why Cook concluded these scientists were deniers or misinformers, but that's less important than the fact that raters were repeatedly primed and stoked to villify and marginalize one side of the issue which they were tasked with rating.
After alerting his raters to the latest misinformation from their enemies/participants – the apparent fable that climate science has texture – Cook observed: "This underscores the value in including the category "humans are causing >50% of global warming" as I think it will be interesting to see how many papers come under this option (yeah, yeah, DAWAAR)."
Well... (FYI, I have no idea what DAWAAR means – anyone know?)
He then linked to an actual scientific study conducted by capable researchers commissioned by the American Meteorological Society, which to Cook found an unacceptably low consensus, and told the raters: "I see the need for TCP everywhere now!"
(TCP is The Consensus Project, their name for this rodeo.)
I was struck by how often they spoke of "deniers" and "denialists" when referring to scientists. They seemed to think that researchers who minimized the degree of anthropogenic forcing were deniers, as where Cook said: "Definitely, we'll show information about the # of denier authors vs endorsement authors." The forums were dripping with this ideological narrative, a completely impossible situation for a valid rater study of climate science abstracts. More broadly, they seemed to genuinely believe that anyone who disagrees with them is a "denier", that reality is arranged this way, consisting merely of their intensely ideological camp and "deniers". They had discussions about how to manage the media exposure and anticipating criticism, where the pre-emptively labeled any future critics as "deniers" (since there is no psychological scientific case for the use of this term, much less its application to the specific issue of global warming, or even a clear definition of what the hell it means, I'm always going to place it in quotes.)
This all tells us:
1) They blatantly, cavalierly, and repeatedly violated the methods they claimed in their paper, methods that are crucial to the validity of a subjective rater study – maintaining blindness to the authors of papers they rated, and conducting their ratings independently. This destroyed the validity of an already invalid study – more coherently, it destroyed the validity of the study if we assumed it was valid to begin with.
2) These people were not in a scientific mood. They had none of the integrity, neutrality, or discipline to serve as subjective raters on such a study. We could've confidently predicted this in advance, given that they're political activists on the subject of their ratings, had an enormous conflict of interest with respect to the results, and the design placed them in the position to deliver the results they so fervently sought. Just that fact – the basic design – invalidates the study and makes it unpublishable. We can't go around being this dumb. This is ridiculous, letting this kind of junk into a scientific journal. It's a disgrace.
They also claimed to use anonymized raters – this is false too. Anonymity of raters in this context could only mean anonymity to each other, since they didn't interact with participants. They were not anonymous to each other, which is clear in the forums, and were even e-mailing each other. It appears that almost everything they claimed about their method was false.
And given their clear hostility to skeptic authors, and their willingness to openly expose and mock them, we know that we have something here we can't possibly trust, many times over. This is fraud, and we should be careful to understand that it is very unlikely to be limited to the few discussion threads I quoted from. Any fraud is too much fraud, and since they were e-mailing papers to each other, we can't confidently claim to know the limits of their fraud. If we had any reason to entertain the validity of this study to begin with, the burden at this point would rest entirely on the people who behaved this way, who were absurdly biased, incompetent, and fraudulent in these examples to prove that the fraud was limited to these examples. We need to be serious. We can't possibly believe anything they tell us, absent abundant proof. But again, why would be okay with a moderate level of fraud, or a severely invalid study? It's never going to be valid – it's not in our power to make this study valid or meaningful.
It won't matter whether the fraud we can see here resulted in a hundred incorrect ratings, or no incorrect ratings. No one should be burdened with validating their ratings. That's a waste of time, a perverted burden, except for post-retraction investigators. These people had no apparent commitment to being serious or careful. Subjective raters in scientific research need to be sober and neutral professionals, not people who sniff out the identities of their participants and chronically violate the method that they would subsequently claim in their paper. (And the math of the dichotomous ratings doesn't matter, because they included a bunch of inadmissable mitigation/engineering papers, excluded of all the neutral papers, and because the Sesame Street counting method is inherently invalid for several reasons, which I discuss further down.)
There is a lot of content in this essay. That's the Joe way, and there are lots of issues to address – it's a textbook case. But the vastness of this essay might obscure the simplicity of many of these issues, so let me pause here for those who've made it this far:
When a scientific paper falsely describes its methods, it must be retracted. They falsely described their methods, several times on several issues. The methods they described are critical to a subjective human rater study – not using those methods invalidates this study, even if they didn't falsely claim those methods. The ratings were not independent at any stage, nor were they blind. Lots of irrelevant social science psychology, survey, and engineering papers were included. The design was invalid in multiple ways, deeply and structurally, and created a systematic inflating bias. There is nothing to lean on here. We will know nothing about the consensus from this study. That's what it means to say that it's deeply invalid. The numbers they gave us have no meaning at this point, cannot be evaluated. Fraudulent and invalid papers have no standing – there's no data here to evaluate. If ERL/IOP (or the authors) do not retract, they'd probably want to supply us with a new definition of fraud that would exclude false descriptions of methods, and a new theory of subjective rating validity that does not require blindness or independence.
The journal has all it needs at this point, as does IOP, and I'm not going to do everyone's work for them for free – there are far more invalid non-climate articles beyond the 17 I listed, and more forum discussions to illustrate the false claims of rater blindness and independence. Don't assume I exhausted the evidence from the discussion forums – I didn't. Nothing I've done here has been exhaustive – there is more. My attitude has always been that I'm providing more than enough, and the relevant authorities can do the rest on their own. The issues below tend to be broader.
It should be clear that a survey of what the general public knows or thinks about climate science is not scientific evidence of anthropogenic warming. It doesn't matter what the results of such a survey are -- it has nothing to do with the scientific evidence for AGW. It's not a climate paper. A survey of people's cooking stove use and why they don't like the new, improved cooking stoves, is not scientific evidence of anthropogenic warming. It's not a climate paper. An investigation of the psychology of personal motivation for adaptation or mitigation is not evidence of anthropogenic warming. It's not a climate paper.
This also makes us throw out the claim that 97% of the authors of the papers that were counted as taking a position said that their papers endorsed the consensus. That claim is vacated until we find out how many of those authors were authors of social science, psychology, marketing, and public survey papers like the above. It would be amazing that the authors of such papers responded and said that their papers counted as endorsement, but at this point all bets are off. (Only 14% of the surveyed authors responded at all, making the 97% figure difficult to take at face value anyway. But the inclusion of non-climate science papers throws it out.)
Note that the authors are still misrepresenting their 97% figure as consisting of "published climate papers with a position on human-caused global warming" in their promotional website. Similarly, for an upcoming event, they claim "that among relevant climate papers, 97% endorsed the consensus that humans were causing global warming." Most egregiously, in a bizarre journal submission, Cook and a different mix of coauthors cited Cook et al as having "found a 97% consensus in the peer-reviewed climate science literature." Clearly, this is false. There's no way we can call the above-listed papers on TV coverage, solar panel adoption, ER visits, psychology, and public opinion "relevant climate papers", and certainly not "peer-reviewed climate science." Don't let these people get away with such behavior – call them out. Ask them how psychology papers and papers about TV coverage and atomic layer deposition can be "relevant climate papers". Many of the random papers they counted as endorsement don't take any position on climate change -- they just mention it. Raise your hand at events, notify journalists, etc. Make them defend what they did. Hopefully it will be retracted soon, but until then, shine the light on them. For one thing, Cook should now have to disclose how many psychology and other irrelevant papers were included. Well, this is pointless given a simultaneously and multiply fraudulent and invalid paper. ERL and IOP are obligated to simply retract it.
For a glimpse of the thinking behind their inclusion of social science papers, see their online discussion here, where they discuss how to rate a psychology paper about white males and "denial" (McCright & Dunlap, 2011). Yes, they're seriously discussing how to rate a psychology paper about white males. These people are deeply, deeply confused. The world thought they were talking about climate science. Most of the raters wanted to count it as endorsement (unlike the psychology papers listed above, the white males study didn't make the cut, for unknown reasons.)
Some raters said it should count as a "mitigation" paper...
"I have classified this kind of papers as mitigation as I think that's mostly what they are related to (climate denial and public opinion are preventing mitigation)."
"I have classified many social science abstracts as mitigation; often they are studying how to motivate people to act to slow AGW."
Cook says "Second, I've been rating social science papers about climate change as Methods."
As a reminder, in their paper they tell us that "Social science, education, research about people's views on climate" were classified as Not Climate Related (see their Table 1.)
Cook elaborates on the forum: "I think either methods or mitigation but not "not climate related" if it's social science about climate. But I don't think it's mitigation. I only rate mitigation if it's about lowering emissions. There is no specific link between emissions & social science studies. So I think methods."
Methods? In what sense would a psychology paper about white males be a methods paper in the context of a study on the scientific consensus on anthropogenic climate change? Methods? What is the method? (The study's authors collected no data – they dug into Gallup poll data, zeroed in on the white male conservatives, and relabeled them deniers and "denialists".)
In their paper, they describe their Methods category as "Focus on measurements and modeling methods, or basic climate science not included in the other categories." They offer this example: "This paper focuses on automating the task of estimating Polar ice thickness from airborne radar data. . . "
We might also attend to the date of their white males discussion: March 22, 2012. This was a full month into the ratings, and most had been completed by then. Yet still people were flagrantly breaking the independence they explicitly claimed in their methods section.
We might also attend to the fact that the poster in that forum, who asked how they should rate the white males paper, provided a link to the full paper, once again shattering blindness to the authors. This suggests that the method these people described in their paper was never taken seriously and never enforced.
I think for some people with a background in physical sciences, or other fields, it might not be obvious why some of these methods matter, so I'll elaborate a bit. I mean people at a journal like ERL, the professional staff at IOP, etc. who don't normally deal with subjective rater studies, or any social science method. (Note that I'm not aware of any study in history that used laypeople to read and interpret scientific abstracts, much less political activists. This method might very well be unprecedented.) I was puzzled that ERL editor Daniel Kammen had no substantive response at all to any of these issues, gave no sign that false claims about methods were alarming to him, or that raters sniffing out authors like Lindzen was alarming, and seemed unaware that blindness and independence are crucial to this kind of study, even absent the false claims of the same.
First, note that our view of the importance of blindness and independence in subjective rating studies won't change the fact that they falsely claimed to have used those methods, which is normally understood as fraud. If we're going to allow people to falsely describe their methods, we should be sure to turn off the lights before locking up. One commenter defended them by quoting this passage from the paper's discussion section: "While criteria for determining ratings were defined prior to the rating period, some clarifications and amendments were required as specific situations presented themselves." This doesn't seem to work as a defense, since it doesn't remotely suggest that they broke blindness, independence, or anonymity, and only refers to the rating criteria, which can be clarified and amended without breaking blindness, independence, or anonymity. They were quite explicit about their methods.
But what about these methods? Why do we care about independence or blindness?
The need for blindness is probably well-understood across fields, and features prominently in both social science and biomedical research (where it often refers to blindness to condition.) As I mentioned, the participants here were the authors of the papers that were rated (the abstracts.) The subjective raters need to be blind to the participants' identities in order to avoid bias. Bias from lack of blindness is a pervasive issue, from the classic studies of bias against female violinists (which was eliminated by having them perform behind a curtain, thus making the hiring listener blind to gender -- see Goldin and Rouse (1997) for a review of impacts), to any study where raters are evaluating authored works. For example, if we conducted a study evaluating the clarity of news reporting on some issue using subjective human raters/readers, the raters would most definitely have to be blind to the sources of the articles they were rating – e.g. they could not know that an article was from the New York Times or the Toledo Blade, nor could they know that the journalist was Tasheka Johnson or James Brubaker III. Without blindness the study wouldn't be valid, and shouldn't be published.
This issue is somewhat comical given the inherent bias of political activists deciding what abstracts mean for their cause, since the bias there likely overwhelms any bias-prevention from blindness to authors. It would be like Heritage staffers rating articles they knew were authored by Ezra Klein, or DNC staffers rating articles they knew were by penned by George Will (it's hard to imagine what the point of such studies would be, so bear with me.) The raters here amply illustrated the potential for such bias – remember, they sniffed out Lindzen. I assume I don't have to explain more about the importance of rater blindness to authors. We can't possibly use or trust a study based on ratings of authored works where raters weren't blind to the authors, even if we were in a fugue state where a paper-counting study using lay political activists as raters of scientific abstracts made sense to us.
Independence: It's important for raters to be independent so that they don't have systematic biases, or measurement error coming from the same place. Again, the nature of the raters here makes this discussion somewhat strange – I can't drive home enough how absurd all of this is. This should never have been published in any universe where science has been formalized. But ignoring the raters' invalidating biases, we would need independence for reliable data. This is rooted in the need for multiple observations/ratings. It's the same reason we need several items on a personality scale, instead of just one. (There is interesting work on the predictive validity of one-item scales of self-esteem, and recently, narcissism, but normally one item won't do the job.) It should be intuitive why we wouldn't want just one rater on any kind of subjective rating study. We want several, and the reason is so we have reliable data. But there's no point if they're not independent. Remember, the point is reliability and reducing the risk of systematic bias or measurement error.
So what does it mean if people who are supposed to be independent raters go to a rating forum that should not exist, and ask "Hey guys, how should I rate this one?" (and also "What if you knew the author was our sworn enemy?"...). What really triggers the problem is when other raters reply and chime in, which many did. It means that raters' ratings are now contaminated by the views of other raters. It means you don't have independent observations anymore, and have exposed yourself to systematic bias (in the form of persuasive raters on the forum, among other things). It means you needn't have bothered with multiple raters, and that your study is invalid.
It also means that you can't even measure reliability anymore. Subjective rating studies always measure and report interrater reliability (or interrater agreement) – there are formal statistics for this, formal metrics of measuring it validly. This study was notable for not reporting it – I've never encountered a rating study that did not report the reliability of its raters (it's not measured as percentages, which they strangely tried to do, but requires formal coefficients that account for sources of variance.) The right estimate here would probably be Krippendorff's alpha, but again, it's moot, on two counts – the method, and the fraud. We would expect political activists to have high agreement in rating something pertinent to their activism – they're all the same, so to speak, all activists on that issue, and with the same basic position and aims. It wouldn't tell us much if their biased ratings were reliable, that they tended to agree with each other. But here, they shattered independence on their forum, so interrater reliability statistics would be invalid anyway (their reported percentages of disagreement, besides being crude, are invalid anyway since the raters weren't independent -- they were discussing their ratings of specific papers in the forum, contaminating their ratings and making the percentage of agreement or disagreement meaningless.)
Another striking feature of the discussions is that they had no inkling of the proper methods for such a study. No one seemed to take the importance of blindness to author seriously, or independence (except for Curtis, who departed for other reasons.) No one seemed to have heard of interrater reliability. They discussed how to measure agreement from a zero-knowledge starting point, brainstorming what the math might be – no one mentions, at any point, the fact that all sorts of statistics for this exist, and are necessary. In one of the quotes above, a rater said: "Disagreements arise because neither the criteria nor the abstracts can be 100% precise." There was no apparent awareness that one source of variance, of disagreement, in ratings was actual disagreement between the raters – different judgments.
They applied their "consensus" mentality more broadly than I expected, applying it to their own study – they actually thought they needed to arrive at consensus on all the abstracts, that raters who disagreed had to reconsider and hopefully come to agreement. They didn't understand the meaning or importance of independent ratings in multiple ways, and destroyed their ability to assess interrater reliability from the very beginning, by destroying independence in their forums from day one. Then they had a third person, presumably Cook, resolve any remaining disagreements. None of these people were qualified for these tasks, but I was struck by the literal desire for consensus when they discussed this issue, and the belief that the only sources of variance in ratings were the criteria and abstracts. It makes me wonder if they embrace some kind of epistemology where substantive disagreement literally doesn't exist, where consensus is always possible and desirable. Having no account of genuine disagreement would neatly explain their worldview that anyone who disagrees with them is a "disinformer" or "denier". The post-mortem on this hopefully brief period of history during which science was reduced to hewing to a purported consensus vs. denial will be incredibly interesting. What's most offensive about this is how plainly dimwitted it is, how much we'd have to give up to accept a joint epistemology/ethics of consensus-seeking/consensus-obedience.
(It's probably importa
Phrases: The Mysticism of Consensus
Getting back to the raters' comments, rater Dana Nuccitelli also said the white males paper should count as "methods", and I was struck that he wrote "It's borderline implicit endorsement though, with all the 'climate change denial' phrases. If you read the paper I'd bet it would be an explicit endorsement."
He thinks that if a psychology paper uses the phrase "climate change denial", it could count as scientific endorsement of anthropogenic climate change. We should linger on that. This is a profound misunderstanding of what counts as scientific evidence of AGW. The implied epistemology there is, well, I don't know that it has a name. Maybe it's a view that reality is based on belief, anyone's belief (except for the beliefs of skeptics), perhaps a misreading of Kuhn.
Even if we thought reality was best understood via consensus, a physical climate reality is not going to be created by consensus, and the only consensus we'd care about would be that of climate scientists. That leftist sociologists pepper their paper with the phrase "climate change denial" does not add to our confidence level about AGW, or feed into a valid scientific consensus. It's not evidence of anything but the unscientific assertions of two American sociologists and a failure of peer review. The paper was invalid – they labeled people as "deniers" for not being worried enough about global warming, and for rating the media coverage as overhyped. The authors did nothing to validate this "denial" construct as a psychological reality (to my knowledge, no one has.) Citing a couple of other invalid studies that invalidly denial won't do it, and it highlights the perils and circularity of "consensus" in biased fields (although there's no consensus in research psychology about a construct of denial – only a handful of people are irresponsible enough to use it.) They presented no evidence that a rational person must worry a given amount about distant future changes to earth's climate, or that not worrying a given amount is evidence of a process of "denial". They offered no evidence that the appropriateness of the media's coverage of an issue is a descriptive fact that can be denied, rather than a subjective and complex, value-laden judgment, or that it is an established and accessible fact that the media does not overhype AGW, and thus that people who say that it does are simply wrong as a matter of scientific fact, and also deniers.
No one who wants to count a paper about white males as evidence has any business being a rater in a subjective rating-based study of the climate science consensus. There are lots more of these alarming online discussions among raters. This was a disaster. This was an invalid study from the very outset, and with this much clear evidence of fraud, it's a mystery why it wasn't retracted already. We have explicit evidence here that these people had no idea what they were doing, were largely motivated by their ideology, and should probably submit to drug testing. These people had neither the integrity nor the ability to be raters on a scientific study of a scientific consensus on their pet political cause, and need to be kept as far as possible from scientific journals. They simply don't understand rudimentary epistemology, or what counts as evidence of anthropogenic climate change.
We need to keep in mind for the future that if people don't understand basic epistemology, or what counts as evidence, they won't be able to do a valid consensus study – not even a survey of scientists, because the construction of the questions and response scales will require such knowledge, basic epistemological competence, and ideally some familiarity with the research on expert opinion and how to measure and aggregate it. Anyone who conducts a scientific consensus study needs to think carefully about epistemology, needs to be comfortable with it, able to understand concepts of confidence, the nature of different kinds of evidence and claims, and the terms scientists are most comfortable using to describe their claims (and why.)
Let's retrace our steps..
The above papers have nothing to do, epistemologically, with the scientific consensus on global warming. The consensus only pertains to climate science, to those scientists who actually study and investigate climate. To include those papers was either a ridiculous error or fraud. I didn't expect this -- I expected general bias in rating climate papers. I never imagined they'd include surveys of the public, psychology papers, and marketing studies. In retrospect, this was entirely predictable given that the researchers are a bunch of militant anti-science political activists.
As I said, I found those papers in ten minutes with their database. I'm not willing to invest a lot of time with their data. The reason is what I've argued before -- a method like theirs is invalid and perverts the burden of proof. The world is never going to be interested in a study based on militant political activists reading scientific abstracts and deciding what they mean with respect to the issue that is the focus of their activism. That method is absurd on its face. We can't do anything with such studies, and no one should be burdened with going through all their ratings and identifying the bias. We can't trust a study where the researchers are political partisans who have placed themselves in the position to generate the data that would serve the political goals -- the position of being subjective raters of something as complex and malleable as a scientific abstract. That's not how science is normally done. I've never heard of researchers placing themselves in the position of subjectively rating complex text, articles and the like, on an issue on which they happen to be activists.
I don't think I spelled this out before because I thought it was obvious: There is enormous potential for bias in such a method, far more potential than in a normal scientific study where the researchers are collecting data, not creating it. Having human beings read a complicated passage, a short essay, and decide what it means, is already a very subjective and potentially messy method. It's special, and requires special training and guidelines, along with special analyses and statistics. The Cook paper is the only subjective rating study I've ever seen that did not report any of the statistics required of such studies. It was amazing -- they never reported interrater reliability. I can't imagine a rater study that doesn't report the reliability of the raters... This study is a teachable moment, a future textbook example of scientific scams.
But having humans read a scientific abstract and decide what it means is even more challenging than a normal rater study. For one thing, it's very complicated, and requires expert knowledge. And in this case, the researchers/raters were unqualified. Most people aren't going to be able to read the abstracts from any given scientific field and understand them. Climate science is no different from any other field in this respect. The raters here included luggage entrepreneurs, random bloggers, and an anonymous logician known only by his nom de guerre, "logicman", among others. Normally, we would immediately stop and ask how in the hell these people are qualified to read and rate climate science abstracts, or in logicman's case, who these people are. To illustrate my point, here's a sample climate abstract, from LeGrande and Schmidt (2006):
We present a new 3-dimensional 1° × 1° gridded data set for the annual mean seawater oxygen isotope ratio (δ18O) to use in oceanographic and paleoceanographic applications. It is constructed from a large set of observations made over the last 50 years combined with estimates from regional δ18O to salinity relationships in areas of sparse data. We use ocean fronts and water mass tracer concentrations to help define distinct water masses over which consistent local relationships are valid. The resulting data set compares well to the GEOSECS data (where available); however, in certain regions, particularly where sea ice is present, significant seasonality may bias the results. As an example application of this data set, we use the resulting surface δ18O as a boundary condition for isotope-enabled GISS ModelE to yield a more realistic comparison to the isotopic composition of precipitation data, thus quantifying the ‘source effect’ of δ18O on the isotopic composition of precipitation.
Would a smart, educated layperson understand what this means? How? Would they know what the GISS ModelE is? What GEOSECS data and 3-dimensional 1° × 1° gridded data are? Even scientists in other fields wouldn't know what this means, unless they did a lot of reading. This study was ridiculous on these grounds alone. The burden lies with the authors trotting out such a questionable method to first establish that their raters had the requisite knowledge and qualifications to rate climate abstracts. No one should ever publish a study based on laypeople rating scientific abstracts without clear evidence that they're qualified. This is technically ad hominem, but I think ad hominem is a wise fallacy in some specific circumstances, like when it's an accurate probability estimate based on known base rates or reasonable inferences about a person's knowledge, honesty, etc. (it's going to be a technical fallacy in all cases because it's based on exogenous evidence, evidence not in the premises, but it's not always a fallacy in the popular usage of "fallacy" being an invalid or unreliable method of reasoning, an issue I explore in my book on valid reasoning.) Wanting a doctor to have gone to medical school is ad hominem (if we dismiss or downrate the diagnoses of those who haven't gone to medical school.) I'm perfectly fine with laypeople criticizing scientists, and I don't think such criticism should be dismissed out of hand (which I call the Insider Fallacy, an unwise species of ad hominem), but if you give me a study where laypeople rated thousands of abstracts, I'm not going to accept the burden of proving that their ratings were bad, one by one. Thousands of ratings is a much different situation than one critical essay by a layperson, which we can just read and evaluate. With thousands of ratings, I think the burden has to be on the lay researchers to establish that they were qualified.
When we add the fact that the raters were partisan political activists motivated to deliver a particular result from the study, we go home. The normal response might be several minutes of cognitive paralysis over the absurdity of such a method, of such a "study". ERL should be ashamed of what they did here. This is a disgrace. Political activists deciding what abstracts mean on the issue of their activism? Have we decided to cancel science? Are we being serious? It's 2014. We have decades and mountains of research on bias and motivated reasoning, scientific research. The idea of humans reading and rating abstracts on their implications for their political cause sparks multiple, loud, shrieking alarms. A decently bright teenager would be able to identify this method as absurd. This really isn't complicated. It shouldn't be happening in the modern world, or in modern scientific journals. It would pervert the burden if we had to run around validating the divinations of motley teams of lay political activists who designed "studies" where they appointed themselves to read science abstracts and decide what they mean for their political cause. It would pervert the burden if teams of politically-motivated scientists appointed themselves as subjective raters. This method is incompatible with long established and basic norms of scientific objectivity and validity.
I assume this article will be retracted. We need to be able to distinguish our political selves from our scientific selves, and politics should never dislodge our transcendent commitments to integrity, scientific rigor, valid methods – or our basic posture against fraud and in favor of using our brains.
One bat, two bats, three bats! Sesame Street Consensus Counting
Speaking of using our brains, I think we might also want to think about why we would ever count papers, and take a percentage, as a measure of the consensus on some issue in scientific field. There are several obvious issues there that we'd need to address first. And on this particular topic, it doesn't address the arguments skeptics make, e.g. publication bias. The publication bias argument is unruffled by counting publications. If we care about engaging with or refuting skeptics, this method won't do. But again, there are several obvious issues with counting papers as a method (which is layered on top of the issues with having humans read abstracts and decide what they mean with respect to a contentious environmental issue.) Like soylent green, a consensus is made of people. Papers have no consensus value apart from the people who wrote them. There is enormous epistemic duplication in counting papers – each paper will not have the same true epistemic weight. One reason is that papers by the same authors are often partly duplicative of their earlier work. If Joe said Xa in Paper 1, said Xa in Paper 2, and Xa in Paper 3 (among other things), Joe now has three "consensus" votes. Pedro says Xa in Paper 1, Xb in P2, and Xc in P3. I'll unpack that later. Other duplication will be across authors.
A huge source of duplication is when related fields import AGW from climate science, and then proceed to work on some tie-in. These are often "mitigation" papers, and they can't be counted without explaining how they are commensurable with climate papers. A critical assumption of the simple counting method is that all these papers are commensurable. This can't possibly be true – it's false. Papers about atomic layer deposition or the ITER are not commensurable with a climate paper simply because they mention climate. This would require a mystical epistemology of incantation. Counting such exogenous mitigation papers would require a rich account of what consensus means, how it's expressed, and what it means when people who don't study climate nod their heads about AGW in their grant applications and articles. We'd need an account of what that represents. What knowledge are they basing their endorsement on? It's very likely from available evidence that they are simply going along with what they believe is a consensus in climate science. It's unlikely most of these researchers have read any consensus papers. It's more likely they just believe there's a consensus, because that's the common view and expressing doubt or even asking questions can get them smeared as "deniers", given the barbarism of the contemporary milieu.
If they're just assuming there's a consensus, then their endorsement doesn't count, unless you can present compelling evidence that the processes by which people in other fields ascertain a consensus in climate science is a remarkably accurate form of judgment – that the cues they're attending to, perhaps nonconsciously, are reliable cues of consensus, and that exaggerated consensuses do not convey those cues. In other words, they'd intuitively sniff out an exaggerated consensus. There a lot of things that would have to be true here for this kind of model to work out. For example, it couldn't be the case that linking your work to climate change helps you get grants or get published, or at least the implicit processes at work in these people would have to be such that these incentives would be overwhelmed by their accurate detection of a reliable consensus in other fields. But I'm being generous here, just hinting in the direction of what we'd need to count most mitigation papers. Determining that non-climate researchers are accurate in consensus detection wouldn't be enough. Being accurate at detection wouldn't make their membrane engineering papers commensurable with climate science papers. You'd need a comprehensive theory of secondary or auxiliary consensus, and in the best case, it might give you a weighting scheme that gave such papers non-zero weights, but much less than the weights for climate science attribution papers. There's no way to get to full commensurability without a revolutionary discovery about the nature of consensus, but remember that consensus is a probabilistic weight that we admit into our assessment of reality. Any theory of consensus needs to address the reliability of its congruence with reality. Without that, we have nothing.
Third, the practice of "least publishable units" will undermine the Sesame Street counting method, could destroy it all by itself. You can run that model in your head. (One way out here is to show that people who engage in this practice are more likely to be right on the issue, and that the degree of their accuracy advantage tracks to their mean pub count advantage over those less prone to LPU strategies. Not promising.)
We need careful work by epistemologists here, but I ultimately I think none of this will work out. Counting papers, even with weighting and fancy algorithms, is a poor proxy for consensus, which is made of people. People can talk, so you might as well just ask them -- and ask them good questions, not the scame binary question. Papers represent something different, and there are so many variables that will drive and shape patterns of publication. Papers should be the units of meta-analysis for substantive questions of "What do we know? What is true, and not true?" Meta-analysis requires careful evaluation of the studies, real substantive thinking and judgment criteria – not just counting and numerology. Rating abstracts looks remarkably weak compared to what a meta-analysis does – why would we choose the less rigorous option?
Consensus is a different question. It's not "What is true?" It's "What proportion of experts think this is true?" It's one of several ways of approaching "What is true?", a special and complex type of probabilistic evidence. It won't be measured by counting papers – those aren't the units of consensus. And counting 15 and 20 year old papers is like finding out someone's political party by looking at their voting records from Clinton-Dole, which would be odd, and less accurate, when you can just call them and ask them "'sup?". All sorts of questions could be answered by carefully reviewing literature, possibly some kind of consensus questions, mostly past consensus or trajectories. But to get a present-tense consensus, you would need to do a lot of work to set it up -- it couldn't be anything as simple as counting papers, since consensus is people.
Interestingly, by simply counting papers like Count von Count, they actually applied weights. None of this was specified or seemingly intended. First, their odd search did something. It's unclear exactly what, and someone will need to unpack it and see how that search stripped Lindzen's votes by excluding everything he's published since 1997. But the basic weights are likely to be: those who publish more, especially those whose articles are counted in the Cook search; those who participate in commentaries; older researchers; English-language researchers; those in non-English countries who get good translations vs those who don't; reviewers -- a paper counting study gives a lot of vicarious votes to gatekeepers; journals -- same gatekeeper effect; people who thrive in the politics of science, the ones who get positions in scientific bodies, who gladhand, social climbers, et al – that should make it easier to be published; researchers with more graduate students get more votes, ceteris paribus (this is all ceteris paribus, or ceteris paribus given the weird search method); men over women, probably, given the culture of climate science being like most sciences; researchers who take positions in their abstracts vs those who don't or who tend to have it in the body only; researchers who write clear abstracts vs those who don't; researchers who aren't hated by the Cook team, since they clearly hated Lindzen and savaged a bunch of skeptical scientists on their bizarre website; probably height too, since it predicts success in various fields, and here that would mean publications (I'm being cute on that one, but paper counting will have all sorts of weights, predictors of admission, and I won't be shocked if height was one.) Those are just some of the likely weights, in various directions, enforced by crude counting.
(Some of these predictors might correlate with some concept of credibility, e.g. simple number of papers published. That's fine. If people want to make that case, make it. It will require lots of work, and it's not going to do anything to our assessment of the Cook study.)
There's also a profound structural bias and error in measuring consensus on an issue by counting only those who take a position on it. The framing here demands that scientists prove a negative -- it counts papers that presumably endorse AGW, and the only things that weigh against them are papers that dispute AGW. Papers that take no position (the vast majority in this case) are ignored. This is absurd. It violates the longstanding evidentiary framework of science, our method of hypothesis testing, and really, the culture of science. Such a method is invalid on its face, because it assumes that at any arbitrary timepoint, if the number who support a claim (Y) exceeds those who oppose it (N), this counts as compelling evidence for the claim, irrespective of the number who take no decisive position on the claim (S). That's a massive epistemic information loss -- there's a lot of what we might call epistemic weight in those who take no position, those who don't raise their hands or hold press conferences, a lot of possible reasons and evidence they might be attending to, in what their neutrality or silence represents. This will be very context-sensitive, will vary depending on the type of claim, the nature of the field, and so forth. We should be very worried when S ≫ (Y+N), which it is here. Note that this assumes that counting papers is valid to begin with -- as I said, there are several obvious issues with counting papers, and they'd have to be dealt with before we even get to this point. Validly mapping S, Y, and N, or something like them, to papers instead of people requires some work, and we won't get that kind of work from Cook and company.
And as I mentioned much earlier, mitigation papers have no obvious rejection counterparts – there is no way for social science papers to count as rejection, for a survey of the general public that is not about AGW to count as rejection, for a report on solar panels or gasoline consumption in Nigeria to count as rejection, for an engineering paper that is not about extracting hydrogen from seawater to count as rejection, for a physicist to not call for funding for a tokamak reactor and count as rejection, for a study of Taco Bell commercials, instead of TV coverage of AGW, to count as rejection, and so on and so forth... You can run the math with the starting conditions and see how counting mitigation papers rigs the results. It will be similar for impacts. Ultimately, this inflates the results by conflating interest for endorsement.
That method also selects for activism. You can model it in your head -- just iterate it a few times, on various issues (yes, I know I keep saying you can model things in your head – I believe it's true.) For example, it's quite plausible that using this absurd method we could report a consensus on intelligent design. Just count the papers that support it (for fun, count religious studies and psychology papers.) Then count the ones that reject it. Ignore all the papers on evolutionary biology that don't talk about ID. Voila! We could easily have a consensus, since I doubt many evolutionary scientists ever talk about ID in journals. (I have no idea if the numbers would come out like this for ID, but it's plausible, and you get my point. The numbers will be very malleable in any case.) At this point, I'm frustrated... This is all too dumb. It sounds mean, I know, but I'm just alarmed by how dumb all of this is. There are too many dumb things here: the use of lay political activists as subjective raters of science abstracts for their implications for their cause, the egregious fraud in falsely claiming to use the methods a subjective rater study requires while violating them with abandon, the inclusion of an alarming number of papers from the early 1990s, the exclusion of a massive number of relevant papers from this century, the inclusion of psychology and marketing papers, windmill cost studies, analyses of TV coverage, surveys of the general public, and magazine articles, the rigged categorization scheme in concert with the fallacy of demanding proof of a negative, the structural asymmetry of including mitigation papers, the calculating of a consensus by simply summing together a bunch of papers that make passing remarks about climate change, along with a few climate science papers (but not Lindzen's), the raters sniffing out skeptic authors and announcing them to all the other raters, raters at home able to google papers and reveal their authors at will, the unprecendented failure of a subjective rater study to report interrater reliability...
I need -- hopefully we need -- science to not be dumb, journals to not be dumb. I'm anti-dumb. I expect people to have a basic grasp of logic, epistemology, and to know the gist of decades of research on bias, or even just what wise adults understand about bias. This whole paper is just too dumb, even if it weren't a fraud case. We've been asleep at the wheel – this was a failure of peer-review. And the way people have strained to defend this paper is jawdropping. This paper never deserved any defense. The issues with it were too obvious and should not have been ignored, and now the fraud should definitely not be ignored. The issues were also very predictable given the starting conditions: the nature of the raters and the design of a study that empowered them to create the results they wanted. This was a waste of everyone's time. We don't need to be this dumb. Subjective rating studies that use political activists with profound conflicts of interest regarding the outcome, and the power to create said outcome, are bad jokes. We need to be serious. Science has always been associated with intelligence – people think we're smart – but all this junk, these ridiculous junk studies being published and cited, makes science a domain with lower standards of data collection than the market research Jiffy Lube might commission. Science needs to be serious – it can't be this dumb or this junky, and journals and bodies that publish this junk, especially fraudulent junk, need to be penalized. This is not acceptable.
The average layperson can identify this stuff as stupid and invalid, and many of them have – we can't keep doing this, tolerating this. The irony of the barbaric "denier! denier!" epistemology is that the targets of the smears are often smarter than the false ambassadors of science who smear them. It was a terrible mistake for these editors, bodies, and universities to ignore the laypeople who pointed out some of the flaws with this work. The laypeople were right, and the scientific authorities were wrong. This cannot be allowed to become a pattern.
There are no results
Some commenters have tried to apply arithmetic based on the ever-growing list of non-climate papers up top (18 so far) to claim that they only slightly change the results. That's amazing. They've missed the point entirely, actually several points. There are no results. There are no results to change. We never had a 97%. Meaningful results require valid methods. This is what science is about – if you're not taking the validity of methods seriously, you're not taking science seriously. This is a basic principle that's already been explained thousands of times in various contexts.
1. When you have a study that used lay political activists to subjectively rate scientific abstracts on the issue of their activism, empowered to create the results they wanted, there are no results. There is nothing we can do with such an invalid study. To my knowledge, no one has ever done this before, and we'd never accept this as science on any other topic, nor should we here (we don't know what Oreskes did in 2004, because that paper is one page long, and leaves out the details.)
2. When you have a study that that falsely describes its methods, which is normally understood as fraud, there are no results. The study goes away. We'd need a new definition of fraud, or extraordinary exonerating circumstances, for it to be retained. They didn't use the methods they claimed – this means we don't know what they did with respect to each rating, how often they communicated with each other about their ratings, how often they shared papers with each other, exposing the authors, etc. For deterrence purposes alone, I think we'd want to retract any study that lies about its methods, and I think that's what we normally do in science. It cannot become acceptable to lie about methods.
3. When you have a study where subjective raters tasked with rating authored works were not blind to authors and were not independent in their ratings, there are no results. A subjective rater study of this sort is not valid without blindness and independence. You certainly can't generate percentages of agreement or disagreement when ratings were not independent, and there are no valid ratings to begin with without blindness to authors. (This is all magnified by point 1 and 2.)
4. When you have a study that counted an unknown number of social science, psychology, and survey studies as scientific evidence of consensus on anthropogenic warming, there are no results – we no longer know what's in the data. None of the results, none of the percentages, are valid in that case, and this voids the second part of the study that collected authors' self-ratings of the papers (with a 14% response rate), because now we don't know how many of those authors were psychologists, social scientists, pollsters, engineers, etc. None of those papers are relevant to this study, and like the false claims about blindness and independence, the authors falsely claimed social science and surveys were excluded. When someone reveals even a couple of such papers in their results, from a few minutes with their database, the proper response is a complete inventory and audit by someone. It's pointless to just recalculate the results, substracting only those handful of papers, as though we know what's in the data.
5. When you have a study of the consensus that includes mitigation and impacts categories, which lack natural disconfirming counterparts, especially in the case of a large number of engineering papers that were counted as endorsement, this creates a systematic bias, and again, you don't have results given that bias. That issue would have to be dealt with before computing any percentage, and issues 1, 2, and 3 should each void the study by themselves.
6. When you have a study that just counts papers, there's no point touting any percentage, any results, until someone credibly explains why we're counting papers. This has no meaning unless it has meaning. It's not at all obvious why we're counting papers and how this represents consensus. The strange search results would have to be explained, like why there's nothing by Dick Lindzen since 1997. This all requires careful thought, and we don't know what the right answers are here until we think it through. It's not going to be automatically valid to run a sketchy search, add up papers like the Count on Sesame Street, and go home. All the de facto weights that such a method applies need to be addressed (see above). And maybe this sounds wild and crazy, but you might want to define consensus before you start adding papers and calling it consensus. They never define or discuss any of this. It would be purely accident for counting papers to represent a valid measure of consensus or something like it – there are so many potential biases and unintentional weights in such a method.
And all the other issues...Note that finding an error somewhere in this vast essay isn't going to salvage the study. Salvaging the study requires direct engagement with these issues, requires some sort of argument and evidence that rebuts these issues, and not just one of them. Attacking me won't do it. Saying this all needs to be in a journal to be credible, as the authors claimed, won't do it, especially since their paper was in a journal, so being in a journal doesn't seem to mean much. The claims here need to be addressed directly, and their failure to do so is extremely disappointing and telling. They don't dispute the evidence in the online forums, and anyone who is fluent in English can read their paper and see that they claimed to be blind and independent, that they claimed to exclude social science, can process the other issues, etc. We don't need a journal to tell us what's real – they're taking this epistemology of "consensus" too far. Reality isn't actually determined by what other people think, even journal reviewers – what others think is a crude heuristic useful in particular circumstances. Anyone can read this report and refer back to the paper. In any case, we don't normally handle fraud with journal submissions, and I'm extremely reluctant to legitimize this junk by addressing it in a journal – it should simply be retracted like every other fraud case, or invalidity case. We can't let postmodernism gash science. (Also, look at the Correigendum in the current issue of ERL – those authors fixed a typo. The 97% authors are guilty of far more than a typo.)
[My fellow scientists, let's huddle up for a minute. What are we doing? What the hell are we doing? I'm mostly speaking to climate scientists, so the "we" is presumptuous. Is this really what you want? Do you want to coarsen science this much? Do we want to establish a scientific culture where scientists must take polar positions on some issue in the field? Do you want to tout a "consensus" that ignores all those who don't take a polar position? Do we want to import the fallacy of demanding that people prove a negative, a fallacy that we often point out on issues like evolution, creationism, religion, and so forth? Modern scientific culture has long lionized the sober, cautious scientist, and has had an aversion to polar positions, simplistic truths, and loyalty oaths. Do we mean to change that culture? Have we tired of it? Are we anti-Popper now? No one is required to be Popperian, but if we're replacing the old man, it should be an improvement, not a step back to the Inquisition. Do we want dumb people who have no idea what they're doing speaking for us? Are you fraud-friendly now, if it serves your talking points? When did we start having talking points?
In any case, what the hell are we doing? What exactly do we want science to be and represent? Do we want "science" to mean mockery and malice toward those who doubt a fresh and poorly documented consensus? Do we want to be featured in future textbooks, and not in a good way? When did we discover that rationality requires sworn belief in fresh theories and models that the presumed rational knower cannot himself validate? When did we discover that rationality requires belief in the rumor of a consensus of researchers in a young and dynamic field whose estimates are under constant revision, and whose predictions center on the distant future? (A rumor, operationally, since laypeople aren't expected to engage directly with the journal articles about the consensus.) Who discovered that rationality entails these commitments, or even argued thusly? Give me some cites, please. When did we discover that people who doubt, or only mildly embrace, the rumor of a consensus of researchers in a young and dynamic field whose estimates are under constant revision, and whose predictions center on distant future developments, are "deniers"? When did science become a church? When did we abandon epistemology? Again, what are we doing?]
Those climate scientists who defended this garbage upset me the most. What are you doing? On what planet would this kind of study be valid or clean? Are you unfamiliar with the nature of human bias? Is this about environmentalism, about being an environmentalist? Do you think being a staunch leftist or environmentalist is the default rational position, or isomorphic with being pro-science? Do you think that environmentalism and other leftist commitments are simply a set of descriptive facts, instead of an optional ideological framework and set of values? Do you understand the difference between 1) descriptive facts, and 2) values and ideological tenets? I'm trying to understand how you came to defend a study based on the divinations of lay political activists interpeting scientific abstracts. Those scientists who endorsed this study are obligated openly and loudly retract their endorsement, unless you think you can overcome the points raised here and elsewhere. I really want to know what the hell you were thinking. We can't be this sloppy and biased in our read of studies just because they serve our political aims. The publication and promotion of a study this invalid and fraudulent will likely impact the future reception of valid studies of the climate science consensus. You might say that we should've hushed this up for that reason, that I should've remained silent, but that just takes us down another road with an interesting outcome.
As I said, I was puzzled that ERL editor Daniel Kammen did not respond to any of the issues I raised. I contacted him on June 5. For over month, there was no reply from him, not a word, then he finally emerged to suggest that some of the issues I raised were addressed on the authors' website, but did not specify which issues he was referring to. To my knowledge, none of these issues are addressed on their website, and many are not even anticipated. He's had nothing to say since, has never had any substantive response, has not argued against any allegation, has not corrected any error on my part, has not claimed that they did in fact follow their stated methods, or that the alleged fraud wasn't fraud at all, nothing at all.
Lack of familiarity with subjective rater studies might explain some of the initial reticence, but we can't have science work like this, where editors and other authorities are completely silent in response to such disclosures or allegations, and offer no substantive defense, or even insight, on issues as severe as those here. Everything breaks down if this is how science works, where there's no evidence of any cognitive activity at all in response to such reports. It would anchor institutional bias and corruption, and advantage power and establishment interests. I was surprised to later discover that Dr. Kammen advises President Obama, who widely publicized, benefitted from, and misrepresented the results of the study (by ascribing 97% agreement with a dangerousness variable that does not exist in the study.) We need to add political affiliations and ideology to our prevailing accounts of conflicts-of-interest, since such allegiances are likely to pull as strongly as a few scattered checks from oil companies.
I was further stunned that editor Daniel Kammen promoted, on his blog, the false Presidential tweet that 97% of scientists think AGW is "dangerous", and continues to promote it. The study never measured anyone's views of the dangerousness of AGW, not even as a scam. It was not a variable, nor was any synonym or conceptual cousin of danger, severity and the like – it's simply not in the study. It's incredible that a scientist and editor would behave this way, would promote a politician's manifestly false tweet, and would be comfortable reducing his field to a tweet to begin with. We seem to have suspended our basic scientific norms and standards regarding the accurate representation of the findings of research. This is rather dangerous. Run the model in your head. Iterate it a few times. You can easily see what could happen if it became normal and okay for scientists and editors to falsely assert findings that were never measured, not even mentioned, in an article. Run the model.
Because of Dr. Kammen's non-response, I escalated the retraction request to IOP, the publisher, on July 28, where it currently stands, and asked that they exclude Dr. Kammen from the decision given his profound conflict of interest. No word yet, just a neutral update that they were working on it. IOP seems quite professional to me, and I hope it's retracted. If they didn't retract a study that made false claims about its methods, that made it impossible to calculate interrater agreement, that included a large number of social science, survey, and engineering papers, and whose core methods are invalid, we'd probably want to know what is retraction-worthy. I don't think science can work that way.
Anyone who continues to defend this study should also be prepared to embrace and circulate the findings of Heartland or Heritage if they stoop to using a bunch of political activists to subjectively rate scientific abstracts. If ERL doesn't retract, for some unimaginable reason, they should cheerfully publish subjective rater studies conducted by conservative political activists on climate science, Mormons on the science of gay marriage, and Scientologists on the harms of psychiatry (well, if ERL weren't just an environmental journal...) This ultimately isn't about this study – it's about the method, about the implications of allowing studies based on subjective ratings of abstracts by people who have an obvious conflict of interest as to the outcome. Science is critically depends on valid methods, and is generally supposed to progress over time, not step back to a pre-modern ignorance of human bias.
I think some of you who've defended this study got on the wrong train. I don't think you meant to end up here. I think it was an accident. You thought you were getting on the Science Train. You thought these people -- Cook, Nuccitelli, Lewandowsky -- were the science crowd, and that the opposition was anti-science, "deniers" and so forth. I hope it's clear at this point that this was not the Science Train. This is a different train. These people care much less about science than they do about politics. They're willing to do absolutely stunning, unbelievable things to score political points. What they did still stuns me, that they did this on purpose, that it was published, that we live in a world where people can publish these sorts of obvious scams in normally scientific journals. If you got on this train, you're now at a place where you have to defend political activists rating scientific abstracts regarding the issue on which their activism is focused, able to generate the results they want. You have to defend people counting psychology studies and surveys of the general public as scientific evidence of endorsement of AGW. You have to defend false statements about the methods used in the study. Their falsity won't be a matter of opinion -- they were clear and simple claims, and they were false. You have to defend the use of raters who wanted to count a bad psychology study of white males as evidence of scientific endorsement of AGW. You have to defend vile behavior, dishonesty, and stunning hatred and malice as a standard way to deal with dissent.
I think many of you have too few categories. You might have science and anti-science categories, for example, or pro-science and denier. The world isn't going to be that simple. It's never been that simple. Reality is a complicated place, including the reality of human psychology and knowledge. Science is enormously complicated. We can't even understand the proper role of science, or how to evaluate what scientists say, without a good epistemological framework. No serious epistemological framework is going to lump the future projections of a young and dynamic scientific field with the truth of evolution, or the age of the earth. Those claims are very different in terms of their bodies of evidence, the levels of confidence a rational person should have in them, and how accessible the evidence is to inquiring laypeople.
Cognition is in large part categorization, and we need more than two categories to understand and sort people's views and frameworks when it comes to fresh scientific issues like AGW. If our science category or camp includes people like Cook and Nuccitelli, it's no longer a science category. We won't have credibility as pro-science people if those people are the standard bearers. Those people are in a different category, a different camp, and it won't be called "science". Those climate scientists who have touted, endorsed, and defended the Cook et al. study – I suggest you reconsider. I also suggest that you run some basic correction for the known bias and cognitive dissonance humans have against changing their position, admitting they were wrong, etc. Do you really want to be on the historical record as a defender of this absurd malpractice? It won't age well, and as a scientist, certain values and principles should matter more to you than politics.
If you're always on the side of people who share your political views, if you're always on the side of people who report a high AGW consensus figure, no matter what they do, something is wrong. It's unlikely that all the people who share our political perspectives, or all the studies conducted by them, are right or valid -- we know this in advance. We need more honesty on this issue, less political malice, better epistemology. I don't think science has been so distrusted in the modern era than it is today. When the public thinks of science, it should not trigger thoughts of liars and people trying to deceive them and take advantage of them. Journals need to take responsibility for what they do, and stop publishing politically motivated junk. Sadly, this paper is self-refuting. A paper-counting study assumes that the papers they're counting are valid and rigorous works, which assumes that peer-review screens out invalid, sloppy, or fraudulent work. Yet the Cook paper was published in a peer-reviewed climate journal. That it was sruvived peer-review undermines the critical assumption the study rests on, and will be important inductive evidence to outside observers.
So you want to know what the 97% is? You really want to know? It's a bunch of abstracts/grant applications that say: "We all know about global warming. Let me tell you about my atomic layer deposition project." "You all know the earth is melting. Let me tell you about my design for a grapeseed oil powered diesel engine." "We've all heard about global warming. Here we report a survey of the public." "...Denial of improved cooking stoves." Let's call that phenomenon A.
Now let's factor in a bunch of militant political activists rating abstracts on the issue of their activism, and who desire a certain outcome. Call that B.
Let's also factor in the fact these militant political activists are for the most part unqualified laypeople who will not be able to understand many science abstracts, who have no idea how to do a proper literature search or how to conduct a proper subjective rating study, have never heard of interrater reliability or meta-analysis, violate every critical methodological feature their study requires, and lie about it. Call that C.
Then add a politically biased journal editor who has a profound conflict of interest with respect to the findings, as he works for the politician whose aims such findings would serve, and which were widely touted and misrepresented by said politician. Call that D.
A + B + C + D = 97%
"97%" has become a bit of a meme over the past year. I predict that it will in the coming years become a meme of a different sort. "97%" will be the meme for scientific fraud and deception, for the assertion of overwhelming consensus where the reality is not nearly so simple or untextured. It may become the hashtag for every report of fraud, a compact version of "9 out of 10 dentists agree" (well, I'm abusing the definition of meme, but so does everyone else...) Because of this kind of fraud, bias, and incompetence, science is in danger of being associated with people who lie and deceive the public. Excellent. Just fantastic. Politics is eroding our scientific norms, and possibly our brains.
The laypeople who first identified the fraud in these cases and contacted the relevant authorities were roundly ignored. In the two cases I've covered, the evidence is surprisingly accessible, not rocket science, and the Australian universities who hosted the researchers have been inexcusably unwilling to investigate, at least not when asked by others. AAAS, who leaned on this fraud, has an Enron culture of denial and buck-passing. These institutions have become part of the story in a way they shouldn't have. The system is broken, at least as far as these politically motivated junk studies are concerned, and most of the responsible gatekeepers have been unwilling to discharge their ethical and scientific responsibilities, and should probably be discharged of those responsibilities. If this is science, then science is less rigorous than any plausible contrast objects we'd set it against – it would be the least rigorous thing we do. Some scientific fields are better than this. They retract papers all the time, often the authors themselves, for non-fraud reasons. Fraud is taken dead seriously, and a fraud report will be investigated thoroughly.
We've bumped into some corruption here. I'm disappointed in the extremely low quality of the arguments from the fraud-defenders and wagon-circlers (calling me a "right-wing extremist" or asking whether other people agree with me won't do it, and the latter scares the hell out of me, as it might signal that an absurdly primitive and simplistic epistemology of consensus actually enjoys a non-zero rate of popularity in academic circles). I'm also disappointed in the more common silence from the ertswhile defenders of this junk. In both cases, no one is refuting anything, presenting any sort of substantive argument. We're taking lots of risks here, potentially rupturing the classic association between science and fact, or between science and the rational mind. We can't allow science to become a recurrent source of deception and falsity, or risk the very concept science devolving into a label for an alternative lifestyle, a subculture of job security and unreliable claims. That outcome seems distant, but if we don't address behavior like the conduct and publication of this study, we'll probably see more of it.
Eyes in space
One of my major interests outside of social science is space-based observatories and exoplanet detection. I've long thought that we need to figure out a way to manufacture monolithic telescope mirrors in space, and I've thought of a couple of different ways to do that. But I did not think of this:
"Under his plan, a 3-D-manufacturing vendor will fabricate an unpolished mirror blank appropriate for his two-inch instrument. He then will place the optic inside a pressure chamber filled with inert gas. As the gas pressure increases to 15,000 psi, the heated chamber in essence will squeeze the mirror to reduce the surface porosity — a process called hot isostatic pressing."
This is excellent. The reason why we need to manufacture mirrors in space is that we need more aperture, and we're limited in the size of mirror/telescope that we can launch into space. Hubble is 2.4 meters. As far as I know, this is the largest monolithic mirror we have in space. The upcoming James Webb Telescope is 6.5 meters, but it's segemented, not monolithic. The mirror is folded for the trip up. Segmented mirrors allow a larger aperture, but they won't reflect as much light as a monolithic -- I think they might only be as good a monolithic mirror of about half the size, but I'm not sure if there's a consistent factor conversion there.
For reference, the largest mirrors anyone has made are in the 8.4 meter ballpark that you see in the Large Binocular Telescope, for example. They're incredibly hard to make, and very massive. Earth's gravity is a major constraint on making larger mirrors with the required level of precision and stability. Making them in space eliminates a lot of constraints. (The huge next-generation terrestrial telescopes you hear about all use segmented mirrors, like the creatively named European Extremely Large Telescope and the proposed Overwhelmingly Large Telescope.)
Why do we need larger telescopes in space? So we can see more. If it were in space, and with IR, the LBT would be able to directly image exoplanets and characterize their atmospheres. Space-based telescopes are far more sensitive than terrestrial, since they don't have to deal with the distortion and IR blocking of earth's atmosphere (Hubble is about as good as the LBT, even though it's much smaller, because it's in space.) We might be able to infer, with some confidence, the presence of life on exoplanets by the composition of their atmospheres, depending on a few things (FYI, an exoplanet is a planet outside of our solar system, a planet orbiting another star, or in rare cases flying alone through interstellar space.) Almost nothing would be more significant in human history than the discovery of life elsewhere.
A larger telescope would also allow us to observe much older stars and galaxies, and possibly see the very first galaxies form (astronomy is another word for time travel...) In fact, this is precisely what we expect the James Webb to do. But it would be so much better if it were a monolithic mirror.
Kenny and The Castle
The recent salience of Australia obligates to me apprise you of two of the greatest movies ever made in the Queen's: The Castle and Kenny. Magnificent works, just a beautiful sense of life.
Reality is what it is
Quick but important point of clarification...
Someone sent me this paper as a presumed example of bias or invalid research:
Eidelman, S., Crandall, C. S., Goodman, J. A., & Blanchar, J. C. (2012). Low-effort thought promotes political conservatism. Personality and Social Psychology Bulletin, 38(6), 808-820.
I don't see anything wrong with it. I think it's pretty good.
I should stress this: That a study finds something unfavorable about conservatives does not mean that it's invalid.
Reality is whatever it is. Our job is to go out and discover it. I would never expect conservatism to not be associated with anything unfavorable, and I hope you don't expect that either (and a lot of these "unfavorable" things won't be unfavorable in everyone's eyes. Read Phil Tetlock's work.)
Nor would I expect libertarian, liberal, or Rastafarian views to not correlate with any purportedly negative traits. Social science in our era is somewhat coarse, often about linear correlations between broad constructs like "conservatism" and "low-effort thought". (I don't want to imply that Eidelman and colleagues' research is "coarse" – they were very smart, used a variety of methods, and have better than average ecological validity. "Low-effort thought" seems too vague or broad to hang together as a cohesive construct, but the devil is in the details, and this could be Stage 1 of their research...)
What "low-effort thought" is is worth a lot of effortful thinking, by social psychologists, cognitive psychologists, philosophers, epistemologists, and bartenders. (So is the commensurability of the operationalization of low-effort thought across the four studies in this paper.) That it predicts conservatism may or may not be interesting, and may or may not have any implications for the wisdom of conservatism. (The cheeky conservative response to their Study 1, where blood alcohol level predicts conservative views is, inevitably, in vino veritas...) But again, reality it whatever it is. We can't just avert our eyes, or run from it. If, circa 2030, we have a large body of evidence that low-effort thought predicts conservatism, and we've sorted out what low-effort thought is to everyone's satisfaction, then that's just the reality. This is what science is for.
Also, please keep in mind that correlations between X and Y do not imply that all members of X are high in characteristic Y. As I say on my Media Tips page (Tip #5), a minority of a sample or group can drive significant effects (e.g. correlations), and often does.
For example, 40% of low-effort thinkers could have endorsed conservative views in their studies, while 25% of controls did, and that kind of difference could easily drive a positive correlation between low-effort thinking and conservative views – even though a majority of low-effort thinkers in the study did not endorse conservative views. I'm just making up those numbers to illustrate and simplify – I have no idea if their data reflects that pattern, and I'm not motivated to dig into their data, or people's data in general, despite recent events.
What's more, even when a majority of people in group X are high on trait Y, please keep in mind that correlations between X and Y say nothing about any individual, including yourself.
I also think people might be implicitly flipping the direction of the Eidelman, et al, studies such that being a conservative (X) implies that one is a "low-effort" thinker (Y) – I suspect that's how lots of laypeople will parse the title. They actually describe it as low-effort thought promoting conservative views, which is a subtle, but potentially very important distinction. (I encourage you to dig into the syllogistic fallacies that address some of the issues that can arise when we interpret findings and flip directions.) Keep in mind that these were experimental manipulations where they subjected people to cognitive load or time constraints, and found that they were subsequently more likely to endorse conservative views. Or they queried drinkers at a bar while also measuring their BAC (Study 1).
I think people are taking these findings personally, like they're being called low-effort thinkers. There's no need to assume that. And we should know, in advance, that scientists will occasionally discover links between one's political affiliations and unflattering things (or at least things that sound unflattering.) That's life.
In any case, this research looks perfectly valid. I'd worry a little bit about putting such a simple and suggestive title out there, especially in this environment when we know we have profound biases as a field with respect to conservatives, when we might want to work to restore our reputation. But the studies look clean to me. You can probably get the paper here if you don't have access to PSPB.
How data can be overrated
This is something I wrote some time ago, and have quickly updated. I'll have a methods paper soon that digs into these issues in much greater detail.
Data is overrated. This is a terrible, terrible thing for a scientist to say. Well, except that it’s true, so it’s a good thing for a scientist to say.
I don’t question, remotely, the central empirical premise of social science or any other science. Social science requires the rigorous collection of data, and always will. This is good and fine, and often thrilling and wonderful. I mean something different.
I think there's a problematic epistemology, particularly in social psychology, that doesn’t distinguish between the kinds of claims that require data and those that don’t. There’s a broad lack of awareness of the epistemic authority of logic – that we can know some very important things through valid reasoning. This has consequences for the validity of research, and what happens after invalid research is published. We've somehow combined rationalism and empiricism, two sides of a classic dichotomy in philosophy. We're rationalistic in our empiricism. Let me explain...
In 2011, I pointed out that Napier and Jost (2008) is invalid. In that paper, they wanted to explain why conservatives are happier than liberals, and asserted that it was because conservatives “rationalize inequality." Strangely, they never measured rationalization of inequality. I don’t mean that they used a measure that I don’t like – I mean they never measured it at all (and they didn’t collect any data, just used public datasets like the National Election Survey and relabeled the variables.) In one study, they took answers to the question of whether hard work tends to pay off in the long run – just that one item – and called it “rationalization of inequality”. In another study, they took a 6-item measure of simple attitudes toward inequality, and called it, again, “rationalization of inequality”. They converted an attitudes measure into a measure of rationalization of those same attitudes – I’ve read a lot of papers, but I’ve never seen a social psychologist do that.
The studies are invalid. They didn’t measure the construct. Rationalization is a cognitive process, one that social psychologists are quite familiar with, dating to classic research on cognitive dissonance. We can’t just assert that people are rationalizing something. We can’t ask people whether they agree with our political views, and call it rationalization if they don’t.
I argued that the paper was completely invalid, that it carries no knowledge pertinent to its claims or hypotheses. (The authors or journal should retract it.) I received lots of supportive e-mails from researchers all over the world, but I was surprised when a couple of them said that my debunking would be even more convincing “with data.” Hold that thought…
I recently wrote about the Lewandowsky, Oberauer, and Gignac (2013) scam (after lots of other people did.) In their title, they linked moon-landing hoaxism with climate science hoaxism, even gave it a causal direction. There was no such relationship. One reason is that virtually no one in the study endorsed the moon-landing hoax – only 10 out of 1145 participants, the lowest level of endorsement anyone has ever reported, to my knowledge. We can’t link moon hoaxism to anything if only 10 of 1145 endorse it (0.8%). Even worse, 7 of those 10 rejected the climate hoax idea, so going with the causal direction in the title, if one believes the moon landing was a hoax, one is unlikely to believe that climate science is a hoax. And if we run a logistic regression – the proper analysis for wildly skewed and substantively dichotomous data, not a Pearson correlation – with moon hoaxism predicting climate hoaxism, there is no effect. However, I don't think any analysis is appropriate when only 10 people endorse the moon hoax – 0.8% is less than the sampling error for the survey, so it's essentially zero. In any case, if we lived in a universe where we could make inferences from 10 people in a sample of 1145, and the titles of our papers were based on 0.8% of our participants, the title here would have to be the opposite, e.g. NASA Faked the Moon Landing, Therefore (Climate) Science is True.
(Ethics Pause: Think about how you would relate to someone if you thought they believed the moon landing was a hoax. Or if they disputed that HIV causes AIDS? (11 people in their sample, from which they asserted other effects...) Would you hire them? Would you date them? What Lewandowsky, Oberauer, and Gignac is a disgrace.)
They made similarly false claims in the abstract and body of the paper, linking people to beliefs that virtually no one in their sample endorsed – numbers far too trivial (11 or 16 people out of the 1145) to be linked to the mainstream views to which they tried to link them. It was a scam. They abused linear correlation statistics driven by variance between people choosing Strongly Agree vs Agree, and told the world that people “rejected” claims that they in fact agreed with.
Once more, a scientist who agreed with me also said that my debunking would be even more convincing “with data.”
What This Is
This is recurrent, and there are many examples of this mindset. It’s a profound epistemological and scientific error.
The examples above are cases where the research is invalid. In the Lewandowsky case, we might just say their claims are false, since they actually measured the variables they talked about, so we can look and see that the claims are false. (There we’d probably say that their analyses were invalid and some of their claims false and unconscionable.)
When people say that “data” would help make a case against these papers, it tells me that they don’t know what I mean when I say that a study is invalid, and I think I mean validity in the normal sense.
I think it comes to a couple of problems in how we as a field understand knowledge as such.
First, a fundamentally invalid study does not represent knowledge. It carries no findings. To say that a study is invalid normally means that the study's method is not a legitimate method of knowing about the things the study wanted to know about. Social psychologists are granting residual epistemic standing to published work, even after it's been shown to be invalid, even if they agree that it's invalid. This might be a crude social proof thing – people might think that if a paper is published it must have some legitimacy or validity. This belief will be false across all scientific fields, and I think it's a very vulnerable assumption to have about social science papers in the current era.
But there’s something else here. Say we have papers like the ones above. Then someone comes along and argues that they’re invalid, or even false by their own data. What sort of knowledge claims are these?
They're claims about the validity of research, or about the data behind claims the authors made. These kinds of claims do not require new data. Let’s linger on this...
Let’s say I measure sunshine and call it self-esteem.
You observe that I measured sunshine and called it self-esteem. You then say, hopefully very stridently, that my study is invalid, as are any claims I made.
What determines whether you argument is true? Whether I indeed measured sunshine and called it self-esteem. That’s what will determine the truth of it.
How would (new) data help your argument? Remember the claim is that I measured sunshine and called it self-esteem. What new data could you collect to further establish that I did so?
There is none. Either I measured sunshine and called it self-esteem, or I did not. If I did, my study is invalid. It doesn’t matter if I got it published in Psychological Science. It’s still invalid. It doesn’t matter if other people cited it, if it helped get me tenure, or if Chris Mooney touted my findings. None of that matters. My study is fundamentally invalid, which makes it meaningless. It does not represent knowledge about self-esteem, does not carry data about the construct. It’s vacated, as a matter of scientific knowledge, regardless of whether I have the integrity to retract it.
Is this clear? The nature of a validity argument is that it rests on what happened in the study – it’s about features of that study. It’s not about the empirical hypotheses. New data is not necessary, and often won’t even strengthen the argument. There are exceptions, but not the examples here. The burden is always and entirely on the researchers to support their claims – it's never on us to go out and collect data to refute an invalid study.
In the Lewandowsky case, what would we refute with data? The claims are false by their own data. That's the claim being made. When we say “10 of 1145 participants were moon hoaxists, and most were not climate hoaxists”, that’s a claim grounded in observations about their data. We don’t need any data of our own to observe their lack of data.
We’re granting an inappropriate privilege to papers and findings just because they’re published. I think some researchers haven’t fully wrapped their heads around the fact that a paper, or a set of studies, can simply be false, or invalid, or both. We need to have those categories, because they describe recurrent phenomena in any science. We don’t have a ready schema for "false paper" or "invalid paper", outside of traditional fraud cases. In the humanities, fields without data in the normal sense, the worst that can happen to a paper (absent fraud) is that it be forgotten, or superseded. There’s no “false” category. The only way to get retracted is plagiarism or fraud in one’s sources. In science, things can be false. Things can be invalid. Papers are retracted all the time, often by their own authors, for non-fraud reasons – for having made a mistake, being false or invalid. But not in research psychology. I never see people retract their papers when someone points out that they’re false or invalid. That’s not a sustainable situation. We can’t be like the humanities in that respect – we need to be like the sciences.
I’d like to zoom back a bit and address another branch of this issue of rationalistic empiricism, and the lack of awareness of the authority of valid reasoning. The two examples here will continue to serve us. In both cases, the research was doomed before it began.
Sometimes social scientists act as though we can’t know anything unless it’s published in one of our journals, comes with a p-value, etc. Now, I don’t think anyone would literally argue this if pressed – it’s just that they implicitly embrace something like this view much of the time. In reality, there is lots of valid knowledge outside of social psychology – the world abounds with knowledge, and sometimes knowledge and data outside our fields will have logical implications for our hypotheses.
For the N&J study, someone might have observed that since conservatives don’t necessarily object to economic inequality (neither do libertarians), they probably can't rationalize it. To rationalize something, it has to be objectionable, and we can't assume that income inequality is objectionable, either in the mind of a participant or as a descriptive fact about the world (the researchers seemed to assume that it was obviously objectionable.)
The Oxford English Dictionary is paywalled, so here's the definition from Dictionary.com: to ascribe (one's acts, opinions, etc.) to causes that superficially seem reasonable and valid but that actually are unrelated to the true, possibly unconscious and often less creditable or agreeable causes.
We definitely can't assume conservatives are doing that on income inequality – that would be biased and question-begging. So it’s not clear that the construct “rationalization of inequality” has any valid application to conservatives or other non-liberals, just as the "rationalization of abortion" would not be valid applied to pro-choice liberals and libertarians. It's a biased, question-begging scam to define "rationalization" around the researcher's problem with a social reality. I’m sure you could induce defensiveness in conservatives by showing them pictures of homeless people or something, but you could induce defensiveness in any camp, on any issue. I’m not sure that it would interesting or meaningful.
(Note that their hypothesis on why conservatives are happier than liberals was refuted later by Schlenker, Chambers, and Le (2012). Their paper was not debunked – there was never anything to debunk since they didn't measure their construct, and therefore did not carry any knowledge about the hypothesis.)
Similarly, in the Lewandowsky case, we already knew, in advance, that climate skepticism isn’t going to be driven by beliefs like moon-landing hoaxism. We have survey data from professional pollsters. We know that very few people believe the moon landing was a hoax. And we know that a much larger number of people are skeptical of climate science. A recent survey showed that 7% of people believed the moon landing hoax, and 37% thought climate science was a hoax. Best case, 19% of climate hoaxists are moon-landing hoaxists (7/37). (It will be lower, because they won’t completely overlap.) If the vast majority of climate hoaxists/skeptics are not moon-landing hoaxists, which we already know from the polls, there’s no point trying to link them. Unless you’re trying to get at a specific cognitive process (which you won’t do with Lewandowsky’s crude surveys) or a subcategory of climate skeptics, there’s no point. It’s scientifically specious, and I think unethical, to draw links between a group and some far-out, damaging belief when the vast majority of the group simply does not hold that belief.
So in both cases here, reality constrains our research, our hypotheses, in ways that are not readily acknowledged in the field. We’re not granting logic and exogenous facts the authority that they truly have. Our research can be rendered meaningless, or deeply flawed, by facts available outside the field, facts that do not count as “data” by our normal reckoning, in concert with plain logic.
The third category is how we deal with the validity and reliability of scales (surveys). I’ve pointed out to researchers that the Social Dominance Orientation (SDO) scale might be invalid, only to have them respond “But it’s very reliable.” Reliability has nothing to do with validity (well, it actually does, but not for our purposes here.) But reliability comes as a number. As we’ve noted, people like numbers, data. Reliability can be expressed as a Cronbach alpha, or better yet, McDonald’s omega. Validity doesn't come as a number, as “data” (at least not in simple form, in one statistic.) Validity often requires logic, substantive reasoning about what things mean. The SDO is what I call a Caricature Scale. I’ll dig into this much more in an upcoming methods paper, but a Caricature Scale has specific features. Ultimately, they’re traps. They’re very reliable traps, in more ways than one, but they’re not generally valid, and are artifacts of a primitive early stage of social science. The SDO, for example, is written purely in the language of the American academic left (circa 1990.) It’s groups this and groups that, full of caricatured and straw man versions of vaguely conservative positions. It traps conservatives into endorsing the midpoint, perhaps out of confusion, while liberals endorse the low end. Voila! Conservatism is correlated with “Social Dominance Orientation”, even though many or most of them don’t actually endorse the items.
In summary, scientists need to be capable epistemologists. No data is inherently meaningful, and it’s quite easy for it to be meaningless. People are looking at r's and p's, without looking at the data and considering what it means. In our "data, data, data!" mindset, we're not paying enough attention to meaning, and ultimately, to reality. We need to understand that validity is decisively important, that invalid science is ultimately not science at all. We need to understand the epistemic authority and utility of logic and reasoning. We need to be more firmly seated in the sciences, as opposed to the humanities. We need to understand that many published papers do not represent or carry knowledge, at least not the knowledge claimed, and some papers are just false. We need to join our scientific cousins in other fields, and account for these realities by retracting such papers when we discover them. If we don’t, we risk accruing a large number of false claims and findings, which would strain our standing as a science.
* A positive "correlation" between the two variables in the Lewandowsky title has no meaning if only 10 people endorsed the moon hoax, and most of them rejected the climate hoax idea, especially given the explicit causal direction presented in the title, where we start with the moon hoax and look at the probability of endorsing the climate hoax. There we can just look at the probability, which is 0.3, making the title false. The valid analysis for these near-dichotomous items will normally be a logistic regression (see the more detailed posts below), which is not significant here (but not valid here because there is no data -- 10 people will be 0 for our purposes most of the time, but it's definitely 0 when we're talking about a wide open, sloppy online study with minors and fakes.) A stark 4-point scale of Disagreement and Agreement is not a continuous variable, and this will elaborated in the literature soon. Agreeing with something is very different from disagreeing with something, and agreeing that the moon landing was a hoax is very, very different from disagreeing that the moon landing was a hoax. So if we only offer the four responses of Strongly Disagree, Disagree, Agree, and Strongly Agree, we cannot leverage variance between Strongly Disagree and Disagree (which is where 1135 out of 1145 participants were) to generate a linear correlation between the two items and talk as though agreement on the moon predicts agreement on the other -- there's no real agreement here. The correlation is actually between those who disagree with the one and disagree with the other. Call this Category 4 of where "data" can be overrated or misused. Correlation statistics sometimes have no meaning.
José L. Duarte
Social Psychology, Scientific Validity, and Research Methods.