A colleague perused the the Cook rater forums and sent me this example:
"I got one last night, though, that I thought was funny - I rated it as "Explicit >50%", because it *was* attributing all the warming to human GHG emissions. It just claimed there wouldn't *be* much warming, asserting that the IPCC had grossly over-estimated climate sensitivity & feedbacks. I had to look it up to see who the authors were - Michaels & Knappenberger. I laughed out loud when I found that out. :-D"
Like I've said, we'll never lack for fraud examples here. They claimed in their paper that raters were blind to authors, which they'd have to be to conduct valid subjective ratings, but they weren't. They broke blindness all the time, even divulging the authors to each other in the forums, mocking them, e-mailing entire papers, etc. I'm not sure everyone knew they were supposed to blind to authors (and journal), or that the paper would subsequently claim that they were, or the crucial importance of independent ratings. They didn't even know what interrater reliability was. And of course the substantive nature of the above rater's decision – and of the coding scheme that empowered him – hits on other invalidity issues.
I stopped going through the discussions a long time ago, because the initial evidence was clear and I'd rather just do my own research. I've repeatedly said "there's much more" and I meant it – I was confused by people who took me to be presenting an exhaustive set of evidence, or tried to recalculate the 97% based on removing the handful of psychology and survey papers I listed. I supplied a batch of evidence that I thought was more than sufficient and expected the journal or IOP to do the rest – the burden of validating the ratings lies entirely with the authors, not anyone else, but that's not possible at this point.
A broader a point I'd like to briefly expand: There are no results. I'm getting other e-mails about them claiming the "results don't change" from the fraud or something like that – the e-mailers understand that this is absurd, but I think my report was too long, and some people might not read the whole thing.
There aren't any results that could change, because there were never any results to begin with. There is no 97%. Or 99% or 82% or 10%. There's no percent, no numbers to evaluate. Normally we'd vacate the paper for fraud, which I assume will ultimately happen here, unless fraud is defined differently at ERL. Even absent fraud, there is no procedure known to science that would enable us to generate meaningful results here. Taking any number from their study is just roulette – the numbers won't be connected to any underlying reality.
The results of a scientific study only have meaning by reference to the methods by which they were obtained. Results cannot be divorced from methods. They falsely described their methods, on almost every substantive feature. This means we no longer have results to speak of – we don't know what we have, because we don't know how the results were obtained. We have a vague sense, and we know that raters were never blind or independent, and that such features could not even be enforced given their at-home procedure and bizarre online forum. We know that we could never do anything with a subjective rating study that employed raters who had a profound conflict of interest with respect to the results. This was a subjective human rater study – the results were entirely in the hands of the raters, like the one we see up top, who were political activists rating abstracts on a dimension that would serve their political aims. We'd have to ignore decades of scientific research on cognitive biases, motivated reasoning, implicit and non-conscious biases, and our basic grasp of conflict-of-interest in order to take this seriously. I think a lot of people would assume this was a prank, maybe like Sokal's – a test to see whether a scientific journal would publish a consensus study based on lay political activists divining the meaning of scientific abstracts for their cause. No one has ever done this.
There are no results here for several other reasons. If you run a casual search on "global warming" and just start counting papers, deciding whether they endorse AGW, even implicitly, that's not anything. It's not any kind of measure of a consensus, not by anyone's definition. Especially if you're including a bunch of social science, polls, mitigation papers, psychology, etc. (They said they excluded social science and polls. They're remarkably casual about the importance of the truthfulness of the methods section of a scientific paper. Scientists would not be nearly so casual.) And we can't calculate interrater reliability here because they broke independence, and their ratings are contaminated by other raters' feedback. A subjective rating study that can't generate interrater reliability estimates no longer has usable data. The lack of meaningful results here, the lack of validity, is multifaceted and impossible to resolve.
One reason this isn't a measure of consensus is that you're giving some people far more votes than others. Each paper is a vote by this method. There are many obvious biases and weights this will impose, even assuming a neutral or valid search. Old men will get a lot more votes than others, and there will be many other large issues that real researchers would think through and account for.
But the search is invalid too – it gives unacceptable results and was apparently never validated or tested. We could never go with a search that excludes everything published by Dick Lindzen since 1997. You just took away dozens of his votes, for unknown, arbitrary reasons. That discovery would stop researchers in their tracks – they would test and dig into the different results of different searches, and figure out what was happening, and what terms were needed to sweep up all the relevant papers. A study of consensus based on a literature search obviously depends critically on the validity and completeness of that search – if the search is bad, it's over. Knowing what we know about this study and the search, there's nothing we can do with the data. Humans have an anchoring bias, and there's a small epistemological dysfunction in some quarters of science where people think that because something is published, or because it was peer-reviewed, it must be somewhat valid or have some kind of legitimacy (social psychologists call this "social proof".) That's not a survivable claim, and it's important to not let those sorts of biases affect your judgment of published studies. There's nothing we can do with this study.
Someone would have to reconduct the study, using valid methods and avoiding fraud, to generate any numbers we could use. There's just no way to get meaningful numbers in this case. And a reconducted study would have to be a very different study -- it wouldn't actually be a repeat of this design. The set of papers would be very different, given a validated and rigorous search procedure. It would have none of the invalid stuff, would include a lot of climate science papers that they excluded. I hope that illustrates my central point -- there's no way to generate results from this study, because the underlying data is invalid in multiple critical ways. Taking a number from the Cook study -- any number -- is just rolling dice. The number won't have any connection to the reality the study was supposed to measure.
It's not clear that we can study a consensus by counting papers. We certainly can't just count papers without any weights, even if we have a valid search. Before we get there, we'd need some kind of theory of consensus, some reason to refer to literature as opposed to careful surveys of climate scientists, some specific hypotheses about what the literature gives us that surveys do not. There are a few possibilities there, but we'd need to think it through. A reasonable theory of literature-based consensus is probably not going to be temporally arbitrary – you won't weigh papers from 1994 as heavily as those from 2009. You won't count everyone's papers additively – you're not going to want to give Dick Lindzen 140 votes and some young buck 4 votes, based simply on paper count. You'll probably decide, based on theory, on a variable weighting system, where someone's 5th through 15th papers on attribution count for the most, but after 15 maybe you taper down the weights, limiting the impact of incumbency. You'd probably think about giving weight to heavily cited papers. You would not include most of the papers Cook et al included, like mitigation and impacts papers, for reasons detailed in the report (unless you could argue that impacts papers, say, carried epistemic information about human forcing. In that case, it would probably only be certain kinds of impacts research.)
A paper-counting method is also extremely vulnerable to historical lag and anchoring effects. They went back to 1991, and there are an alarming number of papers from the early 1990s in their set. This sort of design will not detect recent advances that move the consensus in one direction or another. If we had a breakthrough in estimates of equilibrium climate sensitivity, something that changed the game, something that made scientists a lot more confident in the stability of the new estimates of ECS, ending all these years of constantly revised estimates, this breakthrough might manifest in three or seven papers over the last two years. If it were a significant downward or upward revision, it wouldn't be captured in the bulk of the literature back to 1991. The method is also vulnerable to Least Publishable Units practices and publication bias. The publication bias argument is popular with skeptics – it's an easy argument to make and a hard one to prove, which essentially leaves me with nothing much to say about it at this point.
There are lots of other issues you'd have to think about, but even with rigor and qualified, unbiased raters who are blind and independent, in controlled environments like a lab, and only including climate science or attribution research, this could still all be useless and undiagnostic. We'd want to think about the difference between measuring consensus and measuring reality – the latter is what a meta-analysis does, in a sense. It aggegates findings, with strict methodological inclusion criteria, in an attempt to come to a credible conclusion about true effects (and effect sizes.) It doesn't measure consensus. People in most sciences generally don't talk about consensus much – look at the literature. For example, no one in my field is trying to get anybody to come to consensus or hold a bake sale or anything like that. Cook and company's claims about science as such being driven by consensus are strange – you'd probably have to do some serious empirical work to support such a sweeping positive empirical claim about so many fields. Consensus is unlikely to be a universally reliable, portable heuristic for reality across domains and arbitrary timepoints.
You'd have to really dig in, and have an account for whether consensus will be epistemically diagnostic or reliable for climate science in particular. That's a very hard question. You'd want to consider that dissent in general will cost a person much more than assent or conformity, and there's some reason to believe that dissent in climate science is extraordinarily costly. You'd have to think about how you could account for this in any estimates of consensus, what weighting might be appropriate, whether the dissent in climate science represents ever-present base rates of less competent researchers or contrarian personality traits, or whether it represents something else entirely. For one thing, you'd probably be curious about the dissenters, and might want to talk to them and learn more about their reasoning.
Some scientific truths won't be amenable to discovery by consensus-measurement, certainly not by the methods they used. Sometimes the only way to know something is to know it. Some problems in science might only be understood by 30 people in the world. Or 9. I don't know if that's true of any climate science problems, where the key issues are probably ECS and TCS. Some issues need very careful and advanced reasoning, real substantive engagement. Validity issues are like this. I've never dug into the controversy over climate models and their validity. That's a good example. A scientific method can be completely invalid, though this is probably less likely as time goes on. It's easier to do invalid science than valid science. I doubt the models are invalid, but if there is something fundamentally wrong with how climate models are used by climate scientists, and this leads to a systematic error in their estimates, then the consensus won't matter. If a field makes a fundamental, pervasive validity error, then it will just be wrong. This can definitely happen, though I think social science is much more vulnerable to this than climate science.
More broadly, I don't think people have a full account of all the ways we can be wrong. We can be wrong in our fundamental frameworks, in all sorts of ways that we're not even trained to think about. I doubt that climate science is wrong, but these are the kinds of issues that you'd have to think about before giving much weight to a consensus. Consensus is not necessarily important or interesting – you have to make it so. And you definitely have to develop a valid way of measuring it. There might be better tools than consensus-measurement for public assimilation of scientific realities. For example, well-structured debates might be more informative and valid than consensus. I think there's some emerging research there.
It's alarming that they were able to do this, and get published. The good news, such as it is, is that their false claims were in the area code of reality, as far as anyone knows. Valid studies find consensus in the 80s or maybe 90% (Bray and von Storch are good, as is the AMS report.) Some questions yield much lower estimates, but in general it's up there. 97% is inflated, and in this case the number had no real meaning anyway, just biased paper-counting from an invalid starting set. The inflated values might make a difference in policy choices, which is disturbing. We can't be so risky with science. It horrifies me when people shrug off such malpractice because it's only 10 or 15 percentage points off of reality. The danger of tolerating malpractice because the results are sort of close to accurate seems obvious, and the consequences of that kind of tolerance would likely extend well beyond climate consensus research.