Here's an update on the Cook 97% scam. They apparently have no answers.
Cook and some of his crew had a chat event at Reddit. Last I checked, they never addressed any substantive issues in the report, even though various people asked them about it. They don't argue against the fraud reports, the validity issues, nothing. Instead, they tried to attack me, or issued vague filibusters that had no bearing on anything happening here. To my knowledge ERL has yet to issue a statement or offer an explanation for the paper's comprehensively and substantively false description of its methods.
To review, in their paper, they described their method as: "Abstracts were randomly distributed via a web-based system to raters with only the title and abstract visible. All other information such as author names and affiliations, journal and publishing date were hidden. Each abstract was categorized by two independent, anonymized raters."
All three substantive features of their method are false. Raters were not blind to authors (or any of the other info.) Raters were not independent. Raters were not anonymized.
They falsely described their methods. That is a very, very serious thing. There is no science without an accurate description of methods, and this paper, like all papers, was published on the assumption that they followed the methods they described.
Normally the way science works, that's the end. Nothing else needs to be done by anyone. There are no results to evaluate if they didn't follow their methods. Why? Because valid results critically depended on those methods, and when people don't follow their stated methods, we don't know what they did and thus can't rely on the results. Climate science, or its journals, can't be an exception to this basic norm and epistemic requirement of valid science. Why would they be an exception? (This has nothing to do with the truth of AGW or the reality of a consensus -- this is about a fraudulent and invalid study.)
This is not a pedantic issue. This was a subjective rating study where human raters read authored works and decided what they mean. Such a study could never be valid without blindness to the authors of the works they were rating, since knowing who the authors are is a multifaceted source of bias. Nor could it be valid without independent ratings – raters discussing their ratings in an online forum contaminates those ratings, exposing one rater's ratings to the views of other raters, and makes it completely impossible to calculate interrater agreement or reliability from that point forward. We have no valid numbers with respect to agreement – the crude percentages they strangely offered instead of proper interrater reliability coefficients don't mean anything given that the raters weren't independent. If we can't calculate interrater reliability, we don't have a study anymore.
Having humans read complex text and decide what it means, what position it's taking on some issue, is a very special kind of research because it enables to researchers to create the data to a degree unparallelled in typical science, and the data is the result of subjective human appraisals of text, which is extremely vulnerable to bias, many sources of error, and in some cases the task is not theoretically possible or coherent. Such research demands great care, and won't be valid if things like blindness and independence are not observed. That these people gave themselves a category of implicit endorsement of AGW only exacerbates this. That they excluded papers they interpreted as taking no position on the issue, which were the majority, and calculated a consensus excluding all the non-polar papers, takes this all further from a universe governed by natural laws. They also thought they could just count papers, which is surprising – that papers, not people, are the unit of consensus – after excluding everything they excluded, and given all the obvious weights and biases that would be applied by a paper-counting method (research interest, funding, maleness, age, connections, whiteness, English-speaking, least publishable unit practices, # of graduate students, position as a reviewer, position as an editor, to name a few...)
Moreover, in their online forum, the third author of the paper said: "We have already gone down the path of trying to reach a consensus through the discussions of particular cases. From the start we would never be able to claim that ratings were done by independent, unbiased, or random people anyhow."
Maybe we should quote their methods section again: "Each abstract was categorized by two independent, anonymized raters."
There appears to be no question that they knew, well before submitting the paper, that they had not implemented independent ratings, since as she mentioned, they were discussing particular papers in the forums the whole time. Yet, they still reported in their article that they used independent raters. What is this?
The percentages for rater agreement that they report in the paper are voided because of this -- those weren't independent ratings, so there's no way to measure agreement between them.
Nor can we assume that the fraud was limited to the forums, since the raters worked at home and could just google the titles of the papers to break blindness and see who wrote them. You can't perform this kind of study in such uncontrolled conditions – there's no way to credibly claim blindness here, which is crucial feature. In any case, they freely revealed authors of papers to each other on the forum, sometimes with malicious mockery if it was someone they had already savaged on their weird website, like climate scientist Richard Lindzen. It's incredible – they exposed the authors of the articles. They did so repeatedly and without censure, lacked any apparent commitment to blindness, and were e-mailing papers to each other, so it would be their burden to show the fraud was limited, if for some reason we cared that it was limited. Since they were e-mailing each other, they violated their claim of anonymized raters – they used their real names in the forum, had each other's e-mails, knew each other, were political activists from the same partisan website.
That last fact also invalidates the study in advance – we can't have political activists rating scientific abstracts on their implications for their political cause. That's an obvious, profound conflict of interest, and empowers people to deliver exactly the results they desire. I want to stress that this is unprecedented. No one ever does this. It's too absurd and invalid on its face. Studies based on subjective human raters are a small fraction of all social science, and many researchers will never need such a design, but using subjective raters who desire a certain outcome and are in a position to deliver that outcome is simply not an option.
For the defenders of this study, I think the absurdity of the design would be obvious in any other domain – e.g. a group of Mormon activists reading scientific abstracts and deciding what they mean concerning the effects of gay marriage, much less falsely claiming blindness to authors, falsely claiming to have used independent ratings, etc. If ERL or IOP want to argue that it's fine to use subjective raters who have an explicit, known-in-advance conflict of interest with respect to the outcome of the study, and thus their ratings, and who we can see in the forums gleefully anticipating their results in advance, bragging to each other that they've hit 100 abstracts without a single "rejection" of AGW, further biasing and contaminating others' ratings, posting articles and op-eds smearing "deniers" while they rated abstracts on the issue, and the leader of this rodeo telling the raters that this study was especially needed after another study found an unacceptably low consensus, again, while they were conducting ratings -- well, I'd really like to see that argument. It would break new theoretical ground. I'd normally say they need to talk to some experts in the subjective rating methodological literature, but in this case, you really don't.
(I'm not aware of any study that has ever used laypeople to read and interpret scientific abstracts. This method is remarkable, since it seems implausible that laypeople would consistently understand specialized scientific abstracts. As scientists, we won't understand every abstract from our own fields, much less other fields. Some of the explicit violations of the claim of independent ratings in the forum were cases where raters didn't understand what an abstract was saying, which is easily predictable. Then of course others offer their interpretations, and the rater's subsequent rating is essentially someone else's rating. There is some research on bias and interrater reliability where scientists or doctors rated abstracts in their fields, as for a conference, as well as research on reviewer agreement in peer-reviewed journal submissions, which is historically quite low. Those researchers did not contemplate the idea of laypeople or non-experts rating scientific abstracts. You'd need a lot of training, I think, and it's unclear what the point of such a study would be, for reasons I detail in my report.)
Relatedly, on the reddit forum all I saw were bizarre ad hominem attacks. One of the raters said I had no climate papers. That's goofy. You'd need deep climate science expertise to rate climate science abstracts, but you don't need it to point that out. The Cook paper isn't a climate paper – it's not about cloud feedbacks or aerosols. It's methodologically a social science paper, a subjective rating study, which is well within my area code. Calling out the fraud and invalid methods has no more to do with climate science than their paper does.
For example, when someone says "Abstracts were randomly distributed via a web-based system to raters with only the title and abstract visible. All other information such as author names and affiliations,
journal and publishing date were hidden. Each abstract was categorized by two independent, anonymized raters."
...you don't need to know anything about climate to read this, read the forums where they disclose the authors of papers and discuss their ratings with each other, refer back to the above text from their Methods section, and observe that they falsely described their methods. It might help to know what blindness means and why it matters for subjective ratings, but that's not hard. Moreover, while I think technical ad hominem is valid in some very restricted cases (like a probabilistic inference that non-experts are unlikely to be able to understand specialized scientific material, such that the burden is on them to show they can), it's never valid to respond to fraud reports by saying "He doesn't have any climate papers." You'd want to respond to the report, to answer it substantively. They did something very serious, and they need to answer for it.
Others said my report needs to be published in a peer-reviewed journal in order to matter. Fraud is not normally debated in journals. I don't want to publish this in a journal (nor do I want any "climate papers".) It might be an example in a future publication, but why kill trees to point out fraud? We can actually see and know reality with our mortal eyes and mortal brains. And since the Cook paper made it through peer review, it's not clear that peer review should carry a lot of epistemic weight for the rational knower, at least in some journals or on some issues. My points are going to be true, false, valid, invalid, or some mix – whether they're published in a journal isn't going to tell us that. You can just read the arguments, consider the evidence, and decide – you don't need other people thinking for you. I could be a welder from Wyoming who just got out for good behavior – it would be remarkable for such a person to blow the whistle on a journal article, maybe worthy of a Lifetime movie, but it wouldn't change a damn thing. It wouldn't change anything about the Cook study, or the validity of my points. Nothing about me alters those realities. Journal publication might not even be a better-than-chance heuristic in such cases. This "consensus" epistemology is taken too far – people forget that consensus is only a heuristic, of variable and context-dependent reliability, not a window into reality.
In any case, I reported it to ERL and IOP, and they should be able to handle it without my needing to write it up for a journal. The authors need to answer some very simple questions. There need to be answers. Without some sort of miracle redemption, some way of the fraud not being fraud, it should be retracted. I don't know if this a wild and crazy idea to some people, but science would mean nothing if we can falsely describe our methods (and use methods guaranteed to produce a desired outcome.)
Cook responded to questioners by saying "The common thread in all criticisms of our consensus paper is that the 97.1% consensus that we measured from abstracts is biased or inaccurate in some way. Every one of these criticisms fails to address the fact that the authors of those climate papers independently provided 97.2% consensus. This is clear evidence that attacks on our paper are not made in good faith."
This is bizarre. They may be counting on people not reading my report. I explicitly address this issue in the report, so his claim is false. Here are a few reasons why we don't care about the author self-ratings anymore:
1) We learned that Cook included psychology, social science, public survey, and engineering papers in their "consensus". This after they explicitly said in their paper that social science and surveys of people's views were not included as climate papers. Chalk up another false claim. But, since they included a bunch of invalid papers, this means their author survey included the authors of those papers. This in turn means we can know longer speak of the authors' self-ratings percentages – those figures have no meaning anymore, given that we don't know how many were from psychologists, sociologists, pollsters, and engineers.
2) I pointed out in the report that their counting method is invalid because they count mitigation and impacts papers that have no obvious disconfirming counterparts. For example, if an engineering paper counts as endorsement because it mentions climate on its way to discussing an engineering project, how would an engineering paper count as rejection? By not mentioning climate? If a paper about TV coverage of climate news counts as endorsement (in contradiction of their stated criteria), what sort of study of TV coverage would count as rejection? One that doesn't mention climate? An analysis of Taco Bell commercials? Where's the opportunity for disconfirmation? There's no natural rejection counterpart to such categorization (it won't matter if you find a mitigation paper that they counted as rejection -- this is about the systematic bias, and the endorsements will be far greater than the rejections here as a result.) This all means we won't care about the authors' self-ratings, because of this systematic selection bias in the method (anything that biases the selection of articles biases the subsequent pool of authors rating those articles.)
I also pointed out that we can't validly measure consensus by excluding the vast majority of actual climate science papers that do not take polar positions of endorsement or rejection, which is what they did. Consensus cannot exclude neutrality. We can't assume that neutrality represents a consensus, as they do. And we probably can't count papers to begin with. This is all in the report.
3) Pointing at squirrels is never good when a study has been rebuked for fraud or invalid methods, both of which are the case here. You cannot redeem the malpractice in the first part of the study -- all their false statements about their methods, the invalidity and meaninglessness of their results -- by talking about a completely different part of the study. That's pure evasion. They need to answer for what they did.
4) There are serious questions about their literature search, such as how it excluded everything Dick Lindzen has done since 1997. There are no results without figuring out what's going on with that search, how it excluded all the modern work of a seminal lukewarm climate scientist. You can't just run a lit search based on a couple of terms and then declare that you've got a valid, representative sample of studies. No way. Science can't be so haphazard – you have to try. There's work involved. New methods need to be validated. We can't begin to talk about percentages and numbers without first establishing that our data is valid, this this literature search is valid (and dealing with all the other issues, the first of which is the fraud.) The search issue will interact with point 2 above. Validating the search will require careful thinking, testing, etc. Some of the methodological meta-analysis literature will have guidelines. How to do a valid search for this purpose is a nontrivial issue – nothing that happens after the search matters if the search isn't valid. You have to figure out if there are selection effects, what you're including, what you're excluding, especially with respect to your hypotheses, what happened to Lindzen's papers, and so forth -- you don't just run a search and start rating papers. This is science, not numerology.
(The equivalent would be if some people rented a couple of hot-air balloons for a few days, took pictures of some clouds, wrote it up for a journal submission on cloud cover in North America, and we simply accepting their method without question. Science can't be that sloppy.)
(And only 14% of authors responded to their survey, and they got about two votes each, which highlights the oddity of simply counting papers. I've already pointed out in my report many of the biases and potential weights that will be applied by a simple paper-counting method.)
Cook's statement is also bizarre because of his conclusion. Saying that people who contest the validity of their ratings never address the purportedly similar results of the authors' self-ratings, and that this "is clear evidence that attacks on our paper are not made in good faith" is so strange. Evidence that people are not acting in good faith? Because they don't address some other part of a study? That's such a strange model of human psychology, and of how science works. They seem to have a remarkable immune system or strategy of not engaging substantive criticism, characterizing it as not in good faith, "denial", misinformation, etc. If genuine, it's a remarkable worldview.
Another Cook tactic was to talk about other studies finding similar results. I was dumbfounded. When someone is referring to your false claims about your methods, and your invalid results, the subject under discussion is your study. Nothing else matters. There could be a thousand 97% papers. It wouldn't matter. That there is a consensus does not matter. The issue is the Cook study, its fraudulent claims, its invalid results. Fraud is fraud. We don't redeem fraud by talking about other people's studies. Some of those studies are almost as bad and unusable, like the one-pager, but that's beside the point. The true, meaningful consensus could be 99.997% – that won't matter here. (We also don't need this study to establish a consensus, just as we don't need a particular hockey stick paper to establish anthropogenic warming. We might consider that the case for the consensus likely weakens if this study is included.)
It's incredibly disturbing to repeatedly see this kind of response from people who were published in a scientific journal. Not only did these people have no idea what they were doing, they betray no sign that they're aware of the basic norms of science, the importance of faithfully describing one's methods, why fraud is a serious thing, what it means to say that results are invalid, or how the existence of other people's studies won't intersect with or override the false claims made about one's own study. If their responses are being validated or accepted by ERL or IOP, we've got a much bigger problem. People might be justified in ignoring climate science if climate journals are fine with what Cook did. We wouldn't have any way of knowing that other research was valid, since climate papers would be much harder for us to evaluate. If ERL is fine with fraud, we wouldn't know what to do with any future ERL articles, or with the field as a whole. That's not anywhere we want to be. Some of the confusion or delay might be due to ERL's unfamiliarity with these methods, the need for blindness, independence, interrater reliability, probably a lack of awareness that they falsely described their methods. But those can't be issues anymore – if people don't get it, don't understand why a subjective rater study necessitates that the ratings be blind and independent, don't understand why this study destroyed our ability to calculate interrater reliability, etc. they can just consult some relevant experts in social science.
Let me tell you something else. Some people are saying this paper won't be retracted because the journal is biased, or because it would look bad for Dr. Kammen (the editor), or because he works with the White House, that science is rigged. We need to cut the crap. I don't care who is who or how much power they have. This was fraud. Falsely describing one's substantive methods is fraud. This paper is invalid in ten different ways, but fraud is fraud. They have not answered for it. They apparently have no answers. If the relevant decisionmakers are thinking that they can just ignore this, or issue some PR blather, they need to think long and hard about what the hell we want science to be. They need to think long and hard about the long-term consequences of failing to retract fraud when the evidence of said fraud is so publicly accessible and straightforward. They might think about how far we have to throttle down our brains in order to believe that this study is remotely valid. They might think about what it will do to the reputation of climate science not only that this study was published in a climate science journal, but that it wasn't retracted when it was revealed to be fraudulent and multiply invalid. There are real consequences here, long-term impacts. We can't be this bad at science, this corrupt. We need to stand for things that aren't in a party platform. These people made a complete mockery of the institutions and safeguards that we take for granted. They torched them.
I think it's worth considering what could happen in the long-term if this nonsense were tolerated, if we couldn't get people to act against fraud if the fraud were politically convenient, in combination with all the conventional fraud and malpractice. Cultures and civilizations evolve in countless ways. The meanings and usages of words can change. There's no a priori hard constraint on this process. Science doesn't have to mean something like rational inquiry, or the systematic, reproducible validation of empirical claims or hypotheses, or any of our other definitions (in capsule form.) The word science could end up meaning something very different. For example, in the worst case, it could end up being used only ironically, as a term for scams. As it stands, we probably use the term too broadly -- science is very diverse, and the issues discussed above will be alien to many scientists. (The idea of gathering some political activists to "rate" scientific abstracts is far removed from most scientists' methods, and the type of fraud we see here is very different from prototypical fraud in cellular biology, for example.)
I don't mean that what we've classically known as science would cease to exist -- that would likely require a large asteroid strike. I mean that science could plausibly evolve into a label for a corrupt and privileged subculture whose divinations are no more tied to reality than a day-old religion. In such a case, actual science as we mean it today would be called something else, perhaps interrogo, ρωτήστε, or bluefresh, and the professionals who practiced it would meet high standards -- the standards people expect of scientists today. I'm not saying this is likely -- I don't quite think it is. But I think it's plausible, and I'd caution you against underestimating the degree of change and evolution that human culture and language can experience, even in our lifetimes. I don't mean to make a trivial point about changes in terminology and phonemes. I mean to illustrate that continuing to generate garbage and fraud under the banner of science will despoil that banner. Circling wagons to protect such fraud and garbage is tautologically incompatible with our prior commitments to scientific integrity. In any case, nothing in life is assured -- not prestige, not funding, not an audience. I don't like being pragmatic about this, since I think idealism should dominate here. We like to say we stand on the shoulders of giants -- we might want to think about whether anyone will be able to stand on ours.
José L. Duarte
Social Psychology, Scientific Validity, and Research Methods.