Has Science Journalism Helped Unmask a “Replication Crisis” in Biomedicine?

By Philip KitcherNovember 28, 2019

Has Science Journalism Helped Unmask a “Replication Crisis” in Biomedicine?

Rigor Mortis by Richard Harris
Fraud in the Lab by Nicolas Chevassus-au-Louis

IN 1995, Oxford University established a new kind of professorship. Recognizing the importance of scientific research in the modern world, Charles Simonyi endowed a Chair for “the Public Understanding of Science.” His aim, explicitly stated, was to “communicate science to the layman,” and also to “convey to non-scientists some of the excitement” of scientific research. It would be hard to quarrel with the choice of the first Simonyi Professor, Richard Dawkins, since he belongs to a class of people (including Carl Sagan, E. O. Wilson, Stephen Jay Gould, Olivia Judson, and Brian Greene) who have made complex ideas come alive for hundreds of thousands of people.

Yet no small group of scientists can do all the necessary work of informing the public. Many lines of investigation, bearing on human lives and public policies, proliferate results at a rapid pace. It usually falls to journalists to help citizens appreciate issues on which political decisions might turn. Their work, however, is doubly difficult: the science they report on can be fiendishly technical, and they must perforce find ways to attract and retain the public’s attention.

Given this difficulty, it is hardly surprising that some news media simply abandon the project. Except in special instances, science is considered too dull to warrant space. For those venues that do recognize the importance of a scientifically informed citizenry, the temptation is to fall into the trap of using a narrative formula: embellish the story with lavish details of “personal interest”; portray novel research as an exciting contest between radical innovators and conservative dissenters. Even venues as respected as The New York Times are not immune. Sometimes, the tendency to jazz up science reporting proves disastrous. A case in point: For at least 15 years after the climate science community had achieved consensus on global warming caused by human activities, the Times continued to treat the question as a matter of debate.

Charles Simonyi foresaw the danger. A part of his brief for the Professorship warned against “oversimplifying ideas and presenting exaggerated claims.”

¤


During the past eight years, many astute people, inside and outside the scientific community, have worried about the quality of scientific research. They warn of a “replication crisis.” In biomedicine and psychology in particular, it seems that a high proportion of published results cannot be reproduced. The absolute number of retractions for articles in these fields are rising. Whether or not it is right to talk of crisis, it is certainly reasonable to be concerned. What is going on?

Explanations typically fall into three categories. One possibility is that contemporary science, at least in some domains, is full of corrupt and dishonest people who routinely commit fraud, making up data for experiments that were never performed, or misreporting the results they have actually found, or tweaking their graphs and prettifying their images, and so on. In short, these fraudsters intentionally attempt to deceive their colleagues and, ultimately, members of the broader public. A second possibility is that incompetence or sloppiness is at play. As in Nick Carraway’s verdict on Tom and Daisy Buchanan, biomedical and psychological researchers are quite simply careless people who make a mess for others to clear up as best they can. And the third possibility: Neither fraud nor lack of rigor is responsible for the problem. Investigating some kinds of scientific questions may simply be devilishly difficult, sensitive to myriad factors that are hard for scientists to survey and control. In this case, the difficulties of replication represent the growing pains of an area of research as it struggles to achieve stable and reliable findings.

What exactly is known on the subject? In the past century, several famous cases of scientific fraud have been meticulously exposed. Similarly, there are established instances of investigators failing to conduct experiments according to the standards of their fields or of using the wrong statistical tools to analyze their findings. But it is far from obvious that fraud or sloppiness lies behind most cases in which results prove difficult to reproduce. In fact, most scientists can report how, despite admirably conscientious procedures, they themselves have sometimes been unable to replicate experimental results they had obtained in one place or at one time. Relatedly, the tacit or unconscious knowledge of the laboratory investigator can have an impossible-to-discern impact on results. Recognizing the role of this tacit knowledge is one of the great achievements of recent sociological studies of science. However carefully a given researcher tries to describe how she had performed an experiment, the “methods” section of the published article will inevitably omit certain details. Indeed, she may be quite unaware of the tiny, but consequential, features of her laboratory practice that are crucial to the — repeatable — result she has found.

This point is worth further emphasis. It is actually part of almost everyone’s experience: few of us pass through high school science classes without, at some stage, failing to set up and run an experiment appropriately. Similarly, beginner cooks frequently can’t make a recipe work. And novice gardeners may over- or under-water. Most of us can’t assemble furniture from the parts delivered in the box without experiencing some frustration. It’s hardly surprising, then, that everyday difficulties are magnified when scientific investigation is at its frontiers and the experimental work envisaged outruns established conventions. In much biomedical and psychological research, investigators struggle for months and years to obtain acceptable data. Findings obtained on one occasion or in one sample may be at odds with those delivered by others. Only after much adjusting and tinkering do researchers finally arrive at a result they take to be stable. When others then try to repeat what has been done, the would-be replicators sometimes do not invest the time required to generate that same stability. Indeed, even when the original investigators themselves later attempt to redo the experiment, they more often than not have lost the skills they had built up in the initial long process of modification and tweaking. Like a tennis player who returns to the courts after a significant absence, they are rusty.

All three varieties of explanation described above undoubtedly play a role in accounting for failures of replication. The challenge, however, is to identify the relative frequency of each type. How often is the scientist blameworthy, either for carelessness or deliberate dishonesty? How often does the trouble arise from the intrinsic difficulty of the problems under investigation? Nobody knows. Nor is it easy to see how to remedy our ignorance. Designing a research program for investigating the incidence of fraud and carelessness is enormously difficult. Indeed, the very same doubts described above can be applied to the social scientific assumptions used in deciding what qualifies as trustworthy data, and the same doubts can likewise be applied to drawing conclusions from whichever findings can be validated.

Acknowledging ignorance might point us toward a more modest agenda. We might try to impose checks on bad (fraudulent, sloppy) scientists, but without inflicting undue burdens on their upstanding counterparts. The wiser and cooler heads have indeed committed themselves to this modest enterprise. Not all of those who have described the trend, however, are cool and wise.

¤


Richard Harris is the science correspondent for one of the United States’s most admirable news sources, National Public Radio. His book, Rigor Mortis, comes with glowing testimonials from reviewers. His subtitle also glows: How SLOPPY SCIENCE Creates WORTHLESS CURES, CRUSHES HOPE, and WASTES BILLIONS (capitals and boldface in the original). The principal value of the book consists in its assembly of examples. Harris has talked to an impressively large number of people, each of them for “at least an hour,” who have expressed their concerns about the current state of biomedical research. His informants earnestly relate their own experiences of particular cases in which published findings turned out to be unreliable. The overall effect is that of an exceptionally anguish-ridden Greek chorus, lamenting the fate of their community.

Skilled in techniques of narration honed over more than three decades of science journalism, Harris knows how to leaven his technical discussions with sprinkles of human interest. He introduces his readers to afflicted patients, allowing us to sympathize — and perhaps become enraged — when biomedicine lets them down. He describes researchers’ working environments and adds small personal details. In this way, as the individual episodes pile up, a spoonful of sweetener offsets the sour taste occasioned by reports of misbegotten molecules and disappointing drugs. Periodically, the stories are punctuated by slogans of righteous protest, or, more occasionally, by brief proposals for reform. The dominant effort of Rigor Mortis, however, is to amass tales of woe. To the extent that diagnosis and treatment recommendations figure in the book, they come in higgledy-piggledy. Probing analysis is in short supply.

Harris’s formulaic stories substitute for any attempt to remedy current ignorance about the causes of the “replication crisis.” He simply assumes that carelessness is widespread (although he takes fraud to be only occasional). Sloppy science is his target tout court. Without presenting a careful argument as to why, he just thinks there is a lot of it about.

But, we should ask, what is “sloppy science” and how should we characterize its rigorous counterpart? Rigor Mortis offers two answers to this question. The first, calculated to make anyone having even a casual acquaintance with the history of science wince, appeals to a general notion of scientific method. The method, allegedly invented by Francis Bacon, is the familiar myth propounded in numerous high school textbooks: hypotheses are framed, tested, revised, and ultimately integrated into broader bodies of theory. Not quite Bacon’s resolute inductivism — nor, indeed, readily identifiable with any 17th-century thinker. Moreover, if this is to be the standard against which research is to be judged as “rigorous” or “sloppy,” it is far too vague to allow any definitive verdict, as many 17th-century critics knew too well. Harris’s sloppiness about methodology thus leads him to a criterion that is completely impotent to expose the flaws of “sloppy research.”

Fortunately, his second proposal, implicit in some of his discussions, although never formulated clearly and precisely, fares slightly better. A list of questions offered by C. Glenn Begley, an eminent scientist who called attention to difficulties in replicating experiments, yields more definite methodological criteria. More definite, but, as we shall see, not as easily applied as Harris appears to think.

It is important to keep in mind here that different fields of science have accumulated distinctive ways of elaborating the vague ideas, often mutually discordant, advanced by the pioneers of early modern science (Bacon, Descartes, Galileo, Boyle, Newton, and the like). The genius of those pioneers was to use imprecise imperatives — “Apply mathematics!”, “Collect data!” — to explore various aspects of nature. Their cloudy versions of “method” condensed into studies of motion, light, air pressure, and so forth. Building on their early successes, emerging communities of physical scientists then formulated more precise guidelines for further investigations. In learning about aspects of nature, they learned how to learn about related problems (e.g., the experiments performed by Torricelli and Pascal led to Boyle’s air pump). Now, more than four centuries later, the groups of scientists who work in different fields have inherited the results of a virtuous spiral. Across physics, chemistry, biology, earth and atmospheric sciences, neuroscience, psychology, and more, achieving stable reliable results has also delivered methods (plural). Distinct fields and subfields have developed their individual techniques for making probative observations and assessing the evidential impact of data. Undergraduate students start to learn some of these techniques in their “methods” courses, tailored to a particular kind of research. The methods learned by the beginning physicist are different from those needed by the novice geneticist or those taught to the aspiring neuropsychologist. Those who continue to a career in research undergo a further long apprenticeship, acquiring specialized skills needed for the intricate experiments they will design and carry out. Contemporary researchers stand at the end of a long historical process in which the vague “scientific method” inchoately present in the thoughts of 17th-century heroes has been crystallized in successful investigations — i.e., in inquiries generating reliable results — from which specific standards and guidelines have been drawn.

Harris’s advice “to teach scientists how to design experiments properly” recognizes the body of lore against which particular lines of research can be assessed as “rigorous” or “sloppy.” His informants describe occasions when the standards have not been met — and gesture toward training in “the basics of epistemology or logic.” That is to point in the wrong direction. The moral of such failures: even years of learning the techniques that have generated past successes does not demonstrate unambiguously how those techniques can be extended to new problems. Inevitably, the researcher who aims at genuine novelty must creatively apply the methods and skills amassed in the history of the field. The advice Harris offers is fatuous. To be sure, sometimes specific troubles can be foreseen, and the researcher forewarned: be careful to use the appropriate statistical tools in your analyses; think about the number of times you should repeat an experiment before announcing that you have a stable result; watch out for the possibility that your samples have been contaminated. Judging that a scientist has carelessly neglected these admonitions can be justified. But there’s no general algorithm for designing experiments to guard them against failure.

Even the warnings I’ve envisaged have to be interpreted in the context of the envisaged research. Harris offers an apparently sensible suggestion: researchers should “ship a sample of their cells off to a commercial testing lab” before they run their experiments. Doing so will ensure, he claims, that “the cells are what they expect.” On occasion, this preliminary check might be a good idea. A researcher might suspect her cell lines have been mixed or contaminated. But should the practice be generally applied? Would “commercial testing labs” be able to cope with the flood of arriving samples? Would they be able to generate reliable reassurances (or reliable warnings)? In a timely fashion? At reasonable cost? The proposal smacks of Descartes’s notorious strategy of doubting everything — and, as the philosopher-scientist Charles Sanders Peirce observed, the strategy of everyday inquiry is to respond to genuine doubt: to scratch where it itches. A better strategy for guarding against contamination is to ensure the insulation of samples from environments in which known disruptive factors might intrude. Of course, that will fail to secure the experimental materials against unknown sources of contamination. How you pull that off is something neither Harris, nor his sources, explain. One might even argue, in the spirit of neurobiologist Stuart Firestein’s celebration of scientific failure (an attitude Harris sternly opposes) that failed experiments sometimes advance science by teaching researchers about new possibilities of contamination. To recognize that point is not, of course, to excuse cases in which investigators are cavalier in allowing known contaminants to intrude.

Rigor Mortis is a passionate book, rightly moved by the plight of people who suffer and whose hopes for biomedical relief have been dashed. It inverts a rosy story. For the past few decades, the public has been told of the wonders of new breakthroughs in molecular medicine. Eminent researchers have offered soaring rhetoric and thus raised false expectations: mapping and sequencing the human genome will fathom the nature of humanity, teach us who we are, crack the code of codes, and deliver cures and treatments galore. Phrases like these were useful in obtaining funds for what has proved a highly successful scientific enterprise. Yet, from the very beginning, commentators pointed out that the project was oversold. The likely immediate benefits would be much-improved diagnostic tools for a large class of human diseases. [1] Genomics (as it came to be called) would also advance biological understanding, paving the way for eventual treatments and cures. The predictions were largely correct. Yet some of the biological advances were more significant than anticipated. None of us foresaw CRISPR.

Harris’s dismal picture rightly emphasizes that the promised cornucopia of cures hasn’t yet been delivered. But his portrayal is an overreaction to inflated rhetoric. Despite its failure to meet unrealistic expectations, biomedicine has in fact alleviated the lives of many people. The boldface of his subtitle testifies to his book’s thoughtlessness: he substitutes impassioned alarmism for serious analysis, and he deploys sloppy science journalism to rail at supposed widespread scientific sloppiness. His poorly thought-out proposals threaten to handicap the biomedical research community. But should Rigor Mortis be read more generously? One might explain its failures in accordance with the style of explanation Harris withholds from that community. The fact is that figuring out just how widespread unrigorous research is, and how to make it less common, is extraordinarily difficult. Moved by the plight of the sufferers, a well-meaning whistleblower might indeed overlook the complexities of the problem — and thus flounder.

¤


A calmer analyst might choose a less ambitious, more manageable, enterprise. Focus on one type of scientific dereliction: laboratory fraud. Explore the various types of scientific dishonesty that arise, and consider how they have been detected. Use this as a basis to suggest modifications of scientific practice that might help diminish the frequency of fraud and in identifying it when it occurs. Try to frame proposals for reform in ways that don’t interfere with the pursuit of research by honest and conscientious investigators. In short: Without speculating grandly about the causes of the “replication crisis,” attempt to advance the modest venture in a specific way.

This is the enterprise undertaken by Nicolas Chevassus-au-Louis in Fraud in the Lab. (Originally published in 2016 in French, the translation by Nicholas Elliott appears this year.) Chevassus-au-Louis is a journalist with some experience of laboratory work, and, like Harris, he has conducted interviews, drawing from them to tell tales of scientific misconduct. His store of stories is considerably less full. Yet the individual episodes he describes are carefully chosen to illustrate or defend a point. Fraud in the Lab has an analytic structure that builds a patient case.

Chevassus-au-Louis sees himself as engaged in nosography, mapping different kinds of disease, specifically the kinds afflicting unscrupulous scientists. From the very beginning, he frames his quest: the first Figure in the book depicts the increasing incidence of retractions of journal articles in the fields of biomedicine, overall and for suspected or confirmed fraud. Although he is concerned that many instances of scientific fraud go unrecognized, he seems to appreciate the potential fallibility of the sociological research from which estimates of the extent of the problem are derived. He explains how some researchers become “serial cheaters,” apparently addicted to inflating their reputations by publishing articles based on made-up data. He explores the ways in which digital technology has enabled scientists to “tinker with images.” In an interesting discussion of psychological experiments, he shows how two experimental psychologists were able to detect a striking overrepresentation of studies that just marginally met the conventional threshold for statistical significance. His strategy is not to amass examples to swell a woeful chorus, but to use well-chosen instances to make precise points about particular pathologies.

As the sequence of chapters unfolds, Chevassus-au-Louis elicits the features that will motivate his concluding proposals. The dodgy characters who figure in his tales are inclined to hyperproduction: their laboratories pour out new articles at an extraordinary rate. Caught up in stringent competition for research funding, they usually fear their work being scooped. Sometimes they may even be tempted to contribute to shady journals and conferences, largely emanating from East Asia, or to deploy software to assemble random combinations of words (many of them technical terms) into potential articles. Among the most horrifying — but also extremely funny — pages in the book are those presenting examples of randomly generated articles and of “contributions” to journals that will, apparently, accept anything for a fee. Simply repeating “Get me off your [expletive deleted] e-mail list” a few dozen times can earn the resumé-builder yet another publication, assuming that he is willing to pay for the privilege. Abstracts allegedly co-authored with long-dead authors and full of incongruous juxtapositions of technical terms may well win you a spot on a conference program.

How can we limit the frequency of the pathologies Chevassus-au-Louis identifies? His own proposals, advanced in the concluding chapter, are charmingly conceived by analogy to the “Slow Food” movement. Chevassus-au-Louis wants to relax the competition among scientists, and to emphasize careful production of significant research. In his view, scientists should collaborate, share data, not worry about the “impact factors” of the journals in which they publish, and be supported and assessed for promotion not on the basis of numerical measures — substitutes for weighing their recent publications on a scale — but on the quality of a small number of pieces of research. As he points out, the publication of raw data has been standard practice in theoretical physics for decades, and data-sharing is becoming increasingly common among climatologists. Yet, as he acknowledges, high levels of cooperation are most likely in domains where the potential for financial reward is lower. Contemporary biomedicine is plainly subject to the lure of the megadollar.

The extent to which biomedical researchers seek economic rewards, rather than wanting academic credit, scientific prestige, or even to improve human lives, is, I strongly suspect, not only unknown but extremely hard to determine. One way to encourage cooperation within scientific communities would be to replace the priority rule — the practice of awarding full credit to the first person (or team) to solve an important problem and nothing to also-rans. Recent work in formal social epistemology has recognized both the potential benefits and disadvantages of letting the winner take all. On the one hand, the current competitive environment may generate a welcome diversity of approaches: if I have started to explore a particular line of research, and you are about to begin, your best bet for gaining the laurels may well be to diverge from the approach I have chosen. On the other hand, as both Chevassus-au-Louis and Harris observe, intense competition encourages investigators to cut corners. How do the advantages and downsides of the priority rule trade off in practice?

Nobody knows. Slow science would be an experiment. One thing is relatively clear: anyone who has put in hours on a committee that awards research grants or evaluates scientists for academic promotions will be sympathetic to assessing candidates on the basis of a small number of pieces of work, the best research they have done. Impact factors and h-indices are crude and often misleading ways of distributing rewards. The proposals Chevassus-au-Louis offers are thus worth trying out. In implementing them we might discover unanticipated difficulties. If so, the right response would likely be to modify and correct, to learn from the failures and to seek ways of overcoming them.

¤


Science journalism is crucial to democratic societies, whether or not it explains the details of new scientific findings or reports on a general feature of the scientific enterprise. Plato famously thought democracies would end in disaster since the majority of the citizens are too unintelligent to think through the issues confronting them. His elitism was wrong. But, as the world has learned, ignorance, often fed by misinformation, can be as toxic as stupidity. Had the message from climate science been clearly enunciated to the public two or three decades ago, our species might well have moved beyond bickering about the reality of anthropogenic global warming. We might now have been in the thick of discussions on hard policy questions that arise in the course of trying to preserve our planet.

Journalism can do much good, but also considerable harm when it lapses. Yet, as I acknowledged, delivering clear messages that capture and retain the attention of lay readers is exceptionally hard. News media and individual journalists are constantly tempted to fall into narrative traps, provide simple slogans, tell catchy stories, add human color, portray research as an exciting horse race — and pretend that issues remain open long after the evidence has closed them. That’s the way to create clickbait, raise newspaper subscriptions, or sell books. As things now stand, science journalism suffers from the same perverse incentives to cut corners that both Chevassus-au-Louis and Harris identify in the social structure of biomedical research. In this case, the corner-cutting consists in not doing anything that might tax the reader. Never analyze. Never present a sustained line of reasoning. Entertainment is everything.

Schools of journalism might try addressing the problem by more actively seeking out students with strong backgrounds in science, offering them rewards for undertaking the training required for writing in ways that are informed, enlightening, and vivid. They might develop and inculcate Slow Science Journalism.

Above all, however, they should hammer home, again and again, Charles Simonyi’s crucial demand: avoid oversimplifying ideas and presenting exaggerated claims.

¤


Philip Kitcher is John Dewey Professor of Philosophy at Columbia University.

¤


[1] See Neil A. Holtzman, Proceed with Caution (Baltimore: Johns Hopkins, 1989); Philip Kitcher, The Lives to Come: The Genetic Revolution and Human Possibilities (New York: Simon & Schuster, 1996).

LARB Contributor

Philip Kitcher is John Dewey Professor of Philosophy at Columbia University. He is co-author (with Evelyn Fox Keller) of The Seasons Alter: How to Save our Planet in Six Acts (W. W. Norton) as well as author of Science in a Democratic Society (Prometheus Books).

Share

LARB Staff Recommendations

Did you know LARB is a reader-supported nonprofit?


LARB publishes daily without a paywall as part of our mission to make rigorous, incisive, and engaging writing on every aspect of literature, culture, and the arts freely accessible to the public. Help us continue this work with your tax-deductible donation today!