The Replication Crisis in Preclinical Research


     
The Replication Crisis in Preclinical Research With growing alarm at the failure to replicate published, influential studies, the life sciences community is turning their attention to the causes and consequences of the replication crisis. How can we improve in vivo research to secure the reliability of our results?

Reproducibility in Life Sciences

While some of the most public replication failures came out of the "softer" sciences, such as psychology or economics, the biological and life science disciplines are hardly immune. After all, results from in-house replication studies by pharmaceutical companies have fueled a considerable part of the public conversation around reproducibility.

In a now-infamous commentary on reproducibility in preclinical oncology research, researchers from Amgen and the University of Texas MD Anderson Cancer Center reported disturbingly low replication rates from internal experiments at Amgen in 2012. Amgen researchers were only able to confirm 11% of landmark hematology and oncology papers which had been previously used to justify proposed projects.

The commentary cited a 2011 piece by three Bayer Healthcare researchers which generally called into question the accuracy of published data on potential drug targets. The researchers reported that, in their own validation efforts," in almost two-thirds of the projects, there were inconsistencies between published data and in-house data that either considerably prolonged the duration of the target validation process or, in most cases, resulted in termination of the projects." They attribute the high failure rate of Stage II drug candidates, at least in part, to poor statistical analysis, errors in study design, and pressure to selectively report results at earlier stages of preclinical research.

While these internal studies remain opaque — we don't know which papers Amgen was trying to replicate, for example — they inspired a healthy and contentious conversation in the scientific community about improving reliability. Reproducibility Project: Cancer Biology is an example of the resulting efforts, which publicly documents its subjects and progress and releases its analyses for open review.

Yet replication studies like these have continued to call influential results into question, with troubling implications for the state of preclinical research methodology.

Possible Causes of the Replication Crisis

In Nature's survey on contributions to the replication crisis, respondents were asked to identify what they believed to be the most significant factors driving low reproducibility within their field. More than half cited studies with low statistical power, while publication bias or selective reporting were each advanced by over 60% of respondents.

While it is possible to frame each of these factors as problems of study design, poor experimental design was less-frequently cited as a distinct cause of replication failure.

Publication Bias in Preclinical Research

Publication bias is a challenging influence on in vivo study design, and the pressure to produce results specifically for publication has a distorting effect on scientific inquiry. Researchers are discouraged from the yeoman's work of research in favor of building "mansions of straw".

The surest tell may be in how scientists have come to describe their work when writing for publication. A 2015 analysis in The BMJ found that, between 1974 and 2014, positive words (such as "robust," "novel," "innovative," and "unprecedented") increased in relative frequency in PubMed abstracts by up to 15,000%.

This bias towards sharing positive, audience-friendly results affects not only the studies we undertake and report, but the available information used to make decisions beyond the lab. One analysis of bias in published studies of antidepressant drugs, for example, found that almost a third of FDA-registered studies on twelve antidepressant agents were not published. These were overwhelmingly the studies with negative outcomes; with only the published literature to go by, a clinician would see that 94% of study outcomes for these drugs were positive, rather than the actual figure of 51%.

We note that a 2005 study of publication bias in publications about publication bias found no likely distortion in that area of inquiry.

Selective Reporting of Experimental Results

In one sense, selective reporting is the internal manifestation of publication bias — when data is unreported if it deviates from the goals or expectations of the researcher. This can emerge naturally, via cognitive biases, or through external pressures, such as in response to research funding mechanisms.

Selective reporting bias can be systematically mitigated through experimental design, but few published in vivo studies make explicit note of experimental controls to eliminate bias. A review of 2,671 drug efficacy studies in in vivo disease models found "limited reporting of measures to reduce the risk of bias in a random sample of life sciences publications." Randomization was reported in 24.8% of the sample group, blind outcome assessment in 29.5%, and transparent sample size calculation in 0.7%.

Clear declarations of potential conflicts of interest, such as funding sources, were only found in 11.5% of the sample studies overall but increased from 2.3% to 35.1% between 1992 and 2011; a reassuring trend mirrored in randomization and blind assessment, but not in declared sample size calculation.

Studies with Low Statistical Power

Naomi Altman, Professor of Statistics at The Pennsylvania State University, and Martin Krzywinski, staff scientist at Canada's Michael Smith Genome Sciences Centre, paint a dire picture of the consequences of low-powered studies. Given a hypothetical set of trials with a 10% chance of experimental effect and a power value of 0.2 — the median for neuroscience studies in their sample — over two-thirds of the positive results will be wrong. At 0.8, one-third of the positive experimental results are likely to be invalid.

Sample size calculation for meaningful statistical power is extremely important to the design of reproducible in vivo studies. Underpowered studies can inflate the size and statistical significance of effects, leading to the publication of work which cannot be replicated. Conversely, some experimental effects could go unnoticed, introducing confounding factors into the work of successive researchers.

Poor Experimental Design

While these three factors often influence the questions we ask and the results we report, they can further affect the validity of preclinical research through accompanying errors in experimental design.

In some cases, these errors are systemic and have clear analogs in both clinical and preclinical experiments; in vivo studies carried out within a single gender or without careful attention to a model's genetic background parallel the effective exclusion of women and minority populations from clinical trials, despite the impact of biological gender and genetic variance on patient outcomes.

Clinical Consequences of Study Design

A massive amount of wasted resources is attributable to errors in preclinical study design, and those same underlying mechanisms can have dangerous, unintended consequences in the clinical population.

Two noteworthy examples are mortality rates in ethnic minority communities and the occurrence of next-day impairment among female users of the sleeping aid zolpidem. Together, these situations will hopefully illustrate how a flawed study process, even when undertaken in good faith by ethical researchers, can produce results which mislead us as to the real-world performance of drug candidates.

Gender Bias in Study Populations

Around 4% of US adults use some form of medication to help them sleep. Several, distinct multi-billion-dollar industries exist around assisting, studying, or somehow enhancing sleep. In 2011, Ambien (zolpidem) achieved an impressive 2.8 billion dollars in sales. For people who had previously struggled with long-term sleep disturbances, every dollar was money well spent.

Unfortunately, it wasn't long before evidence emerged of female Ambien users having disturbing experiences while using the medication, including sleepwalking, sleep-driving, and even assaulting others while completely unaware of their actions.

Women metabolize zolpidem at half the rate of male patients and run a much higher risk of remaining under its influence in the morning, despite using the medication as prescribed. The FDA was quick to issue gender-specific dosage guidelines for zolpidem which accounted for the higher "next day" serum levels found in female patients.

No nefarious action was taken to cover up differences in how women and men metabolize zolpidem, and, barring evidence to the contrary, we can assume each step in its preclinical and clinical research was taken in good faith. Given that, however, how did such a significant gender disparity in patient outcomes go unnoticed?

Excluding Female Research Subjects

The assumption that male subjects are a suitable default begins in preclinical research. This was documented in a 2011 survey of in vivo research across ten biological fields which found a significant bias towards male experimental animals. In pharmaceutical research, 63% of studies used only male animals, while single-sex studies of male subjects outnumbered females 5.5 to 1 in neuroscience.

If the gender of research animals is unreported, this can lead to studies which fail to replicate. It can also affect the translatability of research; a treatment for chronic pain tested exclusively in male mice might not work on women at all in clinical trials.

At the level of clinical trials, the exclusion of women was once the official guidance of the FDA, in response to the thalidomide crisis of the 1960s, and "women of childbearing potential" were excluded from participating in Phase I and early Phase II trials. Even before that, however, women were often excluded out of a belief that their more complex hormonal cycles provided too many variables for controlled studies in significant populations.

While that guidance has since been rescinded, newer mandates to study sex-specific impacts of drug whenever appropriate do not apply to privately-funded research. Many researchers either fail to recruit or actively exclude women from clinical studies — due to real or potential liability concerns, a lack of relevant physiological data (due to historical exclusion), or economic constraints — or mix them into the study population as though the genders were medically fungible.

The result is that a new treatment can possibly proceed from in vivo research, through clinical trials, and into commercial circulation without ever being tested on a scientifically significant number of female mammals, or without assessment of gender-variant efficacy and metabolism. When we're fortunate, the bottom-line clinical difference between the sexes is minimal. This was not the case with zolpidem, and gender bias in study populations placed thousands of women at risk.

Mortality Rates in Minority Communities

Mortality rates from asthma improved by one-third between 1990 and 2010, yet this improvement has not been consistent across racial groups. African Americans, in particular, still suffer disproportionately from asthma and receive less effective treatment:

No deliberate, racially-motivated decision-making process is being deployed to exclude African Americans from participating in asthma research. Instead, the issue is one that should be immediately familiar to in vivo researchers: failure to account for the influence of genetic background on drug efficacy and metabolism.

Genetic Background and Drug Efficacy

While the wider concept of race is non-biological, self-reported race is strongly correlated with ancestry and genetic variance. This can sometimes be clinically relevant, as genetic ancestry can alter both susceptibility to disease and responses to treatment. Some Latinas carry genes that make them more resistant to breast cancer, for example, while the antiplatelet drug clopidogrel is significantly less effective at preventing stroke, infarction, or cardiovascular death in Pacific Islanders.

Studies which do not take this genetic influence on treatment outcomes into consideration risk producing research which supports a positive result — supporting publication, approval, and funding — but has less predictive value for efficacy with the population at large.

In the case of clopidogrel, the drug was recommended after a study of 19,185 patients known as the CAPRIE study (Clopidogrel versus Aspirin in Patients at Risk of Ischaemic Events) found it to be more effective than aspirin. It was a randomized, blind trial — of 95% non-Hispanic white patients. In the two years before alternatives were found, twice as many Americans of Pacific Islander descent died following acute myocardial infarction as before the recommendation to treat them with clopidogrel.

With asthma in the African American community, we have a similar case for genetic variance in pathology and response to treatment. While less than 5% of NIH-funded respiratory research reported inclusion of non-white populations in their studies, recent genetic analysis has determined that only 5% of the genetic markers understood to correlate with asthma in populations of European descent are present in African Americans. Additionally, African Americans carry a novel asthma-associated variant of the PTCHD3 gene not present in Europeans.

Cardiovascular disease, diabetes, and cancer clinical trials similarly exclude minority populations, despite those patients bearing the greater disease burden of each condition. Since 1993, in fact, less than 2% of more than 10,000 National Cancer Institute-funded studies met the NIH standards for minority participation.

Improving Replication in In Vivo Research

Systemic efforts to improve reproducibility via increased transparency, bias mitigation, and third-party validation are to be supported, but a great deal of progress can be made by individual in vivo researchers or institutions.

To support those efforts, we put together a list of resources for improving quality and reproducibility in animal studies, covering topics from genetics and health testing to useful listservs in the in vivo research community.

If you would like to add a resource to this page, or have a question you would like to discuss in more detail, please contact our scientific staff.

Improved study design can help ensure rigorous, reliable results — and better outcomes for the patients who ultimately depend on our work.

JF Stackhouse is a content manager and technical writer covering engineering, science, and social systems.

Share this Insight