When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment


Null hypothesis significance testing (NHST) is linked to several shortcomings that are likely contributing factors behind the widely debated replication crisis in psychology and biomedical sciences. In this article, Denes Szucs and John P. A. Ioannidis review these shortcomings and suggest that NHST should no longer be the default, dominant statistical practice of all biomedical and psychological research. If theoretical predictions are weak, scientists should not rely on all or nothing hypothesis tests. When NHST is used, its use should be justified, and pre-study power calculations and effect sizes, including negative findings should be published. The authors ask for hypothesis-testing studies being pre-registered and, optimally, all raw data being published. Scientists should be focusing on estimating the magnitude of effects and the uncertainty associated with those estimates, rather than testing null hypotheses.

Enhancing the usability of systematic reviews by improving the consideration and description of interventions

The importance of adequate intervention descriptions in minimizing research waste and increasing reproducibility rates has gained attention in the past few years. Improving the completeness of intervention descriptions in systematic reviews is likely to be a cost-effective contribution towards facilitating evidence implementation from reviews – a statement that is true for the clinical area as well as for the preclinical and basic research field.
In this article, Tammy C Hoffmann and colleagues explore the problem and implications of incomplete intervention details during the planning, conduct, and reporting of systematic reviews and make recommendations for review authors, peer reviewers, and journal editors. The authors call for everyone with a role in producing, reviewing, and publishing systematic reviews to commit to help solving this remediable barrier caused by inadequate intervention descriptions.

Power-up: a reanalysis of ‘power failure’ in neuroscience using mixture modelling


In 2013, a paper by Katherine S. Button called ‘Power failure: why small sample size undermines the reliability of neuroscience’ was published in Nature Reviews Neuroscience. It got a lot of attention at the time and has since been cited more than 1700 times. The authors concluded that the average statistical power of studies in the neuroscience field is very low. The consequences of this include overestimates of effect size and low reproducibility of results.
Now, four years later, Camilla Nord et al. reanalyzed the same dataset from the original publication and published their finding in the Journal of Neuroscience. The key finding of the new study is that the field of neuroscience is diverse in terms of power, with some branches of neuroscience doing relatively well. The authors demonstrate, using Gaussian mixture modelling, that the sample of 730 studies included in the analysis comprises several subcomponents; therefore, the use of a single summary statistic is insufficient to characterize the nature of the distribution. This indicates that the notion that studies are systematically underpowered is not the full story and low power is far from a universal problem. However, do these findings lessen concerns about statistical power in neuroscience? Unfortunately not. In fact, the authors concluded that the distribution of power is highly heterogeneous demonstrates an undesirable inconsistency, both within and between methodological subfields.

Science with no fiction: measuring the veracity of scientific reports by citation analysis

The current ‘reproducibility crisis’ in biomedical research is enabled by the lack of publicly accessible information on whether the reported scientific claims are valid. In this paper, published on bioRxiv, Peter Grabitz and colleagues propose an approach to solve this problem that is based on a simple numerical measure of veracity, the R-factor, which summarizes the outcomes of already published studies that have attempted to test a claim. The R-factor of an investigator, a journal, or an institution would be the average of the R-factors of the claims they reported. The authors illustrate this approach using three studies recently tested by the Reproducibility Project: Cancer Biology, compare the results, and discuss how using the R-factor can help improve the integrity of scientific research.
Although calculating the R-factor for a handful of reports is relatively simple, especially to an expert in the field, the question remains who will calculate the R-factors for the thousands of researchers and their hundreds of thousands or even millions of reports?

Further potential shortcomings of the R-factor are discussed here.

Additional reads in July 2017

Further commentaries, articles and blog posts worth reading

Results-free review: impressions from the first published article

Make Data Count: Building a System to Support Recognition of Data as a First Class Research Output

Cancer studies pass reproducibility test

“Aufgebauscht, bis es falsch wird” (in German)

Image doctoring must be halted

The Wrong Path

Repeating important research thanks to Replication Studies

Paving the Way to More Reliable Research

Redefine Statistical Significance

When statistician Ronald Fisher introduced the P-value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether the obtained results are ‘worthy of a second look’ and Fisher understood that the ‘threshold’ of 0.05 for defining statistical significance was rather arbitrary.
Since then, the lack of reproducibility of scientific studies has caused growing concerns over the credibility of claims of new discoveries based on “statistically significant” findings. . A much larger pool of scientists are now asking a much larger number of questions, possibly with much lower prior odds of success.
In this article, 72 renowned statisticians therefore propose to change the default P-value threshold for statistical significance for claims of new discoveries from 0.05 to 0.005. According to the authors and for research communities that continue to rely on null hypothesis significance testing, reducing the P-value threshold to 0.005 is an actionable step that will immediately improve reproducibility.
Importantly, however, the authors also emphasize that their proposal is about standards of evidence, not standards for policy action nor standards for publication. Results that do not reach the threshold for statistical significance (whatever it is) can still be important and merit publication in scientific journals if important research questions are addressed with rigorous methods. LINK

Can cancer researchers accurately judge whether preclinical reports will reproduce?

There is currently a vigorous debate about the reproducibility of research findings in cancer biology. Whether scientists can accurately assess which experiments will reproduce original findings is important to determine the pace at which science self-corrects. To address this question, Daniel Benjamin et al. collected forecasts from basic and preclinical cancer researchers on the first 6 replication studies conducted by the Reproducibility Project: Cancer Biology to assess the accuracy of expert judgments on specific replication outcomes. On average, researchers forecasted a 75% probability of replicating the statistical significance and a 50% probability of replicating the effect size, yet none of these studies successfully replicated on either criterion (for the 5 studies with results reported). Accuracy was related to expertise: Experts with greater publication impact (as measured by h-index) provided more accurate forecasts, but experts did not consistently perform better than trainees, and topic-specific expertise did not improve forecast skill. Thus, the authors concluded that experts tend to overestimate the reproducibility of original studies and/or they underappreciate the difficulty of independently repeating laboratory experiments from original protocols. These findings can have important implications as the authors also state ‘knowing how well biomedical researchers can predict experimental outcomes is crucial for maintaining research systems that set appropriate research priorities, that self-correct, and that incorporate valid evidence into policy making’. LINK

Common pitfalls in preclinical cancer target validation

In this Perspective, published in Nature Reviews Cancer, William G. Kaelin Jr outlined some of the common pitfalls in preclinical cancer target identification and some potential approaches to mitigate them. An alarming number of papers from laboratories nominating new cancer drug targets contain findings that cannot be reproduced by others or are simply not robust enough to justify drug discovery efforts. This problem probably has many causes, including an underappreciation of the danger of being misled by off-target effects when using pharmacological or genetic perturbants in complex biological assays. This danger is particularly acute when, as often the case in cancer pharmacology, the biological phenotype is being measured based on e.g. decreased proliferation, decreased viability or decreased tumour growth that could simply reflect a nonspecific loss of cellular fitness. These problems are compounded by multiple hypothesis testing, such as when candidate targets emerge from high-throughput screens that interrogate multiple targets in parallel, and by a publication and promotion system that preferentially rewards positive findings.
Development of the cancer drugs of tomorrow relies on the target identification and validation studies being carried out today. Therefore, the author concludes that the academic community needs to set a higher standard with respect to decision-enabling studies, and we need to remind trainees and ourselves, that publishing papers is a means to an end and not an end in itself. LINK

Is the staggeringly profitable business of scientific publishing bad for science?

Back in 1988, Robert Maxwell predicted that, in the future, there would only be a handful of immensely powerful publishing companies left, operating in an electronic age with very low printing costs, leading to almost ‘pure profit’. In this theguardian article, Stephen Buranyi describes how the scholarly publishing industry came to be what it is and how academia has been complicit in its development at the expense of the advancement of knowledge and the ability for all mankind to benefit.
This development is very alarming during times when Publishers should actually take more responsibility to increase integrity of published results and data reproducibility. Why do we say that? Previous analysis’ have suggested that quality of research conduct is more or less the same for scientifically excellent papers and papers that are scientifically below average. This is a very dangerous situation because it is the scientifically excellent publications (= typically published in high IF journals) that result in follow-up research and therefore may be more likely to trigger waste of time, resources, money, etc.
Thus, there is one obvious solution – make an effort to equip scientifically excellent publications with particularly high quality standards. And Publishers are the most well suited to get this process going. LINK

Scroll Up