‘Redefine statistical significance’

A P-value of < 0.05 is commonly accepted as the borderline between a finding and the empty-handed end of a research project. However, there are problems with that. First, P-values around 0.05 are notoriously irreproducible – as they should be on theoretical grounds (Halsey et a., 2015). Second, P-values around 0.05 are associated with a false discovery rate that can easily achieve more than 30% (Ioannidis 2005). Based on these considerations, David Colquhoun stated a few years ago “a p∼0.05 means nothing more than ‘worth another look’. If you want to avoid making a fool of yourself very often, do not regard anything greater than p<0.001 as a demonstration that you have discovered something” (Colquhoun 2014). While many thought that this has to be taken with a grain of salt, a currently circulating preprint further challenges the P<0.05 concept It is a consensus statement for more than 70 leading statisticians, representing institutions such as Duke, Harvard and Stanford and proposes to move to a new standard of P<0.005. Reducing the statistical alpha to a tenth will certainly reduce false positives in biomedical research, but key questions arise.
First, sample sizes required to power an experiment for a statistical alpha of 0.005 will simply be unfeasible in many if not most experimental models. In other words, feasible n’s will in most cases lead to inconclusive results – or at least to results that carry a considerable uncertainty. I wonder whether this would be such a bad thing if discussed transparently. Research always has to handle uncertainty and researchers should not hide this but rather discuss it. Rather than increasing sample sizes to unfeasible numbers, we should think of alternative approaches such as within-study confirmative experiments, perhaps with somewhat different designs for added robustness.
Second, shifting from 0.05 to 0.005 may simply replace quasi-mythic value with another. However, you set the statistical alpha, you will always balance the chance of false positives against that of false negatives. It is unlikely that one size fits all. If there is a big risk, for instance a deadly complication of a new drug or man-made climate change, I’d rather err on the safe side and may take counter measures at P<0.1. However, in other cases I may be more concerned about false positives, e.g. in genome wide association studies where P-values are given on a log scale.
Third, a threshold P-value (statistical alpha) turns a grey zone of probabilities into a binary decision of whether to reject the null hypothesis. Such binary decisions can be important, for instance whether to approve a new drug. In most cases of biomedical research, we do not necessarily need such binary decisions but rather a careful weighing of the available data and understanding of the associated uncertainties.
In conclusion, a P<0.05 is inadequate in many ways. However, only in few cases will a marked lowering of the threshold for statistical significance be the solution. Rather more critical interpretation of data and uncertainty may be required.

Two neglected aspects when discussing research quality

Scientific Excellence vs. Research Quality Regulated vs. Non-regulated
Scientific excellence is the key to advance science and to develop novel drugs. However, scientific excellence does not guarantee that the conducted experiments deliver robust results. There are two primary reasons for that.
First, education in science does not always focus enough on the requirements for delivering of robust data (e.g. statistical power, blinding, randomization, etc.).
Second, excitement associated with a scientific hypothesis or conveyed by a scientific leader may introduce bias in study design, conduct, analysis and/or reporting.
In the drug discovery and development process, there are several steps that have adequate quality control and are covered by GxP policies (e.g. Good Laboratory Practice, GLP).
For non-regulated areas (most specifically, biology and pharmacology of drug discovery projects), GLP-like procedures would not be acceptable and may not even help to secure the quality of research. In fact, one may indeed imagine a lab running under GLP conditions but nevertheless still failing to design and execute robust studies.
Thus, for non-regulated areas of drug discovery, one needs to have a specialized set of Good Research Practice conditions that focus on study design, unbiased conduct, analysis and reporting.

‘German Leibniz Institute director Karl Lenhard Rudolph guilty of misconduct’

On June 15th, 2017, The Leibniz Association, one of the largest networks of non-profit research institutions in Germany, has announced a decision in a case of misconduct against the Director of the Leibniz Institute for Aging Research – Fritz-Lipmann-Institut (FLI) in Jena, Professor Dr. Karl Lenhard Rudolph.
There are many cases of fraud and misconduct reported every year but so far we (PAASP) have tried to stay out of this discussion. First, we believe that outright fraud represents a minor fraction of the problem with data quality in research. Second, our focus is on preventable sources of insufficient research quality that are related to lack of training, missing and non-effective quality assurance processes, etc. In other words, we focus on errors that are made unintentionally as opposed to intentionally biased presentation of the results as we believe the fraud is. Poor data quality in our sense is the result of missing or not adhering to quality standards for the technical procedures and not the falsification and fabrication of data.
Yet, we have decided to highlight this new case of a misconduct at FLI. Why?
In short, eight of the checked publications from FLI contained errors such as inappropriate duplication of the image, presentation of false images, unjustified selection of the data to be presented, etc. which are clearly cases of misconduct. However, additionally and very importantly, the press release mentions failed documentation (such as protocols and primary data saved in lab journals) as well as questionable reproducibility in four of the evaluated papers where adherence to good research practice was not evident. This may need to be distinguished in our opinion from misconduct since the root cause for this behavior is most of the time ignorance, sloppiness and laziness.
Yet, this case shows that the border between fraud and poor quality in research is not that clear. If this is true, do we (the scientific community) have the correct focus when designing and executing efforts against fraud and misconduct? So far, most Universities have research integrity offices and offer research integrity courses to their students and staff. To the best of our knowledge, most of these efforts do not actively promote principles of robust study design, do not require students to be taught the best practices in data handling and reporting, and do not mandate the use of tools and resources that could increase the adherence to good research practice (e.g. electronic lab notebooks). However, exactly these efforts would bring more transparency and sustainability to the research community and targeting not only the problem of the reproducibility crisis but most likely will also make it harder for people prone to fraud.
Does one really need to wait until such high-profile misconduct cases occur to re-consider our efforts to prevent inappropriate research practices?