Don’t run biomedical science as a business
Metrology is key to reproducing results
Do our measures of academic success hurt science?
Are the times a-changin’? Reporting before and after the 2015 statistical guidelines
Partnership for Assessment and Accreditation of Scientific Practice
This Category will include all news published on PAASP site
Glyphosate, or N-(phosphonomethyl)glycine, is a widely used broad-spectrum, nonselective herbicide that has been in use since 1974. Glyphosate effectively suppresses the growth of many species of trees, grasses, and weeds. It acts by interfering with the synthesis of the aromatic amino acids phenylalanine, tyrosine, and tryptophan, through the inhibition of the enzyme 5-enolpyruvylshikimate-3-phosphate synthase (EPSPS). Importantly, EPSPS is not present in mammals, which obtain their essential aromatic amino acids from the diet.
Glyphosate is currently marketed under numerous trade names by more than 50 companies in several hundreds of crop protection products around the world. More than 160 countries have approved uses of glyphosate-based herbicide products.Glyphosate has now become the most heavily-used agricultural chemical in the history of the world, and its safety profile, including the potential carcinogenicity, has been heavily discussed by scientists, public media and regulatory authorities worldwide for the last several years. Given its widespread use, the key question is: could Glyphosate be toxic for humans?
In 2015, the International Agency for Research on Cancer (IARC), a research arm of the WHO, classified Glyphosate as “probably carcinogenic to humans”. It was categorized as ‘2A’, due to sufficient evidence of carcinogenicity in animals and strong evidence for two carcinogenic mechanisms but limited evidence of carcinogenicity in humans.
In contrast, the European Food Safety Authority (EFSA) concluded, based on the Renewal Assessment Report (RAR) for glyphosate that was prepared by the German Federal Institute for Risk Assessment (BfR), that ‘Glyphosate is unlikely to pose a carcinogenic hazard to humans and the evidence does not support classification with regard to its carcinogenic potential’.
Why do the IARC and the EFSA disagree?
To understand this discrepancy, it is important to note that the IARC carried out a hazard assessment, which evaluates whether a substance might pose a danger. The EFSA, on the other hand, conducted a risk assessment, evaluating whether Glyphosate actually poses risks when used appropriately. The differences between these two approaches can be explained by the following example:
Under real-world conditions, eating a normal amount of bacon (and other processed meats) raise the risk of colorectal cancer by an amount way too small to consider. However, as bacon does appear to be raising cancer by a tiny, but reproducible and measurable amount, it is currently classified in IARC’s category ‘1’ (‘carcinogenic to humans’). Therefore, the analysis done by the IARC boils down to the question ‘Is there any possible way, under any conditions at all, that Glyphosate could be a carcinogen?’ while the EFSA tries to answer the question ‘Is Glyphosate actually causing cancer in people?’
However, these differences are not clearly communicated and people are left confused by these contradicting reports. To aggravate the situation, both parties accuse each other of using inscrutable and misleading (statistical) methods:
Owing to the potential public health impact of glyphosate, which is an extensively used pesticide, it is essential that all scientific evidence relating to its possible carcinogenicity is publicly accessible and reviewed transparently in accordance with established scientific criteria.
To understand the implications behind the two different scientific questions being asked (is a substance hazardous vs. is there a real risk potential) is clearly important as shown above. Furthermore, the Glyphosate story has also demonstrated that science can only move forward through the careful evaluation of data and the rigorous review of findings, interpretations and conclusions. An important aspect of this process is transparency and the ability to question or debate the findings of others. This ensures the validity of the results and provides a strong basis for decisions.
A P-value of < 0.05 is commonly accepted as the borderline between a finding and the empty-handed end of a research project. However, there are problems with that. First, P-values around 0.05 are notoriously irreproducible – as they should be on theoretical grounds (Halsey et a., 2015). Second, P-values around 0.05 are associated with a false discovery rate that can easily achieve more than 30% (Ioannidis 2005). Based on these considerations, David Colquhoun stated a few years ago “a p∼0.05 means nothing more than ‘worth another look’. If you want to avoid making a fool of yourself very often, do not regard anything greater than p<0.001 as a demonstration that you have discovered something” (Colquhoun 2014). While many thought that this has to be taken with a grain of salt, a currently circulating preprint further challenges the P<0.05 concept It is a consensus statement for more than 70 leading statisticians, representing institutions such as Duke, Harvard and Stanford and proposes to move to a new standard of P<0.005. Reducing the statistical alpha to a tenth will certainly reduce false positives in biomedical research, but key questions arise.
First, sample sizes required to power an experiment for a statistical alpha of 0.005 will simply be unfeasible in many if not most experimental models. In other words, feasible n’s will in most cases lead to inconclusive results – or at least to results that carry a considerable uncertainty. I wonder whether this would be such a bad thing if discussed transparently. Research always has to handle uncertainty and researchers should not hide this but rather discuss it. Rather than increasing sample sizes to unfeasible numbers, we should think of alternative approaches such as within-study confirmative experiments, perhaps with somewhat different designs for added robustness.
Second, shifting from 0.05 to 0.005 may simply replace quasi-mythic value with another. However, you set the statistical alpha, you will always balance the chance of false positives against that of false negatives. It is unlikely that one size fits all. If there is a big risk, for instance a deadly complication of a new drug or man-made climate change, I’d rather err on the safe side and may take counter measures at P<0.1. However, in other cases I may be more concerned about false positives, e.g. in genome wide association studies where P-values are given on a log scale.
Third, a threshold P-value (statistical alpha) turns a grey zone of probabilities into a binary decision of whether to reject the null hypothesis. Such binary decisions can be important, for instance whether to approve a new drug. In most cases of biomedical research, we do not necessarily need such binary decisions but rather a careful weighing of the available data and understanding of the associated uncertainties.
In conclusion, a P<0.05 is inadequate in many ways. However, only in few cases will a marked lowering of the threshold for statistical significance be the solution. Rather more critical interpretation of data and uncertainty may be required.
Null hypothesis significance testing (NHST) is linked to several shortcomings that are likely contributing factors behind the widely debated replication crisis in psychology and biomedical sciences. In this article, Denes Szucs and John P. A. Ioannidis review these shortcomings and suggest that NHST should no longer be the default, dominant statistical practice of all biomedical and psychological research. If theoretical predictions are weak, scientists should not rely on all or nothing hypothesis tests. When NHST is used, its use should be justified, and pre-study power calculations and effect sizes, including negative findings should be published. The authors ask for hypothesis-testing studies being pre-registered and, optimally, all raw data being published. Scientists should be focusing on estimating the magnitude of effects and the uncertainty associated with those estimates, rather than testing null hypotheses.