Error bars can convey misleading information

by Martin C. Michel

The most common type of graphical data reporting are bar graphs depicting means with SEM error bars. Based on simulated data, Weissgerber et al. have argued convincingly that bar graphs are not showing but rather hiding data, as various patterns of underlying data can lead to the same mean value (Weissgerber et al., 2015). Thus, an apparent inter-group difference can represent symmetric variability in both groups, as most would assume the difference in means represents. However, it also could be driven by outliers, bimodal distribution within in each group or by unequal sample sizes across groups. Each option may reach statistical significance, but the story behind the data may be differing considerably. Weissgerber et al. have also shown that the choice of depicting variability, at least psychologically, affects how we perceive data. Thus, SEM (SD divided by square root of n) has the smallest error bar, and a small error bar may make even small group difference look large, even if the overlap between both groups is considerably. To further look into this, I have gone back into previously published real data from my lab (Frazier et al., 2006). That study has explored possible difference in relaxation of the urinary bladder by several β-adrenoceptor agonists between young and old rats. At the time, not knowing any better, we reported means with SEM error bars. In the figure below, I show a bar graph based on means with SEM error bars as the data had been presented in the paper along with other types of data representation. Looking at this panel only, it appears that there may be a fairly large difference between the young and old rats, i.e. old rats exhibiting only about half of the maximum relaxation. But if we look at the scatter plot, two problems appear with this interpretation. Firstly, there was one rat among the old rats in which noradrenaline caused hardly any relaxation. It does not look like a major outlier but clearly had impact on the overall mean. Second, there is considerable overlap in the noradrenaline effects between the two age groups. Thus, only 5 out of 9 measurements in old rats yield values smaller than the lowest in the young rats. Thus, these real data confirm that means may hide existing variability in data and pretend a certainty of conclusions that may not be warranted. As proposed by Weissgerber et al., the scatter plot conveys the real data much better than the bar graph and gives readers a choice to interpret the data as they are. Thus, unless there is a large number of data points, the scatter plot is clearly superior to the bar graph.

However, when data are not shown in a figure but in the main text, not all data points can be presented and a summarizing number is required. If one looks at the four bar graphs (each showing the same data, only with a different type of error bar), they convey different messages. The graph with an SEM error bar makes it look as if the difference between the two groups is quite robust, as the group difference is more than thrice the magnitude of the error bar. However, we have seen from the scatter plot that this is not what the data really say. On the other hand, the SD error bars by definition are larger. As everybody knows, about 95% of all data fall within twice the SD. Looking at the SD error bars, it is quite clear that the two groups overlap. This is what the raw data say, but not the impression coming from the SEM error bars.

There also is a conceptual difference between SD and SEM error bars. SD describes the variability within the sample, whereas SEM describes the precision with which the group mean has been estimated. An alternative to presenting precision of the parameter estimate is the 95% confidence interval. In this specific case, it provides a similar message as the SD error bar, i.e. the two populations may differ but probably are overlapping. Of note, SEM and SD are only meaningful if the samples come from a population with Gaussian distribution (or at least close to it). In biology, this often is not the case or we at least do not have sufficient information for an informed decision. In this case it involves fewer assumptions to show medians. To express the variability of the data depicted as medians, the interquartile range is a useful indicator. In this example, it conveys a similar message as the SD or confidence interval error bars.

In summary, many data points may lead to similar bar graphs, but a different biology may be hiding behind it in each case. Therefore, the scatter plot (where possible) is clearly the preferred option of showing quantitative data. If means with error bar have to be sown, e.g. within the main text, SD is the error bar of choice to depict variability and confidence interval to depict precision of parameter estimate. For data from populations with non-Gaussian distribution medians with interquartile ranges are the preferred option to present data when scatter plots are not possible.


Frazier EP, Schneider T, Michel MC (2006) Effects of gender, age and hypertension on ß-adrenergic receptor function in rat urinary bladder. Naunyn-Schmiedeberg’s Arch Pharmacol 373: 300-309

Weissgerber TL, Milic NM, Winham SJ, Garovic VD (2015) Beyond bar and line graphs: time for a new data presentation paradigm. PLoS Biol 13: e1002128

What does the Net Present Value have to do with preclinical drug discovery research?

When looking for Investors in the Life Science sector, entrepreneurial scientists and start-up companies have to deal with an unavoidable questions: ‘What is actually the appropriate valuation of my idea or business?’ Venture capitalists may hesitate investing in biotechnology if bioentrepreneurs fail to provide or accept realistic estimates of the value of their technologies. One of the underlying reasons is that there is often little intuition into what biotech companies are worth and numbers sometimes can seem very arbitrary. Furthermore, owing to the complexity and specificity of scientific knowledge, it can be challenging and time-consuming to evaluate the technological and scientific risks associated with an early-stage Biotech company.

Traditionally, the typical instruments applied for valuation in the biomedical biotech area are based on the Discounted Cashflow (DCF) analysis and the Net Present Value (NPV) model. These approaches require revenue and growth projections as well as projections of potential market share. In addition, the net price of the future drug, the costs per clinical trial and market access are further parameters normally considered. Based on these calculations, assumptions regarding price, peak market share and accessible market have the greatest impact on the venture valuation. By following this type of analysis, investors usually focus more on parameters relevant for the commercial phase of a product rather than the R&D phase.

Importantly, additional risk adjustments can be applied to the NPV calculations by modifying  future cash flows based on the probability of a drug progressing from one development stage to the next, resulting in a risk-adjusted NPV (rNPV). However, reference data for the determination of the attrition risks are usually calculated from historical information on the success rate in each development phase for products of a similar category (e.g. type of disease) – without taking into account pre-clinical data quality and integrity.

At least for the early-stage companies (i.e. before clinical Proof-of-Concept), this is quite surprising and risky as only robust and high-quality pre-clinical data can build a solid foundation for future success of drug R&D projects. Several steps within the R&D process are covered by GxP-based quality procedures (e.g. GLP, GCP, GMP, etc.) that aim to protect the integrity of research. However, these same standards cannot be applied to the basic and pre-clinical areas of drug discovery, and, consequently, biotech companies can differ widely regarding the quality of data sets generated. Thus, pre-clinical quality of research data is crucial and should be taken into account if the venture valuation is done before the Proof-of-Concept in Phase II is delivered. It is therefore highly recommended to analyze the likelihood that a given set of preclinical data is robust enough to support a successful clinical drug discovery project.

Uncertainties regarding data quality and robustness can be reflected by superimposing Monte Carlo (MC) simulations on the rNPV calculations, which returns a range of possible outcomes and importantly, the probability of their occurrences – rather than providing only a single return on investment figure, like the rNPV. In reality, only a small minority of drug development projects have positive cash flows (in case a projects reaches beyond the pre-registration phase) and most scenarios have, in fact, a negative rNPV (in case one of the clinical trials yielded a negative result). In contrast to the standard rNVP, further advanced models (e.g. risk-profiled MC valuations) indeed place the focus on clinical phase I/II failures as the most probable outcome. Hence, the costs and lengths of phase I/II trials become the most critical parameters with the highest impact on valuation.

Furthermore, for projects, where data robustness and probability of reproducing preclinical data is low, most of the rNPV range will shift towards a negative mean, providing a more accurate view of the risks involved in pharmaceutical R&D.

Monte Carlo (MC) simulation: The MC calculation is usually repeated hundreds of times, using different input values for each parameter. The rNPV (in US$K) is plotted against the probability for each rNPV value.

Given the importance of preclinical data for the outcome of all subsequent clinical trials, only a plausible evaluation of the quality, robustness and integrity of all pre-clinical studies, ideally via a third-party assessment, will complete any Due Diligence process and should therefore be a critical and valuable part of the decision-making procedure for modern portfolio management.

Number of citations as a measure of research quality – a dangerous approach

How Many Is Too Many? On the Relationship between Research Productivity and Impact. In this research article, published in PLOS One (Larivière V, Costas R (2016) PLoS ONE 11(9): e0162709), V. Larivière and R. Costas analysed the publication and citation records of more than 28 million researchers, who published at least one paper between 1980 and 2013. Using this database, the authors tried to understand the relationships between research productivity and scientific impact. They addressed the question whether incentives for scientists to publish as many papers as possible will lead to higher-quality work – or just more publications. It was found that, in general, an increasing number of scientific articles per author did not yield lower shares of highly cited publications, or, as Larivière and Costas put it: ‘the higher the number of papers a researcher publishes, the higher the proportion of these papers are amongst the most cited.

There are two reasons why we find this paper very interesting and worth reading:

On the one hand, here at PAASP, we are very much interested in the reverse relationship – whether quality has an impact on productivity. Indeed, some colleagues are worried that introducing and maintaining higher quality standards in research could have a negative impact on the number of papers published, less possibilities to publish in a higher impact factor journal or longer duration of student projects (e.g. for PhD students).

On the other hand, this paper reminds us that using citation numbers as an index of quality is a dangerous approach. For example, we have used data generated by the Reproducibility Project Psychology (Open Science Framework, to plot citations for papers where research findings were replicated vs papers with findings that were not replicated (Excel table with the raw data is available upon request).

As the graph below illustrates, it is does not matter how often the senior or first authors have been cited during their carriers or how many times a particular paper has been cited. There are no differences between publications, which could or could not be replicated!

Reproducibility Project Psychology

The False Discovery Rate (FDR) – an important statistical concept

The p-value reports the probability of seeing a difference as large as the observed one, or larger, even if the two samples came from populations with the same mean value. However, and in contrast to a common perception, the p-value does not determine the probability that an observed finding is true!

When conducting multiple comparisons (e.g. thousands of hypothesis tests are often conducted simultaneously when analyzing results from genome-wide studies) there is an increased probability of false positives. While there are a number of approaches to overcome problems due to multiple testing, most of them attempt to reduce the p-value threshold from 5% to a more reasonable value.

In 1995, Benjamini and Hochberg introduced the concept of the False Discovery Rate (FDR) as a way to allow inference when many tests are being conducted. The FDR is the ratio of the number of false positive results to the number of total positive test results: a p-value of 0.05 implies that 5% of all tests will result in false positives. An FDR-adjusted p-value (also called q-value) of 0.05 indicates that 5% of significant tests will result in false positives. In other words, an FDR of 5% means that, among all results called significant, only 5% of these are truly null.

The importance of the FDR can be nicely demonstrated by analysing the following scientific publication:

Published in Nature Medicine in 2014 (Mapstone et al., 2014; Nature Medicine 20, 415–418), the authors discovered a biomarker panel of ten lipids from peripheral blood that predicted phenoconversion to Alzheimer’s disease within a 2-3 year timeframe. Importantly, the described sensitivity and specificity values of the proposed blood test were over 90%. In general, an accuracy of 90% is considered appropriate for any kind of screening test in normal-risk individuals and the reported results triggered a high degree of optimism – but is it justified?

This paper may well represent progress towards an AD blood test, but usefulness depends on what the rate of Alzheimer’s is in the population being screened:

Given a general Alzheimer’s incidence rate of 1%, out of 10,000 people 100 will have a condition and a test based on the described biomarker panel will reveal 90 true positive results (box at top right). However, what about the false positive results! Although 9.900 people do not have any condition, the test will show a false positive result for 990 people (box at bottom right), which leads to a total of 1080 positive results (990 false positives plus 90 true positives). Of these results, 990/1080 are false positives, resulting in a False Discovery Rate of 92%. That is, over 90% of positive screening results would be false!

False Discovery Rate

As a classic example of Bayes theorem, calculating the FDR clearly demonstrates that a test with a 90% (true positive) accuracy rate is going to misdiagnose (supply a false positive) almost 92% of the tested people (if the actual disease incidence rate is 1%). These sorts of calculations are misunderstood even by people who should know better, e.g. physicians. As can be seen from the above example, a key driver of FDR is the a priori probability of a hypothesis (in this case known incidence of Alzheimer’s). If prior probability is low, FDR will be high for a given p-value. If prior probability is high, FDR tends to be lower.

Consequently, compared to the p-value, the FDR has some useful properties. Controlling for the FDR is a way to identify as many significant tests as possible while incurring a relatively low proportion of false positives. Using the FDR allows scientists to decide how many false positives they are willing to accept among all the results that can be called significant.

When deciding on a cut-off or threshold value, this decision should focus on the question of how many false positive results will this test reveal, rather than just randomly picking a p-value of 0.05 and assuming that every comparison with a p-value less than 0.05 is significant.

Blinding – does it really have an impact?

Zubin Mehta, conductor of the Los Angeles Symphony from 1964 to 1978 and of the New York Philharmonic from 1978 to 1990, is credited with saying, “I just don’t think women should be in an orchestra.” In 1970, the top five orchestras in the U.S. had fewer than 5% female musicians and this number gradually increased over years reaching on average 25-30%. So, what was the source of this change?

Well, blinding seems to be one of the factors!

In the 1970s and 1980s, orchestras began using blind auditions. Candidates and jury members were separated by a curtain in a way that they could not see each other. This blinding process was found to account for at least 30% of the increase in the female proportion of “new hires“ at major symphony orchestras in the US (see Figure below, modified from Goldin & Rouse (2000) American Economic Rev 90: 715).

By the way, the first blinded auditions provided an astonishing result: men were still favoured over women!

It was later discovered that screens kept juries from seeing the candidates move into position, but the sound of the women’s heels when entering the music stage unknowingly influenced the jury. Once the musicians removed their shoes, almost 50% of the women made it past the first audition.

Scientists believed a whiff of the bonding hormone Oxytocin could increase trust between humans. Then they went back and checked their work…


Over the last two decades, the neuropeptide Oxytocin (OT) has been studied extensively and many articles have been published about its role in humans’ emotional and social lives, e.g. increasing trust and sensitivity to others’ feelings. Even a TED talk has been recorded (Trust, morality – and oxytocin?) with over 1.4 million viewers. (LINK)


Structure of Oxytocin

The human trials conducted were based on early animal studies, where a critical manipulation of the OT system was translated into behavioral phenotypes affecting social cognition, bonding and individual recognition.

However, some recent publications question the sometimes bewildering evidence for the role of OT in influencing complex social processes in humans, and failed to reproduce some of the most influential studies conducted. Furthermore, no elevated cerebrospinal fluid (CSF) OT levels could be detected 45 min after administration, which represents the time window at which most behavioral tasks took place (Striepens et al., 2013) ( CSF OT concentrations were increased after 75 minutes, indicating that OT pharmacokinetics is not fully understood. Moreover, it is still unclear whether the usual doses administered in the field (between 24 and 40 IU) can indeed deliver enough OT to the brain in order to produce significant changes in individuals (Leng et al., 2016).

This ultimately leads to the following question: ‘If the published literature on the OT effects does not reflect the true state of the world, how has the vast behavioral OT literature accumulated (Lane et al., 2016)?’

Several possible scenarios and reasons are currently discussed and analyzed amongst OT researchers, demonstrating the crucial importance of implementing Good Research Practice standards, proper study design and a priori statistical power calculations:


Power analysis:

A meta-analysis of the effects of OT on human behavior found that the average OT study has a statistical power of 16% for healthy individuals and a median sample size of 49 individuals. For clinical trials the statistical power was even lower (12%), given a median sample size of 26 individuals (Walum et al., 2016).

Hence, OT studies in humans are dangerously underpowered, as 80% is normally considered the standard for minimal adequate statistical power. Even for studies with the largest effect and sample sizes (N = 112), the statistical power was lower than 70%. In order to achieve 80% power for the average effect size reported, a sample size of 352 healthy individuals would be needed (310 individuals for clinical trials).

Statistical power is the probability that a test will be able to reject the null hypothesis considering a true relation with a given effect size. In other words, replication attempts of true positive OT studies (with the same sample size) would fail up to 88% of the time considering the false negative rate of 84% or 88%, respectively. To further aggravate the problem, the observed effect size in underpowered studies is likely to be highly exaggerated, a phenomenon also known as “the winner’s curse”.

In addition, this meta-analysis also demonstrated that the positive predictive value (PPV) of those studies (using information on power, the pre-study odds and the alpha level) is low. Therefore, it was concluded that most of the reported positive findings in this field are likely to be false positives (Walum et al., 2016).


Publication bias:

Almost all studies (29 out of 33), which were investigated as part of the meta-analysis (Walum et al., 2016), reported at least one positive result (p-value below 0.05). This huge excess of statistically significant findings clearly points towards a phenomenon referred to as the ‘file-drawer effect’ or publication bias suggesting that there could be a substantial amount of unpublished negative or inconclusive findings.

In an admirable and applaudable attempt to investigate if there is a file drawer problem in OT research, Anthony Lane at Catholic University of Louvain started to analyze all studies that were performed in his laboratory from 2009 until 2014 on a total of 453 subjects (Lane et al., 2016). Indeed, he found a statistically significant effect of OT for only one out of 25 tasks. This large proportion of ‘unexpected’ null findings, which were never published after they were conducted, raised concerns about the validity of what is known about the influence of OT on human behaviors and cognition. A. Lane therefore states that ‘our initial enthusiasm for OT has slowly faded away over the years and the studies have turned us from ‘believers’ into ‘skeptics’.

This process of publication bias is further supported by the current publication culture and the strong tendency of journals to favor publishing results that confirm hypothesis and neglect unconvincing data.


Study design:

In addition to publication bias, the excess of significant effects of OT may also be the result of methodological, measurement or statistical artefacts: A. Lane’s laboratory also reported a massive use of ‘between-subject’ designs of relatively small sample size (around 30 individuals per study), which carries the risk of attributing effects to OT that are in fact generated by various unobservable factors, e.g. personality of participants (Lane et al., 2016).

Furthermore, Lane et al. failed twice to replicate their own previous study (Lane at al., 2015), which showed a powerful effect of OT increasing trusting behavior of study members. Notably, in the original study, OT administration followed a single blind procedure, where the subject is blind to the treatment condition but the experimenter is not, introducing the risk that the experimenter might unconsciously act differently and thereby influencing the subjects’ behavior to confirm the researcher’s hypothesis (unconscious behavioral priming). Both subsequent replication attempts were conducted in a double-blinded manner!

Importantly, the statistical and methodological limitations discussed here are not specific to the OT field and also directly affect other areas of biomedical research. Nevertheless, a systematic change in research practices and in the OT publication process is required to increase the trustworthiness and integrity of the data and to reveal the true state of OT effects. The adherence to detailed Good Research Practices (e.g. a priori power calculations and accurate blinding procedures) and a transparent reporting of methods and findings should therefore be strongly encouraged.

Accurate design of in vitro experiments – why does it matter?

Good statistical design is a key aspect of meaningful research. Elements such as data robustness, randomization and blinding are widely recognized as being essential to producing valid results and reducing biased assessment. Although commonly used in in vivo animal studies and clinical trials, why is it that these practices seem to be so often overlooked in in vitro experiments?

In this thread we would like to stimulate a discussion about the importance of this issue, the various designs available for typical in vitro studies, and the need to carefully consider what is ‘n’ in cell culture experiments.

Let’s consider pseudoreplication, as it is a relatively serious error of experimental planning and analysis that hasn’t received much attention in the context of in vitro research.

The term pseudoreplication was defined by Hurlbert more than 30 years ago as “the use of inferential statistics to test for treatment effects with data from experiments where either treatments are not replicated (though samples may be) or replicates are not statistically independent” (Hurlbert SH, Ecol Monogr. 1984, 54: 187-211). In other words, the exaggeration of the statistical significance of a set of measurements because they are treated as independent observations when they are not.

Importantly, the independence of observations or samples is (in the vast majority of cases) an essential requirement on which most statistical methods rely. Analyzing pseudoreplicated observations ultimately results in erroneous confidence intervals, that are too small, and inaccurate p-values as the underlying experimental variability were underestimated and the degrees of freedom (number of independent observations) were incorrect. Thus, the statistical significance can be greatly inflated leading to a higher probability of Type I error (falsely rejecting a true null hypothesis).

To add to the confusion, the word ‘replication’ is often used in the literature to describe technical replicates or repeated measurements on the same sample unit, but can also be used to describe a true biological replicate, which is characterized as “the smallest experimental unit to which a treatment is independently applied” (Heffner et al., Ecology, 1996, 77 (8) 2558-2562).

To understand pseudoreplication-related issues, it is therefore crucial to carefully define the term biological replicate (= data robustness) in this context and to distinguish it from a technical replicate (= pseudoreplicate): The critical difference here (as proposed by M. Clemens: is whether or not the follow-up test should give, in expectation, exactly the same quantitative result as the original study. A technical replication re-analyses the same underlying data set as used in the original study, whereas a biological replicate estimates parameters drawn from different samples. Following this definition, performing pseudoreplication tests does not introduce independency into the experimental system and can mainly be applied to measure errors in sample handling as the new findings should be quantitatively identical to the old results. In contrast, robustness tests represent true biological ‘replicates’ due to independent raw materials (animals, cells, etc.) used and therefore do not need to give the same results as obtained before. Only a robustness test can analyze whether a system operates correctly while its variables or conditions are exchanged.

In the following experiment, cells from a common stock are split into two culture dishes and either left untreated (control) or stimulated with a growth factor of interest. The number of cells per dish is then used as the main readout to examine the effect of the treatment. The process of data acquisition will have a decisive impact on the quality and reliability of the final result. These are different options on how to conduct this experiment:

  1. After a certain period of time, 3 different cover slides are prepared from each dish to calculate cell numbers, resulting in six different values (three per condition).

Sample size equals one

Although there were two culture dishes and six glass slides, the correct sample size here is n=1, as the variability among cell counts reflects technical errors only, and the three values for each treatment condition do not represent robustness tests (= biological replicates) but technical replicates.

  1. A slightly better approach is to perform the same experiment on three different days, counting the cells only once per condition each day.
This experiment indeed shows “n” equal three.

(modified from

This approach gives the same number of final values (six), yet, independency is introduced (in the form of time) due to repeating the experiment at three separate occasions, resulting in a sample size of n = 3. Here, the two glass slides from the same day should be analyzed as paired observations and a paired-samples t-test could be used for statistical evaluation.

  1. To further increase confidence in the obtained results, the three single experiments should be performed as independently as possible, meaning that cell culture media should be prepared freshly for each experiment, different frozen cell stocks and growth factor batches, etc. should be used.

It is reasonable to assume that most scientists who have performed in vitro cell based assays will have gotten as far as to consider and apply these precautions. But now we must ask ourselves: do those measurements actually account for real robustness tests? When working with cell-based assays, it is important to consider that, even if for each replicate a new frozen cell stock was used, ultimately all cells originated from the same starting material, therefore no biological replicates can possibly be achieved.

This problem can only be solved by generating several independent cell lines from several different human/animal tissue or blood samples, which demonstrates that reality often places constraints on what is statistically optimal.

The key questions, thus, are: ‘How feasible is it to obtain true biological replicates and to satisfy all statistical criteria?’ or ‘How much pseudoreplication is still acceptable?’

We all know that cost and time considerations, as well as the availability of biological sample material, are important; and quite frequently these factors force scientists to make compromises regarding study design and statistical analysis. Nevertheless, as many medical advances are based on preclinical basic in vitro research, it is critical to conduct, analyze and report preclinical studies in the most optimal way. As a minimum requirement, when reporting a study, the design of the experiment, the data collection and the statistical analysis should be described in sufficient detail, including a clear definition and understanding of the smallest experimental unit with respect to its independence. Scientists should also be open about the limitations of a research study and it should be possible to consider and publish a study as preliminary or exploratory (using ‘pseudo-confidence intervals’ instead of ‘true’ confidence intervals, when over-interpretation of results should be avoided) or to combine results with others to obtain more informative data sets.

As mentioned above, even if samples are easy to get or inexpensive, it can be dangerous to inflate the sample size by simply increasing the number of technical replicates, which may lead to spurious statistical significance. Ultimately, only a higher number of true biological replicates will increase the power of the analysis and result in quality research.

In this context, and to understand the extent of the problem, it would be quite informative to perform a detailed meta-analysis of articles about in vitro research studies to get an idea about the ratio of biological and technical (and unknown) replicates used for the scientific conclusion!

Decision-enabling studies: Robust enough to support critical decisions?

Decision-making is an essential function of any company and determines long-term success. However, what are the key factors that influence decision-enabling in the pharmaceutical industry leading to new products and approved drugs?

To analyze the importance of Good Research Practice (GRP) standards as well as the quality and validity of data for this decision process we have conducted an analysis of 12 drug discovery projects (preclinical up to clinical candidate selection) that were licensed over the past two years by three EU pharma companies. There were a total of 26 studies that were identified as ‘critical’ (consensus decision based on discussions with representatives of the licensee companies).

Post-licensing analysis of these ‘critical’ studies indicated that not every study was designed in a way that is consistent with its role in decision making. Only around one third of all studies were properly blinded (see Figure) and only one quarter contained well-defined, pre-specified endpoints, which can significantly reduce bias and false experimental outcomes compared to post-hoc or secondary endpoint analyses.

Based on this analysis, PAASP estimates that at least 30% of early-stage innovative drug discovery projects critically depend on data that do not meet minimum quality criteria.

It is well possible that decisions to license projects are often based on factors other than the quality of research data, such as time considerations, organizational and cultural influences, subjective and personal considerations or political influences.

However, given the overall decreasing drug R&D productivity (with pre-clinical data quality as a major contributing factor), decisions during drug development as important organizational elements should not neglect the assessment of data quality and integrity. Instead, the question should be addressed whether or not GSP standards were implemented in decision-enabling studies. Structured and informed decisions will help avoiding unnecessary terminations of drugs in Phase II/III development.

Building a nearshoring collaboration: Success story

Starting to work together with someone you do not know is often associated with some risk-taking if this person is outside your regular circle of collaborators and is off the mainstream of usual partners. We often observe that pharma companies are much more open to collaboration with service providers in the Far East countries than in Eastern Europe. It is clear that there is a major trend to work with partners in China (and, to some extent, India) and the costs benefits appear obvious (or, perhaps better to say, used to be obvious). There is also an established market of many hundreds, if not thousands, of companies in Asia that offer a variety of services and starting a collaboration is not much different from going to a shop and picking up a product that looks like what you need.

To build a collaboration with a lab in Eastern (or Central) European countries requires a different approach. The first and the most critical step is to obtain initial experience that can help to reveal all the advantages of nearshoring (outsourcing to neighbours) and to convince internal decision makers to go for a larger project.

As an example of such a first step, we would like to refer to a project that we have supported. A major pharma company active in the fields of Immunology and Neuroscience was running internally a variety of drug discovery projects – each one of them requiring support of anatomy and histology groups. With all main research facilities of this company located in a Western European country, it turned out to be problematic to engage highly experienced and highly qualified internal research staff to do large volumes of routine work (essentially preparing the tissue for histological analysis). An additional challenge was the reluctance of upper management to increase headcount in R&D. And, not surprisingly, reducing histology support of the projects, another option to reduce the workload for the department, was fiercely resisted by project leaders. The histology department was clearly regarded as a success critical project bottleneck, when an interesting alternative presented itself.

Looking for solutions to eliminate this success-critical project bottleneck, we have identified a lab in Poland that employed people with highly developed histology skills, had at least 50% of their resources unused (due to limited funding) and a high motivation to work. Two people from this lab were sent to the pharma company’s research center for a one-month training program. Upon their return, this lab received basic tools necessary to conduct the work and the tissue samples to process. We have helped both sides to address all legal aspects and to negotiate the terms that were acceptable for both sides.

We refer to this collaboration as a success story because within a fairly short period of time this project started to grow beyond the originally intended focus on outsourcing of routine work. Once the pharma company managers got the first experience, there were other nearshoring projects of increasing complexity and diversity.