Assessment of pharmacogenomic agreement

In 2013 we published an analysis demonstrating that drug response data and gene-drug associations reported in two independent large-scale pharmacogenomic screens, Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer Cell Line Encyclopedia (CCLE), were inconsistent. The GDSC and CCLE investigators recently reported that their respective studies exhibit reasonable agreement and yield similar molecular predictors of drug response, seemingly contradicting our previous findings. Reanalyzing the authors’ published methods and results, we found that their analysis failed to account for variability in the genomic data and more importantly compared different drug sensitivity measures from each study, which substantially deviate from our more stringent consistency assessment. Our comparison of the most updated genomic and pharmacological data from the GDSC and CCLE confirms our published findings that the measures of drug response reported by these two groups are not consistent. We believe that a principled approach to assess the reproducibility of drug sensitivity predictors is necessary before envisioning their translation into clinical settings.

Pharmacogenomic studies correlate genomic profiles and sensitivity to drug exposure in a collection of samples to identify molecular predictors of drug response. The success of validation of such predictors depends on the level of noise both in the pharmacological and genomic data. The groundbreaking release of the Genomics of Drug Sensitivity in Cancer 1 (GDSC) and Cancer Cell Line Encyclopedia 2 (CCLE) datasets enables the assessment of pharmacogenomic data consistency, a necessary requirement for developing robust drug sensitivity predictors. Below we briefly describe the fundamental analytical differences between our initial comparative study 3 and the recent assessment of pharmacogenomic agreement published by the GDSC and CCLE investigators 4 .
Which pharmacological drug response data should one use?
The first GDSC and CCLE studies were published in 2012 and the investigators of both studies have continued to generate data and to release them publicly. One would imagine that any comparative study would use the most current versions of the data. However, the authors of the reanalysis used an old release of the GDSC (July 2012) and CCLE (February 2012) pharmacological data, resulting in the use of outdated IC 50 values, as well as missing approximately 400 new drug sensitivity measurements for the 15 drugs screened both in GDSC and CCLE. Assessing data that are three years old and which have been replaced by the very same authors with more recent data seems to be a substantial missed opportunity. It raises the question as to whether the current data would be considered to be in agreement and which data should be used for further analysis.

Comparison of drug sensitivity predictors
Given the complexity and high dimensionality of pharmacogenomic data, the development of drug sensitivity predictors is prone to overfitting and requires careful validation. In this context, one would expect the most significant predictors derived in GDSC to accurately predict drug response in CCLE and vice versa. This will be the case if both studies independently produce consistent measures of both genomic profiles and drug response for each cell line. In our comparative study 3 , we made direct comparison of the same measurements generated independently in both studies by taking into account the noise in both the genomic and pharmacological data (Figure 1a). By investigating the authors' code and methods, we identified key shortcomings in their analysis protocol, which have contributed to the authors' assertion of consistency between drug sensitivity predictors derived from GDSC and CCLE.
For their ANOVA analyses, the authors used drug activity area (1-AUC) values independently generated in GDSC and CCLE, but used the same GDSC mutation data across the two different datasets (Figure 1b; see Methods). By using the same mutation calls for both GDSC and CCLE, the authors have disregarded the noise in the molecular profiles, while creating an information leak between the two studies. For their ElasticNet analysis, the authors followed a similar design by reusing the CCLE genomic data across the two datasets, but comparing different drug sensitivity measures that are IC 50 in GDSC vs. AUC in CCLE (Figure 1c; see Methods).
We are puzzled by the seemingly arbitrary choices of analytical design made by the authors, which raises the question as to whether the use of different genomic data and drug sensitivity measures would yield the same level of agreement. Moreover, by ignoring the (inevitable) noise and biological variation in the genomic data, the authors' analyses is likely to yield over-optimistic estimates of data consistency, as opposed to our more stringent analysis design 3 .

What constitutes agreement?
In examining correlation, there is no universally accepted standard for what constitutes agreement. However, the FDA/MAQC consortium guidelines define good correlation for inter-laboratory reproducibility 5-8 to be ≥0.8. The authors of the present study used two measures of correlation, Pearson correlation (ρ) and Cohen's kappa (κ) coefficients, but never clearly defined a priori thresholds for consistency, instead referring to ρ>0.5 as "reasonable consistency" in their discussion. Of the 15 drugs that were compared, their analysis found only two (13%) with ρ>0.6 for AUC and three (20%) above that threshold for IC 50 . This raises the question whether ρ~0.5-0.6 for one third of the compared drugs should be considered as "good agreement." If one applies the FDA/MAQC criterion, only one drug (nilotinib) passes the threshold for consistency.
Similarly, the authors referred to the results of their new Waterfall analysis as reflective of "high consistency," even though only 40% of drugs had a κ≥0.4, with five drugs yielding moderate agreement and only one drug (lapatinib) yielding substantial agreement according to the accepted standards 9 . Based on these results, the authors concluded that 67% of the evaluable compounds showed reasonable pharmacological agreement, which is misleading as only 8/15 (53%) and 6/15 (40%) drugs yielded ρ>0.5 for IC 50 and AUC, respectively. Taking the union of consistency tests is bad practice; adding more sensitivity measures (even at random) would ultimately bring the union to 100% without providing objective evidence of actual data agreement.

Consistency in pharmacological data
The authors acknowledged that the consistency of pharmacological data is not perfect due to the methodological differences between protocols used by CCLE and GDSC, further stating that standardization will certainly improve correlation metrics. To test this important assertion, the authors could have analyzed the replicated experiments performed by the GDSC using identical protocols to screen camptothecin and AZD6482 against the same panel of cell lines at the Wellcome Trust Sanger Institute and the Massachusetts General Hospital.
Our re-analyses 3,10 of drug sensitivity data from these drugs found a correlation between GDSC sites on par with the correlations observed between GDSC and CCLE (ρ=0.57 and 0.39 for camptothecin and AZD6482, respectively; Figure 2 a,b). These results suggest that intrinsic technical and biological noise of pharmacological assays is likely to play a major role in the lack of reproducibility observed in high-throughput pharmacogenomic studies, which cannot be attributed solely to the use of different experimental protocols.

Consistency in genomic data
In their comparative study, the authors did not assess the consistency of genomic data between GDSC and CCLE 4 . Consistency of gene copy number and expression data were significantly higher than for drug sensitivity data (one-sided Wilcoxon rank sum test p-value=3×10 -5 ; Figure 3), while mutation data exhibited poor consistency as reported previously 11 . The very high consistency  of copy number data is quite remarkable (Figure 3a) and could be partly attributed to the fact that CCLE investigators used their SNP array data to compare cell line fingerprints with those of the GDSC project prior to publication and removed the discordant cases from their dataset 2 .

Conclusions
We agree with the authors that their and our observations "[…] raise important questions for the field about how best to perform comparisons of large-scale data sets, evaluate the robustness of such studies, and interpret their analytical outputs." We believe that a principled approach using objective measures of consistency and an appropriate analysis strategy for assessing the independent datasets is essential. An investigation of both the methods described in the manuscript and the software code used by the authors to perform their analysis 4 identified fundamental differences in analysis design compared to our previous published study 3 . By taking into account variations in both the pharmacological and genomic data, our assessment of pharmacogenomic agreement is more stringent and closer to the translation of drug sensitivity predictors in preclinical and clinical settings, where zero-noise genomic information cannot be expected.
Our stringent re-analysis of the most updated data from the GDSC and CCLE confirms our 2013 finding that the measures of drug response reported by these two groups are not consistent and have not improved substantially as the groups have continued generating data since 2012 10 . While the authors make arguments suggesting consistency, it is difficult to imagine using these post hoc methods to drive discovery or precision medicine applications.
The observed inconsistency between early microarray gene expression studies served as a rallying cry for the field, leading to an improvement and standardization of experimental and analytical protocols, resulting in the agreement we see between studies published today. We are looking forward to the establishment of new standards for large-scale pharmacogenomic studies to realize the full potential of these valuable data for precision medicine.

Methods
The authors' software source code.

Statistical analysis
All analyses were performed using the most updated version of the GDSC and CCLE pharmacogenomic data based on our PharmacoGx package 12 (version 1.1.4).

Research replicability
All analyses were performed using the most updated version of the GDSC and CCLE pharmacogenomic data based on our PharmacoGx package 12 (version 1. The paper highlights the curious lack of rigorous standards for what constitutes 'agreement', 'consistency' between genomic studies, or more generally, the fundamental issues of 'validation' and 'reproducibility', etc. The problem is even more serious of results based on high-throughput omics data as the potential for false positive is substantial. The persistent lack of consensus or standards may partly indicate that these issues are not so straightforward. The main problem is that when we say we 'validate' a result, this can be done at different strengths. For example, consider the commonly performed method in statistical analyses, the so-called 'cross-validation', where we split our total sample into training and validation sets. If the split is done randomly, then we have only a 'soft validation', since it applies to the same sample (or same lab, same population, same measurement method, etc) so the 'validation' is internal and corresponds to statistical significance only. In contrast a scientist may wish for something stronger, for an external validation, for example, for the 'biological truth' to apply other populations; thus, one study may be performed in a European population, but the external validation is done in an Asian population. The latter is a stronger validation than the random-split validation, giving a more compelling and general biological story. What is relevant here is that both validations are commonly done in practice, and both are valid, but they carry different levels of information. I think what matters in practice is that the implication of the validation should always be clear (or clarified), so that the user of the information can judge its relevance.
The key point of Safikhani is that their 2013 validation study of the genomic predictors of et al drug-sensitivity was more stringent than the 2015 validation studies by the GDSC and CCLE investigators. This is clearly highlighted in Figure 1, where the latter used the same molecular data, so the 'validation' is only of the pharmacological data and perhaps (not clear to me) the method of analyses. Which level of validation is more relevant here? Let us imagine how the results (eg the genomic predictors) are to be used in patients. The molecular data are likely to be generated and analyzed in a diversity of labs, so the genomic predictors should really be robust to the actual heterogeneity in the molecular data. The results (the genomic predictors) may not survive such stringent requirements, but that is what we need to know. So, overall, I agree with Safikhani that a more stringent validation et al allowing for variability in both molecular and pharmacological data is more relevant in this context of drug prediction.
(However, reading Haibe-Kains , there seemed to be an emphasis that the failure of agreement was et al due to the high variability in the pharmacological data. So it is possible that the later studies by the GDSC-CCLE investigators responded to this concern only.)
GDSC-CCLE investigators responded to this concern only.) Regarding specific issues in the paper: I do not consider the use of most recent data as a key issue. I believe that the design, methods and analysis of results are appropriate for the topic being studied, and that for the most part, they were clearly explained. A couple of perceived shortcomings are itemized here.
p.3, column 2, line 2. The "but" would be better replaced by "and". p.5. Figure 2. The dotted and solid diagonal lines on these plots are not identified in either the caption or the text. p.5, Figure 3. It is nowhere explained whose Pearson correlations (PCC) are summarized in these box plots. I suppose that some number (to be stated) of cell lines were profiled in both GDSC and CCLE, and that in all cases, the PCC in the box plots are calculated from molecular data from pairs consisting of the data on the same cell line generated in GDSC and in CCLE. A clear statement along these lines would be helpful.
p.6, column 1, lines 1-4. This assertion would have more force if the authors told the reader how many cell lines have contributed PCC to the box plot of Figure 3a, and how many do so. could did Further, I do believe that the conclusions are sensible, balanced and justified on the basis of the results of the study.
Finally, I understand that all the data used in this study is available, and this is also true for the code used to generate all the results and figures. It is a lot to take/digest the manuscript. I break this story into three parts: In 2012, both GDSC and CCLE released/published drug sensitivity data (both pharmacological and genomic). In 2013, the authors compared the two studies using the drugs in common between two. Their analysis was carried out in a direct fashion which account for variations of both genomic and pharmacological data from the same site (GDSC or CCLE) and found the results between two did not agree.
Recently, GDSC/CCLE did an independent analysis and demonstrated that the agreement between two are actually higher (using ANOVA) than what the authors reported. They concluded that the results between GDSC and CCLE were consistent. However, the comparison was only focused on the pharmacological data because the genomic data used actually came from one site. That means their analysis did not include the noise introduced by both sites in this comparison.
The authors, again, reanalyzed data by including pharmacological and genomic data from both sites and the conclusions remain as the same as they reported in 2013. I have no problem with their analysis and support their conclusions. With that said, I did find the paper could flow better by moving two sections into Discussion. These are: "Which pharmacological drug response data should one use?" -It seems odd and smell bad that GDSC/CCLE used the data published in 2012 and totally ignored the most current data in their analysis. This could be due to many different reasons. Thus, speculation is not necessary considered as "results". I would say this will be better justified as "discussion".
"What constitutes agreement" -Again, this is a difficult call. I believe there is no single baseline that can be used to justify consistency. Thus, most text in this section will sit better in "discussion". Overall, I support its indexation with revision by focusing on the flow of the story and the structure of manuscript.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: