Decomposition of mutational context signatures using quadratic programming methods

Methods for inferring signatures of mutational contexts from large cancer sequencing data sets are invaluable for biological research, but impractical for clinical application where we require tools that decompose the context data for an individual into signatures. One such method has recently been published using an iterative linear modelling approach. A natural alternative places the problem within a quadratic programming framework and is presented here, where it is seen to offer advantages of speed and accuracy. This article is included in the channel. RPackage Andy G. Lynch ( ) Corresponding author: andy.lynch@cruk.cam.ac.uk Lynch AG. How to cite this article: Decomposition of mutational context signatures using quadratic programming methods [version 1; 2016, :1253 (doi: ) referees: awaiting peer review] F1000Research 5 10.12688/f1000research.8918.1 © 2016 Lynch AG. This is an open access article distributed under the terms of the , which Copyright: Creative Commons Attribution Licence permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Data associated with the article are available under the terms of the (CC0 1.0 Public domain dedication). Creative Commons Zero "No rights reserved" data waiver AGL was supported in this work by a Cancer Research UK programme grant [C14303/A20406] to Simon Tavaré. AGL Grant information: acknowledges the support of the University of Cambridge, Cancer Research UK and Hutchison Whampoa Limited. Whole-genome sequencing of oesophageal adenocarcinoma was part of the oesophageal International Cancer Genome Consortium (ICGC) project. The oesophageal ICGC project was funded through a programme and infrastructure grant to Rebecca Fitzgerald as part of the OCCAMS collaboration. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: No competing interests were declared. 07 Jun 2016, :1253 (doi: ) First published: 5 10.12688/f1000research.8918.1 Referee Status: AWAITING PEER REVIEW 07 Jun 2016, :1253 (doi: ) First published: 5 10.12688/f1000research.8918.1 07 Jun 2016, :1253 (doi: ) Latest published: 5 10.12688/f1000research.8918.1 v1 Page 1 of 4 F1000Research 2016, 5:1253 Last updated: 29 JUL 2016


Introduction
The existence of context-specific DNA mutational signatures as a response to carcinogens has been known for some time (see e.g. Pfeifer et al. 1 ), but the last three years have seen progress to bioinformatic inference of mutational signatures from large scale cancer sequencing studies 2-4 such as TCGA (http://cancergenome. nih.gov/) and ICGC (icgc.org).
These methods of signature discovery, while important, do not translate to clinical application. First of all, they are reliant on a large corpus of samples for their efficacy, making them impractical to be run repeatedly for each new patient. Secondly, even with a large corpus, the results for one individual can theoretically change depending on the identities of the other patients in the corpus, which is undesirable in practice. Therefore there is great value in methods such as those recently presented by Rosenthal et al. 5 that can, for a single sample, break a vector of observed mutation counts into constituent signature components.
In the Cancer Research UK funded oesophageal adenocarcinoma ICGC project we have taken a similar view to Rosenthal et al. 5 for the decomposition of a single sample, but rather than decomposing mutational contexts into signatures by fitting iterative linear models (ILM), we have viewed the question as lying within the framework of quadratic programming (QP). By mutational contexts, we commonly mean the 96 trinucleotide contexts consisting of the 6 distinguishable mutations and the 16 combinations of immediately preceding and following bases. More general definitions are possible 3 and can be accommodated in both the QP and ILM approaches, but we assume the standard 96 in what follows.

Methods
In brief, we want to minimize the difference between the normalized observed vector of mutation contexts m (a 96 × 1 vector) and Sw (where S is a 96 × k matrix, each column of which represents the contributions of mutational contexts to one signature, k is the number of known mutational signatures, and w is a k × 1 matrix of weights to be estimated). Our problem, then, is to:  which is the classical quadratic programming problem that can be solved quickly (given the form of S T S) and easily using the core linear algebra functionality of R (version 3.2.4) 6 and the quadprog package (version 1.5-5) 7 , which implements the dual method of Goldfarb and Idnani 8,9 to find the solution. Practical details of the implementation can be found in the 'Data and Software Availability' section of this note. In most circumstances, both the ILM and QP approaches work well. Illustrating them on an example from the OCCAMS consortium's whole-genome sequencing of oesophageal adenocarcinoma 10 , we see that the ILM and QP approaches are highly concordant (See Figure 1). The ILM approach has the advantages of familiarity of interpretation, and enforcement of parsimony should this be desired (while parsimony is generally desirable if building a predictor, if we are trying to model an underlying truth then it represents a strong assumption). More importantly, taking advantage of the linear modelling framework, it would be easy to generalize this approach to use other error models or to include additional structure should one e.g. wish to simultaneously investigate several related samples.

Results
The disadvantage of the ILM approach comes from its having to define a subset of signatures to include in the model. While the signature matrix is of full rank, with noise in the system it is sometimes possible to approximate an observed vector with several different linear combinations of signatures, and an ILM approach is not guaranteed to give consideration to the correct combination of signatures. Even if the correct solution is reached, it can be a substantially slower approach. It is not difficult to simulate a combination of signatures that takes thousands of iterations and thousands of times longer to run than the QP approach.
If one simulates a flat combination of all available signatures, then the ILM approach performs worse than the QP approach. A fairer comparison would be to consider all equal combinations of just two signatures (with noise added). Of 351 possible such combinations using the Nature 2013 signature set 2,5 , the majority are well inferred using both the ILM and QP approaches, while one (the combination of signatures 1B and 3) performs poorly for both methods. Aside from these, there is a definite set of combinations for which the ILM approach performs markedly worse than the QP approach (See Figure 2). Pairs involving signature 1B, or signature 5, appear to cause the most problems. It is not the case that the problematic pairs are themselves highly correlated, but the 1B and U2 signatures are, possibly explaining the outlying nature of the U2-R2 pair. This exercise took approximately 5 seconds using

Supplementary material
The R Markdown document (Dataset 1) compiled into a PDF file.
Click here to access the data.
the QP approach, and approximately 15 minutes using the ILM approach (on a well-specified desktop).

Conclusion
Since it makes use of well-established and core R code in a classical mathematical context, no new software is required to use the QP approach (see Data and software availability and Supplementary material for details of implementation). The speed and improved performance of the QP approach makes it an attractive alternative to the ILM method and complements the additional functionality of the deconstructSigs package 5 .

Data and Software Availability
F1000Research. Dataset 1: An R Markdown document that when compiled will reproduce all the results presented, 10.5256/ f1000research.8918.d124181 13 .
The raw oesophageal adenocarcinoma data for library SS6003314, from which some of these counts are derived, are available from the European Genome-phenome Archive (EGA; accession EGAD00001000704).

Competing interests
No competing interests were declared. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant information
The author's contribution is limited to a technological advance, but in that is seems to surpass the previous approach in speed and accuracy (though I present some reservations below) by casting the problem into the more sophisticated framework of quadratic programming. I think this approach has benefits and I'm convinced that at no expense, and as such, I'm strongly favorable. I have however a few concerns that I'd like to raise.
Biologically I understand that mutational processes have signatures that are non-orthogonal, so a particular footprint of activity (the mutations on a sample) could in general be explained by different activation patterns of these signatures. How do these methods account for prior probabilities? e.g. mutational patterns related to smoking can be far more prevalent that exposure to a rare carcinogenic that could resemble the smoking signature in whole or in part. I can imagine the methods that extract this signature leveraging cohort data to untangle these prior probabilities, but then I think the methods presented in this paper in the deconstructSig cannot make use of this priors. In any case, I don't think current cohort methods predicting de-novo signatures are accounting for these priors since I would imagine they should be reporting these in addition to the signatures, which I believe they are not.
Coming back to the article at hand, the second paragraph in the result section seems to relate to this question in part. I find this paragraph confusing, possibly due to my own shortcomings so perhaps the author can clarify it for me, or even make it more clear on the text if need be. Let me explain. The author claims that the signature matrix is full rank. Correct me if I'm wrong, but in general it need not be, making the problem of approximating the result with different combinations is not just a result of noise and actually not specific to ILM, but to both methods. In fact the following phrase: 'an ILM approach is not guaranteed to give consideration to the correct combination of signatures' seems unfair, does the QP approach offer I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com