Supervised topic modeling for predicting molecular substructure from mass spectrometry [version 1; peer review: 2 approved with reservations]

Small-molecule metabolites are principal actors in myriad phenomena across biochemistry and serve as an important source of biomarkers and drug candidates. Given a sample of unknown composition, identifying the metabolites present is difficult given the large number of small molecules both known and yet to be discovered. Even for biofluids such as human blood, building reliable ways of identifying biomarkers is challenging. A workhorse method for characterizing individual molecules in such untargeted metabolomics studies is tandem mass spectrometry (MS/MS). MS/MS spectra provide rich information about chemical composition. However, structural characterization from spectra corresponding to unknown molecules remains a bottleneck in metabolomics. Current methods often rely on matching to pre-existing databases in one form or another.  Here we develop a preprocessing scheme and supervised topic modeling approach to identify modular groups of spectrum fragments and neutral losses corresponding to chemical substructures using labeled latent Dirichlet allocation (LLDA) to map spectrum features to known chemical structures. These structures appear in new unknown spectra and can be predicted. We find that LLDA is an interpretable and reliable method for structure prediction from MS/MS spectra. Specifically, the LLDA approach has the following advantages: (a) molecular topics are interpretable; (b) A practitioner can select any set of chemical structure labels relevant to their problem; (c ) LLDA performs well and can exceed the performance of other methods in predicting substructures in novel contexts. Open Peer Review


Introduction
Liquid chromatography -tandem mass spectrometry (LC-MS/ MS) is a powerful experimental method for identifying the small molecule metabolites in a sample of unknown composition. It provides detailed structural information from a given molecule with the only prerequisite knowledge being the parent molecule's mass-to-charge ratio. This is especially important since a vast portion of naturally occurring small molecules are believed to remain unidentified 1 . Yet identifying the structure of a molecule given its MS/MS remains challenging 2 . Traditionally, such identification is done by hand, which is difficult, time-consuming, and poses significant reproducibility issues. The repetitive nature of this task naturally lends itself well to a computational approach 3 .
Existing tools generally follow one of two trends. In spectral library matching, a query MS/MS spectrum is compared to reference spectra via similarity metric such as a cosine score 4 . For example, the National Institute of Standards and Technology (NIST) Mass Spectral Library, Human Metabolome Database (HMDB), Metlin, and the Global Natural Products Social Molecular Networking (GNPS) databases all offer spectral similarity search functionalities using spectral similarity scores [5][6][7][8] . However, spectral databases tend to be sparse compared to the universe of possible molecular structures 3 . Simulated spectra can supplement databases with new structures, though matching is usually done on entire spectra [9][10][11] . Molecular fingerprint approaches in which vector representations of molecules are predicted directly from MS/MS spectra are increasingly popular methods 12- 16 .
This work focuses on the specific fingerprint task of substructure prediction in MS/MS spectra ( Figure 1A). Chemical substructures are defined structural subunits that appear across . Spectrum fragments and neutral losses provide information relevant to identifying chemical structure. The goal of the supervised topic modeling approach is to assign interpretable substructure scores for a spectrum via LLDA 17 (b) Supervised learning for substructure prediction. A training set is composed of MS/MS spectra with known chemical structures and a choice of substructure (topic) labels. Each spectrum is labeled with its parent molecule's substructures. (c) The LLDA generative model. The model is defined in 17 and labeled here as it relates to MS/MS substructure prediction. The model is trained on a corpus D of MS/MS spectra and set of molecular substructures K. Given the fragment/neutral loss (feature) composition W d,n of each spectrum and the substructure-spectrum label matrix Λ, the model will find the substructure-feature probability matrix β, the spectrum-substructure probability matrix θ, and the feature-substructure assignment matrix Z. α and η are Dirichlet parameters for the distributions from which θ and β are respectively assumed to have been drawn. Φ is a Bernoulli parameter for the distribution from which Λ is assumed to have been drawn. is a model. Figure (c) has been reproduced and modified with labels with permission from 17. different molecules and are useful for identifying their parent molecules 13,18 . Focusing on substructures benefits practitioners because substructures have directly interpretable chemical meaning. Further, predicting chemical substructures rather than relying on spectral library matching allows for novel predictions of molecules not seen in these libraries. An existing approach, MS2LDA, uses topic modeling to find regularly co-occurring sets of MS/MS spectrum fragments and neutral losses 19 . Here we build on this work by proposing and developing a supervised topic modeling approach, using labeled latent Dirichlet allocation (LLDA) 17 , to find co-occurring spectrum features associated with chemical substructures of the user's choosing ( Figure 1B-C). We find that compared to alternative methods for supervised substructure identification, LLDA performs well and provides interpretable substructure predictions. Using cosine distance k-nearest neighbors (k-NN) for spectral library matching, we find that LLDA's relative performance improves as the test set becomes more chemically distinct from the training set and as the substructures being predicted appear with different frequencies between the two sets. This type of generalization is important for applications where the molecular identity of samples is unknown and thought to correspond to novel molecules.

Data sets
In order to facilitate direct comparison to an existing tool, we train and test LLDA using the same data and evaluation metrics as those used in the metabolite substructure auto-recommender (MESSAR) developed in 20. We utilize the MESSAR target training corpus of 3,146 positive mode LC-MS/MS spectra (available here) originally from the Global Natural Products Social Molecular Networking database 8 . For validation, we use the 5,164 MASSBANK 21 validation spectra and the 185 annotated test spectra as our test set both used in 20 and available here. This test dataset contains spectra for 34 drugs and 126 metabolites from MASSBANK and 25 spectra from the Critical Assessment of Small Molecule Identification 2017 contest. The 712 unique substructures provided in the rule database in 20 (available here) are used. All spectra are provided in .mgf format. Copies of the spectra and a cleaned version of the unique substructure set are included in this publication's Underlying data 22 .

Data preprocessing
To model a mass spectrum using LLDA, it is necessary to represent a mass spectrum as a bag-of-words "document" 23 . First, any fragment having a mass-to-charge-ratio (m/z) below 30 is discarded to remove structurally uninformative fragments. Each fragment in the remaining spectrum is then assigned the molecular formula with the closest theoretical m/z and that satisfies two conditions (i) the formula's theoretical m/z is within 0.1 of the peak's m/z and (ii) the formula is a subformula of the parent spectrum's parent molecular formula ( Figure 2A). We assume all MS/MS spectra have a corresponding molecular formula, a reasonable assumption for a spectrum even if its parent chemical structure is unknown 24 . This formula matching is done using the rcdk library (version 3.5.0). Peaks with no formula match are discarded. Next, all possible neutral losses are computed as pairwise differences between kept spectrum fragments. A neutral loss consists of a m/z difference between two fragments, f a and f b . Neutral losses in which f a has both a greater intensity and m/z than f b are assigned the mean of its two respective peak intensities and matched to formulas in the same manner as fragments. Neutral losses with no matching formulas are discarded. The word count Y for a given spectrum feature (fragment or neutral loss) with intensity x is calculated using a generalized logistic function 25 : For representing a mass spectrum as a document, each spectrum fragment corresponding to a peak is first mapped to a molecular formula by finding the formula that matches the following criteria: (i) the theoretical mass-to-charge ratio of the formula is within 0.1 m/z of the observed fragment. (ii) the formula is a subformula of the spectrum's parent formula. The closest such m/z formula is kept, the same process is repeated for neutral losses, and spectrum features with no formula matches are discarded. where Q determines the threshold value at intensity x = 0 and B determines the growth rate of the word count growth rate as the intensity increases. Input intensities of spectrum fragments are normalized such that the maximum raw intensity is 100, with word counts rounded to the nearest integer. An example intensity response function for intensity values in the range [0, 100] is shown in Figure 2B.
Using the data and code provided, this document generation process including formula mapping and bag-of-words conversion can be run using the command: python make_documents.py \ --in_mgf <in_mgf_file> \ --out_dir <out_documents_dir> \ --eval_peak_script evaluate_peak.R \ --n_jobs 1 \ --adduct_element H \ --adduct_element 1 We note that this process can be quite time consuming if run on a single core -preprocessing a single spectrum using a single core can take more than two minutes depending on the spectrum. As such, the preprocessed spectra for the specified data sets have been provided (see Underlying data 22 ). Runtime can be decreased on a multi-core system by increasing the n_jobs input parameter.
Substructure labels for training spectra having known parent chemical structures are assigned using the RDKit library (version 2020.09.5). Training spectra are labeled with the substructures in the user-defined set that are present in the given spectrum's parent structure. Substructure labeling can be run using the provided data and code using the command: Training LLDA and predicting substructures in a new spectrum The LLDA model is implemented using the Python Tomotopy library (version 0.10.2) 26 . The model takes a set of training spectra, test spectra, and substructures as input. The spectra must be in bag-of-words format, and the training spectra must have been labeled with ground truth substructures as described above. The original LLDA model is described in full in 17. We note that every component of LLDA for modeling text documents has an analog useful for modeling a MS/MS spectrum. A document is an MS/MS spectrum, words are spectrum features (fragments and neutral losses), topics are commonly co-occurring spectrum features, and tags are chemical substructures ( Figure 1). LLDA was trained for 2,000 iterations, since model perplexity tended to converge at around 2,000 iterations across various data sets generated using the preprocessing scheme. We note that the number of necessary iterations may differ across data sets. In addition to including a required input argument for the number of iterations in the LLDA model, we also include an optional flag to record and plot model perplexity.
Substructure predictions in test spectra are calculated using cosine similarity. The cosine similarity between a new spectrum i and substructure j is calculated as: Here v k is the word distribution for topic k and v d is the word count in document d that appears in the training corpus. Both vectors inherit their lengths from the number of words present across all documents. In this manner, every substructure is assigned a score for presence/absence in a test spectrum. After preprocessing, the LLDA model can be trained and tested using the commands

K-nearest neighbors
For comparison to the practice of spectral library matching, we also implement a k-nearest neighbors method for substructure prediction. Spectra are represented as vectors by assigning fragments to m/z bins of width 0.1. Only fragments with mass-to-charge ratio in the range [0, 1000] are kept; the vectors representing the spectra are thus of length 10,000. To make a prediction for substructures associated with a new spectrum using the k-nearest neighbors algorithm, the k nearest neighbors in the training set of spectra are computed using the cosine similarity between the vectors corresponding to spectra in the training set. The score for each substructure in the new spectrum is calculated as the mean of the substructure presence and absence labels in the k nearest neighbors' spectra. For example, for a single substructure s and test spectrum d, if 4 of the 5 nearest neighbor spectra contain s, the predicted score for s in d will be 4/5 = 0.8. After data preprocessing, k-NN can be run using the command: python run_knn.py \ --df_ids <df_ids_filename.tsv> --df_labels <df_labels_filename.tsv> \ --k 10 \ --embedded_spectra <embedded_spectra_fileName.npz> \ --out_dir <knn_out_directory> Further code documentation including required libraries and optional arguments can be found in the README file included in this publication's Underlying data 22 . A Docker image containing all necessary software and library requirements is also available at Docker Hub (see Extended data).

Results and discussion
Preprocessing hyperparameter search As described in the preprocessing pipeline, the generalized logistic function (GLF) is used to convert normalized feature intensity values into word counts to represent spectra as documents. To evaluate the hyperparameter choices in the GLF, LLDA was evaluated using a range of Q and B values corresponding to a wide range of intensity response functions. This performance was measured using the same metric as in 20. Specifically, the metric, denoted as t3≥2, measures the number of spectra in which at least two of the top-3 scoring substructures are true positives. This is similar to the standard metric of recall used in recommender systems 27 . In this experiment, LLDA predicted substructures most sensitively for Q and B values corresponding to binarized word counts. In other words, for a given document, words that appear in the given spectrum receive a word count of 1 and all other words receive a word count of 0; this corresponds to Q = 10.0 and B = 0.001 ( Figure 3A). This finding deserves further research and means that in this formulation, intensity information may harm model performance. This could be for a number of reasons. One possibility is that the GLF formulated in Equation 1 is not an optimal choice of function to map intensities to word counts. There are many options of function for this conversion, the GLF was chosen because of its flexibility to produce a diverse set of intensity-word count relationships. Another possibility is that neutral loss intensity encoding must be revisited. There may be better methods of representing a neutral loss's intensity other than taking the mean of its two constituent fragments. Additionally, valuable information might potentially be destroyed when raw ion counts are converted to normalized intensities as is standard practice 14,18 .  Table 1. However, it is unclear whether it is possible to extract an analogue to 'topics' from the k-nearest neighbors algorithm, as is common in LLDA and topic models. Further research is needed to develop methods that perform as well as k-nearest neighbors while retaining the interpretability and modularity of topic modeling approaches to substructure identification, such as LLDA.

Substructure identification benchmark
As the output of machine learning methods for mass spectrometry data is typically inspected by a human in an experimental procedure, developing interpretable methods remains important. We note that the same train spectra, test spectra, and evaluation metrics as used in 20 were used in this work. This was done to maximize the comparability between methods. Future development of community-wide standards for benchmark datasets and evaluation metrics will greatly facilitate development of new methods.

Ablation study
To study how well k-nearest neighbors performs compared to LLDA on novel test molecules poorly represented in the training set, increasingly difficult subsets of the test data were constructed using two notions of how data might be limited in a real-world experimental setting (i) chemical similarity between training and test molecules and (ii) differential train-test appearance for substructures.
For chemical similarity, the RDKit fingerprint similarity was calculated between each test set of spectrum parent structure and all structures in the training set. These similarity scores were then used to set similarity thresholds for selecting increasingly difficult subsets of the test data. For example, a maximum similarity threshold of 0.4 produced a subset of the test spectra in which the maximum pairwise structure fingerprint similarity to the train structures is at most 0.4. For substructure appearance, the number of times a substructure appears in the training set is counted and normalized according to the number of spectra. The same is done for the test set. Next, substructures are ordered by these differences and percentile cutoffs are used to produce subsets of test spectra of increasing difficulty in terms of test -train appearance. For example, a 60th percentile cutoff means testing only substructures whose normalized test minus train appearance values are above the 60th percentile of all such values across all substructures. The performance of the LLDA model compared to the k-NN model was calculated as the LLDA area under the receiver operating characteristic curve (AUC) averaged across the given test set minus the average AUC for the k-NN model.
The results of this study are shown in Figure 3B. Increasing difficulty along either axis of chemical similarity or substructure overlap improved the performance of the LLDA model relative to the k-NN model. This effect was especially pronounced as test-train molecular similarity became more distant. These results indicate use cases in which an approach such as LLDA may be especially useful compared to spectral library searching. LLDA may be able to better recognize novel chemical configurations of substructures in new spectra. As such it may be a better-suited model for characterizing spectra measuring molecules coming from new and understudied areas of chemical space not well represented in spectral libraries.

Conclusions
Metabolites are central in biology, yet the vast majority are likely unidentified 28 . Untargeted metabolomics via LC-MS/ MS is a promising option for identifying new metabolites in a high throughput pipeline. Improved computational methods for identifying chemical structure from MS/MS spectra are needed for this promise to become a reality. With this in mind, we developed, described, and open-sourced a supervised topic modeling method for identifying chemical substructures in tandem mass spectrometry data via LLDA 17 . In a series of empirical studies, this supervised topic model was trained and tested on publicly available benchmark data and substructures, and LLDA was compared to an alternative method, MESSAR 20 . A k-nearest neighbors (k-NN) was also implemented as a means of testing spectral library matching to predict substructure labels based on neighbor averages.
We report several benefits of the LLDA method. First, when trained and tested using the same spectra, LLDA provides interpretable, probabilistic substructure topics and performs well using the same metrics as in 20. These topics can incorporate a large number of spectrum fragments and neutral losses, so the patterns of spectrum fragments and neutral losses associated with substructures can be as complex as necessary for good predictive performance. LLDA is a probabilistic model that can compensate for ambiguity, redundancy, and other noise that arises from computing substructure labels. The advantage of such a probabilistic method is that substructure labels often have significant overlap 12 . Finally, the LLDA model offers a flexible supervised framework. A practitioner may choose any set of substructure labels on which to train the LLDA model, allowing the user to tailor the output of the model to a specific application requiring accurate and interpretable substructure identification. LLDA offers a supervised topic modeling approach that complements both the benefits of MS2LDA 19 , circumventing the need to pick an arbitrary number of unsupervised topics or to map output topics to substructures -this relationship is predetermined by the user.
By systematically exploring the preprocessing pipeline used to map spectrum features to a representation of spectra as documents, we find that this LLDA model performs best when intensity information is hidden from the model and binary word counts are used to represent MS/MS spectra ( Figure 3A). This aligns with existing work in probabilistic topic models used in recommender systems, such as collaborative topic Poisson factorization, where binarized ratings lead to improved predictive performance 29 . However, we also note that this effect may arise from the choice of train, test, and validation sets that were taken from publicly available spectra with potential heterogeneity in collision energies. We further note the possibility that a different function for translating intensities to word counts, especially for neutral losses, may result in better performance. Neutral losses may be more robust against systemic uniform measurement error in a spectrum and as such could represent more stable sources of information. However, the manner in which intensities are assigned to neutral losses and which neutral losses are kept will heavily affect model performance.
The k-nearest neighbors (k-NN) model with k=10 performs very well using the same data and evaluation metrics, raising important considerations about trade-offs between predictive performance of substructure identification and interpretability of results. We find that the k-NN model may perform well for the task of substructure identification in situations in which substructures appear in similar patterns between train and test sets. Similarly, k-NN may perform well when a test spectrum corresponds to a molecule that is similar to those in the train set. But as explored in Figure 3B, the k-NN model suffers when these similarities are reduced, and the test spectra correspond to increasingly different molecules from those in the train set or when the substructure appearance varies between train and test sets. This observation is important to consider in the development and evaluation of machine learning methods for substructure identification; assessing generalization performance in real-world settings should reflect cases in which a new spectrum is coming from an underexplored area of chemical space that is not well represented by existing spectral libraries. We leave to future work the problem of including such quantities into evaluation metrics to more accurately assess generalization. Ablation studies such as the one presented here can provide the foundation for better metrics and ways of construct evaluation sets relevant to a substructure identification task in a specific problem setting.
An example case study: the substructure with identifier 528 in the unfiltered train/test set corresponds to SMARTS string C1Cc2ccccc2CN1 and appears in one test spectrum.
In this test spectrum, substructure 528 appears without many of its co-substructures from the training set, and these co-substructures from the training set appear in the test set without substructure 528. As such, the k-NN (k = 10) model produces false positive predictions for substructure 528 in the test set, resulting in an AUC of 0.43. But the LLDA model picks up on this substructure's presence in the test set without suffering from false positives, achieving an AUC of 0.99. The spectrum (corresponding to spectrum 128 in the test set) containing substructure 528, its structure, and spectrum fragments in the top 0.95 percentile of substructure 528's LLDA topic are shown in Figure 3C.
Further work is required to better optimize LLDA. Additional preprocessing steps can be explored, such as keeping spectrum fragments that do not map to a child molecular formula but appear consistently across spectra (rather than discarding them as described in the Method). The inclusion of such orphaned fragments could improve downstream ranking of substructures. A similar preprocessing step is effective in processing amplicon sequencing data 30 . A number of limitations remain in LC-MS/MS metabolite identification including a paucity of training data 31 , difficulties in selecting a vocabulary of substructures 12 , and the heterogeneity involved in instrument choice, acquisition method, and sample conditions. Performance drops resulting from these issues are often difficult to disentangle from features of the data that result solely from chemical structure. Future work includes optimization of substructure label sets, incorporation of prior knowledge such as ionization mode or instrument type, and testing these models on a larger dataset. The work presented here highlights the increasing interest and relevance of representing MS/MS spectra in manners that can capture more information than a mass-to-charge ratio binning scheme and applying a metric for predictions such as cosine similarity. Another recent example of such an effort can be found in 32.
By releasing all code and preprocessed training data for the methods developed here, we encourage reproducibility efforts alongside the development of machine learning methods to solve de novo identification of unknown metabolites in LC-MS/MS data as computational metabolomics stands to benefit from community engagement, consistent public evaluation methods, and open-source models. This project contains the following underlying data: -readme.md (file containing descriptions, specifications, and instructions for included data and using code).

List of abbreviations
-requirements.txt (exported conda environment containing the necessary Python dependencies -R decencies must be installed separately. See readme.md).
-data.zip (data used in paper's empirical study, see readme.md).
-code.zip (code for preparing data, running LLDA, and running kNN. See readme.md). training and testing. Were any filtering criteria utilized in selecting these sets of spectra? There appears to be a de facto filtering of spectra which do not contain any peaks with associated molecular formulas.
Coding style of the codebase is at the typical research programming level, which needs improvement. Here is a non-exhaustive list of needed improvements. There is no use of python docstrings. Many variable names are not descriptive. There is no main function defined and utilized in the accepted pythonic manner. There is global namespace pollution with some of the import statements. Codebase is also not designed as a full python package that allows both command line execution or use as a library. These comments come from an expectation for industrial standards of coding, even in scientific programming projects.

Are sufficient details provided to allow replication of the method development and its use by others? Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions about the method and its performance adequately supported by the findings presented in the article? Partly © 2021 Wandy J. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Joe Wandy
Glasgow Polyomics, University of Glasgow, Glasgow, UK Predicting substructural information from LC-MS/MS fragmentation data is an important problem in untargeted metabolomics. In this work, the authors introduced a novel workflow to preprocess fragmentation data and perform supervised topic modelling approach using labeled latent Dirichlet allocation (LLDA). The LDDA model finds co-occurring fragment and loss features that are associated with chemical substructures provided by users.
The main benefit of incorporating label information through LLDA is that it allows for a list of substructures of interest to be inferred. Topics in LLDA are associated with meaningful substructure labels, aiding in model interpretation and improving from standard LDA where users are required to specify the number of topics in advance and manually annotate topics to provide chemical meaning -a time consuming and laborious process.
In this study, LLDA was evaluated on a benchmark dataset of 185 chemicals having known substructural annotations against a rule-based alternative (MESSAR) and a baseline k-nearest neighbour method, and it was found to perform competitively in retrieving substructural hits particularly when applied to new and unseen test data.
Overall I think the work done in this manuscript is interesting and advances the field of substructural discovery in metabolomics. I am happy to recommend this for indexing as long as my comments below are addressed. The comments are separated into major and minor categories, and further divided by the corresponding sections of the manuscript.
formula enumeration via cdk. A typical fragmentation data set easily consists of a few thousand fragmentation spectra, and at two minutes each, the total feature extraction process could easily run up to days (for a single core) or many hours (for multicores). If true, please state this clearly in the manuscript so the reader is aware of it.
Is performing elemental formula mapping actually necessary? The author comments that ' keeping spectrum fragments that do not map to a child molecular formula but appear consistently across spectra (rather than discarding them as described in the Method)' could be beneficial. This raises the question, is it necessary to perform such expensive procedure to map formulae to fragment and loss features, while the LDA (and LLDA model) could also work on the binned fragment and loss features that appear consistently across spectra, e.g. using 'fragment_132.0152' and 'loss_100.9203' as words of the document, following MS2LDA [19]. Filtering features by formula could also potentially remove frequently occurring fragment or loss features that are important to substructural identification. It would make the manuscript stronger if the authors could investigate how an alternative and simpler feature extraction scheme performs in comparison to what's been done in the manuscript.

Section: Training LLDA and predicting substructures in a new spectrum
Please expand the description of LLDA below.
-The original LLDA model is described in full in 17. We note that every component of LLDA for modeling text documents has an analog useful for modeling a MS/MS spectrum. A document is an MS/MS spectrum, words are spectrum features (fragments and neutral losses), topics are commonly co-occurring spectrum features, and tags are chemical substructures (Figure 1). -I would like to see some textual elaboration of the plate diagram in Figure 1C, in particular the generative process of LLDA (using mass spec vocabularies) and how it differs from standard LDA. What are the advantages of LLDA vs LDA in this application? I had to refer to the original LLDA paper [17] and deduced all these things myself, but really it should be made explicit in the manuscript.

1.
What's the perplexity/log likelihood of LLDA compared to standard LDA on the training and test data sets? Also please report the wall clock of running 2000 iterations of LLDA. And what are the hyperparameters used for alpha and eta in the experiments? Were they using the defaults in tomotopy or were they optimised? How sensitive are the results in Table 1 to the choice of alpha and eta?
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com