Positional weight matrices have sufficient prediction power for analysis of noncoding variants

The position weight matrix, also called the position-specific scoring matrix, is the commonly accepted model to quantify the specificity of transcription factor binding to DNA. Position weight matrices are used in thousands of projects and software tools in regulatory genomics, including computational prediction of the regulatory impact of single-nucleotide variants. Yet, recently Yan et al. reported that "the position weight matrices of most transcription factors lack sufficient predictive power" if applied to the analysis of regulatory variants studied with a newly developed experimental method, SNP-SELEX. Here, we re-analyze the rich experimental dataset obtained by Yan et al. and show that appropriately selected position weight matrices in fact can adequately quantify transcription factor binding to alternative alleles.

Any reports and responses or comments on the article can be found at the end of the article.

Introduction
Gene regulatory regions constitute an important part of non-coding DNA which defines both the global development program of a mammal and individual traits of a particular organism. Specific recognition of DNA sites by transcription factors (TFs) provides the gear system linking individual genomic variants to phenotypes. 1 The commonly accepted model to quantify the specificity of transcription factor binding to various DNA sites is the position weight matrix (PWM), which specifies additive contributions of individual nucleotides to the protein-DNA binding energy. 2,3 Recently Yan et al. 4 presented a powerful high-throughput experimental technique, SNP-SELEX, which allows measuring differential TF binding to alternative alleles in vitro. Yan et al. used the experimental data they had obtained for many TFs to assess the performance of PWM in predicting differential TF binding to alternative alleles and compare it to that of deltaSVM, a more complex method based on machine learning. As a result, they reported that in this setting "the position weight matrices of most transcription factors lack sufficient predictive power". Keeping in mind that PWMs are extensively used for prediction of the regulatory potential of single-nucleotide variants 5-8 the finding of Yan et al. could be devastating for a vast array of research projects and software tools.
Yan et al. tend to explain the poor performance of PWMs by model limitations, primarily, arising from the oversimplistic assumption that nucleotides occupying different positions in the binding site provide independent contributions to the binding energy. Here we re-analyze the dataset of Yan et al. and argue that the poor PWM performance in predicting differential transcription factor binding to alternative alleles detected by SNP-SELEX is to a major extent explained not by the principle limitations of PWM as a mathematical construction but rather by particular inadequate PWMs for TFs under study. We show that the careful selection of PWMs of many TFs from a public database quantitatively explains the differential TF binding to allelic variants with reliability comparable to deltaSVM.

Results
To re-assess PWM performance, we used PWMs stored in the CIS-BP database, 9,10 which contains PWMs constructed from data obtained with different experimental techniques for thousands of TFs for different species. With the objective of selecting the PWM appropriate for quantifying differential allele binding of a TF, for each of 129 TFs assessed in Yan et al. we extracted an extended set of candidate PWMs, with a median of 32 PWMs per TF. The overall distribution was non-uniform e.g. there were only 2 candidate PWMs for ZNF396 and over a thousand for FOXA2, see Extended data, Supplementary Table S1.
Through cross-validation on the 1st batch of SNP-SELEX data following the strategy of Yan et al., we selected the best PWM CIS-BP for each TF (see "Selecting the best PWMs and estimating PWM performance with SNP-SELEX data" in the Methods). There was no correlation between the prediction performance (area under precision-recall curve, AUPRC) and the number of tested PWMs per TF (r = À0.07, P = 0.425). Many of the best-performing PWMs were originally constructed from the data related not to the target TF but to other TFs sharing the same DNA-binding domain as the TF of interest. Some PWMs were based upon the TF binding data from different species. We denote by ΔPWM CIS-BP the difference of the allelic scores predicted with PWM CIS-BP and by ΔPWM Mult the results of PWMs obtained from HT-SELEX data using the multinomial algorithm 11 and assessed by Yan et al. 4 Yan et al. provide the experimental differential binding scores for each SNP (the "pbSNP" scores). We compared these scores with ΔPWM CIS-BP predictions. For a complete set of 816594 TF-SNP pairs (Figure 1a), the Pearson correlation coefficient was comparable to that observed by Yan et al.  Figure 2a). Yet, if only SNPs with a strong predicted binding are included (P < 10 -4 for the PWM score of a stronger bound allele) then a much higher correlation (r~0.828) is achieved, see Figure 1b. Finally, if only the most strongly bound TF is considered for each SNP (as in Figure 3a

REVISED Amendments from Version 2
The reviewer has drawn our attention to the missing citation of one of the key PWM benchmarking papers. We also updated the link to supplementary materials.
Any further responses from the reviewers can be found at the end of the article the highest quantile was on par with deltaSVM (see Supplementary Figure S1, compare with Extended data Figure 7 in Yan et al.  In fact, for 34 transcription factors, PWM CIS-BP outperformed advanced models of deltaSVM ( Figure 2c). 5-fold crossvalidation showed that models reaching higher AUPRC simultaneously had a lower variance in prediction quality across individual folds ( Figure 2d). Furthermore, we tested the PWMs on the independent 2nd batch data (

Discussion
Our approach mimics a machine learning setup, where the best model is selected ("trained") through cross-validation on a first experimental data set (1st batch of the SNP-SELEX data), and then additionally independently validated on the second experimental data set (2nd batch of SNP-SELEX data). As we select from a finite and typically small set of candidate PWMs, the risks of overfitting are minimized, and the resulting performance was not correlated with the number of 'candidate' PWMs. The utilized layout allowed us to pick up the best suited PWM independently from the original data or motif discovery method used for PWM construction, yet maintaining the main PWM limitations, such as the assumption of the independent contributions of nucleotides at different TFBS positions.
The lower performance of PWMs for TFBS recognition as compared with more complex models was reported in many publications. [12][13][14] The popular explanation blames the assumption of positional independence of PWM scores, which comes short of taking into account the marked correlations of nucleotides located at neighboring or even distant positions of binding sites. [15][16][17] This shortcoming is also considered over-restrictive for PWM applications in predicting the effects of single nucleotide variants on TF binding, 4,18 which has recently come into the spotlight of modern genetics where the advent of complete genome sequencing brought about the need for interpretation of phenotypes associated with regulatory variants. 19,20 Here we suggest an alternative explanation of inadequate PWM performance in predicting the effects of single-nucleotide variants on TF binding. In many cases, the reason is an inadequate PWM construction or selection procedure.
Besides technical difficulties in proper "training" a PWM through motif discovery from different types of experimental data, the particular experimental context may influence the applicability of the resulting PWM. Careful selection of PWMs from a pool of alternative existing models results in an apparent improvement of the quantitative assessment of preferential binding to single-nucleotide variants, especially for high-scoring TFBS. In many cases, the prediction power becomes comparable to that of the significantly more complex model such as deltaSVM. We specifically emphasize that in our study all selected PWM CIS-BP were genuine PWMs following the classic assumption of the independent contribution of positional scores.
Summing up, our results do not compromise the high performance of deltaSVM, 12  not trivial and its success depends both on the motif discovery algorithm and reliability of the training data. In our case, almost half of the best PWMs were derived from related TFs, including 8 cases of PWMs based on experimental data from other species. The experiments used to obtain the best PWMs were also of different types, including ChIP-Seq, proteinbinding microarrays, and SMiLE-Seq data, see Extended data, Supplementary Table S1. 21 Thus, it is important to consider various sources of PWMs and select those the most suitable by proper benchmarking. In the context of applying PWMs to analyze regulatory variants, SNP-SELEX of Yan et al. provides rich, unique, and practically useful data.
The objective of our study is by no means to undermine the necessity of complex TFBS models with dependent positional contributions. Advanced multiparametric and alignment-free approaches such as deltaSVM appear very likely to shape the oncoming future of transcription factor binding site models. Rather, we want to underline that the prediction performance of transcription factor binding sites in its current stage is more influenced by model training protocols than by model structure restrictions. PWMs still can deliver a solid standard in representation and bioinformatics analysis of the transcription factor binding sites, including assessment of the functional impact of single nucleotide variants in gene regulatory regions. In addition, we underline that better defined 'baseline' PWMs or PWM selection procedures are required for the proper evaluation of advanced models. It is important that such 'baseline' TFBS models, while certainly being handicapped by design, still reach meaningful prediction quality. These are good news for thousands of researchers who still use the 'legacy' PWM scoring for practical applications in regulatory genomics and bioinformatics.

Methods
PWMs used in the study As a source of candidate PWMs we used the CIS-BP (Catalog of Inferred Sequence Binding Preferences) collection 9,10 of pre-made matrices. For each TF, we gathered all PWMs assigned to the TF and added PWMs for related proteins sharing similar DNA binding domains. This was motivated by the results of the benchmarking study of Ambrosini et al. 2 where a PWM for some TF often displayed poorer TFBS recognition power than a PWM for some different TF but with the same DNA-binding domains.
The starting set of position frequency matrices was extracted from TF_Information_all_motifs.txt of CIS-BP 2.0 that includes models derived from direct experimental data for each TF and models that can be inferred given the TF familyspecific threshold on DNA-binding domain similarity, see Ref. 11. In Figure 2 such PWMs are referred to as 'direct' and Determination of transcription factor binding preference using PWMs To assess with a particular PWM whether an SNV affects transcription factor binding, we used PERFECTOS-APE 6 that estimates the log-fold change of motif P-values computed for best PWM hits detected among sites overlapping the first and the second of two alternative alleles. To use the prediction as a binary classifier, we treated the cases with P > 0.005 at both alleles as predicted negatives and used the log-fold change as the prediction score in the remaining cases. The auc function of the sklearn.metrics Python package was used to estimate the area under the precision-recall curve (AUPRC).

Estimating PWM performance with SNP-SELEX data
To provide a fair assessment, we mimicked the benchmarking protocol of Yan et al. Particularly, true positives and true negatives were selected from the SNP-SELEX data as follows. 1st batch data positives: PBS P-value < 0.01 and OBS P-value < 0.05; negatives: PBS P-value > 0.5 and OBS P-value < 0.05. 2nd batch data positives: PBS P-value < 0.01, negatives: PBS P-value > 0.5. For each TF, we tested each CIS-BP PWM from its PWM set. For each TF, the PWM reaching the highest AUPRC on the 1st batch data was selected for evaluation against the best PWM on the 1st batch ( Figure 2a) and against deltaSVM on the 2nd batch of SNP-SELEX data (Figure 2e). Performance estimates for deltaSVM models (used in Figure 2c,e) were extracted from Supplementary  Figure 2a) were kindly shared on our request by the authors. 4 We also mimicked the stratified five- This project contains the following extended data: • Supplementary table S1 (Overview of PWMs and their performance in recognizing SNPs affecting transcription factor binding in SNP-SELEX data.)

Open Peer Review
PWMs are already known to be potentially flawed models, varying from very accurately predicting DNA binding specificity to poorly doing so, with potential confounders like cofactors 1 and indirect binding 3 . The extent to which these issues apply can depend on the method used to determine the PWM. For example, using ChIP-seq data can create a composite motif incorporating part of a cofactor. Using PBM may eliminate this effect, but can produce poor binding specificity in some cases, possibly because either the specificity is mediated by cofactor requirements or because binding is indirect 4,5 .
While I am specifically mentioning older approaches like PBMs here, any in vitro or in silico method potentially has similar issues.
For this reason, I advocate using a range of methods to assess motif quality 2 .
As regards SNPs and other variability, this sort of issue has to be taken into account, otherwise any variation in specificity may not correspond to in vivo reality.
So, back to the approach of this paper: selecting PWMs that match specific criteria for reliability. It is not clear to me that this in any way invalidates the results of Yan et al. as there is variability in the predictive quality of PWMs, given the potential for confounders.
I would like to see a clearer explanation of the extent to which Yan et al. actually diminish the utility of PWMs (noting this is in a specific context, assessing small genomic variants) and the extent to which this review generalises beyond carefully selected PWMs.. We did our best to clarify our claim as it was also questioned by Dr. Levitsky.
PWMs are already known to be potentially flawed models, varying from very accurately predicting DNA binding specificity to poorly doing so, with potential confounders like cofactors 1 and indirect binding 3. The extent to which these issues apply can depend on the method used to determine the PWM. For example, using ChIP-seq data can create a composite motif incorporating part of a cofactor. Using PBM may eliminate this effect, but can produce poor binding specificity in some cases, possibly because either the specificity is mediated by cofactor requirements or because binding is indirect 4,5.
We fully agree that the data source and the computational procedure used to derive the TFBS model would significantly affect the result in terms of whether it reflects the genuine TF binding specificity or significantly depends on confounding factors. In this paper we restricted ourselves to a more specific context of using PWMs for quantifying the variants identified with the SNP-SELEX, which is an in vitro assay, so indirect binding and cofactors do not influence the outcome. To make it clear, we have revised the Introduction section of our manuscript.
For this reason, I advocate using a range of methods to assess motif quality 2.
Indeed. A comprehensive assessment of motif models using different types of experimental data was performed e.g. in Ambrosini et al. 2022. In this study, we did not focus on selecting the optimal PWMs for a wide range of practical applications or in terms of representing in vivo binding. Our aim was to demonstrate that PWMs provide the type of a model, which is able to show a reasonable performance in classifying differentially bound oligonucleotides with single-nucleotide substitutions.

So, back to the approach of this paper: selecting PWMs that match specific criteria for reliability. It is not clear to me that this in any way invalidates the results of Yan et al. as there is variability in the predictive quality of PWMs, given the potential for confounders. I would like to see a clearer explanation of the extent to which Yan et al. actually diminish the utility of PWMs (noting this is in a specific context, assessing small genomic variants) and the extent to which this review generalises beyond carefully selected PWMs.
The same issue was pointed out by Dr. Levistky and we did our best to clarify the aim of the study in the revised version of the manuscript. We believe, that quantifying the effects of single nucleotide variants on TF binding is an important practical problem emerging in the increasingly influential field of personalized genomics, as according to the recent reports up to 80% of causal variants are found in the regulatory regions [see e.g. https://www.medrxiv.org/content/10.1101/2021.06.08.21258515v2]. Even though this is a limited problem, it is worth clarifying the PWM performance for this particular application.
We have added the necessary information in the Introduction section.
SELEX indicated that the G allele bound more strongly than the T allele to HLF. This could be caused by the dinucleotide interdependency between position 2 (the SNP position) and position 10 in the binding site (Fisher's exact test P < 2.2 × 10−16, odds ratio = 3.34)… …(2) PWM performed poorly for SNPs located in low-affinity binding sites of transcription factors. However, this limitation could be overcome by using deltaSVM. When we categorized SNPs into five quantiles on the basis of their binding affinities as measured by OBS, and assessed the performance of PWM and deltaSVM in predicting their allelic binding by fivefold cross-validation or using the novel batch of SNP-SELEX experimental results (Extended Data Fig. 7), deltaSVM outperformed PWM in all quantiles, particularly in the lower quantiles corresponding to weak transcription factor binding sites….. Boytsov et al. concluded …However, properly selected PWMs achieve performance that is very close and in some cases even better than that of deltaSVM. Despite the simplicity of the PWM model, its construction is not trivial and its success depends both on the motif discovery algorithm and reliability of the training data… Any motif discovery algorithm does not use any motif library on the process of de novo search. Usually, motif libraries are applied to interpret enriched motifs (e.g. STREME and Tomtom in meme suite, https://meme-suite.org/meme/index.html) Hence, application of motifs library is not a step in de novo process. At this step, I again does not understand why Boytsov et al. compared Figure 1b with Fig. 2b   Yet, we strongly disagree with the authors' conclusion on the PWM performance ("lack sufficient predictive power") and believe that their comparison of the performance of the PWM and deltaSVM became one-sided in favor of deltaSVM due to an accidental selection of particular PWMs in their study. We believe that public databases contain PWM which can display much better performance in quantifying allele-specific binding and put it explicitly in the Introduction that "We show that the careful selection of PWMs of many TFs from a public database quantitatively explains the differential TF binding to allelic variants with reliability comparable to that of deltaSVM." Again, we put into Discussion that "Summing up, our results do not compromise the high performance of deltaSVM, used by Yan et al. as an advanced substitution of position weight matrices (PWMs). However, properly selected PWMs achieve performance that is very close and in some cases even better than that of deltaSVM." We have also added a detailed discussion on the subject into the Discussion section (see the response to the next question of the reviewer).
Since, in particular, Yan et al. also wrote that …(1) We reasoned that the poor performance of many PWMs was probably because they did not take into account dinucleotide interdependency in transcription factor-DNA interactions and the influence of flanking DNA sequences… …(2) PWM performed poorly for SNPs located in low-affinity binding sites of transcription factors.

Does Boytsov et al. not agree with Yan et al. in these two points?
We agree with the theoretical limitations of position weight matrices regarding their inability to account for non-additive contributions of particular nucleotides into protein affinities. Yet, it is important to distinguish between the general and well-known limitations of the PWM as a model and the low performance of particular matrices. In our opinion the poor performance of PWMs used in the study of Yan et. al. was not due to the intrinsic inability of PWMs to classify TF binding preferences at particular TFBS, but rather due to the way the PWMs used in this study were selected/constructed. We prove it by providing the alternative PWMs that belong to the same class of mononucleotide models but perform better than the PWMs of Yan et al. and comparably to deltaSVM. We have added an explicit statement on this matter in the revised version of the manuscript. We used PWMs with the same DNA binding domains to increase the repertoire of candidate PWMs, from which the best PWM for assessing variants identified by SNP-SELEX experiments can be selected. To avoid confusion we have added two subsections to the Methods section: "PWMs used in the study" and "Selection of the best PWM for a TF". In fact, carefully selected PWMs outperformed deltaSVM models for 34 of 129 TFs (see paragraph 3 of Introduction in the manuscript and Figure 2), and many of these PWMs were initially constructed for different TFs and even different species (see Supplementary Table S1). This does not compromise better deltaSVM performance for other TFs (see Fig. 1e).

Hence, Boytsov et al. proved that BEESEM realization may be better if we incorporate CIS-BP data or what?
We did not test BEESEM or other types of motif discovery software or alternative PWM-like motif representations, and thus don't know if they provide even better PWMs than we found in CIS-BP. We only used the existing published PWMs available in the CIS-BP database. SNP-SELEX provides a rich data source to test various types of models in the task of predicting rSNP effects on transcription factor binding, but such testing does not fit the scope of our manuscript. We did our best to better highlight the key idea of the study in the revised version of the manuscript and added an extensive Discussion section.

The TF classification by family is wrongly described. … This description does not explain several pair from Supplementary Data (Overview of best CIS-BP PWMs), e.g.ETV2 & FLI1
In the pool of possible PWMs for each TF, we included CIS-BP 'inferred' PWMs (as described in Methods) which belonged to TFs with a similar DNA-binding domain, hence there is no contradiction. We revised the Methods section, see the subsection "PWMs used in the study." The correct and default approach was described in the previous publication CIS-BP classification of DNA-binding domains is very general and leads to very wide sets of PWMs potentially applicable to a particular TF, if all PWMs across the TF family are considered. To reduce computational complexity, we made a compromise of including 'inferred' motifs (see above) but only for related proteins by matching gene names and not relying on the detailed TF family annotation. Even with this simplification in the PWM selection, which greatly reduced the number of available PWMs, the resulting performance of the best PWMs was significantly better than the PWM performance reported by Yan et al. for the same TF.