On the limits of inferring biophysical parameters of RBP-RNA interactions from in vitro RNA Bind’n Seq data

Niels Schlusser; Mihaela Zavolan

doi:10.12688/f1000research.135164.1

Home Browse On the limits of inferring biophysical parameters of RBP-RNA interactions...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Method Article

On the limits of inferring biophysical parameters of RBP-RNA interactions from in vitro RNA Bind’n Seq data

[version 1; peer review: 1 approved with reservations, 1 not approved]

Niels Schlusser¹, Mihaela Zavolan¹

PUBLISHED 26 Jun 2023

Author details Author details

¹ Biozentrum, Universitat Basel, Basel, Basel-Stadt, 4056, Switzerland

Niels Schlusser
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Mihaela Zavolan
Roles: Conceptualization, Methodology, Project Administration, Resources, Supervision, Validation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

We develop a thermodynamic model describing the binding of RNA binding proteins (RBP) to oligomers in vitro. We apply expectation-maximization to infer the specificity of RBPs, represented as position-specific weight matrices (PWMs), by maximizing the likelihood of RNA Bind’n Seq data from the ENCODE project. We demonstrate that the model can reproduce known specificities for well-studied proteins and that in some cases we predict
novel, longer binding motifs. However, the model does not recover all the motifs that are in principle known, indicating that the data is not well explained by a single underlying biophysical model. Our code is publicly available.

Keywords

Systems biology, bioinformatics, computational biology, machine learning, maximum entropy method, Bayesian statistics, RNA binding proteins, RNA Bind'n'Seq

Corresponding author: Mihaela Zavolan

Competing interests: No competing interests were disclosed.

Grant information: This work was funded by the Swiss National Fund under grant number 310030_204517
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2023 Schlusser N and Zavolan M. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Schlusser N and Zavolan M. On the limits of inferring biophysical parameters of RBP-RNA interactions from in vitro RNA Bind’n Seq data [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2023, 12:742 (https://doi.org/10.12688/f1000research.135164.1) First published: 26 Jun 2023, 12:742 (https://doi.org/10.12688/f1000research.135164.1) Latest published: 29 May 2024, 12:742 (https://doi.org/10.12688/f1000research.135164.2)

1. Introduction

RNA-binding proteins (RBPs) interact with RNAs at every step of their life cycle. Their modular structure, usually an assortment of RNA-binding domains, underlies their ability to interact with both RNAs and proteins, and couple various layers of gene expression.¹ While $\sim 2500$ RBPs are currently known, most remain to be functionally characterized. A first step in this process is to determine the interaction partners and the sequence/structure specificity of the RBP. Many RBPs recognize their targets in a sequence-specific manner, although the accessibility of binding sites within the targets also plays a role.² The sequence specificity is usually represented in a position weight matrix (PWM), which specifies the probability of finding each of the four nucleotides at each position in the RBP binding site. This is an obvious simplification, as dependencies between positions in the binding site likely occur. However, training more complex models requires substantially more data, which are often not available. Moreover, the improvement in binding site predictability by more complex models is modest, at least in the case of other nucleic acid binding proteins, transcription factors.³ With the realization that the presence of a canonical RNA-binding domains is not necessary for the ability of a protein to bind RNAs⁴ came a pressing need to determine the determinants of RNA-RBPs interactions and the sequence/structure specificity of the proteins newly found to interact with RNAs.

The past two decades have seen the development and broad application of experimental methods for RBP target identification. They include in vivo high-throughput approaches such as HITS-CLIP, PAR-CLIP, iCLIP and eCLIP (reviewed in Ref. 5), and more recently-developed in vitro approaches such as RNA Bind’n Seq.⁶ While the CLIP methods rely on the sequencing of RNAs that interact and can therefore be crosslinked to RBPs in vivo, RNA Bind’n Seq relies on the affinity-dependent interaction of RBPs with random RNAs in vitro. The oligonucleotides whose interaction with the RBP (or domain thereof) of interest are computationally analyzed to identify short sequence motifs that are enriched in the affinity-selected pool of RNAs. So far, analyses of such data involved the identification of enriched k-mers (short oligonucleotide sequences of a specified length, $k$ ), and then a greedy alignment procedure yielded PWM representations of the RBP binding motifs. This left open the question of whether the derived PWMs accurately predicted the interaction energies of RBPs with their binding sites. In contrast, the aim of our work was to develop a biophysics-anchored method to directly infer the PWMs from RNA Bind’n Seq data. Our paper is organized as follows: Sec. 2 explains how we derive our thermodynamical model. We comment on the practical implementation of this model in Sec. 3, where we also explain how we account for sequence composition biases in the pool of oligomers. Results for different RBPs are presented in Sec. 4, where we also comment on the accuracy of the results obtained from this type of data for different RBPs. Concluding remarks are given in Sec. 5, and a list of publicly available data sets that we analyzed is provided in Sec. 6.

2. Model

Our model is an adaptation of a Bayesian, thermodynamic model that was constructed to infer di-nucleotide weight tensors from SELEX data.⁷ In the following, we derive the log-likelihood of Bind’N Seq data given the PWM for the RBP of interest, which will be inferred by expectation-maximization as described in Sec. 3.

We assume that an RBP binds an oligomer over a binding site $s$ of length $L_{w}$ and that the likelihood of the binding taking place, according to Boltzmann’s law, goes as $\propto exp (\sum_{i = 1}^{L_{w}} E_{i}^{s_{i}}) \equiv e^{E (s)}$ , where $s_{i}$ is the nucleotide at position $i$ , so $s_{i} \in \{A, C, G, T\}$ . Therefore, each element of the position weight matrix (PWM) $M$ can be identified with $m_{i}^{α} \equiv exp (E_{i}^{α})$ , with their columns being normalized as $\sum_{α} m_{i}^{α} = 1 \forall i = 1, \dots, L_{w}$ .

Additionally, we account for the fact that there are genuinely two different ways of binding, sequence-specific binding as described by the PWM, and unspecific binding to RNAs with a probability $exp (E_{0})$ . Combining these two possibilities, we arrive at the probability for an RBP binding to a site $s$

(2.1)

P_{b} (s| c| M| E_{0}) = \frac{c (e^{E (s)} + e^{E_{0}})}{1 + c (e^{E (s)} + e^{E_{0}})},

where the 1 in the denominator represents the (constant) chance of an RBP being unbound which is set to 1 by normalizing the protein concentrations

c

, accordingly. Note that

E_{0} \leq 0

needs to be satisfied since

e^{E_{0}}

is a probability which, in turn, needs to satisfy

e^{E_{0}} \leq 1

. Exploiting the fact that the binding of RBPs to oligomers is not saturated, i.e.

c ≪ 1

, we can linearize (2.1)

(2.2)

P_{b} (s| c| M| E_{0}) \approx c (e^{E (s)} + e^{E_{0}}) .

Consequently, the chance of an RBP being bound somewhere on a longer oligomer $S$ with $L_{S} \geq L_{w}$ is

(2.3)

P_{b} (S| c| M| E_{0}) = \sum_{s \in S} P_{b} (s| c| M| E_{0}) \approx c (e^{E (S)} + (L_{S} - L_{w} + 1) e^{E_{0}}),

where

e^{E (S)} \equiv \sum_{s \in S} e^{E (s)}

and a sum over all possible

L_{w}

-mers

s

S

. The probability of each read

S

in the pool of oligomers that are washed over the RBP is

(2.4)

P_{IP} (S| c| M| e_{0}) = \frac{f_{S} P_{b} (S| c| M| E_{0})}{\sum_{σ \in D} f_{σ} P_{b} (σ| c| M| E_{0})} \approx \frac{f_{S} (e^{E (S)} + (L_{S} - L_{w} + 1) e^{E_{0}})}{\sum_{σ \in D} f_{σ} (e^{E (σ)} + (L_{σ} - L_{w} + 1) e^{E_{0}})},

with

D

being the data set containing all the reads at hand, and

f_{S}

a frequency prior that corrects for the fact that the pool of oligomers has a non-uniform nucleotide composition. Note that, due to the linearization in

c

P_{IP}

is independent of the concentration

c

since it cancels as an overall prefactor in both numerator and denominator. (2.4) is essentially a formulation of Bayes’ theorem with conditional probability

P_{b} (S| c| M| E_{0})

of having a read

S

bound by an RBP, the likelihood of finding a read

S

in the pool washed over the RBP,

f_{S}

, and an overall normalization (denominator).

Eventually, the logarithmic likelihood of our library of oligomers $D$ reads

(2.5)

log P (D) \approx \sum_{S \in D} n_{S} log [\frac{f_{S} (e^{E (S)} + (L_{S} - L_{w} + 1) e^{E_{0}})}{\sum_{σ \in D} f_{σ} (e^{E (σ)} + (L_{σ} - L_{w} + 1) e^{E_{0}})}],

where

n_{S}

is the number of copies of read

S

in our library.

3. Implementation

Our goal is to optimize the parameters in (2.5) such that they maximize the likelihood of our library to be realized in the present way. As a side note, it is equivalent to optimize $P (D)$ , or its logarithm because the logarithm strictly increases as its argument increases and decreases as its argument decreases. Since the libraries are typically quite big it is beneficial for us to maximize the logarithm in order to keep the overall numbers under control. While the library $D$ , the copy number of a read $n_{S}$ , the read and binding site length $L_{S}$ and $L_{w}$ , and – with some limitations – the frequency priors $f_{S}$ are given from our data, the position-specific binding encoded by the PWM and the position-unspecific binding $e^{E_{0}}$ have to be obtained during the optimization process. Eventually, we want to obtain the the PWM, whereas $e^{E_{0}}$ represents a hidden parameter which will be inferred via the expectation-maximization procedure. In principle, this would also apply to the concentration $c$ but none of our final expressions depend on $c$ any more due to the linearization. Before diving into the details of the EM procedure’s implementation we would like to comment on how to infer the frequency priors $f_{S}$ .

3.1. Construction of the frequency priors f_S from a Markov model

RNA Bind’n Seq data does not only comprise libraries of pulled down RBP-bound reads at different, non-vanishing RBP concentrations, but also control experiments that do not contain any RBPs. The oligonucleotides that were used for RBP affinity-based selection were short, typically 20 nucleotides in length (c.f. Ref. 8). The number of possible 20mers is $4^{20} \approx 10^{12}$ , much larger than the library sizes of $\sim 10^{7}$ . Thus, even in the absence of selection ( $c = 0$ ), the expected overlap of two libraries is extremely small.

To preserve the statistical power of the foreground pool, i.e. use all the reads detected in the foreground sample in the analysis, even though they were not represented in the background sample, we would have to predict the frequency of foreground reads under the assumption of no selection for binding the RBP. A commonly used approach for this type of problem is to train a Markov model from the background pool and construct the expected frequency of each read in the foreground from the trained model, just as in Ref. 9. For an completely unbiased process of oligomer synthesis and capture, the degree $d$ of the Markov model would be $0$ , i.e. each base would be equally likely to occur at any position in the oligomer, and all 20mers would have the same prior frequency of occurrence $f_{S}$ . However, biases in the capture and sequencing of oligomers could lead to some sequences, with specific composition of short nucleotide motifs, being captured more often than others. To account for this possibility we trained Markov models of different orders and found that, in general, the higher the order of the model trained from the background sequences, the better the prediction of likelihood of sequences in the foreground samples. Thus, we used a Markov model of order $d = 14$ , which allows the most accurate prediction of background reads frequencies with our computational resources.

3.2. Inferring PWMs from the expectation maximization algorithm

Having constructed our model, with the final expression (2.5), and having constructed the background frequencies $f_{S}$ as described in the subsection above, the main remaining question is how to optimize the PWMs and $E_{0}$ such that the likelihood for the result being realized (c.f. (2.5)) is maximized. To this end, we rely on the expectation maximization algorithm.¹⁰^,¹¹ Provided that only some of our model parameters can be directly inferred from the data, the algorithm optimizes the “hidden” parameters to maximize (2.5). The expectation-maximization procedure (EM) can be divided into the following steps:

1. Initialize $E_{0}$ and the PWM elements $m_{i}^{α}$ with respectively well-defined real numbers, i.e. $E_{0} \in (- \infty, 0]$ and $\sum_{α} m_{i}^{α} = 1 \forall i = 1, \dots, L_{w}$ . This can either be done in an entirely unbiased way or by pre-determining some motifs and specifying randomly or uniformly initialized positions in the PWM.
2. Optimize $E_{0}$ to maximize (2.5) holding the PWM fixed.
3. Updating the PWM with the new $E_{0}$ from the previous step. The update of the PWM works by splitting the data set into $L_{w}$ -mers $s$ (on a read $S$ ) and adding the weight

(3.1)

\frac{P (s| c| M| E_{0})}{P (S| c| M| E_{0})} = n_{S} \frac{e^{E (s)}}{e^{E (S)} + e^{E_{0}} (L_{S} - L_{w} + 1)}

to all entries in the PWM corresponding to $s$ . Repeat that for all $s$ in $S$ , and over all $S$ in $D$ . Renormalize the PWM again by enforcing $\sum_{α} m_{i}^{α} = 1 \forall i = 1, \dots, L_{w}$ .

4. Repeat the previous two steps until convergence. We terminate the iteration when the quadratic difference between the current and the updated PWM is less than $10^{- 6}$ on average per entry, i.e. for $L_{w} = 5$ the quadratic difference is less than $5 \times 4 \times 10^{- 6}$ . Usually, this takes $O (10)$ iterations.

Our code is written in C++ and python and is publicly available.¹²

4. Results

In analyzing Bind’n Seq datasets for various RBPs, we found that only a small subset of random initializations deliver a convergent EM process. The larger $L_{w}$ is the larger is the space of possible initializations, therefore it becomes increasingly unlikely to accidentally hit a region of initialization which converges. This could be compensated for by increasing the number of runs by a factor of $4$ for each additional position in the PWM. To avoid that, one can use the knowledge of previous runs, done with shorter PWMs and initialize the longer PWM from the shorter PWM, filling up the additional entries with randomly initialized values to check if the shorter PWM is part of a longer PWM. We carry on with this procedure until the EM algorithm does not find any minimum any more amongst 12 different random initializations of $E_{0}$ and the random initialization of the PWM at the additional positions. Sometimes, neither a fully random intialization nor an initialization of the PWM “guided” by prior knowledge lead the EM algorithm converging to a minimum in log-likelihood (2.5). The relative efficiency of the algorithm finding true maxima is displayed in Figure 1 for the RBPs discussed in the following section. We consider an outcome of the algorithm to be a “true” maximum if the posterior log-likelihood is larger than the initial one and the algorithm is not stuck in a region where $E_{0}$ is large, meaning that the unspecific binding dominates. The maximization algorithm is eventually terminated by a limit of 200 iterations. For readability, we list the used Bind’n Seq data files from Ref. 8 in Table 1 in Sec. 6.

Figure 1. Summary of fraction of convergent outcomes for different investigated RBPs and binding site length $L_{w}$ .

RBPs and binding site lengths with no convergent maxima of the log-likelihood (as described in the corresponding subsections, e.g. subsection 4.6) were discarded. While $E_{0}$ is always initialized with a negative random number, the PWM can be initialized either “guided” by already obtained shorter motifs or literature motifs (c.f. corresponding subsection of Sec. 4), or with a fully “random” PWM.

4.1 Benchmark: PWM of length 6 for RBFOX2

To benchmark our method, we started our evaluation with RBFOX2, a key regulator of alternative splicing¹³ that was extensively studied with a variety of methods (e.g. Ref. 14). The RBFOX2 Bind’n Seq dataset⁸ consists in nine libraries at nine different protein concentrations and two protein-free control libraries, all containing reads of 50 nucleotides (nts) in length, including the adaptor. RBFOX2 is widely used to benchmark computational analysis methods (c.f. Ref. 15) and thus the corresponding dataset was carefully generated, to include multiple, high-quality libraries. Established techniques like kmer-enrichment analysis and the streaming-kmer-algorithm (SKA) predict a consensus 6mer TGCATG as the most prominent motif followed by other GCATG-containing 6mers.¹⁵ Our results in Figure 2 reproduce the predicted TGCATG 6mer as a part of the motif Figure 2(a). Moreover, we find the subdominant PWM Figure 2(b) which has a quite substantial overlap with Figure 2(a) in the first four positions. We therefore consider an important real world data test of our code to be passed.

Figure 2. Findings of our model for PWMs of $L_{w} = 6$ .

(a) The consensus motif that features a higher $log P (D)$ than (b). However, (b) is also present.

4.2 Other PWMs found for RBFOX2

The big overlap of the motifs Figure 2 suggests also searching for longer motifs, which may subsume the shorter ones. Indeed, the motif shown in Figure 3(d) contains both 6mers. Along with that we find local minima in the probability landscape, i.e. PWMs of $L_{w} \leq 9$ (see Figure 3). All motifs have the consensus TGCATG in common. For RBFOX2, we found no evidence for our model to converge beyond $L_{w} = 9$ . The posterior probability $log \hat{P} (D)$ – (2.5) at the optimized parameters – serves as a measure to compare and rank different motifs at equal $L_{w}$ . The Bayesian Information Criterion (BIC)¹⁶ estimates the information content of every obtained local minimum,

(4.1)

BIC = k log n - 2 log \hat{P} (D),

with the number of degrees of freedom

k = (4 - 1) L_{w} + 1

(four nucleotides minus one for the normalization per position, one extra for

E_{0}

), and the number of data points

n = \sum_{S} n_{S} (L_{S} - L_{w} + 1)

, i.e. the number of possible binding sites in the entire foreground pool. Closely related is the Akaike Information Criterion (AIC)¹⁷

(4.2)

AIC = 2 k - 2 log \hat{P} (D),

which is a bit less susceptible to overfitting. Both criteria rank the longer PWMs as having the higher information content, while we rather expect to find an optimum in information content with respect to

L_{w}

. Therefore, we compare posterior probabilities only among equal

L_{w}

in the following and leave the search for a comparison criterion among different

L_{w}

for future work. Having found that the model retrieves the expected motif for a well-studied RBP, we sought to further explore its performance for others.

Figure 3. Non-consensus PWMs of different $L_{w}$ for RBFOX2.

4.3 CELF1

CELF1 is an RBP of the CUG-binding CELF family.¹⁸ CELF1 participates in multiple steps of post-transcriptional processing of RNAs, including splicing, translation and decay,¹⁹ and requires UGU motifs for high-affinity interaction with RNAs.²⁰ The corresponding Bind’n Seq dataset⁸ consists of libraries generated for seven different RBP concentrations, each containing $\sim 2 \times 10^{7}$ reads of $L_{S} = 40$ . Since 40 runs with completely random PWM intialization for $L_{w} = 3, 4, 5$ did not yield any local optima of the probability landscape we decided to test whether the biased initialization of the PWM with the known motif ( $UGU$ / $TGT$ ), which was found as as enriched 3-mer in RNA Bind’n Seq⁸ enables the recovery of longer motifs. Indeed, our procedure yielded multiple extended versions of TGT Figure 4 with $L_{w}$ up to $8$ . The context in which the reduced motif occurs is A/T-rich, the presence of an A immediately upstream indicates that CELF1 could recognize the AUG start codon. This could be interesting in light of CELF1 being a translational regulator of epithelial-mesenchymal transition via the binding of both cap-binding EIF4E and the poly(A)-binding protein.²¹

Figure 4. PWMs of different $L_{w}$ for CELF1.

4.4 HNRNPD

Within the class of heterogeneous ribonucleoproteins (hnRNPs), hnRNPD (also known as AUF1) is a well-known A/U-rich element RNA binding protein with important role in RNA decay.²² HNRNPD has been reported to bind clusters of AUUUA elements.²² The ENCODE-database⁸ lists AUAAU as another possible binding site for hnRNPD. While entirely random initializations do not deliver any convergent runs, we recover both AUAAU, and, with a smaller binding log-likelihood, AUUUA, as binding sites. Building on this shorter motifs enables the discovery of UAAAU-containg longer motifs that can be extended up to $L_{w} = 14$ (see Figure 5), the highest length for which we found convergent results. We did find PWMs with $L_{w} = 7, \dots, 13$ which we omitted in Figure 5 since they are parts of the two $L_{w} = 14$ PWMs.

Figure 5. PWMs of different $L_{w}$ for HNRNPD.

The benchmark cases $L_{w} = 5$ are show in (a) and (b), whereas (c) and (d) show the two motifs of $L_{w} = 14$ , the longest motifs that our algorithm found.

4.5 HNRNPK

We were also interested in determining whether we can recover G/C-rich binding motifs from the data and therefore applied the model to heterogeneous nuclear ribonucleoprotein K (HNRNPK), a member of the poly(C) binding family of proteins.²³ We could only recover one of the two consensus motifs reported in the ENCODE analysis of these data (GCCCA, from SKA⁸) when initialized with it. The second reported motif, with the CACGC consensus, could not be found by our algorithm even when the PWM was initialized with the motif itself and even when sequences containing the first motif were eliminated, indicating that this motif does not correspond to a local maximum of the likelihood function. We did not find any PWMs of $L_{w} > 5$ in this data set, whether we used random initialization or shorter motif-guided initialization.

4.6 Other RBPs

There are other proteins covered in the Bind’n Seq data⁸ whose specificity was studied before. For example, we analyzed the data corresponding to MBNL1,²⁴ hnRNPL,²⁵ FUS,²⁶ TAF15.²⁷ For these, our model did not deliver any convergent results, even if the PWM was directly intialized with the expected consensus motif. This indicates that the enrichment did not work equally well for all the RBPs studied with the Bind’n Seq method. Interestingly, expected motifs were identified for these proteins with another method, the so-called kmer-enhancement that underlies most of the consensus motifs reported in the ENCODE database.⁸ Kmer enhancements are computed by counting the number of occurrences of every possible kmer in the foreground samples (RBP concentration $\neq 0$ ) and in the background samples (RBP concentration $= 0$ ), and finally normalizing the foreground abundances by the background to extract the respective enhancement. The higher the enhancement of a given kmer, the higher the likelihood of it being bound by the RBP used in the experiment is thought to be. We computed these enhancements for 6mers, as done in the ENCODE studies. The results, shown in Figure 6 indicate that only RBFOX2 has a few highly enhanced 6mers with a clear hierarchy of enrichment, while all other investigated RBPs show a much flatter hierarchy of motif enhancements. An analysis of the Levenshtein distance of these motifs showed no clear difference in the pattern of distances among the leading motifs across the investigated RBPs. This suggests that these motifs correspond to many local minima of comparable depth, which precludes our algorithm finding clear PWMs representing the binding sites. Conversely, it becomes unclear whether the specificity of these RBPs would be well represented as weight matrices, or whether another model, for e.g. clusters of short, degenerate motifs may better represent the specificity of these RBPs.

Figure 6. Logarithmic (base eenhancement (counts foreground normalized by counts background) of all $4^{6} \approx 4000$ possible 6mers of all investigated RBPs, ranked by enhancement.

The top most enhanced motifs are RBFOX2: TGCATG, FUS: GCGCGC, hnRNPL: CACACA, MBNL1: GCTGCT, TAF15: GGGGGG, followed by variants thereof. All except the RBFOX2 motif are repeats of shorter oligomers.

5. Conclusion

We constructed a thermodynamical model that can be used to infer characteristic position weight matrices for the binding domains of RNA binding proteins from data obtained from affinity-based enrichment of oligonucleotides. Since we directly model the RBP-binding specificity as PWMs, our method bypasses arbitrary choices in the alignment of k-mers found to be individually enriched in the data. We evaluate our model on data in the public domain⁸ using expectation-maximization. For the benchmark case of RBFOX2, where very high-quality data is available, our model reproduces the known binding motif TGCATG where the first position features an almost uniform superposition of T and A in our result. Subdominantly, we find another PWM of $L_{w} = 6$ as well as longer PWMs for RBFOX2. Unfortunately, our principled model does not robustly recover the binding motifs of other RBPs. For a few, e.g. CELF1, HNRNPD, HNRNPK, we can still recover the motifs as well as some longer variants, if the search starts from a PWM closely matching the expected motif. However, for most of the data sets our model did not deliver any prediction. Rather, other motifs, e.g. poly(A), often show higher enrichment in the data than the expected motifs. This indicates that experimental details that do not have to do with the affinity of the RBP for oligomers affect the frequency of oligomer capture in the data, complicating its analysis and raising questions about the biophysical realism of the motifs derived from the data. The motifs of the RBPs for which we did not recover a PWM tend to be more degenerate than those of RBPs for which some motif emerged. They consist of repeated occurrences of mono, di or trinucleotides. It is likely that for these motifs, it is crucially important to construct an appropriate background model. How to best do this remains to be determined in future work. Of note, while crosslinking and immunoprecipitation data is available for the proteins studied here, PWMs with enriched binding sites were also not recovered.²⁸ Thus, it will be interesting to explore models that allow more flexible spacing of RBP contact points on RNAs in the future.

Data availability

Underlying data

Table 1. ENCODE file accession IDs and data repository DOIs of RNA Bind’n Seq samples from Ref. 8 used for motif prediction.

RBP and doi of data repository	samples with protein	control samples
RBFOX2 https://doi.org/doi:10.17989%2FENCSR441HLP	ENCFF002DHA	ENCFF002DHJ
	ENCFF002DHB	ENCFF002DGZ
	ENCFF002DHC
	ENCFF002DHD
	ENCFF002DHE
	ENCFF002DHF
	ENCFF002DHG
	ENCFF002DHI
HNRNPD https://doi.org/doi:10.17989%2FENCSR175OMA	ENCFF002DFU	ENCFF002DFT
	ENCFF002DFV	ENCFF002DFZ
	ENCFF002DFW
	ENCFF002DFX
	ENCFF002DFY
MBNL1 https://doi.org/doi:10.17989%2FENCSR006QKZ	ENCFF002DGQ	ENCFF002DGO
	ENCFF002DGR	ENCFF002DGY
	ENCFF002DGS
	ENCFF002DGT
	ENCFF002DGU
	ENCFF002DGV
	ENCFF002DGW
	ENCFF002DGX
HNRNPC https://doi.org/doi:10.17989%2FENCSR569UIU	ENCFF083JEO	ENCFF866DQR
	ENCFF161YRQ	ENCFF658BYB
	ENCFF235WCJ
	ENCFF304NRR
	ENCFF366SWO
TAF15 https://doi.org/doi:10.17989%2FENCSR827QYL	ENCFF032XJW	ENCFF325RIN
	ENCFF206DZS	ENCFF160CZF
	ENCFF304HMY
	ENCFF662TOD
	ENCFF686CWM
CELF1 https://doi.org/doi:10.17989%2FENCSR992NHR	ENCFF002DFL	ENCFF002DFK
	ENCFF002DFM	ENCFF002DFS
	ENCFF002DFN
	ENCFF002DFO
	ENCFF002DFP
	ENCFF002DFQ
	ENCFF002DFR
HNRNPK https://doi.org/doi:10.17989%2FENCSR368NMO	ENCFF004OKN	ENCFF537ORU
	ENCFF261ZHZ
	ENCFF281CNE
	ENCFF674RCB
	ENCFF732XFP
HNRNPL https://doi.org/doi:10.17989%2FENCSR954TYO	ENCFF004ECZ	ENCFF395BBW
	ENCFF066AFW
	ENCFF086FCD
	ENCFF407TKA
	ENCFF929YDP
FUS https://doi.org/doi:10.17989%2FENCSR936LOF	ENCFF375AIG	ENCFF441MLG
	ENCFF448YEM	ENCFF871MJA
	ENCFF680AGJ
	ENCFF692LOC
	ENCFF739HXK

Software availability

Source code available from: https://git.scicore.unibas.ch/zavolan_group/pipelines/bind-n-seq-pwms

Archived source code at time of publication: https://doi.org/10.5281/zenodo.8028034 ²⁹

License: MIT

Acknowledgements

We would like to thank Erik van Nimwegen for useful conversations and fruitful suggestions. Calculations were performed at sciCORE (http://scicore.unibas.ch/) scientific computing core facility at University of Basel.

References

1. Lunde BM, Moore C, Varani G: RNA-binding proteins: modular design for efficient function. Nat. Rev. Mol. Cell Biol. 2007; 8: 479–490. PubMed Abstract | Publisher Full Text | Free Full Text
2. Kazan H, Ray D, Chan ET, et al.: RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS Comput. Biol. 2010; 6: e1000832. PubMed Abstract | Publisher Full Text | Free Full Text
3. Weirauch MT, Cote A, Norel R, et al.: Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 2013; 31: 126–134. PubMed Abstract | Publisher Full Text | Free Full Text
4. Hentze MW, Castello A, Schwarzl T, et al.: A brave new world of RNA-binding proteins. Nat. Rev. Mol. Cell Biol. 2018; 19: 327–341. PubMed Abstract | Publisher Full Text
5. Imig J, Brunschweiger A, Brümmer A, et al.: miR-CLIP capture of a miRNA targetome uncovers a lincRNA H19-miR-106a interaction. Nat. Chem. Biol. 2015; 11: 107–114. PubMed Abstract | Publisher Full Text
6. Lambert N, Robertson A, Jangi M, et al.: RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol. Cell. 2014; 54: 887–900. Publisher Full Text
7. Omidi S, Zavolan M, Pachkov M, et al.: Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors. PLoS Comput. Biol. 2017; 13: 1.
8. Luo Y, Hitz BC, Gabdank I, et al.: New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020; 48: D882–D889. Publisher Full Text
9. Shannon CE: A mathematical theory of communication. Bell Syst. Tech. J. 1948; 27: 379–423. Publisher Full Text
10. Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Series B Methodol. 1977; 39: 1–22. Publisher Full Text
11. van Nimwegen E : Finding regulatory elements and regulatory motifs: a general probabilistic framework. BMC Bioinformatics. 2007; 8 Suppl 6: S4. PubMed Abstract | Publisher Full Text | Free Full Text
12. Schlusser N: Bind’n Seq PWMs.2022. Reference Source
13. Ponthier JL, Schluepen C, Chen W, et al.: Fox-2 splicing factor binds to a conserved intron motif to promote inclusion of protein 4.1R alternative exon 16. J. Biol. Chem. 2006; 281: 12468–12474. PubMed Abstract | Publisher Full Text
14. Van Nostrand EL, Pratt GA, Shishkin AA, et al.: Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat. Methods. 2016; 13: 508–514. PubMed Abstract | Publisher Full Text | Free Full Text
15. Lambert NJ, Robertson AD, Burge CB: RNA Bind-n-Seq: Measuring the Binding Affinity Landscape of RNA-Binding Proteins. Methods Enzymol. 2015; 558: 465. Publisher Full Text
16. Schwarz G: Estimating the Dimension of a Model. Ann. Stat. 1978; 6: 461. Publisher Full Text
17. Akaike H: A new look at the statistical model identification. IEEE Trans. Autom. Control. 1974; 19: 716–723. Publisher Full Text
18. Ladd AN, Charlet N, Cooper TA: The CELF family of RNA binding proteins is implicated in cell-specific and developmentally regulated alternative splicing. Mol. Cell. Biol. 2001; 21: 1285–1296. PubMed Abstract | Publisher Full Text | Free Full Text
19. Dembowski JA, Grabowski PJ: The CUGBP2 splicing factor regulates an ensemble of branchpoints from perimeter binding sites with implications for autoregulation. PLoS Genet. 2009; 5: e1000595. PubMed Abstract | Publisher Full Text | Free Full Text
20. Marquis J, Paillard L, Audic Y, et al.: CUG-BP1/CELF1 requires UGU-rich sequences for high-affinity binding. Biochem. J. 2006; 400: 291–301. PubMed Abstract | Publisher Full Text | Free Full Text
21. Chaudhury A, Pal R, Kongchan N, et al.: Celf1 is an eif4e binding protein that promotes translation of epithelial-mesenchymal transition effector mrnas. bioRxiv. 2019. Reference Source
22. Xu N, Chen CY, Shyu AB: Versatile role for hnRNP D isoforms in the differential regulation of cytoplasmic mRNA turnover. Mol. Cell. Biol. 2001; 21: 6960–6971. PubMed Abstract | Publisher Full Text | Free Full Text
23. Swanson MS, Dreyfuss G: Classification and purification of proteins of heterogeneous nuclear ribonucleoprotein particles by RNA-binding specificities. Mol. Cell. Biol. 1988; 8: 2237–2241. PubMed Abstract
24. Miller JW, Urbinati CR, Teng-Umnuay P, et al.: Recruitment of human muscleblind proteins to (CUG)(n) expansions associated with myotonic dystrophy. EMBO J. 2000; 19: 4439–4448. Publisher Full Text
25. Hahm B, Cho OH, Kim JE, et al.: Polypyrimidine tract-binding protein interacts with HnRNP L. FEBS Lett. 1998; 425: 401–406. PubMed Abstract | Publisher Full Text
26. Iko Y, Kodama TS, Kasai N, et al.: Domain architectures and characterization of an RNA-binding protein, TLS. J. Biol. Chem. 2004; 279: 44834–44840. PubMed Abstract | Publisher Full Text
27. Wang Z, Morris GF, Rice AP, et al.: Wild-type and transactivation-defective mutants of human immunodeficiency virus type 1 Tat protein bind human TATA-binding protein in vitro. J. Acquir. Immune Defic. Syndr. Hum. Retrovirol. 1996; 12: 128–138. PubMed Abstract | Publisher Full Text
28. Katsantoni M, van Nimwegen E , Zavolan M: Improved analysis of (e) CLIP data with RCRUNCH yields a compendium of RNA-binding protein binding sites and motifs. Genome Biol. 2023; 24: 77. PubMed Abstract | Publisher Full Text | Free Full Text
29. Schlusser N: PWMs from RNA Bind’n’Seq data (1.0). Zenodo. 2023. Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 26 Jun 2023

Author details Author details

¹ Biozentrum, Universitat Basel, Basel, Basel-Stadt, 4056, Switzerland

Niels Schlusser
Roles: Data Curation, Formal Analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Mihaela Zavolan
Roles: Conceptualization, Methodology, Project Administration, Resources, Supervision, Validation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was funded by the Swiss National Fund under grant number 310030_204517
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 29 May 2024, 12:742

https://doi.org/10.12688/f1000research.135164.2

version 1

Published: 26 Jun 2023, 12:742

https://doi.org/10.12688/f1000research.135164.1

© 2023 Schlusser N and Zavolan M. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Schlusser N and Zavolan M. On the limits of inferring biophysical parameters of RBP-RNA interactions from in vitro RNA Bind’n Seq data [version 1; peer review: 1 approved with reservations, 1 not approved]. F1000Research 2023, 12:742 (https://doi.org/10.12688/f1000research.135164.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 26 Jun 2023

Views

Reviewer Report 22 Nov 2023

Jun Zhang, Texas Tech University Health Science Center El Paso (TTUHSCEP), El Paso, USA

Not Approved

https://doi.org/10.5256/f1000research.148267.r215390

The manuscript presents a thermodynamic model for scrutinizing RNA motifs associated with RNA-binding proteins, utilizing position-specific weight matrices. Although the method effectively characterizes the RNA-binding specificity of RBFOX2, its application to other RNA-binding proteins, such as CELF1, HNRNPD, and HNRNPK, proves unsuccessful. The manuscript, categorized as a method paper, lacks comprehensive details about the model's development and fails to adequately address why the method specifically succeeds for RBFOX2.

Key areas of improvement are outlined below:

Unclear Definition of Parameter PIP:
- The manuscript lacks a clear definition of the parameter PIP. Providing a concise explanation will enhance reader understanding and facilitate the application of the method.
Determination of Parameter c and its Relation to Protein Concentrations:
- Clarify how the parameter c is determined and elaborate on its relationship with the concentrations of RNA-binding proteins. A more in-depth explanation will contribute to the method's transparency.
Thermodynamic Model Explanation:
- Offer a more detailed explanation of the thermodynamic model to eliminate the necessity for readers to consult the original reference. This will enhance accessibility and comprehension for a wider audience.
Derivation of Equation 2.1:
- Clearly outline the derivation of Equation 2.1 to provide readers with insights into the model's foundational principles. This will aid in understanding the method's inner workings.
Computation Time Information:
- Include information about the expected computation time, as this is crucial for users assessing the feasibility of implementing the method in their own studies.
Evidence for Data Quality Variation:
- While the manuscript attributes the method's failure for CELF1, HNRNPD, and HNRNPK to lower data quality, provide concrete evidence supporting this claim. Offer a detailed analysis of the data quality disparities between RBFOX2 and the other proteins.
Consideration of Variable Spacer Length in RNA-Binding Proteins:
- Acknowledge the fact that many RNA-binding proteins, including CELF1, HNRNPD, and HNRNPK, bind to RNA with variable spacer lengths. The manuscript should discuss why the PWM method, assuming a fixed RNA motif length, may be inadequate for proteins with multiple RNA-binding domains connected by long linkers. This acknowledgment will help users understand the method's limitations and guide appropriate applications.

In conclusion, addressing these points will significantly enhance the manuscript's clarity, transparency, and utility, providing a more comprehensive understanding of the developed method and its limitations.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

No
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: RNA-binding protein, structural biology.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Author Response 23 Jul 2024

niels schlusser, $usrAffiliation

23 Jul 2024

Author Response
Dear Reviewer,

We apologize for the delay in submitting the answer to your valuable comments. The delay was caused by a technical issue with the web portal.
Thank you ... Continue reading
Dear Reviewer,

We apologize for the delay in submitting the answer to your valuable comments. The delay was caused by a technical issue with the web portal.
Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

As a result of the revised terminology and calculation of frequency priors f_S, some of the
comments have become superfluous. Nevertheless, the main comments were addressed as
follows:

We clarified the definition of P_IP in and around Eq. (2.4), and the notation has changed.

Around Eq. (2.1), we elaborated on the role of the ratio of concentrations of bound and unbound RBP c.

We explained the derivation of the thermodynamic model in greater detail (c.f. above comments) so that it does not require the reader to consult the cited reference of [1].

We also gave more detail around Eq. (2.1) (c.f. item 2).

Information about the computation time is given in the first paragraph of Sec. 4 ”results”.

Regarding the evidence for data quality variation: we have applied the model to the entire set of RBPs assayed by RNA Bind’n Seq and, with the revised prior calculation we recovered the consensus motifs for 48 out of 82 RBPs. While in response to the comments of reviewer #1 we also tried initialization with the consensus motif/enriched kmers, the results for the 34 proteins for which random initialization did not lead to the recovery of meaningful, non-poly(A) motif, have not improved. Thus, our model is, in principle, able to recover the expected binding motifs.

The reviewer suggested that the PWM model may be inadequate for many RBPs that bind to discontinuous motifs. While a large body of prior work uses the PWM representation of RBP binding sites, there are known example where structure plays a role. How to appropriately represent structured binding sites is an open problem that we think is beyond the scope of our work. Nevertheless, we added a comment about this possibility explaining some of our results in the discussion.

We hope we have adequately responded to your comments.

Yours sincerely,
The authors

References
[1] S. Omidi, M. Zavolan, M. Pachkov, J. Breda, S. Berger, and E. van Nimwegen, Automated incor-
poration of pairwise dependency in transcription factor binding site prediction using dinucleotide
weight tensors, PLOS Computational Biology 13 (2017) 1.
Dear Reviewer,

We apologize for the delay in submitting the answer to your valuable comments. The delay was caused by a technical issue with the web portal.
Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

As a result of the revised terminology and calculation of frequency priors f_S, some of the
comments have become superfluous. Nevertheless, the main comments were addressed as
follows:

We clarified the definition of P_IP in and around Eq. (2.4), and the notation has changed.

Around Eq. (2.1), we elaborated on the role of the ratio of concentrations of bound and unbound RBP c.

We explained the derivation of the thermodynamic model in greater detail (c.f. above comments) so that it does not require the reader to consult the cited reference of [1].

We also gave more detail around Eq. (2.1) (c.f. item 2).

Information about the computation time is given in the first paragraph of Sec. 4 ”results”.

Regarding the evidence for data quality variation: we have applied the model to the entire set of RBPs assayed by RNA Bind’n Seq and, with the revised prior calculation we recovered the consensus motifs for 48 out of 82 RBPs. While in response to the comments of reviewer #1 we also tried initialization with the consensus motif/enriched kmers, the results for the 34 proteins for which random initialization did not lead to the recovery of meaningful, non-poly(A) motif, have not improved. Thus, our model is, in principle, able to recover the expected binding motifs.

The reviewer suggested that the PWM model may be inadequate for many RBPs that bind to discontinuous motifs. While a large body of prior work uses the PWM representation of RBP binding sites, there are known example where structure plays a role. How to appropriately represent structured binding sites is an open problem that we think is beyond the scope of our work. Nevertheless, we added a comment about this possibility explaining some of our results in the discussion.

We hope we have adequately responded to your comments.

Yours sincerely,
The authors

References
[1] S. Omidi, M. Zavolan, M. Pachkov, J. Breda, S. Berger, and E. van Nimwegen, Automated incor-
poration of pairwise dependency in transcription factor binding site prediction using dinucleotide
weight tensors, PLOS Computational Biology 13 (2017) 1.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 23 Jul 2024

niels schlusser, $usrAffiliation

23 Jul 2024

Author Response
Dear Reviewer,

We apologize for the delay in submitting the answer to your valuable comments. The delay was caused by a technical issue with the web portal.
Thank you ... Continue reading
Dear Reviewer,

We apologize for the delay in submitting the answer to your valuable comments. The delay was caused by a technical issue with the web portal.
Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

As a result of the revised terminology and calculation of frequency priors f_S, some of the
comments have become superfluous. Nevertheless, the main comments were addressed as
follows:

We clarified the definition of P_IP in and around Eq. (2.4), and the notation has changed.

Around Eq. (2.1), we elaborated on the role of the ratio of concentrations of bound and unbound RBP c.

We explained the derivation of the thermodynamic model in greater detail (c.f. above comments) so that it does not require the reader to consult the cited reference of [1].

We also gave more detail around Eq. (2.1) (c.f. item 2).

Information about the computation time is given in the first paragraph of Sec. 4 ”results”.

Regarding the evidence for data quality variation: we have applied the model to the entire set of RBPs assayed by RNA Bind’n Seq and, with the revised prior calculation we recovered the consensus motifs for 48 out of 82 RBPs. While in response to the comments of reviewer #1 we also tried initialization with the consensus motif/enriched kmers, the results for the 34 proteins for which random initialization did not lead to the recovery of meaningful, non-poly(A) motif, have not improved. Thus, our model is, in principle, able to recover the expected binding motifs.

The reviewer suggested that the PWM model may be inadequate for many RBPs that bind to discontinuous motifs. While a large body of prior work uses the PWM representation of RBP binding sites, there are known example where structure plays a role. How to appropriately represent structured binding sites is an open problem that we think is beyond the scope of our work. Nevertheless, we added a comment about this possibility explaining some of our results in the discussion.

We hope we have adequately responded to your comments.

Yours sincerely,
The authors

References
[1] S. Omidi, M. Zavolan, M. Pachkov, J. Breda, S. Berger, and E. van Nimwegen, Automated incor-
poration of pairwise dependency in transcription factor binding site prediction using dinucleotide
weight tensors, PLOS Computational Biology 13 (2017) 1.
Dear Reviewer,

We apologize for the delay in submitting the answer to your valuable comments. The delay was caused by a technical issue with the web portal.
Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

As a result of the revised terminology and calculation of frequency priors f_S, some of the
comments have become superfluous. Nevertheless, the main comments were addressed as
follows:

We clarified the definition of P_IP in and around Eq. (2.4), and the notation has changed.

Around Eq. (2.1), we elaborated on the role of the ratio of concentrations of bound and unbound RBP c.

We explained the derivation of the thermodynamic model in greater detail (c.f. above comments) so that it does not require the reader to consult the cited reference of [1].

We also gave more detail around Eq. (2.1) (c.f. item 2).

Information about the computation time is given in the first paragraph of Sec. 4 ”results”.

Regarding the evidence for data quality variation: we have applied the model to the entire set of RBPs assayed by RNA Bind’n Seq and, with the revised prior calculation we recovered the consensus motifs for 48 out of 82 RBPs. While in response to the comments of reviewer #1 we also tried initialization with the consensus motif/enriched kmers, the results for the 34 proteins for which random initialization did not lead to the recovery of meaningful, non-poly(A) motif, have not improved. Thus, our model is, in principle, able to recover the expected binding motifs.

The reviewer suggested that the PWM model may be inadequate for many RBPs that bind to discontinuous motifs. While a large body of prior work uses the PWM representation of RBP binding sites, there are known example where structure plays a role. How to appropriately represent structured binding sites is an open problem that we think is beyond the scope of our work. Nevertheless, we added a comment about this possibility explaining some of our results in the discussion.

We hope we have adequately responded to your comments.

Yours sincerely,
The authors

References
[1] S. Omidi, M. Zavolan, M. Pachkov, J. Breda, S. Berger, and E. van Nimwegen, Automated incor-
poration of pairwise dependency in transcription factor binding site prediction using dinucleotide
weight tensors, PLOS Computational Biology 13 (2017) 1.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 31 Aug 2023

Johannes Söding, Campus-Institut Data Science (CIDAS), Georg-August-Universitat Gottingen, Göttingen, Lower Saxony, Germany; Quantitative and Computational Biology, Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany

Approved with Reservations

https://doi.org/10.5256/f1000research.148267.r192424

The manuscript describes a method to learn position-weight matrices (PWM) to describe the sequence specific binding of an RBP given Bind'n'seq data for this RBP. The method is based on a biophysically motivated, statistical model for the likelihood of Bind'n'seq data given the PWM to describe the sequence specific binding of the RBP to the library DNA oligomers. The model parameters (PWM and non-specific binding energy E₀) are learned using an expectation maximization algorithm. The method is applied to Bind'n'seq data for 9 RBPs. For some RBPs correct PWMs are retrieved. For others, the most likely PWM does not describe the known binding motif but is rather an unrelated, low-complexity motif that appears to reflect simple sequence biases in the experiment.

The statistical model is very sound and the manuscript is well written. The results are disappointing, in particular since it is not clear whether the negative results for a good part of the data sets are due to the inadequacy of the statistical model or some bug in the implementation.

1. My guess is the following. The method uses a Markov model of order d to model the prior probability P(S) = f_S for a sequence S to be part of the background library (before enrichment). It was found that "the better the prediction of likelihood of sequences in the foreground samples." This led the authors to set d=14. A Markov model of order d=14 has 3*4^28 ~ 10^9 parameters, around the same number as 14-mers in the entire sequencing library of around 10^7 reds. This means the Markov model is hopelessly over-parameterized and the reason that the likelihood was increasing for increasing d is simply that the overlap between the (d+1)-mers in the background and foreground libraries decreases with increasing d. A reasonable choice of d is around 4, certainly not more than 8. The overfitting of the pior probability model could explain the many weird local optima observed by the authors, since it adds enormous noise to the sequence space such that a single mutation can change the f_s dramatically, resulting in many spurious PWM minima.

2. The notation in equations (2.1) to (2.4) is quite "unstatistical" and confusing. The way (2.1) and (2.3) are written, the left hand side would need to sum to one when summing over all sequences s or S, respectively. This is of course not the case. What the authors rather meant to write on the left hand side of eq (2.1) is P(bound | s,c,M,E₀), with a variable bound in {0,1} indicating whether the sequence on the right side of the conditioning is bound by the RBR or not. The left hand side of eq. (2.3) should read P(bound | S,c,M,E₀).

With this corrected notation and using f_S = P(S) for the prior probability of finding a sequence S in the background library (with ∑_S P(S) = 1), equation (2.4) turns into the correct Bayes theorem,

P(S | bound,c,M,E₀) = P(bound | S,c,M,E₀) P(S) / ∑_σ P(bound | σ,c,M,E₀) P(σ) .

Please correct the text accordingly:
"(2.4) is essentially a formulation of Bayes’ theorem with conditional probability P_b(S|c,M,E₀) of having a read S bound by an RBP, the likelihood of finding a read S in the pool washed over the RBP, fS, and an overall normalization (denominator)."
=>
"(2.4) is essentially a formulation of Bayes’ theorem with the likelihood P(bound | S,c,M,E₀) of having a read S bound by an RBP, the likelihood of finding a read S in the pool washed over the RBP, P(S)=f_S, and an overall normalization in the denominator."

3. To improve the search for the global optimum, the authors could compute the most highly enriched 6-mers and start their PWM optimization from a soft version of each of the top 20 or so 6-mers.

4. Please derive the EM update equations transparently.

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

No
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Biophysics, statistical modeling

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 21 Jun 2024

niels schlusser, $usrAffiliation

21 Jun 2024

Author Response

Dear reviewer,

Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript ... Continue reading Dear reviewer,

Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

Addressing your feedback:

1. You suggested that we do not over-parameterize the background model, but
rather use a Markov model of order 4−8. We followed this suggestion, setting d = 4 and
adapting the normalization of frequency priors accordingly. While implementing this
suggestion we also noticed a mistake in the initial implementation of the normalization,
which likely contributed to higher order background giving improved results initially.
As a result of these changes we found non-trivial motifs for a larger proportion of the
RBPs, which led us to apply the model to all RBP RNA Bind’seq datasets in ENCODE
(111 in total).

2. You also pointed out some sloppiness in our initial notation, which we have
now corrected, along with the text referring to the respective equation (2.4).

3. We did try various ways of initializing the PWM from enriched kmers and consensus
motifs, but the number of cases where this led to the convergence to the expected motif,
while the random initialization did not, was very small.

4. We elaborated more on the derivation of the EM update equations, in particular, we
give the derivative whose root is calculated.

We hope we have adequately responded to your comments.

Yours sincerely,
Mihaela Zavolan and Niels Schlusser
Dear reviewer,

Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

Addressing your feedback:

1. You suggested that we do not over-parameterize the background model, but
rather use a Markov model of order 4−8. We followed this suggestion, setting d = 4 and
adapting the normalization of frequency priors accordingly. While implementing this
suggestion we also noticed a mistake in the initial implementation of the normalization,
which likely contributed to higher order background giving improved results initially.
As a result of these changes we found non-trivial motifs for a larger proportion of the
RBPs, which led us to apply the model to all RBP RNA Bind’seq datasets in ENCODE
(111 in total).

2. You also pointed out some sloppiness in our initial notation, which we have
now corrected, along with the text referring to the respective equation (2.4).

3. We did try various ways of initializing the PWM from enriched kmers and consensus
motifs, but the number of cases where this led to the convergence to the expected motif,
while the random initialization did not, was very small.

4. We elaborated more on the derivation of the EM update equations, in particular, we
give the derivative whose root is calculated.

We hope we have adequately responded to your comments.

Yours sincerely,
Mihaela Zavolan and Niels Schlusser
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 21 Jun 2024

niels schlusser, $usrAffiliation

21 Jun 2024

Author Response

Dear reviewer,

Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript ... Continue reading Dear reviewer,

Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

Addressing your feedback:

1. You suggested that we do not over-parameterize the background model, but
rather use a Markov model of order 4−8. We followed this suggestion, setting d = 4 and
adapting the normalization of frequency priors accordingly. While implementing this
suggestion we also noticed a mistake in the initial implementation of the normalization,
which likely contributed to higher order background giving improved results initially.
As a result of these changes we found non-trivial motifs for a larger proportion of the
RBPs, which led us to apply the model to all RBP RNA Bind’seq datasets in ENCODE
(111 in total).

2. You also pointed out some sloppiness in our initial notation, which we have
now corrected, along with the text referring to the respective equation (2.4).

3. We did try various ways of initializing the PWM from enriched kmers and consensus
motifs, but the number of cases where this led to the convergence to the expected motif,
while the random initialization did not, was very small.

4. We elaborated more on the derivation of the EM update equations, in particular, we
give the derivative whose root is calculated.

We hope we have adequately responded to your comments.

Yours sincerely,
Mihaela Zavolan and Niels Schlusser
Dear reviewer,

Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

Addressing your feedback:

1. You suggested that we do not over-parameterize the background model, but
rather use a Markov model of order 4−8. We followed this suggestion, setting d = 4 and
adapting the normalization of frequency priors accordingly. While implementing this
suggestion we also noticed a mistake in the initial implementation of the normalization,
which likely contributed to higher order background giving improved results initially.
As a result of these changes we found non-trivial motifs for a larger proportion of the
RBPs, which led us to apply the model to all RBP RNA Bind’seq datasets in ENCODE
(111 in total).

2. You also pointed out some sloppiness in our initial notation, which we have
now corrected, along with the text referring to the respective equation (2.4).

3. We did try various ways of initializing the PWM from enriched kmers and consensus
motifs, but the number of cases where this led to the convergence to the expected motif,
while the random initialization did not, was very small.

4. We elaborated more on the derivation of the EM update equations, in particular, we
give the derivative whose root is calculated.

We hope we have adequately responded to your comments.

Yours sincerely,
Mihaela Zavolan and Niels Schlusser
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 26 Jun 2023

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4
Version 2 (revision) 29 May 24	read	read	read	read
Version 1 26 Jun 23	read	read

Johannes Söding, Georg-August-Universitat Gottingen, Göttingen, Germany; Max Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
Jun Zhang, Texas Tech University Health Science Center El Paso (TTUHSCEP), El Paso, USA
Junbai Wang, University of Oslo, Campus AHUS/Oslo, Norway
Andreas Schlundt, University of Greifswald, Greifswald, Germany

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

11 Views

24 Jul 2024 | for Version 2

Jun Zhang, Texas Tech University Health Science Center El Paso (TTUHSCEP), El Paso, USA

11 Views Cite this report Responses(0)

Approved

The revision has addressed the issues in the first version.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

RNA-binding protein, structural biology.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

20 Views

22 Jul 2024 | for Version 2

Andreas Schlundt, Institute for Biochemistry, University of Greifswald, Greifswald, Germany

20 Views Cite this report Responses(0)

Approved With Reservations

This manuscript by Schlusser and Zavolan provides a bioinformatic approach to infer PWM representations in motifs targeted by RNA-binding proteins. It uses available RBnS data from the ENCODE database. Its suggested strength and improvement over the canonical way to infer consensus motifs in RBnS is that it bypasses the prior definition of kmers for the alignment procedure of enriched sequences.
As a result the authors present convergent and specific motifs for 82 RBPs (out of 111 in the database). For 48 of these there is complete and for 14 partial agreement with previously reported motifs considering the RBnS database as the correct reference.

The current manuscript version had undergone a prior round of revision with visible improvement. Although I am not an expert in the bioinformatic background I can nevertheless clearly see the (functionally relevant) motivation of the approach and highly appreciate it in light of the need to try to exploit the in-depth information of HTS-derived motifs of RBPs beyond the current standards. Still, I feel the approach misses to address or at least discuss the most obvious challenge (which could on the longer run be implemented?): the majority of RBPs uses more than one domain for interacting with RNA and individual domains may contribute with orders of magnitude difference in affinity to the combined specific target motif, while only this combined motif is the functionally relevant specific one. While this is an ongoing challenge in the interpretation of all CLIP and RBnS data, I wonder how much the herein presented may address this issue a bit more realistic.
Together with this, here are my concrete points I suggest to address and clarify:
1) The authors themselves mention the possible role of spaced elements targeted by multiple domains within one RBP. Similar to RBnS, a proper analysis towards this direction is challenging, but could e.g. include the consideration of multiple motifs. Have the authors thought about this? I suggest a bit more discussion into this direction. Mixed motifs with weighted contributions from relative affinities of domains will remain a major point to address in the future.

2) In line with above: Considering multiple binding sites of an RBP on an oligo may mean more than one RBP binding to it OR multiple domains of one RBP binding to one combined element. I may not have fully understood how the estimation of Kd values according to the formula (2.6) may take this into consideration?

3) Why is (page 5) the assumption that RBP binding is usually of low affinity in order to simplify the equation? Many RBPs bind in the low nM range, which is pretty strong, be it specific or not.

4) Is there any particular way to treat motifs that are apparently part of RNA structure (Rc3h1)? This should be highly converging and specific reg. the central 5mer, but likely not in a larger context (i.e. the stem beneath the 3-5 nt loop). There may be less such structured motifs according to our expectations, but on the other there could be more of them hidden the RBnS data and we just neglect their fold context.

5) For the les converging/specific RBPs: is there a correlation to their amount of domains present? And adding to that: seeing the motifs shown in Fig. 2, some seem to show large differences between the low vs. high K_D. Could this correlate with the number of RBDs in the RBP, such that the affine ones come from the strong-binding domains and the other one from the subordinate domains? And: could also the occurrence of polyA relate to the number of domains and thus too many parallel motifs (e.g. for the IGF2BPs)? Apparently exactly those ones would need the definition of a motif cluster, while current approaches merely provide a weighted motif, which in the end is not precisely the right one for any of the six domains.

6) Have the authors specifically taken into consideration the different concentrations of RBPs in the RBnS data? Esp. when there are multiple concentrations tested? Sorry if I have missed this.

7) Could the authors in the end provide a somewhat fair discussion of how their inferred motifs now should be seen relative to RBnS motifs?! Do they suggest to question RBnS logos (which in the end still could be "wrong”) or do they just claim to have provided a faster and more efficient, user-friendly way to derive logos from RBnS data?

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

Partly
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Structural biochemistry/biophysics of multi-domain RNA-binding proteins and their target RNAs centered around solution methods NMR and SAXS

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

36 Views

09 Jul 2024 | for Version 2

Junbai Wang, University of Oslo, Campus AHUS/Oslo, Norway

36 Views Cite this report Responses(0)

Not Approved

Manuscript by Schlusser N and Zavolan M describe a biophysical model for inferring motifs of RNA-binding proteins. They describe the derivation of LL function to the proposed EM algorithm for predicting enriched RNA binding motifs in sequences. The method was tested on RNA Bind’n Seq data from Encode with success on 7 RNA-binding proteins but failed on the other 4 RNA-binding proteins mentioned in the main text. In total, authors tested the method on 111 RNA-binding datasets, 82 of them obtained results where 48 with good agreement but 34 with either marginal or poor agreement. In summary, the success rate of this new method on 111 datasets is only around 43%, which is very low when compared to many public tools in DNA/RNA motif analysis. There are three major problems in the manuscript:

The theory behind their proposed biophysical model for motif finding is very unclear to me. Authors claim that their method is based on Boltzmann machine, which has been used many decades in protein-DNA interaction theory such as a classical paper in biophysics published in 1987 Berg OG et.al.,(ref 1 ), several applications of it in DNA sequences analysis in Djordjevic M et.al., 2003 (Ref 2) and Foat BC et.al., 2006 (Ref 3), and new development in more advanced Fermi-Dirac form of protein-DNA interactions by considering protein concentration with a Bayesian solution Wang J et.al., 2009 (Ref 4) Yang M et.al., 2023 (Ref 5) for DNA motif finding. From my point of view, there is not a clear association between the authors’ sections (2.1 ,2.2) and the biophysical theory of aforementioned works that have been applied successfully in numerous datasets and applications. In particular, I am completely lost when authors describing “combine (2.9) with (2.5), the output of the optimization procedure, and the reference (2.10) allows us to compute the logarithm of the dissociation constant of RBP-RNA binding …” in section 2.2 because I do not understand the reason for combining these two equations. After going through sections 3.1 and 3.2, I feel that the proposed method is actually in analogy to k-mer motif enrichment tests such as a paper in Ghandi M et.al. 2014 (Ref 6)
The manuscript describes motif prediction in RNA-binding proteins, but all of the figures are showing DNA binding motifs where T shall be replaced by U in the RNA sequence. It is very weird that the paper describes RNA-binding proteins but all sequence logo are DNA binding motifs. Authors have to correct these errors in all sequence logo figures. As we all know methods for motif finding in either DNA or RNA sequences are almost the same, but the prediction results (e.g., DNA-binding motif, RNA-binding motif) are not the same in biology.

3) I am not able to compile and run cpp code that authors provided in the github, nor can I find demo input data to reproduce their results. Authors need to provide compiled cpp binary code (e..g,, Linux and MAC version) as well as relevant demo data for readers to reproduce some of their predictions in the manuscript.

Is the rationale for developing the new method (or application) clearly explained?

No
Is the description of the method technically sound?

No
Are sufficient details provided to allow replication of the method development and its use by others?

No
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

No

References

1. Berg OG, von Hippel PH: Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters.J Mol Biol. 1987; 193 (4): 723-50 PubMed Abstract | Publisher Full Text
2. Djordjevic M, Sengupta AM, Shraiman BI: A biophysical approach to transcription factor binding site discovery.Genome Res. 2003; 13 (11): 2381-90 PubMed Abstract | Publisher Full Text
3. Foat BC, Morozov AV, Bussemaker HJ: Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE.Bioinformatics. 2006; 22 (14): e141-9 PubMed Abstract | Publisher Full Text
4. Wang J, Morigen: BayesPI - a new model to study protein-DNA interactions: a case study of condition-specific protein binding parameters for Yeast transcription factors.BMC Bioinformatics. 2009; 10: 345 PubMed Abstract | Publisher Full Text
5. Yang M, Ali O, Bjørås M, Wang J: Identifying functional regulatory mutation blocks by integrating genome sequencing and transcriptome data.iScience. 2023; 26 (8): 107266 PubMed Abstract | Publisher Full Text
6. Ghandi M, Lee D, Mohammad-Noori M, Beer MA: Enhanced regulatory sequence prediction using gapped k-mer features.PLoS Comput Biol. 2014; 10 (7): e1003711 PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Theoretical Physics, Computational biology, Bioinformatics, Applied Mathematics, Algorithm

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

25 Views

05 Jul 2024 | for Version 2

25 Views Cite this report Responses(0)

Approved With Reservations

I am glad to see that the method appears to work well after the adjustment to the background model.

1. What other methods or software tools exist that can analyze Bind’n’seq data?

2. How does your tool compare (quantitatively) with others to analyze Bind’n’seq data?

Minor comments:

3. Equation (2.4): P(IG|S,c,M,E_0) = … => P(S | bound, c, M, E₀) =
As a simple general rule, when you sum over all possible outcomes of all variables on the left of the conditioning bar |, you have to get 1. Here, summing over all possible sequences S in your library gives 1, thanks to the normalization in the denominator from Bayes’ rule.

Same notation in the text a few lines below the equation.

4. A bit below: P(S) = f_s is *not* a likelihood but a prior probability.
5. In the conclusion, you wrote” In addition, it is possible that the binding sites of these proteins are not contiguous, linear motifs, but rather contain variable length spacers of form structures such as G-quadruplexes.” You could cite these manuscript here:
(Jolma A, et al., 2020 [Ref 1]), ( Sohrabi-Jahromi S, 2021 [Ref 2])

References

1. Jolma A, Zhang J, Mondragón E, Morgunova E, et al.: Binding specificities of human RNA-binding proteins toward structured and linear RNA sequences.Genome Res. 2020; 30 (7): 962-973 PubMed Abstract | Publisher Full Text
2. Sohrabi-Jahromi S, Söding J: Thermodynamic modeling reveals widespread multivalent binding by RNA-binding proteins.Bioinformatics. 2021; 37 (Suppl_1): i308-i316 PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Biophysics, statistical modeling

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

45 Views

22 Nov 2023 | for Version 1

Jun Zhang, Texas Tech University Health Science Center El Paso (TTUHSCEP), El Paso, USA

45 Views Cite this report Responses(1)

Not Approved

Unclear Definition of Parameter PIP:
- The manuscript lacks a clear definition of the parameter PIP. Providing a concise explanation will enhance reader understanding and facilitate the application of the method.
Determination of Parameter c and its Relation to Protein Concentrations:
- Clarify how the parameter c is determined and elaborate on its relationship with the concentrations of RNA-binding proteins. A more in-depth explanation will contribute to the method's transparency.
Thermodynamic Model Explanation:
- Offer a more detailed explanation of the thermodynamic model to eliminate the necessity for readers to consult the original reference. This will enhance accessibility and comprehension for a wider audience.
Derivation of Equation 2.1:
- Clearly outline the derivation of Equation 2.1 to provide readers with insights into the model's foundational principles. This will aid in understanding the method's inner workings.
Computation Time Information:
- Include information about the expected computation time, as this is crucial for users assessing the feasibility of implementing the method in their own studies.
Evidence for Data Quality Variation:
- While the manuscript attributes the method's failure for CELF1, HNRNPD, and HNRNPK to lower data quality, provide concrete evidence supporting this claim. Offer a detailed analysis of the data quality disparities between RBFOX2 and the other proteins.
Consideration of Variable Spacer Length in RNA-Binding Proteins:
- Acknowledge the fact that many RNA-binding proteins, including CELF1, HNRNPD, and HNRNPK, bind to RNA with variable spacer lengths. The manuscript should discuss why the PWM method, assuming a fixed RNA motif length, may be inadequate for proteins with multiple RNA-binding domains connected by long linkers. This acknowledgment will help users understand the method's limitations and guide appropriate applications.

Is the rationale for developing the new method (or application) clearly explained?

Partly
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

No
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

No source data required
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

RNA-binding protein, structural biology.

Respond to this report

Responses (1)

Author Response

23 Jul 2024

niels schlusser,

Dear Reviewer,

We apologize for the delay in submitting the answer to your valuable comments. The delay was caused by a technical issue with the web portal.
Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

As a result of the revised terminology and calculation of frequency priors f_S, some of the
comments have become superfluous. Nevertheless, the main comments were addressed as
follows:

We clarified the definition of P_IP in and around Eq. (2.4), and the notation has changed.
Around Eq. (2.1), we elaborated on the role of the ratio of concentrations of bound and unbound RBP c.
We explained the derivation of the thermodynamic model in greater detail (c.f. above comments) so that it does not require the reader to consult the cited reference of [1].
We also gave more detail around Eq. (2.1) (c.f. item 2).
Information about the computation time is given in the first paragraph of Sec. 4 ”results”.
Regarding the evidence for data quality variation: we have applied the model to the entire set of RBPs assayed by RNA Bind’n Seq and, with the revised prior calculation we recovered the consensus motifs for 48 out of 82 RBPs. While in response to the comments of reviewer #1 we also tried initialization with the consensus motif/enriched kmers, the results for the 34 proteins for which random initialization did not lead to the recovery of meaningful, non-poly(A) motif, have not improved. Thus, our model is, in principle, able to recover the expected binding motifs.
The reviewer suggested that the PWM model may be inadequate for many RBPs that bind to discontinuous motifs. While a large body of prior work uses the PWM representation of RBP binding sites, there are known example where structure plays a role. How to appropriately represent structured binding sites is an open problem that we think is beyond the scope of our work. Nevertheless, we added a comment about this possibility explaining some of our results in the discussion.

We hope we have adequately responded to your comments.

Yours sincerely,
The authors

References
[1] S. Omidi, M. Zavolan, M. Pachkov, J. Breda, S. Berger, and E. van Nimwegen, Automated incor-
poration of pairwise dependency in transcription factor binding site prediction using dinucleotide
weight tensors, PLOS Computational Biology 13 (2017) 1.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

40 Views

31 Aug 2023 | for Version 1

40 Views Cite this report Responses(1)

Approved With Reservations

Is the rationale for developing the new method (or application) clearly explained?

Yes
Is the description of the method technically sound?

Yes
Are sufficient details provided to allow replication of the method development and its use by others?

No
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Biophysics, statistical modeling

Respond to this report

Responses (1)

Author Response

21 Jun 2024

niels schlusser,

Dear reviewer,

Thank you for the insightful comments and questions related to our paper, which we have
taken into careful consideration in revising our manuscript. We believe the manuscript is
much improved as a result. Below we comment on the most significant changes.

Addressing your feedback:

1. You suggested that we do not over-parameterize the background model, but
rather use a Markov model of order 4−8. We followed this suggestion, setting d = 4 and
adapting the normalization of frequency priors accordingly. While implementing this
suggestion we also noticed a mistake in the initial implementation of the normalization,
which likely contributed to higher order background giving improved results initially.
As a result of these changes we found non-trivial motifs for a larger proportion of the
RBPs, which led us to apply the model to all RBP RNA Bind’seq datasets in ENCODE
(111 in total).

2. You also pointed out some sloppiness in our initial notation, which we have
now corrected, along with the text referring to the respective equation (2.4).

3. We did try various ways of initializing the PWM from enriched kmers and consensus
motifs, but the number of cases where this led to the convergence to the expected motif,
while the random initialization did not, was very small.

4. We elaborated more on the derivation of the EM update equations, in particular, we
give the derivative whose root is calculated.

We hope we have adequately responded to your comments.

Yours sincerely,
Mihaela Zavolan and Niels Schlusser

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Lunde BM, Moore C, Varani G: RNA-binding proteins: modular design for efficient function. Nat. Rev. Mol. Cell Biol. 2007; 8: 479–490. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Kazan H, Ray D, Chan ET, et al.: RNAcontext: a new method for learning the sequence and structure binding preferences of RNA-binding proteins. PLoS Comput. Biol. 2010; 6: e1000832. PubMed Abstract | Publisher Full Text | Free Full Text

[3] 3. Weirauch MT, Cote A, Norel R, et al.: Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 2013; 31: 126–134. PubMed Abstract | Publisher Full Text | Free Full Text

[4] 4. Hentze MW, Castello A, Schwarzl T, et al.: A brave new world of RNA-binding proteins. Nat. Rev. Mol. Cell Biol. 2018; 19: 327–341. PubMed Abstract | Publisher Full Text

[5] 5. Imig J, Brunschweiger A, Brümmer A, et al.: miR-CLIP capture of a miRNA targetome uncovers a lincRNA H19-miR-106a interaction. Nat. Chem. Biol. 2015; 11: 107–114. PubMed Abstract | Publisher Full Text

[6] 6. Lambert N, Robertson A, Jangi M, et al.: RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol. Cell. 2014; 54: 887–900. Publisher Full Text

[7] 7. Omidi S, Zavolan M, Pachkov M, et al.: Automated incorporation of pairwise dependency in transcription factor binding site prediction using dinucleotide weight tensors. PLoS Comput. Biol. 2017; 13: 1.

[8] 8. Luo Y, Hitz BC, Gabdank I, et al.: New developments on the Encyclopedia of DNA Elements (ENCODE) data portal. Nucleic Acids Res. 2020; 48: D882–D889. Publisher Full Text

[9] 9. Shannon CE: A mathematical theory of communication. Bell Syst. Tech. J. 1948; 27: 379–423. Publisher Full Text

[10] 10. Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Series B Methodol. 1977; 39: 1–22. Publisher Full Text

[11] 11. van Nimwegen E : Finding regulatory elements and regulatory motifs: a general probabilistic framework. BMC Bioinformatics. 2007; 8 Suppl 6: S4. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Schlusser N: Bind’n Seq PWMs.2022. Reference Source

[13] 13. Ponthier JL, Schluepen C, Chen W, et al.: Fox-2 splicing factor binds to a conserved intron motif to promote inclusion of protein 4.1R alternative exon 16. J. Biol. Chem. 2006; 281: 12468–12474. PubMed Abstract | Publisher Full Text

[14] 14. Van Nostrand EL, Pratt GA, Shishkin AA, et al.: Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat. Methods. 2016; 13: 508–514. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. Lambert NJ, Robertson AD, Burge CB: RNA Bind-n-Seq: Measuring the Binding Affinity Landscape of RNA-Binding Proteins. Methods Enzymol. 2015; 558: 465. Publisher Full Text

[16] 16. Schwarz G: Estimating the Dimension of a Model. Ann. Stat. 1978; 6: 461. Publisher Full Text

[17] 17. Akaike H: A new look at the statistical model identification. IEEE Trans. Autom. Control. 1974; 19: 716–723. Publisher Full Text

[18] 18. Ladd AN, Charlet N, Cooper TA: The CELF family of RNA binding proteins is implicated in cell-specific and developmentally regulated alternative splicing. Mol. Cell. Biol. 2001; 21: 1285–1296. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Dembowski JA, Grabowski PJ: The CUGBP2 splicing factor regulates an ensemble of branchpoints from perimeter binding sites with implications for autoregulation. PLoS Genet. 2009; 5: e1000595. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Marquis J, Paillard L, Audic Y, et al.: CUG-BP1/CELF1 requires UGU-rich sequences for high-affinity binding. Biochem. J. 2006; 400: 291–301. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Chaudhury A, Pal R, Kongchan N, et al.: Celf1 is an eif4e binding protein that promotes translation of epithelial-mesenchymal transition effector mrnas. bioRxiv. 2019. Reference Source

[22] 22. Xu N, Chen CY, Shyu AB: Versatile role for hnRNP D isoforms in the differential regulation of cytoplasmic mRNA turnover. Mol. Cell. Biol. 2001; 21: 6960–6971. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Swanson MS, Dreyfuss G: Classification and purification of proteins of heterogeneous nuclear ribonucleoprotein particles by RNA-binding specificities. Mol. Cell. Biol. 1988; 8: 2237–2241. PubMed Abstract

[24] 24. Miller JW, Urbinati CR, Teng-Umnuay P, et al.: Recruitment of human muscleblind proteins to (CUG)(n) expansions associated with myotonic dystrophy. EMBO J. 2000; 19: 4439–4448. Publisher Full Text

[25] 25. Hahm B, Cho OH, Kim JE, et al.: Polypyrimidine tract-binding protein interacts with HnRNP L. FEBS Lett. 1998; 425: 401–406. PubMed Abstract | Publisher Full Text

[26] 26. Iko Y, Kodama TS, Kasai N, et al.: Domain architectures and characterization of an RNA-binding protein, TLS. J. Biol. Chem. 2004; 279: 44834–44840. PubMed Abstract | Publisher Full Text

[27] 27. Wang Z, Morris GF, Rice AP, et al.: Wild-type and transactivation-defective mutants of human immunodeficiency virus type 1 Tat protein bind human TATA-binding protein in vitro. J. Acquir. Immune Defic. Syndr. Hum. Retrovirol. 1996; 12: 128–138. PubMed Abstract | Publisher Full Text

[28] 28. Katsantoni M, van Nimwegen E , Zavolan M: Improved analysis of (e) CLIP data with RCRUNCH yields a compendium of RNA-binding protein binding sites and motifs. Genome Biol. 2023; 24: 77. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Schlusser N: PWMs from RNA Bind’n’Seq data (1.0). Zenodo. 2023. Publisher Full Text

On the limits of inferring biophysical parameters of RBP-RNA interactions from in vitro RNA Bind’n Seq data

Abstract

Keywords

1. Introduction

2. Model

(2.1)

(2.2)

(2.3)

(2.4)

(2.5)

3. Implementation

3.1. Construction of the frequency priors fS from a Markov model

3.2. Inferring PWMs from the expectation maximization algorithm

(3.1)

4. Results

Figure 1. Summary of fraction of convergent outcomes for different investigated RBPs and binding site length Lw.

4.1 Benchmark: PWM of length 6 for RBFOX2

Figure 2. Findings of our model for PWMs of Lw=6.

4.2 Other PWMs found for RBFOX2

(4.1)

(4.2)

Figure 3. Non-consensus PWMs of different Lw for RBFOX2.

4.3 CELF1

Figure 4. PWMs of different Lw for CELF1.

4.4 HNRNPD

Figure 5. PWMs of different Lw for HNRNPD.

4.5 HNRNPK

4.6 Other RBPs

Figure 6. Logarithmic (base eenhancement (counts foreground normalized by counts background) of all 46≈4000 possible 6mers of all investigated RBPs, ranked by enhancement.

5. Conclusion

Data availability

Underlying data

Table 1. ENCODE file accession IDs and data repository DOIs of RNA Bind’n Seq samples from Ref. 8 used for motif prediction.

Software availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated

3.1. Construction of the frequency priors f_S from a Markov model

Figure 1. Summary of fraction of convergent outcomes for different investigated RBPs and binding site length $L_{w}$ .

Figure 2. Findings of our model for PWMs of $L_{w} = 6$ .

Figure 3. Non-consensus PWMs of different $L_{w}$ for RBFOX2.

Figure 4. PWMs of different $L_{w}$ for CELF1.

Figure 5. PWMs of different $L_{w}$ for HNRNPD.

Figure 6. Logarithmic (base eenhancement (counts foreground normalized by counts background) of all $4^{6} \approx 4000$ possible 6mers of all investigated RBPs, ranked by enhancement.