Keywords
RNA Binding Proteins, CLIP, RBP, Web-server
This article is included in the Bioinformatics gateway.
RNA Binding Proteins, CLIP, RBP, Web-server
RNA binding proteins (RBPs) are key players in a broad spectrum of RNA regulation, including all stages of the RNA lifecycle (Gerstberger et al. 2014; Gebauer et al. 2021). Eukaryotic genomes typically encode hundreds of RBPs. For example, over 1500 human RBPs involved in the maturation, transport, stability and translation of coding and non-coding RNA were recently characterised and manually curated (Gerstberger et al. 2014). Each RBP can typically target hundreds of RNAs in a complex coordinated fashion (Hogan et al. 2008). The general transcriptomic locations of thousands of RNA binding sites corresponding to hundreds of RBPs have been identified using a family of experimental techniques based on RBP CrossLinking, ImmunoPrecipitation and Sequencing (CLIP-Seq) (Chi et al. 2009; Licatalosi et al. 2008; Moore et al. 2014; Hafner et al. 2010). A thorough exploration of tens of RBPs binding characteristics in vitro has shown that RBPs can differentiate their binding sites with context preferences beyond narrowly defined binding sequence motif and secondary structure often involving complex binding configurations (Dominguez et al. 2018). Among the experimental techniques available to date, the enhanced CLIP (eCLIP) protocol (Van Nostrand et al. 2016) is particularly important, as it significantly reduces required amplification and increases specificity in identifying authentic binding sites. This improves the efficiency and accuracy of RBP binding site identification and allows for a deeper understanding of the role of RBPs in gene regulation.
The availability of a large amount of data points produced from the same experimental technique can be very beneficial for applications such as machine learning, as it allows researchers to train and test models with more confidence. Here we present RBP-Tar (Tomáš Raček 2023), a centralised and searchable database of experimentally identified RBP binding sites that can significantly facilitate the study of the RBP mediated gene regulation. Using RBP-Tar, researchers can quickly and cleanly retrieve RBP binding sites constrained by both genomic location and associated RBP for hundreds of RBPs.
Pipeline for reproducible data download and annotation
We have developed a reproducible and easy-to-use pipeline for downloading and annotating RBP eCLIP data from the ENCODE (Luo et al. 2020) database.
Metadata for eCLIP experiments, as well as additional files containing genomic coordinates are downloaded from the ENCODE database with the following parameters:
“status= released&
internal_tags=ENCORE&
assay_title=eCLIP&
biosample_ontology.term_name=K562&
biosample_ontology.term_name=HepG2&
files.file_type=bed+narrowPeak&
type=Experiment”.
Following the download, information about the chromosome, start position, end position and strand of reads are extracted for each RBP binding site. Binding sites are filtered by length, excluding ones shorter than 20 and longer than 100 nucleotides. As a last step, genomic sequences of binding sites are retrieved.
The described pipeline is implemented as a set of python scripts and is freely available at GitHub.
Using this pipeline, data for 167 RBPs on two cell lines (K562, HepG2) were downloaded. In total, 1.8 Gb of data, representing more than 25 million binding sites, were thus processed.
Web server
Here we present RBP-Tar, a web server that can access the above curated dataset of RBP binding sites (https://ncbr.muni.cz/RBP-Tar) and was built with Python (RRID:SCR_008394), the web development framework Flask, and a simple SQLite database (RRID:SCR_017672). The application’s source codes can be found on the project’s GitHub page, along with the requirements and instructions for the deployment if a user wants to run the application locally.
The web user interface allows searching and filtering based on the start and end position of the binding site, strand, chromosome, and protein name. It offers the download of the filtered data based on the search done by the user. Due to the size of the dataset, the view is limited to 10 000 results. However, the whole dataset can be conveniently downloaded as a gzipped CSV (579 MB) (here).
The RBP-Tar web server can take as input any of the following user-provided parameters: [Start min, Start max] denote the limits of the low genomic coordinate of the locus of interest. Similarly, [End min, End max] denote the limits of the high coordinate. [Strand] and [Chromosome] can be used to narrow down the search to only one strand and a specific chromosome. Using these combinations of parameters, a user can easily search for binding sites on their favourite gene, exon, or even a whole chromosome. The last parameter is [Protein name], which brings out a drop-down menu of all the RBPs in the database. If this parameter is not set, all RBPs are queried.
Results are shown as a table with the [Chromosome, Start, End, Strand] genomic location of the binding site, followed by the [Protein name] of the associated RBP and the [Sequence] of the binding site contains the genomic sequence of the binding site. Results can be seen online in a table format or downloaded as a CSV file with one button click. We expect most users to download the results and use them for further downstream analyses.
The first and potentially most common use case would be the query of all known RBP binding sites on a gene of interest. For example, we can query our web server with the coordinates of the Fused in Sarcoma (FUS) gene (chr16: 31180139-31191605, +) and leave the protein field empty. After this search, all 5 045 known RBP binding sites on this gene are returned and can be easily downloaded in a CSV file for further analysis (Figure 1).
In fact, the gene we used encodes the RBP FUS, which plays important roles, among others, in neurodegeneration and cancer progression. We can use the filter [Protein Name] to identify the potential self-targeting of FUS on itself. Indeed, we can thus identify 72 potential FUS self-targeting binding sites identified via eCLIP (Grešová & Raček 2023) (Figure 2). Of course, the biological relevance of this type of finding is left to the users, as is the further validation of high-throughput derived RBP binding sites.
A potential user may want to develop an RBP binding site machine learning tool that would be able to predict binding sites based on a sequence. It is important to make sure that training and testing sets for their machine learning method are not overlapping. Using our web server, they can download all binding sites for a specific RBP, for example, on chromosome 1, and use them as a training set (Figure 3). Then do the same for chromosome 2 and use it as an independent testing set, thus ensuring that training and testing set do not overlap.
Recent advancements in experimental techniques, such as eCLIP, have greatly expanded our understanding of RBP binding preferences and their role in gene regulation. The development of centralised and searchable databases of experimentally identified RBP binding sites allows researchers to access and analyse the binding preferences of RBPs easily. This information can be used to identify known binding sites on genes of interest and aid in training machine-learning models for RBP binding site prediction. This paper presents RBP-Tar, a centralised web server that can retrieve RBP Target sites with location and RBP constraints. RBP-Tar has been designed to be easily accessible by non-experts. It is still confined to a single source of data, which is helpful for avoiding experimental design effects, but makes its scope limited. We plan to expand the web server with other sources of data, as well as ways for the user to be able to take into account provenance and experimental variation.
All data used in RBP-Tar has been downloaded from ENCODE and Ensembl projects in April 2022. RBP eCLIP metadata were downloaded from https://www.encodeproject.org/metadata/?status=released&internal_tags=ENCORE&assay_title=eCLIP&biosample_ontology.term_name=K562&biosample_ontology.term_name=HepG2&files.file_type= bed+narrowPeak&type=Experiment&files.analyses.status=released&files.preferred_default=true. List of all downloaded files containing genomic coordinates can be found here: [https://github.com/ML-Bioinfo-CEITEC/rbp_encode_eclip/blob/main/csv/coord_links.txt] (hundreds of filenames). Reference genome was downloaded from http://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
Columns ‘chrom’, ‘chromStart’, ‘chromEnd’ and ‘strand’ from downloaded files containing genomic coordinates can be found here: https://github.com/ML-Bioinfo-CEITEC/rbp_encode_eclip/tree/main/csv. The whole dataset (curated and including sequence) can be downloaded from https://rbp-tar.ncbr.muni.cz/ as a gzipped CSV (579 MB) (https://rbp-tar.ncbr.muni.cz/download_all).
Software: https://ncbr.muni.cz/RBP-Tar
Source code: https://github.com/sb-ncbr/RBP-Tar/tree/v1.0.0
Archived source code at time of publication: https://doi.org/10.5281/zenodo.7807678 (Raček 2023)
License: MIT
Source code: https://github.com/ML-Bioinfo-CEITEC/rbp_encode_eclip
Archived source code at the time of publication: https://doi.org/10.5281/zenodo.7802803 (Grešová & Raček 2023)
License: Apache 2.0
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
No
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Epigenetics, Chromatin analysis, Gene regulation, Single cell
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
References
1. Gerstberger S, Hafner M, Tuschl T: A census of human RNA-binding proteins.Nat Rev Genet. 2014; 15 (12): 829-45 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: The authors need solve some issues before published
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||||
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | |
Version 3 (revision) 25 Nov 24 |
read | read | read | read | ||
Version 2 (revision) 12 Aug 24 |
read | read | ||||
Version 1 27 Jun 23 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)