RVDB-prot, a reference viral protein database and its HMM profiles

We present RVDB-prot, a database corresponding to the protein equivalent of the nucleic acid reference virus database RVDB. Protein databases can be helpful to perform more sensitive protein sequence comparisons. Similarly to its homologous public repository, RVDB-prot aims to provide reliable and accurately annotated unique entries, while including also an Hidden Markov Model (HMM) protein profiles database for distant protein searching.


Introduction
Sequence assignation often uses similarity criteria to infer homology, and hence taxonomy and / or protein function. In order to search for this, similarity, reliable, accurate and comprehensive databases are required. When trying to characterize sequences present in a metagenomics sample, searching first for related sequences in a viral database can lead to identify rapidly a known virus (high identity between the query sequence and the one in the database), or identify potential new species (low identity with any known sequence). Such hits must be further characterized on more comprehensive databases to increase the robustness of taxonomic assignations.
In the specific field of viruses, several solutions are available but their ability to provide valid results is highly dependent on the goal of the study and on the available computer resources. Using a database with a high number of sequences, such as NCBI nr/nt may seem appropriate, but it implies an increased computation time and annotation quality is not always optimal. Similarly, UniProtKB 1 contains numerous viral sequences (4 497 049 in total, including 17 008 (0.38%) reviewed) that could, as for NCBI/nr, increase computation time when thousands of sequences have to be analyzed concomitantly, which is routinely practiced in metagenomics analyses. RefSeq, on the other hand, is generally better curated but contains only full-length genomes, which reduces the diversity of available sequences, and also rarely includes the latest discoveries. RefSeq contains 13 180 virus sequences. Other specialized databases provide only specific groups of taxa for specific purposes, for instance, virus families responsible for infectious diseases like HIV or influenza viruses.
Thus, the need for better, well-annotated and comprehensive public viral database that can be used for the identification of viruses by high-throughput sequencing led Goodacre et al. to propose their Reference Viral DataBase (RVDB) 2 . This database consists of a collection of all currently known viral genomes and virus-related nucleic sequences retrieved from NCBI/nr or RefSeq and includes a specific, both manual and computational reviewing process, as well as four updates of the contents per year. The reviewing process eliminates a great quantity of unwanted non-viral sequences like: cloning vectors, endogenous sequences, sequences that were wrongly annotated as virus but were actually of cellular origin, etc. This high level of curation makes RVDB quite attractive for the virology research community and in fact, in June 2020, version 19.0 was released.
Since viral genomes mainly consist of coding sequences, the need for an equivalent reference database that provides the protein version of these sequences may prove quite advantageous.
Indeed, protein sequences are useful when searching for distant homologs: their substitution rates are much lower than nucleic sequences. Additionally, proteins can also be efficiently clustered according to their similarity, and the resulting clusters can then be used to build Hidden Markov Model (HMM) profiles in order to identify more evolutionary distant proteins. In fact, programs like HMMER 3 allow the building of HMM profiles from a multiple sequence alignment of proteins. This profile can then help recognizing proteins based on complex position-specific models of sequence conservation and evolution, and it does so in a more accurate way than if a classic sequence alignment is used. Therefore, we propose a protein sequence version of RVDB whose update will be synchronized with the original nucleotide RVDB release. Here we describe the conversion from the nucleotide version of RVDB to the protein version RVDB-prot, as well as the clustering process leading to the HMM profiles.

Methods
Conversion from RVDB nucleic database to RVDB-prot The current version of RVDB, v19.0 4 consists of a collection of 3 084 319 nucleic sequences 2 . The accession numbers were extracted in order to gather the corresponding database entries in Genbank format. From these entries, the corresponding coding domain protein sequences, description, and protein accession numbers were automatically recognized and copied into the protein collection. The process relies on the aminoacid sequences and information provided initially in the nucleic entry annotations. The resulting protein file contains the nucleic sequence reference, for traceability purposes. The sequence names are formatted in the following way: where: p_bank is the bank in which the protein can be found p_acc is the accession number corresponding to the protein sequence n_bank is the bank in which the original nucleotide sequence was found n_acc is the original information found in the nucleic database

Amendments from Version 1
This second version takes into account the comments of the reviewers. We also updated the metrics and the files available to reflect the version 19.0 of RVDB. The code of the pipeline itself was updated. In the introduction, we exemplified usages of the database, introduced UniProtKB virus database, and explained why the curation of RVDB is so important. In the methods, we detailed where the protein sequences coded by nucleic sequences are found. We specified that we used the default parameters of two programs used by the pipeline (Silix and HMMER). Regarding the database annotation, we clarified how each cluster is described, and we added a new feature: annotation keywords are not only gathered from sequences descriptions, but now also from PFAM clusters matching to the cluster sequences.

REVISED
descr is the description of the protein sequence as found in the database entry sp is the species name.
This process produces a 4 705 359 protein sequence file.

Generation of HMM profiles
The HMM generation rationale was inspired from vFam (the database of HMM profiles built from all the viral proteins present in RefSeq, discontinued from 2014) 5 , but was entirely recoded as a Snakemake pipeline 6 , using different tools for some key steps (clustering, alignment). The proteins sequences were clustered with a 100% identity criterion to remove duplicates using CD-Hit 4.7.0 7 . Then, the sequences were processed using Blast 2.2.26 8 performing an all-against-all comparison. These comparisons allowed Silix 1.2.6 9 (using default parameters) to define clusters of sequences according to their similarity. This step produced a text file in which each sequence was associated to one cluster. The information of each cluster (containing at least four sequences) was transformed into a fasta file containing all the sequences within the cluster. Then, sequences were aligned using Mafft 7.023 10 in auto mode. The multiple sequence alignments were processed by HMMER 3.2.1 3 (hmmbuild, using default parameters) in order to obtain the HMM profiles. The HMM profiles were finally grouped into a single file.

Annotation of HMM profiles
A cluster is defined as a set of sequences, among which each sequence is characterized by its taxonomy (i.e. a virus species) and eis associated with a description of its putative function, when it is known. In order to describe the different clusters, these information and other indicators (such as the cluster length and number of sequences) are combined into an annotation database, in SQLite format. The schema of this database is shown in Figure 1.
The first type of data associated to a cluster is a set of keywords describing the putative function of the proteins present in a given cluster. These keywords correspond to the union of all names of the significant sequences found in PFAM 11 (with --cut_ga parameter which tells HMMER to trust the cutoff defined by PFAM) using all the sequences of the cluster as queries, weighted according to their frequencies, and excluding trivial words. We also produce a complementary word frequency count using sequence descriptions. These keywords are stored separately from the PFAM ones. Despite the fact that sequence descriptions can be vague or inaccurate, they are a good fallback in case the cluster had no match with any PFAM one.
In addition to the protein description, the database stores the virus taxonomy associated to all the taxa, referring to tNCBI TaxIDs. For each cluster, the taxonomic information is summarized by a Last Common Ancestor (LCA) that corresponds to the taxon in the tree of life to which all the sequence taxa belong; this LCA can be close to the root of the tree (Viruses), but is usually specific to a family.
Finally, the database also provides the length (number of amino acids of the multiple sequences alignment) and the number of sequences in each cluster.
This database is available in SQLite format, and to provide more direct access, flat text files are proposed. A text file for each cluster, identified with its cluster number, contains all the information related to it.

Software availability
The different steps explained above are performed using a Snakemake pipeline 6 , available at Institut Pasteur's Gitlab.
• Pipeline available from https://gitlab.pasteur.fr/tbigot/ rvdb-prot/.  Table 1 shows some summary metrics for the entries of this release and the different resources.

Open Peer Review I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Author Response 26 Aug 2020 Thomas Bigot, Institut Pasteur, Paris, France We would like to thank the Reviewer. Please find below our line-by-line responses.
In order to ease reproducibility, the parameters used for the different programs (e.g. HMMER, SiLiX) of the pipeline should be provided.
Done. Parameters are now specified in the new version of the manuscript. Actually, for both of these programs, we use default parameters.

Why is it required to locally translate the Coding DNA Sequences (CDS) from the original RVDB nucleotide database instead of downloading them from the resource?
We have clarified this point in the new version: actually, we use translations provided in the entry of each nucleic sequence. They are provided in the raw data of the original database (Genbank, RefSeq) along with the accession number. What we do amounts to retrieve all protein sequences corresponding to a nucleic sequence from protein database with accession numbers, but doing it directly from the nucleic database is faster.

I have a question for the taxonomic assignation to the Last Common Ancestor (LCA) when building the clusters. How are handled the possible contradictions within a cluster? More exactly, what is done exactly if sequences that belong to distantly related taxa are clustered together? If a strict LCA rule is applied, then it would be possible to have a really inprecise assignation (something like "virus" and that's it).
Indeed, we use naïve LCA assignation, and it can lead to imprecise assignation (some clusters can be tagged as Viruses). As we do not have other information about the cluster we characterize, we chose not to avoid this possibility. We have added a precision about this case in the new version of the manuscript: "For each cluster, the taxonomic information is summarized by a Last Common Ancestor (LCA) that corresponds to the taxon in the tree of life to which all the sequence taxa belong; this LCA can be close to the root of the tree (Viruses), but is usually specific to a family." © 2019 le Mercier P. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Philippe le Mercier
Swiss-Prot Group, CMU, Swiss Institute of Bioinformatics, Geneva, Switzerland In this article, the authors present a RVDB-prot, a reference viral protein database and its HMM profiles. The purpose of this approach is providing a complete reference database of viral proteins to identify new sequences. Their database is based on nucleotide Reference Viral Database (RVDB). The rationale of this work is that protein sequences can be better than nucleotides for searching distant homologs.
In brief, their approach was as follows: RVDB database was converted to proteins, thereby creating a new dataset of 3,899,699 proteins. The protein were clustered, and these clusters used to create HMM. Words frequently present in sequence names of a cluster were used to annotate HMM profiles. The software, pipeline and final database are all available.
The final data are of good quality, will hopefully be maintained along with RVDB and offer a new approach for protein virus reference.
While the article is well written and the method is well described, there are a number of issues that need to be addressed: The database is described as facilitating sequence assignation. This seems a bit vague, a sentence describing possible applications could help.

○
The introduction may describe better the current state of research in the field. UniProtKB should be cited in "existing databases" for viral proteins, and authors may add citations of its use in virus detection. (ex: UniRef90 used with success to create synthetic human virome 1 (PMID: 26045439)). This would also highlight new potential applications for RVDB-prot.
○ UniProtKB provides data for 3,972,271 viral proteins, a bit more than RVDB-Prot (3,899,699). RVDB-prot data is based similarity gathering of sequences with viral RefSeq, which has the advantage to ignore any taxonomical issues. On the other hand, many RefSeq are provisional, and those are not free or errors. The authors may discuss the advantage of their method over existing protein dataset.

○
Similarly, UniRef90 contains 577,105 clusters of proteins, which could be compared to the 489,207 unique proteins of RVDB-prot. Further discussion may help understanding the advantages of these two datasets.

○
The paper could provide more details on parameters used for defining clusters with Silix.

○
Minor remark: The naming system may be perfected. Although imaginative and automatic, it seems to be limited. Actually, the name used to create RVDB-pro keywords is not clearly defined. The name of GenBank protein entry are not very consistent and this may explain these problems. Maybe using pfam or any other method of identification over the clusters may help naming them in a more consistent way.

Thomas Bigot, Institut Pasteur, Paris, France
We would like to thank the Reviewer. Please find below our line-by-line responses.
The database is described as facilitating sequence assignation. This seems a bit vague, a sentence describing possible applications could help.
We have added the following sentence to exemplify possible applications: "When trying to characterize sequences present in a metagenomics sample, searching first for related sequences in a viral database can lead to identify rapidly a known virus (high identity between the query sequence and the one in the database), or identify potential new species (low identity with any known sequence). Such hits must be further characterized on more comprehensive databases to increase the robustness of taxonomic assignations." The introduction may describe better the current state of research in the field. UniProtKB should be cited in "existing databases" for viral proteins, and authors may add citations of its use in virus detection. (ex: UniRef90 used with success to create synthetic human virome1 (PMID: 26045439)). This would also highlight new potential applications for RVDB-prot. UniProtKB provides data for 3,972,271 viral proteins, a bit more than RVDB-Prot (3,899,699). RVDB-prot data is based similarity gathering of sequences with viral RefSeq, which has the advantage to ignore any taxonomical issues. On the other hand, many RefSeq are provisional, and those are not free or errors. The authors may discuss the advantage of their method over existing protein dataset.
We have updated the introduction, introducing UniProtKB viral sequences: "UniProtKB11 contains numerous viral sequences (: 4 497 049 in total, including 17 008 (0.38%) reviewed ones) that could, as for NCBI/nr, increase computation time when thousands of sequences have to be analyzed concomitantly, which is routinely practiced in metagenomics analyses." We have also updated the description of RefSeq (with updated contents) and better exemplified the benefit of RVDB over these two databases.
Similarly, UniRef90 contains 577,105 clusters of proteins, which could be compared to the 489,207 unique proteins of RVDB-prot. Further discussion may help understanding the advantages of these two datasets.
We stressed on the first asset of RVDB: the curation that is done on the sequences is unique and allows to raise confidence in the fact that all the sequences of the database are real viral sequences.
The paper could provide more details on parameters used for defining clusters with Silix.
Done. We used the default parameters of Silix.
The naming system may be perfected. Although imaginative and automatic, it seems to be limited. For example cluster 77 in the -prot-hmm-txt.zip (v 15.1) contains 398 sequences, which are obviously rep proteins for ssDNA viruses circo, gemini and their satellites, but the names fished out by author's method are not clear. This can be problematic for sequence assignation. Actually, the name used to create RVDB-pro keywords is not clearly defined. The name of GenBank protein entry are not very consistent and this may explain these problems. Maybe using pfam or any other method of identification over the clusters may help naming them in a more consistent way.
We are grateful for this remark which helped us make the naming system clear. Indeed, the pipeline does now query PFAM to annotate sequences. As explained in the new version of the manuscript, for each cluster, we query PFAM with every sequences of this cluster, using --cut_ga option of HMMER (this option makes HMMER trust PFAM GA bitscore cutoff defined for each cluster). We kept the original system (using sequences descriptions) despite the fact they are inaccurate, since sometimes, we do not find homologs clusters in PFAM.

Competing Interests:
No competing interests were disclosed.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com