ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

Restriction site-associated DNA from Python-implemented Digestion Simulations (RApyDS): a companion tool for RAD sequencing experimental design

[version 1; peer review: awaiting peer review]
PUBLISHED 07 May 2021
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS AWAITING PEER REVIEW

This article is included in the Cell & Molecular Biology gateway.

This article is included in the Bioinformatics gateway.

This article is included in the Python collection.

Abstract

Reduced representation sequencing is a practical approach for obtaining genetic variations from a random subsample of the genome. RADseq (Restriction Site-Associated DNA Sequencing), as one of the more popular reduced representation approaches, is currently being used in a wide array of applications including marker development, phylogenetics, and population genomics. A crucial step in designing a RADseq experiment is the selection of one or a pair of restriction enzymes (RE) that will result in sufficient density of loci to meet the objectives of the study, which is not straightforward because of difficulties in obtaining a standard set of REs that can generally be applied to RADseq experimental designs. Here we present RApyDS, a simulation tool that provides users with evaluation metrics to aid in choosing suitable REs based on their target RADseq design. RApyDS can perform simulations for single- or double-digest RADseq, preferably with a supplied reference genome. The tool outputs an overview page, electrophoresis visualization, mapping of restriction cut sites, and RAD loci density across the genome. If supplied with an annotation file, the program can also output evaluation metrics for a specified genomic feature. The tool is currently available at https://github.com/pgcbioinfo/rapyds.

Keywords

RApyDS, Restriction Site-Associated DNA Sequencing, RADseq, simulation, companion tool, experimental design

Introduction

Next-generation sequencing (NGS) technologies provide an avenue to obtain a broader view of the genetic landscape of individual species, substantially impacting various applications such as molecular marker discovery (Shendure & Ji, 2008). However, not all approaches and studies require sequencing of complete genomes, which may be cost-prohibitive and more difficult to analyze (Etter et al., 2011; Meger et al., 2019). In dealing with whole-genome resequencing studies, reduced representation genomic libraries (RRLs) are alternatively used to randomly subsample the genome for the discovery of genetic markers and variations. Instead of resequencing at a lower depth of coverage to reduce the sequencing cost, RRLs can increase the effective coverage depth but only for certain regions of the genome. Amongst RRLs, Restriction Site-Associated DNA Sequencing (RADseq) has gained a lot of attention, as this approach utilizes the power of high-throughput sequencing to efficiently discover polymorphic markers on organisms even in the absence of a reference genome (Etter et al., 2011).

RADseq, which utilizes restriction enzymes to sample a subset of the genome that is flanked by restriction sites, is commonly used today for SNP discovery and genotyping (Davey et al., 2013). RADseq protocols may vary according to the number of restriction enzymes used and the frequency of cut sites across the length of the DNA. In its original formulation, RADseq relies on the fragments produced from the cut sites of a single restriction enzyme; whereas other protocols such as double-digest RADseq (ddRAD) produce fragments with known distances from the cut-sites of two different restriction enzymes (Andrews et al., 2016). Briefly, the process begins with the digestion of high-quality genomic DNA by the chosen restriction enzyme. Molecular identifiers and adapters are ligated to the sticky ends of the fragments. Samples are then pooled and randomly sheared. Another set of adapters are attached to the sheared fragments to ensure proper amplification. Finally, size selection is performed for sequence-ready fragments. Random shearing is eliminated in the ddRAD protocol as it uses another restriction enzyme for a second round of digestion instead (Davey & Blaxter, 2010).

Restriction enzymes are the key elements in RADseq. These are essential in determining the genomic regions to be sequenced, since the patterns of genetic variations to be observed will occur adjacent to the corresponding restriction sites (Andrews et al., 2016). The choice of which enzyme to use thus determines the number of markers that can be obtained and assayed, which in turn determines the amount of sequencing required, the number of samples, the cost, and ultimately the success of the project (Herrera et al., 2015). The lack of prior estimates on the number of restriction sites, fragments that can be produced, density of RAD loci that can be sequenced, or other projected effects of using one or a combination of restriction enzymes are some of the common impediments in successfully designing and performing reduced representation library preparations for whole-genome resequencing. It is therefore of utmost importance to have a prior idea on the distribution and frequency of cut-sites and fragments that will be produced by a given restriction enzyme.

To address the aforementioned impediments, we developed RApyDS (Restriction Site-Associated DNA from Python-implemented Digestion Simulation), a tool that aims to provide evaluation metrics to aid researchers in choosing the best set of restriction enzymes for a RADseq study. It is a Python-based tool that simulates digestion of a given reference genome in silico either by original RADseq or ddRAD protocols. The simulation tool generates estimates for the following: (i) the number of restriction sites, (ii) the number of fragments after digestion, (iii) the number of fragments after shearing and size selection (RAD loci), (iv) the number of RAD loci within repeat regions, (v) the cut-site distribution, and (vi) the projected RAD loci density across the genome. A gel electrophoresis view of the restriction fragments is also shown to simulate the common laboratory method used for initial evaluation of restriction digestion. When supplied with an annotation file for the reference genome, the tool will also provide the number of RAD loci observed within annotated regions. All of this information, taken together, will enable the user to better determine the top restriction enzyme candidates and RADseq experimental design to implement for a particular RADseq or resequencing study.

Methods

Implementation

Figure 1 shows an overview of the implementation workflow of the RApyDS tool. Two distinct RADseq protocols are implemented in RApyDS for simulation: the original RADseq protocol (Baird et al., 2008), and the double-digest RADseq or ddRAD (Peterson et al., 2012). Global splitting was implemented to represent the digestion of DNA sequence based on the selected restriction enzymes; one RE for original RAD and two REs for ddRAD. In order to obtain simulation metrics based on the mapping of RAD loci to the reference genomic sequence, we used the alignment tool BWA (BWA, RRID:SCR_010910) (Li & Durbin, 2009). At present, the primary output of RApyDS is an interactive HTML file that can be viewed properly using the Mozilla Firefox browser.

e3cac9c2-de02-46ce-9c13-ef181d0858d5_figure1.gif

Figure 1. Summary of RApyDS workflow.

Original RADseq Simulation. In this protocol, we simulate the original RADseq workflow by initially segmenting the reference sequence based on the presence of restriction enzyme cut sites. The procedure therefore assumes a complete digestion of the intact genome. The resulting fragments are then further sheared and size-selected according to a user-defined insert size range. This process results into three fragment types: (1) those that are deemed too small after shearing and size selection, (2) those that are within the insert size range but do not contain a sticky end (remnant of the restriction cut site), and (3) those that are within the defined insert size range and contains both sticky and sheared ends. Type 1 fragments are filtered out by the program to simulate the size selection process in sequencing library preparation. Type 2 fragments, on the other hand, are also filtered out by the program because the sticky end resulting from the restriction enzyme activity is needed for the attachment of the P1 sequencing adapter. Thus, only fragments containing a restriction enzyme cut site in one of the ends and whose lengths are within the insert size range (Type 3) will be considered as putative RAD loci for downstream analysis.

Double-digest RADseq Simulation. For the ddRAD protocol, the process similarly begins with the segmentation of the genome based on the positions of the first restriction enzyme cut sites. However, instead of implementing a shearing step afterwards, a second round of digestion based on the cut sites of the second restriction enzyme is done. The resulting fragments after the two rounds of digestion are then size selected according to the user-defined insert size range. This procedure will generate the following fragment types: (1) those that are shorter than the minimum insert size, (2) those that are within the insert size range but contain sticky ends from only a single enzyme, and (3) those that are within the insert size range and contain sticky ends from both of the enzymes used. Again, to simulate size selection during library preparation, Type 1 fragments are filtered out in the process. Type 2 fragments will also be filtered out because the sticky ends from both enzymes are necessary for the proper attachment of the sequencing adaptor pairs. At the end of this step, only the Type 3 fragments will remain as putative RAD loci.

Operation

RApyDS software runs on a Linux OS computing machine with at least Python 3 installed and a minimum system requirement of 500 MB free memory (RAM) and a single CPU. Resource consumption will depend on the size of the input genome. Installation of the Python module dependencies can be done using pip. A complete guide on installation, usage, software requirements, and specifications can be accessed at https://github.com/pgcbioinfo/rapyds.

Use cases

Program inputs

The program requires a genome or a DNA sequence in FASTA format as input. Whenever a corresponding annotation in GFF/GTF format is available, this can also be supplied together with the sequence to enable counting of RAD loci within target regions. In cases where a reference genome is not available, the tool can randomly generate a hypothetical reference sequence based on a user-defined GC content and sequence length using the algorithm implemented in SimRAD (Lepais & Weir, 2014). By default, the tool will evaluate a total of 56 commonly used restriction enzymes which recognition sequences were acquired from New England Biolabs Inc. . The users may also select any subset of these enzymes for evaluation and are given the option to add new restriction enzyme information if these are not yet included in the current list.

Outputs and implementation features

A successful RApyDS run will generate a compressed ZIP file as final output. This archived file contains simulation reports in hypertext markup language (HTML) format, all of which can be accessed interactively through the main file index.html. The landing page (Overview tab) shows a table containing various simulation metrics for each of the restriction enzymes tested (Figure 2). These metrics are described in more detail in Table 1. Briefly, the table contains information on the number of predicted RAD loci, the breadth of reference sequence covered by these loci, single-copy and repetitive loci, and loci within annotated regions if an annotation file has been supplied. To further facilitate assessment of the results, the table can be sorted in increasing or decreasing order based on the values from a particular column by clicking on the column header.

e3cac9c2-de02-46ce-9c13-ef181d0858d5_figure2.gif

Figure 2. Overview report of an Original RAD protocol simulation of the C. elegans reference genome.

Table 1. Description of column headers in the overview page of RApyDs simulation report.

Column headersDescription
Restriction enzymeRestriction enzymes tested
Fragments after
digestion
Number of fragments produced after enzymatic digestion. For ddRAD this accounts for the first and
second digestion.
Fragments after size
selection (RAD loci)
Number of fragments produced after size selection. This is equivalent to the number of RAD loci produced
Percent breadth of
coverage
Percent of the input sequence length covered by the RAD loci
Single-copy RAD lociNumber of RAD loci found only once in the reference sequence. If at least one of the read pairs
representing the RAD loci maps uniquely to the reference genome, the loci is considered to be single-copy.
Repetitive RAD lociNumber of RAD loci found multiple times in the reference sequence. If none of the read pairs representing
the RAD loci maps uniquely to the reference genome, the loci is considered to be repetitive.
Repeat regions with
RAD loci
Number of repeat regions in the reference sequence harboring RAD loci
RAD loci in specified
feature
Number of RAD loci that are found within the specified sequence feature
Specified feature with
RAD loci
Number of specified sequence feature in the reference harboring RAD loci
Percent specified
feature with RAD loci
Percentage of specified sequence feature with RAD loci based on total number of specified feature

We note here that the number of single-copy and repetitive RAD loci is based on alignment results. In certain cases, the alignment step may take a substantial amount of time to complete. Thus, we included a parameter flag to skip the alignment step if the user prefers to speed up the simulation process at the expense of repeat information. Table 2 shows the average processing times for the following simulations: (i) with alignment and reference indexing, (ii) with alignment only wherein an indexed reference is supplied, and (iii) without alignment. A reduction in processing time of around two to five times can be obtained when running RApyDS using the latter options.

Table 2. Average running time of RApyDS (with histogram graphing skipped) on each species using a server with 16 CPU Cores of Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz and 50GB of memory allocation.

SpeciesFASTA file
size (MB)
Average Processing Time (sec)
With Alignment
and Reference
Indexing
With
Alignment
Only
Without
Alignment
Eescherichia colia4.5016.8914.566.41
Caenorhabditis elegansa97.00656.18530.86132.85
Brassica napusb943.002093.52722.63356.56
Aedes aegyptib1200.004092.681684.06584.77
Zea maysb2000.008291.834311.70899.99
Callithrix jacchusb2700.008063.472165.911288.30

a RApyDS default parameters (original RAD, entire enzyme database)

b RApyDS ddRAD parameters (enzymes used: NlaIII-MluCI)

In addition to the overview metrics, graphical views of the results in various forms are also made available to the users. The first is a virtual gel electrophoresis output based on the obtained fragments after digestion simulations (Figure 3). For this visualization, the user can specify the sizes of the DNA ladder rungs to display (Lane 1), and can also choose up to five enzymes for which the simulated banding patterns will be shown (Lanes 2 to 6). This particular view can be helpful to researchers who are more adept at assessing the utility of restriction enzymes based on gel electrophoresis profiles.

e3cac9c2-de02-46ce-9c13-ef181d0858d5_figure3.gif

Figure 3. Gel visualization of the simulation results on C. elegans reference genome using an Original RAD protocol with BamHI, BstYI, PstI, SbfI and NotI as REs.

To give the users information on the distribution and density of the enzyme cut sites across the entire length of the reference sequence, the tool also outputs a view of the cut site distribution for each of the enzymes included in the simulation (Figure 4). This view can be magnified to focus on specific regions with greater resolution. A histogram of the enzyme cut site density per binned region is also generated.

e3cac9c2-de02-46ce-9c13-ef181d0858d5_figure4.gif

Figure 4. Cut site and RAD loci density distribution of the restriction enzyme Acc65I on chromosome I of C. elegans.

Sample application

In general, the users can do RAD simulations for a given target fragment size range to be generated from one or more REs available in the tool’s database. Alternatively, if a particular RE is not included in the database, information on this RE can be manually added and subsequently used. It is also recommended for the users to have a prior idea of an appropriate reference genome to be used in estimating the distribution of target features across the entire genome length. For instance, in capturing polymorphic sites, the number of RAD loci may drastically change when using common cutter enzymes exclusively or in combination with rare cutter enzymes.

Certain resequencing studies also involve calling variants using whole genome data sets sequenced at low coverage depth in an attempt to reduce the overall sequencing cost. However, a RADseq approach can result in a similar cost reduction (O’Rourke et al., 2011) without sacrificing coverage depth and variant loci density if an appropriate set of restriction enzymes is used. Specifically, enzymes that can generate a relatively even fragment length distribution across the entire genome are generally desirable (Roszik et al., 2017). Thus, RApyDS can be utilized to evaluate which restriction enzyme or restriction enzyme pair can generate the desired RAD loci density and distribution given a range of insert sizes and within certain genomic features (e.g., genes, exons) if an annotation file is supplied. By default, the tool will also output the number of single- and multi-copy RAD loci based on mapping information, which can also be a consideration in deciding which particular restriction enzyme to use.

To demonstrate the use of RapyDS, an original RAD simulation was implemented on the C. elegans reference genome using all the REs present in the tool’s database. Figure 2 shows a portion of the results arranged alphabetically based on RE names. We can infer from these results that using AciI on C. elegans chromosome I produced the highest number of RAD loci (16,082), covering 32% of the reference sequence length with the default size selection range of 200-300 bp. Based on alignment results, AciI enzyme is predicted to yield mostly single-copy RAD loci (15,815), with only 1.66% (267) repetitive RAD loci detected and 5.32% (856) occurring at repeat regions. The number of sequenced repetitive RAD loci and/or RAD loci repeat regions can pose issues in downstream analysis that are heavily dependent on mapping and alignment approaches.

Future developments

There are certain biological factors not accounted for in the current implementation of RApyDS that may affect actual RADSeq experiments. For instance, certain restriction enzymes can be sensitive to the methylation state of a particular nucleotide (Mülhardt & Beese, 2007), which is usually not reflected in the reference sequence. Single nucleotide polymorphisms in the enzyme recognition sites can also result in allele drop-outs that can further lead to a systematic underestimation of the overall diversity in population studies, particularly for organisms with relatively high mutation rates (Andrews et al., 2016). We will try to further incorporate some of these biological considerations in future releases of RApyDS in order to obtain simulation metrics that can more closely reflect the results from real experimental data sets.

Data availability

Underlying data

NCBI Assembly: AaegL5.0. RefSeq assembly accession: GCF_002204515.2.

NCBI Assembly: ASM584v2. RefSeq assembly accession: GCF_000005845.2

NCBI Assembly: B73 RefGen_v4. RefSeq assembly accession: GCF_000005005.2.

NCBI Assembly: Bra_napus_v2.0. RefSeq assembly accession: GCF_000686985.2.

NCBI Assembly: Callithrix jacchus-3.2. RefSeq assembly accession: GCF_000004665.1.

NCBI Assembly: WBcel235. RefSeq assembly accession: GCF_000002985.6.

Software availability

Source code available from: https://github.com/pgcbioinfo/rapyds.

License: GNU General Public License v3 (GPLv3).

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 07 May 2021
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Gabriel KA, Nepacina MR, Tablizo F et al. Restriction site-associated DNA from Python-implemented Digestion Simulations (RApyDS): a companion tool for RAD sequencing experimental design [version 1; peer review: awaiting peer review]. F1000Research 2021, 10:360 (https://doi.org/10.12688/f1000research.52141.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status:
AWAITING PEER REVIEW
AWAITING PEER REVIEW
?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 07 May 2021
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.