Keywords
drosophila, sequencing, data, dimorphic traits
This article is included in the Data: Use and Reuse collection.
drosophila, sequencing, data, dimorphic traits
In this version, a mistake in the title of the Version 2 has been corrected - the words 'population sample' have been added to the title.
See the authors' detailed response to the review by Stephen Richards
See the authors' detailed response to the review by Geraldine A. Van der Auwera
As part of a study on the molecular genetics of sexually dimorphic complex traits, we used hemiclonal analysis in conjunction with high-throughput sequencing1 to characterise molecular genetic variation across the genome, from an outbred laboratory-adapted population of Drosophila melanogaster, LHM2,3. The hemiclone experimental design allows the repeated phenotyping of multiple individuals, each with the same unrecombined haplotype on a different random genetic background. This method has been used to investigate standing genetic variation and intersexual genetic correlations for quantitative traits2 and gene expression4, but it has not yet been used to obtain genomic data.
The 220 hemiclone females that were sequenced in the present study have a maternal haplotype, from the dm6 reference assembly strain (BDGP6+ISO1 mito/dm6, Bloomington Drosophila Stock Center no. 2057)5,6, and have a different paternal genome each, sampled using cytogenetic cloning from the LHM base population (See Figure 1). All non-reference genotypes in the sequenced LHM hemiclones were expected to be heterozygous and in-phase, except in rare instances where the in-house dm6 reference strain also had the same non-reference allele.
1 Capture - Single wild-type male from the base-population (open chromosomes) was crossed to five clone generator female, which harbour a fused double-X chromosome (DX), a Y chromosome, as well as a marked translocation of chromosomes 2 and 3 (hatched chromosomes; see text for full genotype). This cross captured a single wild-type haplotype. 2 . Amplify - A single heterozygous male was then crossed again to a group of clone-generator females to amplify the number stock hemiclonal males. This cross can be repeated, to replenish the stock hemiclone males. 3 Express - Finally, a single reference genome female (filled chromosomes) was crossed to a single stock hemiclonal male to produce a female that harbours the original wild-type haplotype (excluding the 4th dot chromosome, which remains uncontrolled throughout), in a reference genome genetic background. Cytoplasmic factors in the offspring also derive from the reference genome stock (black outline). DNA is extracted from a single hemiclone female from each line and sequenced.
Previous studies indicate that the limits for DNA quantity in next-generation sequencing are 50–500ng7. We sequenced individual D. melanogaster, rather than pools of clones, because more biological information can be obtained, and because modern transposon-based library preparation allows accurate sequencing at low concentrations of DNA. D. melanogaster is a small insect (~1μg) although this problem is off-set by the reduced proportion of repetitive intergenic sequence, and small genome size relative to other insects (170Mb verses ~500Mb)7.
We mapped reads to the D. melanogaster dm6 reference assembly using a BWA-Picard-GATK pipeline, and called nucleotide variants using both HaplotypeCaller, and Genomestrip, the latter of which detects copy-number variation up to 1Mb in length. A graphic representation of the data analysis pipeline is provided in Figure 2. We have made the mapped sequencing data, and genotype data publicly-available on NCBI, and additionally have made the meta-data, analysis code and logs publicly-available on Zenodo. This is the first report of a study which uses methods for detecting both SNPs, indels and structural variants (deletions and duplications >1Kb in length), genome-wide in next-generation sequencing data, and the first report of whole genome resequencing in hemiclonal individuals.
Data file types are indicate at the left side. Nodes are also coloured by file type. Software platforms are in typed in italics. Values in parentheses indicate the size of the data uploaded at different stages. Latex code for generating this figure is available at https://doi.org/10.5281/zenodo.168582.
The base population (LHM) was originally established from a set of 400 inseminated females, trapped by Larry Harshman in a citrus orchard near Escalon, California in 19913. It was initially kept at a large size (more than 1,800 reproducing adults) in the lab of William Rice (University College Santa Barbara, USA). In 1995 (approximately 100 generations since establishment) the rearing protocol was changed to include non-overlapping generations, and a moderate rearing density with 16 adult pairs per vial (56 vials in total) during 2 days of adult competition, and 150–200 larvae during the larval competition stage3. In 2005, a copy of LHM population sample was transferred to Uppsala University, Sweden (approximately 370 generations since establishment), and in 2012, to the University of Sussex (UK), when the current set of 223 haplotypes were sampled. At the point of sampling we estimate that the population had undergone 545 generations under laboratory conditions, 445 of which had been using the same rearing protocol.
Hemiclonal lines were established by mating groups of five clone-generator females (C(1)DX,y,f ; T(2;3) rdgC st in ri pP bwD) with 230 individual males sampled from the LHM base population (see 2). A single male from each cross was then mated again to a group of five clone-generator females in order to amplify the number of individuals harbouring the sampled haplotype. Seven lines failed to become established at this point. The remaining 223 lines were maintained in groups of up to sixteen stock hemiclonal males in two vials that were transferred to fresh vials each week. Stock hemiclonal males were replenished every six weeks by mating with groups of clone-generator females. A stock of reference genome flies (Bloomington Drosophila Stock Center no. 2057) was established and maintained initially using five rounds of sib-sib matings before expansion. 223 virgin reference genome females were then collected and mated to a single male from each of 223 hemiclonal lines. Female offspring from this cross therefore have one copy of the reference genome and one copy of the hemiclonal haplotype. Groups of these hemiclonal females were collected as virgins, placed in 99% ethanol and stored at -20°C prior to DNA extraction.
One virgin female per hemiclonal line, was homogenised with a microtube pestle, followed by 30-minute mild-shaking incubation in proteinase K. DNA was purified using the DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA), according to manufacturer’s instructions. Volumes were scaled-down according to mass of input material. Barrier pipette tips were used throughout, in order to minimise cross-contamination of DNA. Template assessment using the Qubit BR assay (Thermo Fischer, NY, USA) indicated double-stranded DNA, 10.4Kb in length at concentrations of 2–4 ng/μl (total quantity 50–100ng).
Sequencing was performed under contract by Exeter Sequencing service, University of Exeter, UK. The sonication protocol for shearing of the DNA was optimised for low concentrations to generate fragments 200–500bp in length. Libraries were prepared and indexed using the Nextera Library Prep Kit (Illumina, San Diego, USA). All samples were sequenced on a HiSeq 2500 (Illumina), with five individuals per lane. We also sequenced DNA from two individuals from the in-house reference line (Bloomington Drosophila Stock Centre no. 2057). One was prepared as the hemiclones, using the Illumina Nextera library (sample RGil), and the other using an older, Illumina Nextflex method (sample RGfi). The median number of read pairs across all samples was 29.23×106 (IQR 14.07×106). Quality metrics for the sequencing data were generated with FastQC v0.10.0 by Exeter Biosciences, and used to determine whether results were suitable for further analyses. For twelve samples with less than 8×106 reads, sequencing was repeated successfully (H006, H041, H061, H084, H086, H087, H092, H098, H105), with a further three samples omitted entirely (H015, H016, H136), leaving 220 hemiclonal samples in total. As shown in Figure 3A and 3B, the read quality score and quality-per-base for the samples taken forward for genotyping in this study were well within acceptable standards, and similar across all samples.
Raw data (fastq files) were stored and processed in the Linux Sun Grid Engine in the High-Performance Computing facility, University of Sussex. Adaptor sequences (Illumina Nextera N501-H508 and N701-N712), poor quality reads (Phred score <7) and short reads were removed using Fastq-mcf (ea-utils v.1.1.2). Settings were: log-adapter minimum-length-match: 2.2, occurrence threshold before adapter clipping: 0.25, maximum adapter difference: 10%, minimum remaining length: 19, skew percentage-less-than causing cycle removal: 2, bad reads causing cycle removal: 20%, quality threshold causing base removal: 10, window-size for quality trimming: 1, number of reads to use for sub-sampling: 3×105.
Cleaned sequence reads were mapped to the D. melanogaster genome assembly, release 6.0 (Assembly Accession GCA_000001215.46) using Burrows-Wheeler Aligner mem (version 0.7.7-r441)8, with a mapping quality score threshold of 20. Remaining reads were remapped using Stampy v1.0.24, which is slower but more precisely maps reads which are divergent from the reference genome assembly9. This method was used previously for the Drosophila Genome Nexus10. Removal of duplicate reads, indexing and sorting was performed with Picard-Tools v1.77. Re-mapping of sequence reads around insertion-deletion polymorphisms was performed using Genome Analysis Tool-Kit (GATK) v3.2.2, as a recommended standard practice11.
The median depth of coverage across all samples used for genotyping was 31X (IQR 14, see Figure 3C). As shown in Figure 3D, the mean nucleotide mis-match rate to the dm6 reference assembly for the LHM hemiclones was 3.27×10-3 per PCR cycle (IQR 0.2×10-3), contrasting with the two reference line samples for which the mis-match rate was 0.89-1.10×10-3 per cycle. We observed spikes of nucleotide mismatches in some PCR cycles for some samples, which are likely to be errors rather than true sequence variation.
A: Sequence read quality for each sample sequenced. Y-axis scale is logarithmic. B: Quality of sequences by nucleotide base position for each sample. C: Read depth of coverage distribution across each sample. Colouring corresponds to the order which which the samples were originally sequenced. D: Mis-matches to the dm6 reference genome assembly, by PCR cycle-number. Colouring is by sample as in plot C. The two red lines with visibly-lower mismatch rates than the others correspond to the two in-house BDGP/dm6 reference lines that were sequenced. Data and R code for this figure are located at https://doi.org/10.5281/zenodo.159282.
Single-nucleotide polymorphisms (SNPs) and insertion/deletions (indels) <200bp in length, were detected and genotyped relative to the BDGP+ISO1/dm6 assembly, on chromosomes 2,3,4,X, and mitochondrial genome using Haplotyper Caller (GATK v3.4-0)12. Individual bam files were genotyped, omitting reads with a mapping quality under 20, stand call and emit confidence thresholds of 31, then combined and genotyped again. 143,726,002 bases of genomic sequence were analysed from which 1,996,556 variant loci were identified consisting of 1,581,341 SNPs, 196,582 deletions, and 218,633 insertions. Functional annotation was added using SNPeff v4.113.
We used hard-filtering to remove variants generated by error, because the alternative ’variant recalibration’ requires prior information on variant positions from a similar population or parents. Quality filtering thresholds were decided following inspection of the various sequencing metrics associated with each variant locus, and by software developers’ recommendations12. The filtering thresholds were: Quality-by-depth >2, strand bias <50 (Phred-scaled p-value from Fisher’s Exact test), mapping quality >58, mapping quality rank sum >-7.0, read position rank sum >-5.0, combined read depth <15000, and call rate >90%. This filtering removed 167,319 variants (8.3%), leaving 1,829,237. Summary values for the variant quality metrics are shown in Table 1. Distributions of quality metrics for Haplotype Caller variants are shown in Supplementary Figure 2. The density of sequence variants, measured as the median for windows of 10Kb in length across the genome, was 75 per for biallelic SNPs, 1 for multi-allelic SNPs, 6 for bi-allelic indels, and 3 for multi-allelic indels (see Figure 4A). Mean separation between variants of any type or allele frequency was 78bp. As shown in Figure 4B the allele frequency distribution for bi-allelic SNPs and indels was similar, and broadly within expectations for an out-bred diploid population sample. The two in-house reference line individuals had 515 homozygous and 3171 heterozygous mutations from the reference assembly. The median genotype counts for the 220 LHM hemiclone individuals, were 585 homozygous, 728,214 heterozygous and 4963 no-call (IQR 400, 36707 and 7876). Genotype counts for each individual are shown in Figure 4C.
Values show the total number of variants, median (and IQR) for each metric. Data generated from vcf file using GATK VariantsToTable, on the quality-filtered data. *Strand bias refers to Phred-scaled p-value from Fisher’s Exact Test. Code and data used to generate this table located at https://doi.org/10.5281/zenodo.159282.
A: Density of common variants across the genome (MAF>0.05 (Variants from the in-house reference line are included but account for less than 3,686 of the 1,825,917 common variants plotted (<0.2%). B: Allele frequency distribution by variant type. *MAF values were calculated from the count of heterozygous calls, and so for multi-allelic variants, the MAF is derived from the combined count of both alternate alleles. C: Genotype counts per individual genotyped. Data generated using GATK/3.4 VariantEvaluation function. Data and R code for this figure are located at https://doi.org/10.5281/zenodo.159282.
For data submission to NCBI dbSNP, we were obliged to exclude 44,644 indels that were multi-allelic or greater than 50bp in length, and a further 57,662 SNPs and indels situated within deletions. Variants greater than 50bp in length were submitted to the NCBI structural variant database dbVar. The genotype data submitted to dbSNP consists of 1,726,931 quality-filtered, functionally-annotated variant records (1,423,039 SNPs and 303,892 short, biallelic insertion and deletion variants) corresponding to 383,378,682 individual genotype calls.
Large genomic variants – deletions and duplications, between 1Kb and 100Kb in length – were detected and genotyped using GenomeStrip v2.014. One of the reference strain individuals (sample RGfi) was omitted from the this analysis because a different sequencing library preparation method was used from the other samples (see above). We included the following settings (according to developers’ guidelines): Sex-chromosome and k-mer masking when estimating sequencing depth, computation of GC-profiles and read counts, and reduced insert size distributions. Large variant discovery and genotyping was performed only on chromosomes 2, 3, 4 and X, omitting the mitochondrial genome and unmapped scaffolds.
We used the Genomestrip CNV Discovery pipeline with the settings: minimum refined length 500, tiling window size 1000, tiling window overlap 500, maximum reference gap length 1000, boundary precision 100, and genotyped the results with the GenerateHaploidGenotypes R script (genotype likelihood threshold 0.001, R version 3.0.2). Following visualisation of the genotype results and comparison with the bam sequence alignment files using the Integrated Genomics Viewer (IGV) v2.3.7215, we excluded telomeric and centromeric regions where the sequencing coverage was fragmented, and six regions of multi-allelic gains of copy-number with dispersed breakpoints, previously reported to undergo mosaic in vivo amplification prior to oviposition16 (see Supplementary Table 1 for genomic positions, and Supplementary Figure 3 for visualisation of in vivo amplification in a sequence alignment file). We excluded 6 samples (H082, H083, H090, H097, H098, H153) for which 80–90% of the genome was reported by Genomestrip to contain structural variation, which we regarded as error. Most these samples were grouped by the order in which they were processed for DNA extraction and sequencing, so this may have been caused partly by a batch-effect leading to differences in read pair separation, depth-of-coverage, and response to normal fluctuations in GC-content. Following removal of these samples, there were 2897 CNVs (1687 deletions, 877 duplications, and 333 of the ’mixed’ type), ranging in size from 1000bp to 217,707bp. We observed eight regions, for which Genomestrip identified multiple adjacent CNVs in single individuals, but which are likely single CNVs, 100Kb to 1.3Mb in length (Supplementary Table 2).
Quality-filtering for structural variants detected by Genomestrip analysis of whole-genome resequencing data are not thoroughly established. We visually inspected, in the bam read alignment files using the Integrated Genomics Viewer15, reported structural variants which were most likely to be artefacts. Specifically these were variants with: i) Extreme values for quality-score, GC-content or cluster separation, ii) Any homozygous non-reference genotypes (not expected with our breeding design), iii) Type ’mixed’. Following this, we used the following criteria for quality filtering: Quality score >15, cluster separation <17, GC-fraction >0.33, no mixed types (deletions and duplications only), homozygous non-reference genotype count >0, and heterozygous genotype count <200. Summaries of the quality metrics for quality-filtered data are shown in Table 2 and Supplementary Figure 2. We applied an upper limit to the cluster separation to remove groups of outliers in the upper end of the distribution, although this may have excluded many true, low-frequency variants. However, data on rare variants are not directly useful for our further investigations.
After filtering, 167 CNVs remained (78 deletions and 89 duplications, size range 1Kb-26.6Kb). The positions and genotypes of these CNVs for each individual are shown in Figure 5. The genotype data for quality-filtered CNVs were combined with the data from 2252 indels >50bp from the Haplotype Caller pipeline, and a total of 2419 variants were uploaded to the public database on structural variation, NCBI dbVar. Although we have used methods for detecting SNPs, indels and CNVs, variants between 200bp and 1Kb are not reported by either HaplotypeCaller or Genomestrip. Additionally, sequence inversions are not detected by these methods and the upper limits to CNV detection using Genomestrip, based on the parameters and results of this study, are 100Kb-1Mb.
Each row corresponds to an individual sequenced (in order originally sequenced from top to bottom, with the reference line at the bottom). Image generated using R/3.3.1 (package ggplot v2.1.0) with data generated by GATK VariantsToTable with individual genotypes as copy-numbers. Data and R code for this figure are located at https://doi.org/10.5281/zenodo.159282.
Initial validation of our methods can be seen by lack of variants in the two reference line individuals compared with the LHM hemiclones (3,686 verses a median of 728,799 per sample). For a more thorough test of the genotyping and hemiclone method reproducibility, we sequenced an additional hemiclone individual from three of the LHM lines, and mapped the reads to the reference genome assembly as before. For HaplotypeCaller, we generated gVCF files for each sample, and then performed genotyping and quality-filtering as described above, except that the original three samples were replaced with the replication test samples. Similarly, for Genomestrip, we performed structural variant discovery and genotyping on all of the same samples as before, replacing three original samples with the replication test samples. We then used the GATK Genotype Concordance function to generate counts of genotype differences between the three pairs of samples. Overall results are presented in Table 3. Genotype reproducibility for quality-filtered biallelic SNPs was 98.5–99.5%, going down to 89.1–93.2% for filtered multi-allelic indels. Reproducibility of structural variant genotype calls was 95.6–100.0%, although we noted that for one individual (H119) filtering actually reduced the reproducibility rate from 99.7% to 95.6%. Full code, logs and numerical results can be found at http://doi.org/10.5281/zenodo.160539.
*Presented values are the overall genotype concordance, as generated using GATK/3.4 Genotype Concordance function. Code, logs and output data are available at http://doi.org/10.5281/zenodo.160539.
Although these results indicate that our genotype accuracy is very good, there are several caveats to consider. In the quality-filtered small-variant data, seven samples (H034, H035, H040, H038, H039, H188, H174) had prominently higher genotype drop-out rates than the others (of 2–7%), as well as a higher proportion of homozygous non-reference genotypes (2–4%; See Figure 4C). Additionally two samples had prominently more heterozygous variants (H072:885,551 and H093:955,148 verses the other LHM hemiclones: mean 710,934).
Although the genotype replication rate for the structural variants was also very high, we cannot exclude the possibility that, due to incomplete masking of hard-to-sequence regions of the reference assembly, variants which are artefacts reported in the original genotype data, may also be present in the replication genotype data.
All publicly-available records are for 220 LHM hemiclone individuals and 2 in-house reference line individuals, with the exception of the large-variant data for which one in-house reference line sample and six LHM hemiclones were omitted. The NCBI BioProject identifier is PRJNA282591. Code, logs and quality control data for each dataset, and for generating the figures and tables in this manuscript are publicly-available at Zenodo, https://zenodo.org/, ’Sussex Drosophila Sequencing’ community. Use of the files uploaded to Zenodo is under Creative Commons 4.0 license.
Raw fastq sequence reads, and bam alignment files for D. melanogaster are publicly-available at the NCBI Sequence Read Archive, accession number SRP058502 (https://www.ncbi.nlm.nih.gov/sra/?term=SRP058502). The code for read-mapping, alongside the run logs and quality-control data are available at https://doi.org/10.5281/zenodo.159251. Additionally the sequence alignment files for the corresponding Wolbachia have accession number SRP091004 (https://www.ncbi.nlm.nih.gov/sra/?term=SRP091004), with further supporting files at https://doi.org/10.5281/zenodo.15978416.
Records of quality-filtered sequence variants identified by GATK HaplotypeCaller in the LHM hemiclones, and in the in-house reference line, are available from the NCBI dbSNP, https://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewBatch.cgi?sbid=1062461, handle: MORROW_EBE_SUSSEX. In compliance with NCBI dbSNP criteria, variants >50bp in length, multi-allelic indels, and variants with a null alternate allele are excluded. More extensive genotype data (unfiltered, quality-filtered, and formatted for NCBI dbSNP) are available at https://doi.org/10.5281/zenodo.15927217. Also included is the code used for variant discovery and genotyping, quality-filtering and formatting, alongside run logs and quality-control data. Further filtering of this dataset may be necessary to remove localised areas of artefact SNPs in single samples. We have also released a gvcf genotypes file which contains an ’all-sites’ record of the sample genotypes, available for download at https://doi.org/10.5281/zenodo.198880.
Records of quality-filtered variants detected by GenomeStrip, and variants >50bp detected by Haplotype Caller are publicly-available at NCBI dbVar, accession number nstd134, https://www.ncbi.nlm.nih.gov/dbvar/studies/nstd134/. Unfiltered and filtered genotype data, code for CNV discovery and genotyping using Genomestrip/2.0, run logs, and summary data are publicly-available at https://doi.org/10.5281/zenodo.15947217.
Run code and logs for performing the genotyping using Haplotype Caller and Genomestrip when three samples are replaced by hemiclones from the same line, code for comparing the genotype calls between pairs of hemiclones, and results tables are located at https://doi.org/10.5281/zenodo.160539.
Input data, code and logs for generating the figures and tables used in this manuscript are located at https://doi.org/10.5281/zenodo.159282. The LATEXcode for generating the flow-diagram for Figure 2 is available from https://doi.org/10.5281/zenodo.168582. Code and logs for the generation of the input data is provided in the data releases pertaining to each process.
EM conceived and supervised the experiment. EM, TP, IF, MW and WG designed the experiment. TP and IF established and maintained the lines, and carried out the DNA extractions. WG analysed the sequencing and genotype data. WG and MW developed the read-mapping and variant-calling procedures. WG and EM wrote the manuscript.
Funding was provided to EM by a Royal Society University Research Fellowship, the Swedish Research Council (No. 2011-3701), and by the European Research Council (No. 280632).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Sequencing was performed under contract by Exeter University, DNA Sequencing Service (UK), who also provided analysis advice, http://www.exeter.ac.uk/business/facilities/sequencing/. Crucial computational support was provided by Jeremy Maris at the Centre for High-Performance Computing, University of Sussex, http://www.sussex.ac.uk/its/services/research/highperformance. Bob Handsaker (Harvard Medical School, USA) provided analysis advice for use of Genomestrip for structural variant detection.
Supplementary information for: "Whole genome resequencing of a laboratory-adapted Drosophila melanogaster population".
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: Not really a competing interest as such, but I'd like to disclose that I am a member of the group that develops the GATK software so I have strong opinions about how it should be run.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 3 (revision) 22 Dec 16 |
||
Version 2 (revision) 20 Dec 16 |
||
Version 1 07 Nov 16 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)