Whole genome resequencing of a laboratory-adapted population sample Drosophila melanogaster

As part of a study into the molecular genetics of sexually dimorphic complex traits, we used next-generation sequencing to obtain data on genomic variation in an outbred laboratory-adapted fruit fly (Drosophila melanogaster) population. We successfully resequenced the whole genome of 220 hemiclonal females that were heterozygous for the same Berkeley reference line genome (BDGP6/dm6), and a unique haplotype from the outbred base population (LHM). The use of a static and known genetic background enabled us to obtain sequences from whole genome phased haplotypes. We used a BWA-Picard-GATK pipeline for mapping sequence reads to the dm6 reference genome assembly, at a median depth of coverage of 31X, and have made the resulting data publicly-available in the NCBI Short Read Archive (Accession number SRP058502). We used Haplotype Caller to discover and genotype 1,726,931 small genomic variants (SNPs and indels, <200bp). Additionally we detected and genotyped 167 large structural variants (1-100Kb in size) using GenomeStrip/2.0. Sequence and genotype data are publicly-available at the corresponding NCBI databases: Short Read Archive, dbSNP and dbVar (BioProject PRJNA282591). We have also released the unfiltered genotype data, and the code and logs for data processing and summary statistics ( ). https://zenodo.org/communities/sussex_drosophila_sequencing/ William P. Gilks ( ), Edward H. Morrow ( ) Corresponding authors: wpgilks@gmail.com ted.morrow@sussex.ac.uk Gilks WP, Pennell TM, Flis I How to cite this article: et al. Whole genome resequencing of a laboratory-adapted Drosophila melanogaster 2016, :2644 (doi: ) population sample [version 1; referees: 2 approved] F1000Research 5 10.12688/f1000research.9912.1 © 2016 Gilks WP . This is an open access article distributed under the terms of the , which Copyright: et al Creative Commons Attribution Licence permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Funding was provided to EM by a Royal Society University Research Fellowship, the Swedish Research Council (No. Grant information: 2011-3701), and by the European Research Council (No. 280632). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: No competing interests were disclosed. 07 Nov 2016, :2644 (doi: ) First published: 5 10.12688/f1000research.9912.1 1 1 1 2


Introduction
As part of a study on the molecular genetics of sexually dimorphic complex traits, we used hemiclonal analysis in conjunction with next-generation sequencing to characterise molecular genetic variation across the genome, from an outbred laboratory-adapted population of Drosophila melanogaster, LH M 1,2 . The hemiclone experimental design allows the repeated phenotyping of multiple individuals, each with the same unrecombined haplotype on a different random genetic background. This method has been used to investigate standing genetic variation and intersexual genetic correlations for quantitative traits 1 and gene expression 3 , but it has not yet been used to obtain genomic data.
The 220 hemiclone females that were sequenced in the present study have a maternal haplotype, from the dm6 reference assembly strain (BDGP6+ISO1 mito/dm6, Bloomington Drosophila Stock Center no. 2057) 4,5 , and have a different paternal genome each, sampled using cytogenetic cloning from the LH M base population. All non-reference genotypes in the sequenced LH M hemiclones were expected to be heterozygous and in-phase, except in rare instances where the in-house dm6 reference strain also had the same nonreference allele.
Previous studies indicate that the limits for DNA quantity in next-generation sequencing are 50-500ng 6 . We sequenced individual D.melanogaster, rather than pools of clones, because more biological information can be obtained, and because modern transposon-based library preparation allows accurate sequencing at low concentrations of DNA. D. melanogaster is a small insect (∼1μg) although this problem is off-set by the reduced proportion of repetitive intergenic sequence, and small genome size relative to other insects (170Mb verses ∼500Mb), 6 .
We mapped reads to the D. melanogaster dm6 reference assembly using a BWA-Picard-GATK pipeline, and called nucleotide variants using both HaplotypeCaller, and Genomestrip, the latter of which detections copy-number variation up to 1Mb in length. We have made the mapped sequencing data, and genotype data publicly-available on NCBI, and additionally have made the metadata, analysis code and logs publicly-available on Zenodo. This is the first report of a study which uses methods for detecting both SNPs, indels and structural variants (deletions and duplications >1Kb in length), genome-wide in next-generation sequencing data, and the first report of whole genome resequencing in hemiclonal individuals.

Study samples
The base population (LH M ) was originally established from a set of 400 inseminated females, trapped by Larry Harshman in a citrus orchard near Escalon, California in 1991 2 . It was initially kept at a large size (more than 1,800 reproducing adults) in the lab of William Rice (University College Santa Barbara, USA). In 1995 (approximately 100 generations since establishment) the rearing protocol was changed to include non-overlapping generations and a moderate rearing density with 16 adult pairs per vial (56 vials in total) during 2 days of adult competition, and 150-200 larvae during the larval competition stage 2 . In 2005, a copy of LH M population sample was transferred to Uppsala University, Sweden (approximately 370 generations since establishment), and in 2012, to the University of Sussex (UK), when the current set of 223 haplotypes were sampled. At the point of sampling we estimate that the population had undergone 545 generations under laboratory conditions, 445 of which had been using the same rearing protocol.
Hemiclonal lines were established by mating groups of five clone-generator females (C(1)DX,y,f ; T(2;3) rdgC st in ri p P bw D ) with 230 individual males sampled from the LH M base population (see 1). A single male from each cross was then mated again to a group of five clone-generator females in order to amplify the number of individuals harbouring the sampled haplotype. Seven lines failed to become established at this point. The remaining 223 lines were maintained in groups of up to sixteen stock hemiclonal males in two vials that were transferred to fresh vials each week. Stock hemiclonal males were replenished every six weeks by mating with groups of clone-generator females. A stock of reference genome flies (Bloomington Drosophila Stock Center no. 2057) was established and maintained initially using five rounds of of sib-sib matings before expansion. 223 virgin reference genome females were then collected and mated to a single male from each of 223 hemiclonal lines. Female off-spring from this cross therefore have one copy of the reference genome and one copy of the hemiclonal haplotype. Groups of these hemiclonal females were collected as virgins, placed in 99% ethanol and stored at -20°C prior to DNA extraction.

DNA extraction
One virgin female per hemiclonal line, was homogenised with a microtube pestle, followed by 30-minute mild-shaking incubation in proteinase K. DNA was purified using the DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA), according to manufacturer's instructions. Volumes were scaled-down according to mass of input material. Barrier pipette tips were used throughout, in order to minimise cross-contamination of DNA. Template assessment using the Qubit BR assay (Thermo Fischer, NY, USA) indicated double-stranded DNA, 10.4Kb in length at concentrations of 2-4ng/μl (total quantity 50-100ng).

Whole-genome resequencing
Sequencing was performed under contract by Exeter Sequencing service, University of Exeter, UK. The sonication protocol for shearing of the DNA was optimised for low concentrations to generate fragments 200-500bp in length. Libraries were prepared and indexed using the Nextera Library Prep Kit (Illumina, San Diego, USA). All samples were sequenced on a HiSeq 2500 (Illumina), with five individuals per lane. We also sequenced DNA from two individuals from the in-house reference line (Bloomington Drosophila Stock Centre no. 2057). One was prepared as the hemiclones, using the Illumina Nextera library (sample RGil), and the other using an older, Illumina Nextflex method (sample RGfi). The median number of read pairs across all samples was 29.23×10 6 (IQR 14.07×10 6 ). Quality metrics for the sequencing data were generated with FastQC v0.10.0 by Exeter Biosciences, and used to determine whether results were suitable for further analyses. For twelve samples with less than 8×10 6 reads, sequencing was repeated successfully (H006, H041, H061, H084, H086, H087, H092, H098, H105), with a further three samples omitted entirely (H015, H016, H136), leaving 220 hemiclonal samples in total. As shown in Figure 1A & B, the read quality score and quality-per-base for the the samples taken forward for genotyping in this study were were well within acceptable standards, and similar across all samples.
Cleaned sequence reads were mapped to the D. melanogaster genome assembly, release 6.0 (Assembly Accession GCA_ 000001215.4 5 ) using Burrows-Wheeler Aligner mem (version 0.7.7-r441) 7 , with a mapping quality score threshold of 20. Fine mapping was performed with both Stampy v1.0.24 8 and the Genome Analysis Tool-Kit (GATK) v3.2.2 9 (following 10 ). Removal of duplicate reads, indexing and sorting was performed with Picard-Tools v1.77 and SamTools v1.0. The median depth of coverage across all samples used for genotyping was 31X (IQR 14, see Figure 1C). As shown in Figure 1D, the mean nucleotide mis-match rate to the dm6 reference assembly for the LH M hemiclones was 3.27×10 -3 per PCR cycle (IQR 0.2×10 -3 ), contrasting Colouring corresponds to the order which which the samples were originally sequenced. D: Mis-matches to the dm6 reference genome assembly, by PCR cycle-number. Colouring is by sample as in plot C. The two red lines with visibly-lower mismatch rates than the others correspond to the two in-house BDGP/dm6 reference lines that were sequenced. Data and code for this figure is located at https://doi.org/10.5281/ zenodo.159282.
with the two reference line samples for which the mis-match rate was 0.89 -1.10×10 -3 per cycle. We observed spikes of nucleotide mis-matches in some PCR cycles for some samples, which are likely to be errors rather than true sequence variation.
We used hard-filtering to remove variants generated by error, because the alternative 'variant recalibration' requires prior information on variant positions from a similar population or parents. Quality filtering thresholds were decided following inspection of the various sequencing metrics associated with each variant locus, and by software developers' recommendations 11 . The filtering thresholds were: Quality-by-depth >2, strand bias (-log 10. p Fisher ) <50, mapping quality >58, mapping quality rank sum >-7.0, read position rank sum >-5.0, combined read depth <15000, and call rate >90%. This filtering removed 167,319 variants (8.3%), leaving 1,829,237. Summary values for the variant quality metrics are shown in Table 1. Distributions of quality metrics for Haplotype Caller variants are shown in Supplementary Figure 2. The density of sequence variants, measured as the median for windows of 10Kb in length across the genome, was 75 per for biallelic SNPs, 1 for multi-allelic SNPs, 6 for biallelic indels, and 3 for multiallelic indels (see Figure 2A). Mean separation between variants of any type or allele frequency was 78bp. As shown in Figure 2B the allele frequency distribution for bi-allelic SNPs and indels was similar, and broadly within expectations for an out-bred diploid population sample. The two in-house reference line individuals had 515 homozygous and 3171 heterozygous mutations from the reference assembly. The median genotype counts for the 220 LH M hemiclone individuals, were 585 homozygous, 728,214 heterozygous and 4963 no-call (IQR 400, 36707 and 7876). Genotype counts for each individual are shown in Figure 2C.
For data submission to dbSNP, we removed 44,644 indels that were multi-allelic or greater than 50bp in length, and a further 57,662 variants that had null alternate alleles (likely due to being situated within a deletion). The genotype data submitted to dbSNP consists of 1,726,931 quality-filtered, functionally-annotated variant records (1,423,039 SNPs and 303,892 short, biallelic insertion and deletion variants) corresponding to 383,378,682 individual genotype calls.

Structural-variant detection methods
Large genomic variants -deletions and duplications, between 1Kb and 100Kb in length -were detected and genotyped using Genom-eStrip v2.0 13 . One of the reference strain individuals (sample RGfi) was omitted from the this analysis because a different sequencing library preparation method was used from the other samples (see above). We included the following settings (according to developers' guidelines): Sex-chromosome and k-mer masking when estimating sequencing depth, computation of GC-profiles and read counts, and reduced insert size distributions. Large variant discovery and genotyping was performed only on chromosomes 2, 3, 4 and X, omitting the mitochondrial genome and unmapped scaffolds.
We used the Genomestrip CNV Discovery pipeline with the settings: minimum refined length 500, tiling window size 1000, tiling window overlap 500, maximum reference gap length 1000, boundary precision 100, and genotyped the results with the Generate-HaploidGenotypes R script (genotype likelihood threshold 0.001, R version 3.0.2). Following visualisation of the genotype results and comparison with the bam sequence alignment files using the Integrated Genomics Viewer (IGV) v2.3.72 14 , we excluded telomeric and centromeric regions where the sequencing coverage was  fragmented, and six regions of multi-allelic gains of copy-number with dispersed break-points, previously reported to undergo mosaic in vivo amplification prior to oviposition 15 (see Supplementary Table 1 for genomic positions, and Supplementary Figure 3 for visualisation of in vivo amplification in a sequence alignment file). We excluded 6 samples (H082, H083, H090, H097, H098, H153) for which 80-90% of the genome was reported by Genomestrip to contain structural variation, which we regarded as error. Most these samples were grouped by the order in which they were processed for DNA extraction and sequencing, so this may have been caused partly by a batch-effect leading to differences in read pair separation, depth-of-coverage, and response to normal fluctuations in GC-content. Following removal of these samples, there were 2897 CNVs (1687 deletions, 877 duplications, and 333 of the 'mixed' type), ranging in size from 1000bp to 217,707bp. We observed eight regions, for which Genomestrip identified multiple adjacent CNVs in single individuals, but which are likely single CNVs, 100Kb to 1.3Mb in length (Supplementary Table 2).
Quality-filtering for structural variants detected by Genomestrip analysis of whole-genome resequencing data are not thoroughly established. We visually inspected, in the bam read alignment files using the Integrated Genomics Viewer 14 , reported structural variants which were most likely to be artefacts. Specifically these were variants with: i) Extreme values for quality-score, GC-content or cluster separation, ii) Any homozygous non-reference genotypes (not expected with our breeding design), iii) Type 'mixed'. Following this, we used the following criteria for quality filtering: Quality score >15, cluster separation <17, GC-fraction >0.33, no mixed types (deletions and duplications only), homozygous non-reference genotype count >0, and heterozygous genotype count <200. Summaries of the quality metrics for quality-filtered data are shown in Table 2 and Supplementary Figure 2. We applied an upper limit to the cluster separation to remove groups of outliers in the upper end of the distribution, although this may have excluded many true, low-frequency variants. However, data on rare variants are not directly useful for our further investigations.
After filtering, 167 CNVs remained (78 deletions and 89 duplications, size range 1Kb-26.6Kb). The positions and genotypes of these CNVs for each individual are shown in Figure 3. The genotype data for quality-filtered CNVs were combined with the

Dataset validation
Initial validation of our methods can be seen by lack of variants in the two reference line individuals compared with the LH M hemiclones (3,686 verses a median of 728,799 per sample). For a more thorough test of the genotyping and hemiclone method reproducibility, we sequenced an additional hemiclone individual from three of the LH M lines, and mapped the reads to the reference genome assembly as before. For HaplotypeCaller, we generated 'g.vcf' files for each sample, and then performed genotyping and qualityfiltering as described above, except that the original three samples were replaced with the replication test samples. Similarly, for Genomestrip, we performed structural variant discovery and genotyping on all of the same samples as before, replacing three original samples with the replication test samples. We then used the GATK Genotype Concordance function to generate counts of genotype differences between the three pairs of samples. Overall results are presented in Table 3. Genotype reproducibility for quality-filtered biallelic SNPs was 98.5-99.5%, going down to 89.1-93.2% for filtered multi-allelic indels. Reproducibility of structural variant  Although these results indicate that our genotype accuracy is very good, there are several caveats to consider. In the quality-filtered small-variant data, seven samples (H034, H035, H040, H038, H039, H188, H174) had prominently higher genotype drop-out rates than the others (of 2-7%), as well as a higher proportion of homozygous non-reference genotypes (2-4%; See Figure 2C). Additionally two samples had prominently more heterozygous variants (H072:885,551 and H093:955,148 verses the other LH M hemiclones: mean 710,934).
Although the genotype replication rate for the structural variants was also very high, we cannot exclude the possibility that, due to incomplete masking of hard-to-sequence regions of the reference assembly, variants which are artefacts reported in the original genotype data, may also be present in the replication genotype data.

Data availability
All  Overall this is a very solid technical note. The methods seem sound, the descriptions are fairly straightforward, and caveats are properly acknowledged. The authors have provided detailed descriptions of what was done (including software tool versions) and made data and code available to reproduce not only the dataset but also all figures in the paper itself.
Regarding the experimental design, I think it's great that Gilks et al. chose to perform sequencing on individual flies rather than pooled samples. It takes some extra effort to deal with the small amounts of starting material involved, but the resulting dataset is that much more valuable.
It's also nice to see a study looking at CNVs and short variants together. As tools in this space improve and enable greater integration I look forward to seeing more analysis of how the different variant types relate to each other (e.g. looking at which short variants might be amplified or suppressed by CNV events).

Request for additional figures
I would recommend including diagrams of the hemiclonal experimental design and of the analysis workflows to maximize clarity. In particular, I think it could be made more obvious that the HaplotypeCaller workflow was run using the GVCF pathway for joint analysis.

Minor comments
I prefer "high-throughput sequencing" to "next generation sequencing" (this technology was "next-gen" ten years ago, now it's just the current standard).
On page 3, does "Fine mapping" refer to realignment around indels or equivalent processes?
On page 4, I would express "strand bias" as "FisherStrand estimation of strand bias" to avoid ambiguity with other estimators like Strand Odds Ratio, SOR).