Whole genome resequencing of a laboratory-adapted Drosophila melanogaster population sample

As part of a study into the molecular genetics of sexually dimorphic complex traits, we used high-throughput sequencing to obtain data on genomic variation in an outbred laboratory-adapted fruit fly ( Drosophila melanogaster) population. We successfully resequenced the whole genome of 220 hemiclonal females that were heterozygous for the same Berkeley reference line genome (BDGP6/dm6), and a unique haplotype from the outbred base population (LH M). The use of a static and known genetic background enabled us to obtain sequences from whole-genome phased haplotypes. We used a BWA-Picard-GATK pipeline for mapping sequence reads to the dm6 reference genome assembly, at a median depth-of coverage of 31X, and have made the resulting data publicly-available in the NCBI Short Read Archive (Accession number SRP058502). We used Haplotype Caller to discover and genotype 1,726,931 small genomic variants (SNPs and indels, <200bp). Additionally we detected and genotyped 167 large structural variants (1-100Kb in size) using GenomeStrip/2.0. Sequence and genotype data are publicly-available at the corresponding NCBI databases: Short Read Archive, dbSNP and dbVar (BioProject PRJNA282591). We have also released the unfiltered genotype data, and the code and logs for data processing and summary statistics ( https://zenodo.org/communities/sussex_drosophila_sequencing/).


Introduction
As part of a study on the molecular genetics of sexually dimorphic complex traits, we used hemiclonal analysis in conjunction with high-throughput sequencing 1 to characterise molecular genetic variation across the genome, from an outbred laboratory-adapted population of Drosophila melanogaster, LH M 2,3 . The hemiclone experimental design allows the repeated phenotyping of multiple individuals, each with the same unrecombined haplotype on a different random genetic background. This method has been used to investigate standing genetic variation and intersexual genetic correlations for quantitative traits 2 and gene expression 4 , but it has not yet been used to obtain genomic data.
The 220 hemiclone females that were sequenced in the present study have a maternal haplotype, from the dm6 reference assembly strain (BDGP6+ISO1 mito/dm6, Bloomington Drosophila Stock Center no. 2057) 5,6 , and have a different paternal genome each, sampled using cytogenetic cloning from the LH M base population (See Figure 1). All non-reference genotypes in the sequenced LH M hemiclones were expected to be heterozygous and in-phase, except in rare instances where the in-house dm6 reference strain also had the same non-reference allele.
Previous studies indicate that the limits for DNA quantity in nextgeneration sequencing are 50-500ng 7 . We sequenced individual D. melanogaster, rather than pools of clones, because more biological information can be obtained, and because modern transposonbased library preparation allows accurate sequencing at low concentrations of DNA. D. melanogaster is a small insect (1µg) although this problem is off-set by the reduced proportion of repetitive intergenic sequence, and small genome size relative to other insects (170Mb verses 500Mb) 7 .
We mapped reads to the D. melanogaster dm6 reference assembly using a BWA-Picard-GATK pipeline, and called nucleotide variants using both HaplotypeCaller, and Genomestrip, the latter of which detects copy-number variation up to 1Mb in length. A graphic representation of the data analysis pipeline is provided in Figure 2. We have made the mapped sequencing data, and genotype data publicly-available on NCBI, and additionally have made the metadata, analysis code and logs publicly-available on Zenodo. This is the first report of a study which uses methods for detecting both SNPs, indels and structural variants (deletions and duplications >1Kb in length), genome-wide in next-generation sequencing data, and the first report of whole genome resequencing in hemiclonal individuals.

Study samples
The base population (LH M ) was originally established from a set of 400 inseminated females, trapped by Larry Harshman in a citrus orchard near Escalon, California in 1991 3 . It was initially kept at a large size (more than 1,800 reproducing adults) in the lab of William Rice (University College Santa Barbara, USA). In 1995 (approximately 100 generations since establishment) the rearing protocol was changed to include non-overlapping generations, and a moderate rearing density with 16 adult pairs per vial (56 vials in total) during 2 days of adult competition, and 150-200 larvae during the larval competition stage 3 . In 2005, a copy of LH M population sample was transferred to Uppsala University, Sweden (approximately 370 generations since establishment), and in 2012, to the University of Sussex (UK), when the current set of 223 haplotypes were sampled. At the point of sampling we estimate that the population had undergone 545 generations under laboratory conditions, 445 of which had been using the same rearing protocol.
Hemiclonal lines were established by mating groups of five clonegenerator females (C(1)DX,y,f ; T(2;3) rdgC st in ri p P bw D ) with 230 individual males sampled from the LH M base population (see 2). A single male from each cross was then mated again to a group of five clone-generator females in order to amplify the number of individuals harbouring the sampled haplotype. Seven lines failed to become established at this point. The remaining 223 lines were maintained in groups of up to sixteen stock hemiclonal males in two vials that were transferred to fresh vials each week. Stock hemiclonal males were replenished every six weeks by mating with groups of clonegenerator females. A stock of reference genome flies (Bloomington Drosophila Stock Center no. 2057) was established and maintained initially using five rounds of sib-sib matings before expansion. 223 virgin reference genome females were then collected and mated to a single male from each of 223 hemiclonal lines. Female offspring from this cross therefore have one copy of the reference genome and one copy of the hemiclonal haplotype. Groups of these hemiclonal females were collected as virgins, placed in 99% ethanol and stored at -20°C prior to DNA extraction.

DNA extraction
One virgin female per hemiclonal line, was homogenised with a microtube pestle, followed by 30-minute mild-shaking incubation in proteinase K. DNA was purified using the DNeasy Blood and Tissue Kit (Qiagen, Valencia, CA), according to manufacturer's instructions. Volumes were scaled-down according to mass of input material. Barrier pipette tips were used throughout, in order to minimise cross-contamination of DNA. Template assessment using the Qubit BR assay (Thermo Fischer, NY, USA) indicated double-stranded DNA, 10.4Kb in length at concentrations of 2-4 ng/µl (total quantity 50-100ng).

Whole-genome resequencing
Sequencing was performed under contract by Exeter Sequencing service, University of Exeter, UK. The sonication protocol for shearing of the DNA was optimised for low concentrations to generate fragments 200-500bp in length. Libraries were prepared and indexed using the Nextera Library Prep Kit (Illumina, San Diego,

Amendments from Version 2
In this version, a mistake in the title of the Version 2 has been corrected -the words 'population sample' have been added to the title.

REVISED
USA). All samples were sequenced on a HiSeq 2500 (Illumina), with five individuals per lane. We also sequenced DNA from two individuals from the in-house reference line (Bloomington Drosophila Stock Centre no. 2057). One was prepared as the hemiclones, using the Illumina Nextera library (sample RGil), and the other using an older, Illumina Nextflex method (sample RGfi). The median number of read pairs across all samples was 29.23×10 6 (IQR 14.07×10 6 ). Quality metrics for the sequencing data were generated with FastQC v0.10.0 by Exeter Biosciences, and used to determine whether results were suitable for further analyses. For twelve samples with less than 8×10 6 reads, sequencing was repeated successfully (H006, H041, H061, H084, H086, H087, H092, H098, H105), with a further three samples omitted entirely (H015, H016, H136), leaving 220 hemiclonal samples in total. As shown in Figures 3A and 3B, the read quality score and quality-per-base for the samples taken forward for genotyping in this study were well within acceptable standards, and similar across all samples.

Read mapping
Raw data (fastq files) were stored and processed in the Linux Sun Grid Engine in the High-Performance Computing facility, University of Sussex. Adaptor sequences (Illumina Nextera N501-H508 and N701-N712), poor quality reads (Phred score <7) and short reads were removed using Fastq-mcf (ea-utils v.1.1.2). Settings were: log-adapter minimum-length-match: 2.2, occurrence threshold before adapter clipping: 0.25, maximum adapter difference: 10%, minimum remaining length: 19, skew percentageless-than causing cycle removal: 2, bad reads causing cycle population (open chromosomes) was crossed to five clone generator female, which harbour a fused double-X chromosome (DX), a Y chromosome, as well as a marked translocation of chromosomes 2 and 3 (hatched chromosomes; see text for full genotype). This cross captured a single wild-type haplotype. 2 . Amplify -A single heterozygous male was then crossed again to a group of clone-generator females to amplify the number stock hemiclonal males. This cross can be repeated, to replenish the stock hemiclone males. 3 Express -Finally, a single reference genome female (filled chromosomes) was crossed to a single stock hemiclonal male to produce a female that harbours the original wild-type haplotype (excluding the 4th dot chromosome, which remains uncontrolled throughout), in a reference genome genetic background. Cytoplasmic factors in the offspring also derive from the reference genome stock (black outline). DNA is extracted from a single hemiclone female from each line and sequenced.
Cleaned sequence reads were mapped to the D. melanogaster genome assembly, release 6.0 (Assembly Accession GCA_000001215.4 6 ) using Burrows-Wheeler Aligner mem (version 0.7.7-r441) 8 , with a mapping quality score threshold of 20. Remaining reads were remapped using Stampy v1.0.24, which is slower but more precisely maps reads which are divergent from the reference genome assembly 9 . This method was used previously for the Drosophila Genome Nexus 10 . Removal of duplicate reads, indexing and sorting was performed with Picard-Tools v1.77. Re-mapping of sequence reads around insertion-deletion polymorphisms was performed using Genome Analysis Tool-Kit (GATK) v3.2.2, as a recommended standard practice 11 . The median depth of coverage across all samples used for genotyping was 31X (IQR 14, see Figure 3C). As shown in Figure 3D, the mean nucleotide mis-match rate to the dm6 reference assembly for the LH M hemiclones was 3.27×10 -3 per PCR cycle (IQR 0.2×10 -3 ), contrasting with the two reference line samples for which the mis-match rate was 0.89-1.10×10 -3 per cycle. We observed spikes of nucleotide mismatches in some PCR cycles for some samples, which are likely to be errors rather than true sequence variation.
We used hard-filtering to remove variants generated by error, because the alternative 'variant recalibration' requires prior information on variant positions from a similar population or parents. Quality filtering thresholds were decided following inspection of the various sequencing metrics associated with each variant locus, and by software developers' recommendations 12 . The filtering thresholds were: Quality-by-depth >2, strand bias <50 (Phred-scaled p-value from Fisher's Exact test), mapping quality >58, mapping quality rank sum >-7.0, read position rank sum >-5.0, combined read depth <15000, and call rate >90%. This filtering removed 167,319 variants (8.3%), leaving 1,829,237. Summary values for the variant quality metrics are shown in Table 1. Distributions of quality metrics for Haplotype Caller variants are shown in Supplementary  Figure 2. The density of sequence variants, measured as the median for windows of 10Kb in length across the genome, was 75 per for biallelic SNPs, 1 for multi-allelic SNPs, 6 for bi-allelic indels, and Colouring corresponds to the order which which the samples were originally sequenced. D: Mis-matches to the dm6 reference genome assembly, by PCR cycle-number. Colouring is by sample as in plot C. The two red lines with visibly-lower mismatch rates than the others correspond to the two in-house BDGP/dm6 reference lines that were sequenced. Data and R code for this figure are located at https://doi.org/10.5281/ zenodo.159282.
3 for multi-allelic indels (see Figure 4A). Mean separation between variants of any type or allele frequency was 78bp. As shown in Figure 4B the allele frequency distribution for bi-allelic SNPs and indels was similar, and broadly within expectations for an out-bred diploid population sample. The two in-house reference line individuals had 515 homozygous and 3171 heterozygous mutations from the reference assembly. The median genotype counts for the 220 LH M hemiclone individuals, were 585 homozygous, 728,214 heterozygous and 4963 no-call (IQR 400, 36707 and 7876). Genotype counts for each individual are shown in Figure 4C.
For data submission to NCBI dbSNP, we were obliged to exclude 44,644 indels that were multi-allelic or greater than 50bp in length, and a further 57,662 SNPs and indels situated within deletions. Variants greater than 50bp in length were submitted to the NCBI structural variant database dbVar. The genotype data submitted to dbSNP consists of 1,726,931 quality-filtered, functionallyannotated variant records (1,423,039 SNPs and 303,892 short, biallelic insertion and deletion variants) corresponding to 383,378,682 individual genotype calls.

Structural-variant detection methods
Large genomic variants -deletions and duplications, between 1Kb and 100Kb in length -were detected and genotyped using Genom-eStrip v2.0 14 . One of the reference strain individuals (sample RGfi) was omitted from the this analysis because a different sequencing library preparation method was used from the other samples (see above). We included the following settings (according to developers' guidelines): Sex-chromosome and k-mer masking when estimating sequencing depth, computation of GC-profiles and read counts, and reduced insert size distributions. Large variant discovery and genotyping was performed only on chromosomes 2, 3, 4 and X, omitting the mitochondrial genome and unmapped scaffolds.
We used the Genomestrip CNV Discovery pipeline with the settings: minimum refined length 500, tiling window size 1000, tiling window overlap 500, maximum reference gap length 1000, boundary precision 100, and genotyped the results with the GenerateHap-loidGenotypes R script (genotype likelihood threshold 0.001, R version 3.0.2). Following visualisation of the genotype results and comparison with the bam sequence alignment files using the Integrated Genomics Viewer (IGV) v2.3.72 15 , we excluded telomeric and centromeric regions where the sequencing coverage was fragmented, and six regions of multi-allelic gains of copy-number with dispersed breakpoints, previously reported to undergo mosaic in vivo amplification prior to oviposition 16 (see Supplementary Table 1 for genomic positions, and Supplementary Figure 3 for visualisation of in vivo amplification in a sequence alignment file). We excluded 6 samples (H082, H083, H090, H097, H098, H153) for which 80-90% of the genome was reported by Genomestrip to contain structural variation, which we regarded as error. Most these samples were grouped by the order in which they were processed for DNA extraction and sequencing, so this may have been caused partly by a batch-effect leading to differences in read pair separation, depth-of-coverage, and response to normal fluctuations in GC-content. Following removal of these samples, there were 2897 CNVs (1687 deletions, 877 duplications, and 333 of the 'mixed' type), ranging in size from 1000bp to 217,707bp. We observed eight regions, for which Genomestrip identified multiple adjacent CNVs in single individuals, but which are likely single CNVs, 100Kb to 1.3Mb in length (Supplementary Table 2).
Quality-filtering for structural variants detected by Genomestrip analysis of whole-genome resequencing data are not thoroughly established. We visually inspected, in the bam read alignment files using the Integrated Genomics Viewer 15 , reported structural variants which were most likely to be artefacts. Specifically these were variants with: i) Extreme values for quality-score, GC-content or cluster separation, ii) Any homozygous non-reference genotypes (not expected with our breeding design), iii) Type 'mixed'. Following this, we used the following criteria for quality filtering: Quality score >15, cluster separation <17, GC-fraction >0.33, no mixed types (deletions and duplications only), homozygous nonreference genotype count >0, and heterozygous genotype count <200. Summaries of the quality metrics for quality-filtered data are shown in Table 2 and Supplementary Figure 2. We applied an upper limit to the cluster separation to remove groups of outliers in the upper end of the distribution, although this may have excluded many true, low-frequency variants. However, data on rare variants are not directly useful for our further investigations.
After filtering, 167 CNVs remained (78 deletions and 89 duplications, size range 1Kb-26.6Kb). The positions and genotypes of these CNVs for each individual are shown in Figure 5. The genotype data for quality-filtered CNVs were combined with the data from 2252 indels >50bp from the Haplotype Caller pipeline, and a total of 2419 variants were uploaded to the public database on structural variation, NCBI dbVar. Although we have used methods for detecting SNPs, indels and CNVs, variants between 200bp and 1Kb are not reported by either HaplotypeCaller or Genomestrip. Additionally, sequence inversions are not detected by these methods and the upper limits to CNV detection using Genomestrip, based on the parameters and results of this study, are 100Kb-1Mb.  Values show the total number of variants, median (and IQR) for each metric. Data generated from vcf file using GATK VariantsToTable, on the quality-filtered data. *No CNVs in the quality-filtered samples had a 'no-call' or homozygous nonreference genotype.

Dataset validation
Initial validation of our methods can be seen by lack of variants in the two reference line individuals compared with the LH M hemiclones (3,686 verses a median of 728,799 per sample). For a more thorough test of the genotyping and hemiclone method reproducibility, we sequenced an additional hemiclone individual from three of the LH M lines, and mapped the reads to the reference genome assembly as before. For HaplotypeCaller, we generated gVCF files for each sample, and then performed genotyping and qualityfiltering as described above, except that the original three samples were replaced with the replication test samples. Similarly, for Genomestrip, we performed structural variant discovery and genotyping on all of the same samples as before, replacing three original samples with the replication test samples. We then used the GATK Genotype Concordance function to generate counts of genotype differences between the three pairs of samples. Overall results are presented in Table 3. Genotype reproducibility for quality-filtered biallelic SNPs was 98.5-99.5%, going down to 89.1-93.2% for filtered multi-allelic indels. Reproducibility of structural variant genotype calls was 95.6-100.0%, although we noted that for one individual (H119) filtering actually reduced the reproducibility rate from 99.7% to 95.6%. Full code, logs and numerical results can be found at http://doi.org/10.5281/zenodo.160539.
Although these results indicate that our genotype accuracy is very good, there are several caveats to consider. In the quality-filtered small-variant data, seven samples (H034, H035, H040, H038, H039, H188, H174) had prominently higher genotype drop-out rates than the others (of 2-7%), as well as a higher proportion of homozygous non-reference genotypes (2-4%; See Figure 4C). Additionally two samples had prominently more heterozygous variants (H072:885,551 and H093:955,148 verses the other LH M hemiclones: mean 710,934).
Although the genotype replication rate for the structural variants was also very high, we cannot exclude the possibility that, due to incomplete masking of hard-to-sequence regions of the reference assembly, variants which are artefacts reported in the original genotype data, may also be present in the replication genotype data.

Data availability
All publicly-available records are for 220 LH M hemiclone individuals and 2 in-house reference line individuals, with the exception of the large-variant data for which one in-house reference line sample and six LH M hemiclones were omitted. The NCBI BioProject identifier is PRJNA282591. Code, logs and quality control data for each dataset, and for generating the figures and tables in this   Overall this is a very solid technical note. The methods seem sound, the descriptions are fairly straightforward, and caveats are properly acknowledged. The authors have provided detailed descriptions of what was done (including software tool versions) and made data and code available to reproduce not only the dataset but also all figures in the paper itself.
Regarding the experimental design, I think it's great that Gilks et al. chose to perform sequencing on individual flies rather than pooled samples. It takes some extra effort to deal with the small amounts of starting material involved, but the resulting dataset is that much more valuable.
It's also nice to see a study looking at CNVs and short variants together. As tools in this space improve and enable greater integration I look forward to seeing more analysis of how the different variant types relate to each other (e.g. looking at which short variants might be amplified or suppressed by CNV events).

Request for additional figures
I would recommend including diagrams of the hemiclonal experimental design and of the analysis workflows to maximize clarity. In particular, I think it could be made more obvious that the HaplotypeCaller workflow was run using the GVCF pathway for joint analysis.

Minor comments
I prefer "high-throughput sequencing" to "next generation sequencing" (this technology was "next-gen" ten years ago, now it's just the current standard).
On page 3, does "Fine mapping" refer to realignment around indels or equivalent processes?
On page 4, I would express "strand bias" as "FisherStrand estimation of strand bias" to avoid ambiguity with other estimators like Strand Odds Ratio, SOR).