The complete genome sequence of elite bread wheat cultivar, “Sonmez”

High-yielding crop varieties will become critical in meeting the future food demand in the face of worsening weather extremes and threatening biotic stressors. The bread wheat cultivar Sonmez-2001 is a registered variety that is notable for its performance under low-irrigation conditions, which further improves upon irrigation. Additionally, Sonmez-2001 is resilient against certain biotic stressors, particularly soil-borne pathogens. Here, we provide a reference-guided whole genome sequence of Sonmez-2001, assembled into 21 chromosomes of the A, B and D genomes and totaling 13.3 gigabase-pairs in length. Additionally, a de novo assembly of an additional 1.05 gigabase-pairs was generated that represents either Sonmez-specific sequences or sequences that considerably diverged between Sonmez and Chinese Spring. Within this de novo assembly, we identified 35 gene models, of which 11 were high-confidence, that may contribute to the favorable traits of this high-performing variety. We identified up to 24 million sequence variants, of which up to 2.4% reside in coding sequences, that can be used to develop molecular markers that should be of immediate use to the cereal community.


Introduction
Triticum aestivum cv. Sonmez-2001 (Sonmez, hereafter) is a registered, elite bread wheat variety that has been bred particularly for drylands. Accordingly, Sonmez exhibits remarkable tolerance against drought and performs considerably better than its ancestor, Bezostaya-1, in terms of yield, stress tolerance and disease resistance. Sonmez variety is notable for high yield and grain quality, building up to ≈15% protein content, under rain-fed conditions, both of which further improve with supplemental irrigation. Sonmez is also highly resistant against causal agents of devastating diseases, in particular, cereal cyst nematode and yellow rust. Sonmez has superior resistance against soil-borne pathogens and exhibit good tolerance against diseases affecting leaves and inflorescence. Due to these attributes, Sonmez is the cultivar of choice for most of the Central Anatolian Plateau. Facing a fast-growing world population, estimated to reach over 9 billion people in the next three decades, and changing climate trends with destructive effects on agriculture, securing the food demand of upcoming generations will require extensive improvements in crop yields. With cereals being the staple food for the developing world, Sonmez is a promising candidate that can contribute to meeting this demand. Here, we report a reference-guided sequence of the Sonmez genome, and its comparative analysis with the reference species, Triticum aestivum genotype Chinese Spring, for which extensive data, including a high-quality genome sequence, is available.
Sequence variations, including single nucleotide polymorphisms (SNPs) and insertion/deletion polymorphisms (indels) were called by BCFtools v1.3.1 on pileups generated by SAMtools v1.3.1. 4 Homozygous SNP and indel variants were identified using GATK's SelectVariants to retain only variants with no support for the CS reference allele at a series of read depth thresholds (1, 5, 10, 20, 30 and 40). BEDTools v2.26.0 intersect tool was used to identify intersects between gene annotation coordinate ranges and the identified variants. Homozygous variants were analysed by SNPeff v4.3i 5 to estimate their impact in the context of the CS RefSeq v1.0 High Confidence gene annotations, excluding intergenic regions (-no-intergenic). Using all identified homozygous variants, we recalled the reference to generate a "Sonmez genome sequence v1.0". Where there was no coverage of the CS reference, we softmasked the Sonmez genome sequence. It should be noted that these softmasked bases could represent regions which are either deletions in Sonmez or insertions in CS.
Finally, the read pairs that remained unmapped following the two-step alignment approach were assembled de novo to uncover Sonmez-specific genomic contigs. k-mers of length 71 bp and occurring ≥ 9 times in the unmapped reads were extracted using KMC v3.0.1. 6 These extracted k-mers were assembled into contigs using merutensils v0.7.15 kextend command; contigs < 250 bp in length were filtered out. This assembly approach ensures that contig extension only occurs if there is an unambiguous 1 bp extension possible in the input k-mer data set. Methylobacterium are well documented, common contaminants of reagents used in Illumina sequencing. As such, contigs showing high sequence identity to one of several Methylobacterium genomes (NZ_CP006992.1, NC_010511.1, NZ_CP017640.1, CP001029.1, AP014813.1, AP014810.1) or phiX (NC_001422.1) were also filtered out. These de novo assembled sequences are referred as "Sonmez-specific contigs" hereafter.

Results
In total, 13.3 Gbp (91.51%) of the 14.5 Gbp CS reference genome assembly were covered by Sonmez reads, with a mean depth of coverage of ≈50Â, enabling an almost complete, first construction of the Sonmez genome. Additionally, sequences that are either unique to Sonmez (e.g. introgressions) or significantly divergent compared to CS were used to build up a de novo assembly. This assembly totaled 1.05 Gbp in length, with the longest contig being 15,887 bp (N50=427 bp, N90=269 bp). An updated version (v5.3p01) of the TriAnnot pipeline 7 optimized for wheat was used to generate similarity-based and ab initio gene models and annotate repetitive elements on contigs that are longer than 10 kilobases. While the de novo assembly was highly fragmented, compared to the recalled Sonmez genome, we were still able to pick up 35 gene models, of which 11 were high-confidence (Extended data 8 ).
We identified between 3.15 -23.96 million variants, depending on the coverage threshold used, of which between 0.03 -3.23% were indel variants (Extended data 9,10 ). We found that 1.47 -2.39% of all variants fell within the RefSeq v1.0 High Confidence gene annotations (Extended data 9 ). Of these, approx. 40% fell within coding regions. Of the homozygous variants supported by ≥ 5 reads, we observed approximately one variant per 500 bp in the A and B genomes and approximately one variant per 4,000 bp in the D genome.
Here, we present the complete genome of the elite wheat variety Sonmez, notable for its performance under low-irrigation conditions. In the face of climatic extremes and other factors that challenge the food safety of upcoming generations, genome sequences of multiple genotypes, varieties and close relatives will not only help us understand complex traits, such as yield and stress responses, but also enable us to efficiently explore the genetic diversity within germplasms for favorable genotypes and/or traits for crop improvement through the use of molecular tools.

Data availability
Underlying data Sonmez complete genome sequence v1.0 and de novo assembly are available from the dedicated URGI database.  Interestingly, the Sonmez cultivar is resilient against certain biotic stressors, particularly soil-borne pathogens. Besides, it is notable for its performance under low-irrigation conditions. Therefore, this work is particularly relevant in the current trend of global warming and climate change, including worldwide drought. Comparison of wheat genomes will allow to decipher complex traits, like abiotic and biotic stresses. That should allow the development of molecular markers for wheat breeding. These developments will help to address future food demand for a growing worldwide population.
Are the rationale for sequencing the genome and the species significance clearly described? Yes

Are the protocols appropriate and is the work technically sound? Yes
Are sufficient details of the sequencing and extraction, software used, and materials provided to allow replication by others? Yes Are the datasets clearly presented in a usable and accessible format, and the assembly and annotation available in an appropriate subject-specific repository? Yes