First draft genome assembly of the Argane tree ( Argania spinosa)

Background: The Argane tree ( Argania spinosa L. Skeels) is an endemic tree of mid-western Morocco that plays an important socioeconomic and ecologic role for a dense human population in an arid zone. Several studies confirmed the importance of this species as a food and feed source and as a resource for both pharmaceutical and cosmetic compounds. Unfortunately, the argane tree ecosystem is facing significant threats from environmental changes (global warming, over-population) and over-exploitation. Limited research has been conducted, however, on argane tree genetics and genomics, which hinders its conservation and genetic improvement. Methods: Here, we present a draft genome assembly of A. spinosa. A reliable reference genome of A. spinosa was created using a hybrid de novo assembly approach combining short and long sequencing reads. Results: In total, 144 Gb Illumina HiSeq reads and 7.6 Gb PacBio reads were produced and assembled. The final draft genome comprises 75 327 scaffolds totaling 671 Mb with an N50 of 49 916 kb. The draft assembly is close to the genome size estimated by k-mers distribution and covers 89% of complete and 4.3 % of partial Arabidopsis orthologous groups in BUSCO. Conclusion: The A. spinosa genome will be useful for assessing biodiversity leading to efficient conservation of this endangered endemic tree. Furthermore, the genome may enable genome-assisted cultivar breeding, and provide a better understanding of important metabolic pathways and their underlying genes for both cosmetic and pharmacological.

Any reports and responses or comments on the article can be found at the end of the article.

Introduction
Argania spinosa (L. Skeels) is a tree endemic to the Middle West of Morocco and occupying arid and semi-arid regions totaling up to around 900,000 ha 1 . The argane tree forest was recognized as biosphere reserve (Arganeraie Biosphere Reserve) by UNESCO in 1998 2 . It is the only unique member of the tropical Sapotaceae family in Morocco 3 . In addition to its ecological role in preventing soil erosion and desertification, the argane tree has great cultural and socio-economic importance. The oil extracted from the seed is considered the most expensive edible oil in the world with great cosmetic value and therapeutic potential 4-6 . Argane oil represents a significant source of dietary fatty acids, while the Argane fruit is used as livestock feed by the local population 7-9 . Phytochemical composition of Argane fruits reveals different classes of bioactive compounds, including essential oils, fatty acids, triacylglycerols, flavonoids and their acylglycosyl derivatives, monophenols, phenolic acids, cinnamic acids, saponins, triterpenes, phytosterols, ubiquinone, melatonin, new aminophenols, and vitamin E. Argane oil contains high levels of antioxidant compounds. The long-chain fatty acids in Argane oil are primarily represented by unsaturated oleic acid, then linoleic acid, palmitic acid and stearic acid 10 .
The distribution area of the Argane forest decreased drastically during the 18th century. Furthermore, about 44 % of the forest was again lost between 1970 and 2007. While there are multiple causes, desertification and overgrazing form the main pressures on the Argane forest 11,12 . Therefore, the management and conservation of the remaining genetic resources of Argane forest are urgent priorities. In recent decades, several studies have been conducted to evaluate the genetic diversity of the Argane tree using morphological 13,14 , chemical 10,15 , biochemical 6,16 and standard molecular marker techniques, all with the aim of describing the genetic diversity of Argane trees and addressing ecological and conservation issues [17][18][19][20][21][22][23][24][25] . The karyotype of A. spinosa (L.) is constituted of ten pairs of chromosomes (2n =2x =20) 3 . Until now, no reference genome was available of the A. spinosa species. Here, we present the Argane tree genome assembled from short and long DNA reads using a hybrid assembly strategy.

Plant material
The Argane tree (Argania spinosa, taxid 85883, Sapotaceae, family, order Ericales), named Argane AMGHAR, to be sequenced was selected for its biological and ecological characteristics ( Figure 1). This was a 9-year-old shrub, with weeping form (geotropic, unlike the erect have) with only one main trunk 3 m in height. The ripe fruits has a rounded shape.  (Table 1).

Amendments from Version 1
We appreciate the interest that the editor and reviewers have taken in our manuscript and the constructive criticism they have given. We have addressed all the concerns of the reviewers. These changes have clearly improved our manuscript. We have also included a point-by-point response to the reviewers in addition to making the changes in the manuscript. Our main findings remain unchanged: The final genome comprises 75 327 scaffolds totaling 671 Mb with an N50 of 49 916 kb. The draft assembly is close to the genome size estimated by k-mers distribution and covers 89% of complete and 4.3 % of partial "embryophyta_odb9" lineage orthologous groups in BUSCO. This A. spinosa draft genome will be useful for assessing biodiversity leading to efficient conservation of this endangered endemic tree. Furthermore, the genome may enable genome-assisted cultivar breeding, and provide a better understanding of important metabolic pathways and their underlying genes for both cosmetic and pharmacological purposes.  In silico genome size estimation Trimmed reads from the Illumina platform were subjected to k-mer frequency distribution analysis with JELLYFISH v2.1.4 software 26,27 . Analysis parameters were set at -k 21 and 25, and the final result was plotted as a frequency graph ( Figure 2). Two distinctive modes were observed from the distribution curve: the higher peak at a depth of 44 and reflecting the high heterozygosity of the Argane genome; the lower peak provided a peak depth of 87 for the estimation of the genome size 28 . Based on the total number of k-mers obtained, the Argane genome size was calculated to be approximately 573 Mb and 615 Mb, for 21-and 25-mers respectively, using the following formula: total number of k-mer / Peak depth. The double peak of k-mer distribution indicates heterozygosity whose rate is estimated to be 1.58 % and the duplicated fraction of the genome is estimated to be 3,19% ( Figure 2 Table 2). The scaffolding was done using initial contigs and implemented in MaSuRCA v3.2.4 assembler script using Celera Assembler v8.3. The GC content was estimated to be 33%. The assembly was screened by VecScreen to look for and remove remaining vector contamination. Based on the VecSreen report, contigs containing mitochondrial/chloroplast were also removed. Trimmed PE reads were mapped on the final assembly using CLC genomics (v11.0, CLCbio, Arhus, Denmark) with 0.8 in length and 0.9 in sequence similarity. In total, 94% of the reads were mapped against the Argane genome. The 6% reads that were unmapped may result from the stringency of mapping criteria used.
The difference between the genome size estimation and assembly size may be due to the use of parameters excluding extremely high frequency k-mers. They often represent organelle sequences, eventual contaminants inflating the genome size estimation 30 , or the high-frequency of repetitive regions found in plant genomes. Furthermore, the genome is highly heterozygous and different allelic regions would inflate assembly size. To assess the completeness of the final assembly, a Benchmarking Universal Single-Copy Orthologs (BUSCO) v3 software approach was used with "embryophyta_odb9" lineage-specific orthologous groups 31 . Thus from a total of 1,440 BUSCO genes, 1291 genes (89%) were complete (1179 in single copy and 112 duplicated), 62 genes (4.3%) were represented partially while 87 genes (6%) were missing" from the assembly.

Conclusions
This draft genome assembly is a first step towards a global and integrative omics strategy for exhaustive characterization of the Argane tree. In particular, future work will focus on structurally annotating the genome using predictive tools and transcriptome analysis. Other future work will focus on functional gene annotation, finding evidence for genome duplication and comparative genome evolution. A reliable annotation is highly dependent   on transcriptomic data, analysis and research. Sequencing of different parts and developmental stages of the Argane transcriptome is ongoing. The metabolome, and analysis of Argane oil biosynthesis, as well as the tree's microbiome should also be analyzed. To this end, and in order to coordinate the strong interests of the Plant Genomics community for this precious tree, the International Argane Genome Consortium (IAGC) and a resource website has been created (www.arganome.org).

Data availability
All In the article "First draft genome assembly of the Argane tree (Argania spinosa)", the authors present the draft assembly of the Argane tree using a combination of Illumina and PacBio reads. An argument for the importance of the Argane tree to the agriculture and economy of the endemic region is given. The assertion that "the genome may enable genome-assisted cultivar breeding, and provide a better understanding of important metabolic pathways and their underlying genes for both cosmetic and pharmacological purposes" seems reasonable based on the use of previous genome sequences for important agricultural resources. This article does not claim much beyond providing a reasonable draft genome for A. spinosa. The primary evidence for reasonable quality is three-fold: a k-mer analysis of genome size using Jellyfish, 94% of Illumina reads mapping back to the assembly, and a BUSCO analysis showing 89% full length and 4.3% partial length gene matches for these single copy genes. This is probably sufficient to claim reasonable quality and the usefulness of the genome for downstream research given the contiguity shown by the N50 contig size.
The authors do miss an opportunity to look more closely at the haplotype separation that may be occurring as they speculate possibly leading to a somewhat larger genome size than indicated by Jellyfish. I'm not sure why the number of BUSCO genes with more than one full or partial length match was not given -perhaps there were none? The implication that heterozygosity is uniform "The double peak of k-mer distribution indicates heterozygosity whose rate is estimated to be 1.58%" seems unfounded without more evidence nor is it clear that is the model resulting in this estimate.
Lastly the grammar would benefit from editing. For example, "A reliable annotation is highly dependent on transcriptomic research, and sequencing of Argane transcriptome analysis of different parts and developmental stages or the plant is ongoing." Might be better as "A reliable annotation is highly dependent on transcriptomic data, analysis and research. Sequencing of different parts and developmental stages of the Argane transcriptome is ongoing.".

If applicable, is the statistical analysis and its interpretation appropriate? Not applicable
Are all the source data underlying the results available to ensure full reproducibility? Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Whole genome shotgun assembly, pan-genome analysis

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
Author Response 21 Apr 2020

Rachid Mentag
Answers to Reviewer #2 1-The authors do miss an opportunity to look more closely at the haplotype separation that may be occurring as they speculate possibly leading to a somewhat larger genome size than indicated by Jellyfish. I'm not sure why the number of BUSCO genes with more than one full or partial length match was not given -perhaps there were none? The implication that heterozygosity is uniform "The double peak of k-mer distribution indicates heterozygosity whose rate is estimated to be 1.58%" seems unfounded without more evidence nor is it clear that is the model resulting in this estimate. Answer. We agree with the reviewer. The manuscript has been corrected: "..which showed that the assembly contained 89% (1271 genes) of complete and 4.3% (62 genes) of partial sequences that were Arabidopsis orthologs." has been replaced by ". Thus, from a total of 1,440 BUSCO genes, 1291 genes (89%) were complete (1179 in single copy and 112 duplicated), 62 genes (4.3%) were represented partially while 87 genes (6%) were missing" from the assembly. ○ 2-Lastly the grammar would benefit from editing. For example, "A reliable annotation is highly dependent on transcriptomic research, and sequencing of Argane transcriptome analysis of different parts and developmental stages or the plant is ongoing." Might be better as "A reliable annotation is highly dependent on transcriptomic data, analysis and research. Sequencing of different parts and developmental stages of the Argane transcriptome is ongoing.". Answer. We agree with the reviewer. The manuscript has been corrected: " A reliable annotation is highly dependent on transcriptomic research, and sequencing of Argane transcriptome analysis of different parts and developmental stages or the plant is ongoing" has been replaced by " A reliable annotation is highly dependent on transcriptomic data, analysis and research. Sequencing of different parts and developmental stages of the Argane transcriptome is ongoing."

Amit Sinha
New England Biolabs Inc., Ipswich, MA, USA In the current manuscript, the authors present the first draft genome assembly for the argan tree Argania spinosa, a tree with significant ecological as well as commercial importance. The authors have used a hybrid assembly approach, combing data from both Illumina short-reads and PacBio long-reads technology. Use of such hybrid approaches is welcome, as they can combine the advantages offered by different technological platforms. Based on a k-mer analysis, the authors also provide a reasonable estimate of the genome-size and the level of heterozygosity. These are important metrics to report, as they can guide the methods to be used for genome assembly, as well as for interpretation of the assembled genome sequence. The draft genome has been satisfactorily assembled, using state of the art computational methods, and it contains 89% of the predicted core genes as defined in the BUSCO software.
It is hoped that in their future studies, the authors will follow up with more improvements to the draft genome, including gene annotations using transcriptomic data.
Overall, manuscript provides valuable information and sequence resources that will be useful to the scientific community. I have only 1 major and 6 minor comments/suggestions that I think the authors should be able to address.
1. Both Illumina as well as PacBio reads have been processed and trimmed before being used as an input to the MaSuRCA software. However, the instructions for the MaSuRCA software explicitly ask for NOT using any third-party tools, or performing any trimming, cleaning or error correction on Illumina reads, as it will most likely lead to a deteriorated assembly (See https://github.com/alekseyzimin/masurca#overview). It is therefore strongly suggested that the authors try an assembly with the raw Illumina reads directly, as it might lead to a better assembly. If such an analysis was already performed but the assembly was worse than the current version, please provide this information in the manuscript.
Minor comment/suggestion: 4. In addition to the observed heterozygosity, the repeat content of the genome could be an important reason for a fragmented assembly. The k-mer analysis can also be used to estimate the % of genome that is comprised of repeats. It will be great if this estimate is provided.
5. For the BUSCO analysis, the authors report using the Arabidopsis lineage-specific orthologous groups. However, from the BUSCO manual, the only valid value for the lineage parameter seems to be "embryophyta_odb9", which internally uses Arabidopsis hmms for Augustus-based gene predictions, but the member proteins are form multiple species. If this is the lineage parameter the authors used, please specifiy it as "embryophyta_odb9" rather than Arabidopsis, or provide more details on how the Arabidopsis BUSCO groups were determined.
6. The genome sequences of the organelles mitochondria and chloroplast are also very important in investigating the biology of any species, and also serve as important resource for evolutionary and phylogenetic studies. Therefore, it will be great if instead of completely removing the organellar genomes, the authors provide them as separate set of sequences, in separate files and ideally as separate GenBank entries.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes 2. The authors report that the genomic DNA was size-selected for fragments above 2kb. What was the approximate maximum size of input DNA for PacBio library preparation. Also, what was the maximum size of PacBio reads (e.g after the CCS step) ?

Response:
DNA extracted from the lyophilized leaves was divided into HMW and LMW fractions, and each was used separately to make PacBio libraries. The best PacBio SMRT cell yielded 4G of data on 163,000 spots, with an average read size of 24,468 bases.
3. The raw coverage obtained from Illumina data is reported as 160X. But as per my calculations, based on number of trimmed bases in table 1 and an estimated genome size of 573 Mb, I obtain 135,539,587,270 / 573,000,000 = 236X. Please correct or explain further.

Response:
We agree with the reviewer. The manuscript have been corrected:"These data was trimmed of adapters and low-quality sequences, yielding a clean set of 936,053,040 reads, representing 160× genome coverage," has now been replaced by "These data was trimmed of adapters and low-quality sequences, yielding a clean set of 936,053,040 reads, representing 236x genome coverage," 4. In addition to the observed heterozygosity, the repeat content of the genome could be an important reason for a fragmented assembly. The k-mer analysis can also be used to estimate the % of genome that is comprised of repeats. It will be great if this estimate is provided.

Response:
Based on the 21-mer distribution analysis, the duplicated fraction of the genome is estimated to be 3,19%. The revised version of the manuscript has been corrected: "The double peak of k-mer distribution indicates heterozygosity whose rate is estimated to be 1.58 % (Figure 2)." Has been replaced by "The double peak of k-mer distribution indicates heterozygosity whose rate is estimated to be 1.58 % and the duplicated fraction of the genome is estimated to be 3,19% (Figure 2)."

5.
For the BUSCO analysis, the authors report using the Arabidopsis lineage-specific orthologous groups. However, from the BUSCO manual, the only valid value for the lineage parameter seems to be "embryophyta_odb9", which internally uses Arabidopsis hmms for Augustus-based gene predictions, but the member proteins are form multiple species. If this is the lineage parameter the authors used, please specifiy it as "embryophyta_odb9" rather than Arabidopsis, or provide more details on how the Arabidopsis BUSCO groups were determined.

Response:
We agree. The lineage parameter used is "embryophyta_odb9". (The manuscript has been corrected): "To assess the completeness of the final assembly, a Benchmarking Universal Single-Copy Orthologs (BUSCO) v3 software approach was used with Arabidopsis lineage-specific orthologous groups31," has been replaced by "To assess the completeness of the final assembly, a Benchmarking Universal Single-Copy Orthologs (BUSCO) v3 software approach was used with "embryophyta_odb9" lineage-specific orthologous groups31," 6. The genome sequences of the organelles mitochondria and chloroplast are also very important in investigating the biology of any species, and also serve as important resource for evolutionary and phylogenetic studies. Therefore, it will be great if instead of completely removing the organellar genomes, the authors provide them as separate set of sequences, in separate files and ideally as separate GenBank entries.

Response:
We agree with the reviewer. The work is in progress and we plant to publish the organellar genomes soon in a separate manuscript.