Keywords
high-throughput sequencing data, population genetics, genotype likelihoods, Julia language, pooled sequencing, polyploidy, aneuploidy
This article is included in the RPackage gateway.
This article is included in the Cell & Molecular Biology gateway.
high-throughput sequencing data, population genetics, genotype likelihoods, Julia language, pooled sequencing, polyploidy, aneuploidy
In the revised version, we provide users with a comprehensive online documentation of APIs and CLIs at https://ngsjulia.readthedocs.io. Additionally, all functions have docstring documentation. The methodology behind some of the implementations has been clarified and references on the applicability of said methods have been added. We also rephrased few sentences for clarity as suggested by the reviewer, and fixed few typos. We believe that these changes significantly enhance the user’s experience and simplify the interpretation of results obtained using ngsJulia.
See the authors' detailed response to the review by Lindsay Clark
See the authors' detailed response to the review by Stéphane De Mita
Population genetics, i.e., the study of genetic variation within and between groups, plays a central role in evolutionary inferences. The quantification of genetic diversity serves the basis for the inference of neutral1 and adaptive2 events that characterised the history of different populations. Additionally, the comparison of allele frequencies between groups (i.e. cases and controls) is an important aspect in biomedical and clinical sciences.3
In the last 20 years, next-generation sequencing (NGS) technologies allowed researchers to generate unprecedented amount of genomic data for a wide range of organisms.4 This revolution transformed population genetics (therefore also labelled as population genomics) to a data-driven discipline. Data produced by short-read sequencing machines (still the most accessible platform worldwide) consist of a collection of relatively short (approx. 100 base pairs) fragments of DNA which are then mapped or de novo assembled to form a contiguous sequence.4 At each genomic position, all observed sequenced reads are used to infer the per-sample genotype (an operation called ‘genotype calling’) and the inter-samples variability, i.e., whether a particular site is polymorphic (an operation called ‘single-nucleotide polymorphism (SNP) calling’).5
To this end, several software packages have been implemented to perform genotype and SNP calling from NGS data, the most popular ones being samtools/bcftools,6 GATK,7 and freeBayes.8 When, on average, few reads map at each genomic position (a scenario referred to as ‘low-coverage’ or ‘low-depth’), genotypes and SNPs cannot be assigned with confidence due to the high data uncertainty.9,10 Under these circumstances, statistical methods that integrate data uncertainty into genotype likelihoods and propagate it to downstream analyses have been proposed.5 Software packages like ANGSD11 and ngsTools,12 among others reviewed by Lou et al.,13 implement a statistical framework to estimate population genetic metrics from low-coverage sequencing data. Similarly, an affordable generation of sequencing data from large sample sizes can be obtained via pooled sequencing experiments, where assignment of individual samples is typically not retained.14 Several new and popular software for the analysis of pooled sequencing have been proposed in recent years.15,16
Despite these advances, most of these implementations are either tuned and suitable for model organisms only (e.g., with haploid or diploid genomes) or not easily adaptable to novel applications. Therefore, an accessible computational framework for building and testing ad hoc population genetic analyses from NGS data is in dire need. Among programming languages, Julia17 has emerged as both a powerful and easy-to-use dynamically typed language that is widely used in many fields of data sciences, including genomics.18 While several Julia packages are currently available for both population genetic and bioinformatic analyses (e.g., BioJulia), to our knowledge, a suitable framework for custom population genetic analysis from NGS data is not available yet.
Here we present ngsJulia, a set of templates and functions in Julia language to process NGS data and create custom analyses in population genetics. To illustrate its applicability, we further introduce two implementations, ngsPool and ngsPloidy, for the analysis of pooled sequencing data and polyploid genomes, respectively. By extensive simulations, we show the performance of several methods implemented in these programs under various experimental conditions. We also introduce novel statistical methods to estimate population genetic parameters from NGS data and demonstrate their applicability and suggest optimal experimental design. We finally discuss further directions and purposes for ngsJulia and bioinformatics for NGS data analysis.
ngsJulia was built in Julia language (Julia Programming Language, RRID:SCR_021666) and requires the packages ‘GZip’, ‘Combinatorics’, and ‘ArgParse’. Auxiliary scripts to process output files were built in R (R Project for Statistical Computing, RRID:SCR_001905) version 3.6.3 and require the package ‘getopt’. ngsJulia receives gzipped mpileup input files which can be generated using samtools.19 Output files are in text file format and can be easily parsed for producing summary plots and for further analyses. Scripts for some downstream analyses are provided in ngsJulia.
ngsJulia is compatible with all major operating systems and is maintained at https://github.com/mfumagalli/ngsJulia. Documentation and tutorials are available via this GitHub repository and archived at Zenodo20 at the time of writing. All analyses in the manuscript are performed in Julia language and R.
ngsJulia implements functions to read and parse gzipped mpileup files and to output gzipped text files on various calculations (e.g., genotype and allele frequency likelihoods) and estimations (e.g., allele frequencies), as requested by the user. ngsJulia also allows for several data filtering options, including on global and per-sample depth, proportion or count of minor allele, and base quality. Finally, several options for SNP and biallelic and triallelic polymorphisms calling are available.
ngsJulia provides utilities to calculate nucleotide and genotype likelihoods, i.e., the probability of observed sequencing data given a specific nucleotide or genotype,21 for an arbitrary ploidy level, as in Soraggi et al.22 We now describe how such quantities are calculated in ngsJulia.
Following the notation in Soraggi et al.,22 for one sample and one site, we let be the observed NGS data, the ploidy, and the genotype. Therefore, has values in , i.e., the number of derived (or alternate) alleles.
In the simplest form, genotype likelihoods can be calculated by considering individual base qualities as probabilities of observing an incorrect nucleotide.21 We adopt the calculation of genotype likelihoods for an arbitrary ploidy level as proposed in Soraggi et al.22 From the genotype likelihoods with , the two most likely alleles are identified by sorting values after pooling all sequencing reads together across all samples. This operation will restrict the range of possible genotypes to biallelic variation only. Note that this calculation is still valid for monomorphic sites, although the actual assignment of the minor allele is meaningless.
We now describe how to estimate , the frequency of allele at site . Similarly to Kim et al.,23 the log-likelihood function for is given by:
To perform SNP calling, we implement a likelihood-ratio test (LRT) with one degree of freedom with null hypothesis and alternate hypothesis , as described by Kim et al.23 Additionally, we develop a test for a site being biallelic or triallelic. The former can be interpreted as a further evidence of polymorphism, while the latter as a condition not to be met for the site being included in further estimations, as our models assume at most two alleles. The log-likelihood of site being biallelic is equal to while the log-likelihood being triallelic is equal to , with , , and being the most, second most, and third most likely allele with (i.e. haploid genotype ), respectively. An LRT with one degree of freedom can be conducted to assess whether is significantly greater than , or the latter is significantly greater than .
Several estimators of population parameters from pooled sequencing data are implemented in ngsPool, a separate program which uses functions in ngsJulia. We now describe the statistical framework for the analysis of pooled sequencing data.
In case of data with unknown sample size, the MLE of the population allele frequency is calculated as in Equation 1 with . With known sample size, the same equation is used to calculate sample allele frequency likelihoods ,24 for instance with for samples of equal ploidy . From these likelihoods, we can calculate both the MLE and the expected value with uniform prior probability as estimators of .
A simple estimator of the site frequency spectrum (SFS) from pooled sequencing data is obtained by counting point-estimates of across all sites. We propose a novel estimator of the SFS implemented in ngsPool. Under the standard coalescent model with infinite sites mutations, we let the probability of derived allele frequency in a sample of genomes to be proportional to with .25 The parameter determines whether the population is deviating from a model of constant effective population size. For instance, is equal to the expected distribution of under constant population size, while models a population shrinking and population growth.
We optimise the value of to minimise the Kullback-Liebrel divergence between the expected distribution of and the observed SFS. The latter can be obtained by either counting across all sites or by integrating over the sample allele frequency probabilities (i.e. with a uniform prior distribution). A threshold can be set to ignore allele frequencies with low probability to improve computing efficiency and reduce noise. Within this framework, folding spectra can be generated in case of unknown allelic polarisation.
Finally, we introduce a strategy to perform association tests from pooled sequencing data. Similarly to Kim et al.,23 we propose an LRT with one degree of freedom for null hypothesis and alternate hypothesis . The likelihood of each hypothesis is calculated from and, therefore, this strategy avoids the assignment of counts or per-site allele frequencies. A statistically significant LRT with one degree of freedom suggests a difference in allele frequencies between cases and controls, and possible association between the tested phenotype and alleles.
We now describe the statistical framework implemented in the program ngsPloidy to estimate ploidy levels and test for multiploidy. When multiple samples are available, two scenarios can be envisaged: (i) all samples have the same ploidy, (ii) each sample can have a different ploidy (multiploidy, i.e. as in tumor genomes).26
The log-likelihood function for a vector of ploidy levels for samples and sites is defined as:
With being the MLE of allele frequency at site . is the genotype likelihood while is the genotype probability given the ploidy and allele frequency at site .
Once is estimated and the inbreeding is known (or under the assumption of Hardy-Weinberg Equilibrium (HWE)), then genotype probabilities are fully defined. Equation 2 is optimised by maximising the marginal likelihood of each sample separately, assuming independence among samples and sites.
With limited sample size, is not a good estimator of the population allele frequency and therefore genotype probabilities may not be well defined. In the simplest scenario, genotype probabilities can be set as uniformly distributed, with all genotypes being equally probable. However, the assignment of alleles into ancestral (e.g., wild-type) and derived (e.g., mutant) states is particularly useful to inform on genotype probabilities. Recalling Equation 2, we can substitute with , where is the expected allele frequency of , as introduced previously. Note that does not depend on the sequencing data for each specific site .
Note that, in practice, Equation 2 is a composite likelihood function, as samples and sites are not independent observations due to shared population history and linkage disequilibrium, respectively. A solution to circumvent this issue is to perform a bootstrapping procedure, by sampling with replacement segments of the chromosome and estimate ploidy for each bootstrapped chromosome. The distribution of inferred ploidy levels from bootstrapped chromosomes provides a quantitative measurement of confidence in determining the chromosomal ploidy. It is not possible to calculate the likelihood of ploidy equal to one after SNP calling. Nevertheless, the identification of haploid genomes from sequencing data is typically trivial.
So far, we assumed to know which allele can be assigned to an ancestral state, and which one to a derived state. However, in some cases, such assignment is either not possible or associated with a certain level of uncertainty due to, for instance, ancestral polymorphisms or outgroup sequence genome from a closely related species not being available. Under these circumstances, we extend our formulation by adding a parameter underlying the probability that the assigned ancestral state is incorrectly identified.
Let us define as the ancestral state and as any possible allele in . In practice, can take only two possible values as we select only the two most likely alleles. We label this set of the two most common alleles as and we assume that the true ancestral state is included in such set. The log-likelihood function of ploidy for a single sample under unknown ancestral state is:
Where indicates the probability that allele is the ancestral state, and it is invariant across sites. If , then the equation refers to the scenario of folded allele frequencies, where each allele is equally probable to be the ancestral state.
Finally, note that in a sufficiently large sample size, the major allele is more probable to represent the ancestral state. This probability depends on the shape of the site frequency spectrum, and it is equal to the cumulative distribution of evaluated at . We can extend Equation 3 to reflect this parameter uncertainty with being the major allele in .
We introduce a novel test for multiploidy. If all samples have the same ploidy , then is true for all . We propose an LRT for multiploidy with null hypothesis and alternate hypothesis A large value of LRT is suggestive of multiploidy. Statistical significance can be assessed with the LRT and degrees of freedom.
To benchmark the performance of methods implemented in ngsJulia, we simulated NGS data following a strategy previously proposed by Fumagalli et al.27 available as a stand-alone R script. Briefly, individual genotypes are drawn according to probabilities depending on input parameters. The number of mapped reads at each position is modelled with a Poisson distribution and sequenced bases are sampled with replacement with a probability given by the quality score. As an illustration, the following code
Rscript simulMpileup. R --out test.txt --copy 2x10 --sites 1000 \\--depth 20 --qual 20 ---pool|gzip > test.mpileup.gz
will simulate 10 diploid genomes (--copy 2x10), base pairs each (--sites 1000) with an average sequencing depth of and base quality of in Phred score (--depth 20 --qual 20) from pooled sequencing (--pool) with results stored in test.mpileup.gz file.
For the analysis of pooled sequencing data, we simulated independent sites at sample sizes , , and from a diploid population with a constant effective population size of . We imposed the average per-sample sequencing depth to be , , , or with an average base quality of in Phred score. To assess the performance of ngsPool, we calculated bias and root mean squared error (RMSE) between the estimated value of the true value, either from the sample or the whole population. While both metrics measure the distance with the true value, the bias retains the direction of the error (i.e. over- or under-estimation). To quantify the accuracy of SNP calling, we calculated F1 scores (the harmonic mean of precision and recall rates).
To simulate data for association test, we assumed an equal number of cases and controls (equal to ) and 200 SNPs, either causal or non causal. For non causal sites, cases and controls have the same population allele frequency of . For causal sites, cases and controls have a population allele frequency of and , respectively. These conditions are derived assuming a high risk allele frequency of , prevalence of , genotypic relative risk for the heterozygote of , genotypic relative risk for the homozygous state of . The sample size simulated guarantees at least power with a false positive rate of .
To illustrate the usage of ngsPloidy, we simulated NGS data of genomes with different ploidy levels (one haploid, eight triploids, one tetraploid) at sites. NGS data was simulated assuming an average depth of at the haploid level. Code and simulated data sets analysed are available in ngsJulia GitHub repository.
To showcase the applicability of ngsJulia to real data, we deployed to two distinct empirical data sests. To test ngsPool, we used mapped reads of Arabidopsis lyrata produced in a previous study.28 Specifically, We used data belonging to the population B where 25 individuals of A. lyrata have been sequenced by both pool-sequencing and genotyping-by-sequencing. Since GBS harbours data on a fraction of the genome and Pool-seq is whole genome, We restricted the analysis to the base pairs (bp) sequenced by both techniques, for a total of 15202 bp. The mpileup files have been generated with samtools6 (v1.14) and SNP calling in GBS data has been performed with Varscan29 (v2.4.2), resulting in 333 SNPs.
To test ngsPloidy, we reanalysed whole-genome sequencing data of the human fungal pathogen Candida auris from a previous study.30 We processed data for 21 samples, each one consisting of 17 contigs. As the null hypotheses is that these samples are haploid, we tested ploidy levels from 1 to 4. We filtered out sites with a minimum depth lower than 5 and a proportion of minor alleles lower than 0.15. The number of sites analysed per contig ranged from 555 to 754.
ngsJulia implements data structures and functions for an easy calculation of nucleotide and genotype likelihoods (of arbitrary ploidy) which serve the basis of genotype and SNP calling and for the estimation of allele frequencies and other summary statistics. It is particularly suitable for low-coverage sequencing data and for cases when there is high data uncertainty. To demonstrate the use of ngsJulia, we provide two custom applications from its templates and functions.
We used ngsJulia to implement a separate program, called ngsPool, to perform population genetic analysis from pooled sequencing data. Specifically, ngsPool implements established and novel statistical methods to estimate allele frequencies and site frequency spectra (SFS) and perform association tests from pooled sequencing data.
ngsPool uses functions in ngsJulia to parse mpileup files as input. As an illustration, the following code
julia ngsPool.jl --fin test.mpileup.gz --fout test.out.gz --lrtSnp 6.64
will parse test.mpileup.gz file and write estimates of allele frequencies in test.out.gz file from unknwon sample size after performing SNP calling with an LRT value of 6.64 (--lrtSnp 6.64, equivalent to a p-value of ).
Depending on the options selected by the user, output files contain, various results are printed on the screen, including
• inferred major allele,
• inferred minor allele,
• LRT statistic for SNP calling,
• LRT for bi- and tri-allelic calling,
• three estimators of the minor allele frequency at each site.
Additionally, ngsPool can output a file with per-site sample allele frequency likelihoods. The following code
julia ngsPool.jl --fin test.mpileup.gz --fout test.out.gz \\--nChroms 20 --fsaf test.saf.gz
will produce estimates of allele frequencies from the known sample size (specified by --nChroms 20) and allele frequency likelihoods in test.saf.gz file.
These files can then be used to estimate the SFS and perform an association test using two scripts provided in ngsPool. For instance, the code
Rscript poolSFS. R test.saf.gz > sfs.txt
will estimate the SFS, while the code
Rscript poolAssoc. R test.cases.saf.gz test.controls.saf.gz > assoc.txt
will perform an association test assuming two sets of allele frequency likelihood files, one from cases (test.cases.saf.gz) and one from controls (test.controls.saf.gz).
To illustrate the usage of ngsPool, we estimated allele frequencies based on simulated data at different experimental conditions. We sought to compare the performance among different estimators implemented in the program. If the sample size is not provided, ngsPool provides a simple MLE of the population allele frequency assuming haploidy. Alternatively, if the sample size is provided by the user, ngsPool calculates sample allele frequency likelihoods and returns both the MLE and the expected value of the allele frequency using a uniform prior probability.
Results show that the error of estimating allele frequencies decreases with increasing depth and sample size, as expected (Figure 1). Likewise, the error for estimating the population allele frequency is more pronounced for lower sample sizes. MLE values tend to be less biased than expected values and sample estimates appear to be unbiased even at depth 1 for moderate sample size (Figure 1).
Bias and root mean square error (RMSE) against the reference true value from either the population or sample are provided for each value of depth (the average number of sequenced reads per base pair) and sample size (on column panels) tested. Three estimators of allele frequencies are considered: a population maximum likelihood estimate (MLE) from unknown sample size and the expected value and MLE from known sample size.
Furthermore, we simulated NGS data at fixed population allele frequency and compared the distribution of true sample allele frequencies and estimated values using a MLE approach from known sample size. Figure 2 shows that most of the deviation from the true population allele frequencies occurs at intermediate frequencies ( equal to 0.5). This effect is more evident for low depth and low sample size (Figure 2).
True sample allele frequencies (SAF) and maximum likelihood estimates (MLE) are shown at different fixed population allele frequencies F and sample sizes (on rows, 20 and 50) and depths (the average number of sequenced reads per base pair, on columns, 0.5, 1, 2, and 5).
We then assessed the effect of low-frequency variants on SNP calling. Specifically, we calculated F1 scores for SNP calling when the population allele frequency is 0 (not a SNP) or greater than 0 (a SNP). Figure 3 shows how the prediction performance increases with the population allele frequency (), sample size, and depth (). For instance, an F1 score greater than 0.75 is obtained with 20 samples only for and . On the other hand, the same F1 score is achieved with 50 samples even with and also at but only if .
F1 scores (harmonic means of precision and recall rates) for predicting either a SNP (true population allele frequency greater than 0) or not () are reported are various values of , sample sizes (on columns, 20 and 50) and depths (the average number of sequenced reads per base pair).
SNP calling under-performs when there is no variation in the population (), with sequencing errors and sampling statistical uncertainty generating estimates of greater than 0.
ngsPool implements several methods to estimate the SFS from sample allele frequencies. As described in the methods, a simpler estimator is based on assigning the most likely sample allele frequency at each site (labelled count). ngsPool implements novel estimators of SFS from pooled sequencing data as described in the method section. A script implements an algorithm to fit the theoretical SFS to the observed SFS. The latter can be calculated either by assigning per-site MLE of allele frequencies (labelled Fit_count) or by integrating the uncertainties across all sample allele frequency likelihoods (labelled Fit_afl).
Figure 4 shows the error in estimating either the population or sample SFS at various settings with different methods. The error decreases with increasing depth and sample size. Estimating SFS by fitting the theoretical SFS without assignment of allele frequencies generally outperforms other tested strategies (Figure 4).
Estimates based on counting allele frequencies (Count) or fitting from counted allele frequencies (Fit Count) or from allele frequency likelihoods (Fit afl) are calculated. Root mean square error (RMSE) values between true (either population or sample, on rows) and estimated SFS are reported at various depths D and sample sizes (on columns).
Notably, the novel approach implemented in ngsPool to estimate the parameter of the SFS distribution (see Methods) allow us to directly quantify the error in inferring demographic events. In fact, all simulations assumed constant population size, equivalent to . Figure 5 shows the estimated values of by fitting the SFS either by assigning (counting) allele frequencies (Fit_count) or by using allele frequency likelihoods (Fit_afl). For low-to-moderate depth and sample size, estimates of tend to suggest population expansion (), possibly due to an over-estimation of the abundance of low-frequency alleles. However, the error is reduced when integrating the data uncertainty with sample allele frequency likelihoods, as estimates of values tend to be closer to the true simulated value of (Figure 5).
Estimates based on fitting from counted allele frequencies (Fit count) and from allele frequency likelihoods (Fit afl) are reported. Note that the true simulated value of K is 1.
ngsPool implements a script to perform association tests from pooled sequencing data. Specifically, the script calculates an LRT statistic, with null hypothesis being that allele frequencies of cases and controls (or any two groups) are the same, as used by Kim et al.23 It uses sample allele frequency likelihoods and, therefore, it maintains data uncertainty and avoids the assignment of counts or per-site allele frequencies. An LRT statistics significantly greater than 0 indicates a difference in allele frequencies between cases and controls.
Figure 6 compares the distribution of LRT statistics between causal and non causal sites at different experimental scenarios. The distribution of LRT at causal SNPs is skewed towards higher values for increasing depth, indicating more support to find phenptype-SNP association. Nevertheless, a clear separation between the distributions of causal and non causal SNPs is observed at low depth (Figure 6).
The distribution of likelihood-ratio test (LRT) statistics for casual and non causal single-nucleotide polymorphism (SNP) is reported at different depths (the average number of sequenced reads per base pair, on columns).
To illustrate the application of ngsPool to real data, we reanalysed genomic data from Arabidopsis lyrata.28 In this study, authors generated data both by genotype-by-sequencing and by pooled-sequencing, with the former providing ground-truth values for genotype and allele calls. Estimates of minor allele frequencies using ngsPool yield a lower RMSE (0.09372) than estimates from Varscan (0.09783) across all SNPs analysed.
We further utilised ngsJulia to implement an additional program, ngsPloidy, for the estimation of ploidy from unknown genotypes. The method implemented is similar to the one proposed by Soraggi et al.22 with some notable differences on the calculation of genotype probabilities (see Methods). Additionally, ngsPloidy includes a novel method to test for multiploidy in the sample. As an illustration, the following code
julia ngsPloidy.jl --fin test.mpileup.gz --nSamples 20 > test.outRscript ploidyLRT. R test.out
will estimate ploidy levels for genomes (--nSamples 20) from test.mpileup.gz file and return LRT values.
Following Equation 2, the genotype probabilities for each tested ploidy are pre-calculated using a script provided in ngsPloidy. This script takes as input the value of parameter (the shape of the expected SFS), the effective population size, and the probability that the major allele is the ancestral allele. The latter can be either set by the user (e.g., a value of 0.5 would be equivalent to unknown polarisation, as in a folded SFS) or be calculated from the expected population SFS itself. If genotype probabilities are not set, a uniform distribution is assigned. Further options allow for estimation only on called SNPs and/or genotypes.
ngsPloidy uses functions in ngsJulia to parse mpileup files as input. At the end of the computation, various results are printed on the screen, including
• the number of analysed sites that passed filtering for each sample,
• a matrix of ploidy log-likelihoods for each sample,
• the log-likelihood and MLE of the ploidy vector (i.e., the individually estimated ploidy for each sample),
• LRT scores for the test of multiploidy against all tested ploidy levels.
Additionally, if requested by the user, ngsPloidy can generate output files with several statistics for each site (e.g., estimate allele frequency), and all per-site genotype likelihoods for each sample and the tested ploidy.
To illustrate the usage of ngsPloidy, we deployed it to simulated data of an multiploid sample consisting of one diploid, eight triploid and one tetraploid genomes. We compared the performance of ploidy and multiploidy inference among different choices of genotype probabilities. The latter were derived either from the expected folded population site frequency, from the estimated ancestral population allele frequency, or from the calculated sample allele frequency.
In all tested cases, we inferred the correct vector of marginal ploidy levels. We therefore assessed the confidence in such inference by calculating the LRT statistics of ploidy and multiploidy inferences. Both were calculated by comparing the most against the second most likely vector of ploidy levels or the most likely vector of equal ploidy levels, respectively.
Results show that using the per-site estimate sampled allele frequency yields higher LRT statistics (and therefore confidence) than using expected population allele frequencies (Table 1). We reiterate that for all three cases we correctly identified the patterns of multiploidy. However, we should caution that with lower sample sizes we do not expect inferences using estimated sample allele frequencies to perform well.
Likelihood-ratio test (LRT) statistics using different methods of incorporating allele frequencies are reported.
Allele frequency | LRT – ploidy | LRT - multiploidy |
---|---|---|
Folded | 34.19 | 252.72 |
Ancestral | 31.49 | 277.91 |
Sample | 37.83 | 373.07 |
To illustrate the applicability of ngsPloidy to real empirical data, we inferred ploidy levels on 21 isolates from an outbreak of Candida auris.30 In all contigs analysed, haploid was inferred as the most likely ploidy and multiploidy was rejected in all samples. These results are consistent with findings from the original study, although the identification of polyploidy in other isolates may be associated with multidrug-resistant phenotypes.31
Analyses presented here provide further support for the use of genotype and allele frequency likelihoods in the analysis of NGS data.5 Notably, we demonstrated how probabilistic estimates of population genetic parameters can be obtained in case of pooled sequencing data and short-read data from polyploid genomes. Additionally, we motivated the inference of SFS from allele frequency likelihoods as a direct way to infer demography from raw sequencing data.
ngsJulia offers new possibilities of software prototyping for custom analyses of NGS data for population genetic applications. Furthermore, it facilitates experimental design as it provides a platform to benchmark the efficacy of population genetic analysis from competing sequencing experiments. Finally, ngsJulia has accessible documentation and tutorials to inform users on the theory underpinning the implemented methods. We envisage that further improvements in ngsJulia will include the expansion of suitable input formats and data file types, and the compatibility with additional NGS data type, including from long-read sequencing experiments.32
In this study, we introduce ngsJulia, a series of templates and functions in Julia language to analyse NGS data for population genetic purposes. We present two implementations for the analysis of pooled sequencing data and polyploid genomes, with the inclusion of novel methods. ngsJulia is a suitable framework for prototyping new software and for custom population genetic analyses from NGS data.
Simulated data and pipeline to reproduce all results presented here are available at https://doi.org/10.5281/zenodo.5886879.20
Data are available under the terms of the Creative Commons Attribution 4.0 International Public License.
• Source code available from: https://github.com/mfumagalli/ngsJulia
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.588687920
• License: Creative Commons Attribution 4.0 International Public License
We are thankful to Johanna Rhodes for assistance gathering sequencing data for Candida auris.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Population and quantitative genetics of crops, Plant breeding, cereals
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: bioinformatics, genomics, medical genetics, ancient DNA
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Population genomics, phytopathology, programming (Python, C, C++).
Is the rationale for developing the new software tool clearly explained?
Yes
Is the description of the software tool technically sound?
Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Population genomics, phytopathology, programming (Python, C, C++).
Is the rationale for developing the new software tool clearly explained?
Partly
Is the description of the software tool technically sound?
Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?
Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?
Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Bioinformatics, polyploidy, population genetics
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||||
---|---|---|---|---|
1 | 2 | 3 | 4 | |
Version 3 (revision) 14 Jul 23 |
read | read | read | |
Version 2 (revision) 29 Nov 22 |
read | |||
Version 1 31 Jan 22 |
read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)