ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

cophesim: A comprehensive phenotype simulator for testing novel association methods

[version 1; peer review: 2 approved]
PUBLISHED 01 Aug 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Simulation is important in evaluating novel methods when input data is not easily obtainable or specific assumptions are needed. We present cophesim, a software to add the phenotype to generated genotype data prepared with a genetic simulator. The output of cophesim can be used as a direct input for different genome wide association study tools. cophesim is available from https://bitbucket.org/izhbannikov/cophesim.

Keywords

Phenotype simulation, GWAS

Introduction

Genome-wide association studies (GWAS) are routine in population research. New methods are being developed for better accessing complex associations between genotypes and phenotypes, uncovering genotype structures or testing evolutionary hypotheses. Testing the novel methods requires experimental data, which may not be easily obtainable. One solution is to use artificial data simulated with specific assumptions.

The best existing phenotype simulators, such as: GENOME1, Plink2, phenosim3, CoaSim4, Fregene5, ForSim6, QuantiNemo7, GCTA8, HapGen9, SeqSimla10, and SimRare11 offer qualitative and dichotomous simulated phenotype. But the known phenotype simulation software tools have some limitations, which may prevent customers from using them: (i) the majority, if not all, of the phenotype simulation software tools do not offer simulation of survival traits/time-to-event outcome, making it impossible to test respective hypotheses of associations; (ii) some of the tools are not easy to use, due to wide range of parameters, which the user has to provide and control (rather than calculate them automatically), making them unnecessarily difficult to use and preventing the user from future use of the tool; (iii) phenotype simulation is often offered as an auxiliary part of the genetic simulation routine, and therefore the user first has to perform a time-consuming unavoidable genetic simulation in order to obtain the phenotype; (iv) in situations when the genetic data is already simulated from other tools, only phenosim and GCTA offer adding simulated phenotype to such data. Consequently, it is necessary to have a new, simple and flexible phenotype simulation tool with plain algorithmic assumptions.

Consequently, we present cophesim, a comprehensive phenotype simulation tool that was developed to add a phenotype to corresponding genotypes simulated by other simulation tool (Table S1). cophesim offers simulation of continuous, dichotomous and survival traits, with different (user-provided) effect sizes of causal variants, with the ability to simulate epistatic interactions. It also can simulate phenotype within gene-environment interaction assumptions using up to 10 covariates.

Methods

Implementation

The workflow (see Figure 1) includes the following stages: (i) Input data pre-processing; (ii) phenotype simulation; (iii) generation of final output files.

139a6dfe-3ead-4a3c-b6cd-eae867165496_figure1.gif

Figure 1.

Workflow of cophesim has three stages: (1) Input stage, where the input data (can be provided in one of the three formats: Plink, MS and GENOME, see the user manual - Supplementary File 1) along with the other input parameters (such as causal variants with size effects, output format, etc.) is prepared for phenotype simulation; (2) Phenotype simulation stage, where different types of phenotypic traits are simulated: dichotomous, continuous and time-to-event (‘survival’); (3) Output stage – the final stage, where simulated phenotype data are packed to various formats in order to be directly usable by six GWAS tools: EMMAX, BLOSSOC, Plink, QTDT, TASSEL and GenABEL. Summary statistics are generated at the output stage as well.

Input data

Currently cophesim accepts the genotype output data from Plink, MS12 and GENOME software applications. Phenotypes (dichotomous, continuous and survival) are then added according to the following simulation scenarios.

Dichotomous phenotype

Dichotomous phenotype for ith individual (i = 1...N, where N is the total number of individuals in a dataset) is simulated according to the logistic model (if the user provided effect sizes for causal variants):

pi=11+ezi(1)
where pi is the probability of a particular outcome. In cophesim, it is a probability of a “case” (cases are marked by “1”, and “0” are controls in simulated dichotomous phenotype) for ith individual. If pi is greater than the some threshold p0 (we use p0 ~ U(0, 1)), then the phenotype for ith individual is set to “1” and to “0” otherwise. The variable z is determined with the following equation:
zi=j=1MEjgij+j=1KαjXij+ϵi(2)
Ej – effect size for jth variant, user-defined; gij – value of jth genetic marker for ith individual; αj - effect size for jth covariate and Xij is a value of jth covariate for a ith individual (the term j=1KαjXij is added to represent gene-environment iterations); ϵi – a standard normal residual, ϵi N (0, 1), computed for ith individual, M is a total number of genetic variants and K is a total number of covariates used.

If the user did not provide the effect sizes for causal variants, the following strategy is then used:

zi=j=1Mwij+j=1KαjXij+ ϵi(3)

Here wij is a weight and computed as follows: wij=gij2MAFj(2MAFj(1MAFj))1/2 (a standardization procedure, and the matrix W containing element wij is called a standardized genotype matrix8; MAFj – a minor allele frequency for jth genetic variant, and the other values are the same as described above. This strategy allows using defined genetic architecture in a simulated population.

Continuous phenotype

Qualitative (continuous) phenotype for ith individual is simulated according to the linear regression scenario according to the equations (2) or (3) (in case if effect sizes were not supplied).

Inverse Probability method

We model a survival phenotype from the proportional hazards model using the inverse probability method13: if U is uniform in (0, 1) and if S(·|z) is the conditional survival function derived from the proportional hazards model: S(t|z) = eH0(t)ez, then the random variable

T=S1(|z)=H0(t)1(log(U)ez)(4)
has survival function S(·|z). In this equation, H0(t) is a cumulative baseline hazard. By default, we use the Weibull cumulative baseline hazard: H0(t) = λρtρ–1; z is the same parameter that defined above, for each individual, and depends on whether the user provided effect sizes for causal variants or not. We also implemented exponential and Gompertz hazards.

Linkage Disequilibrium

The simplest way to simulate collinearity between two SNPs, g1 and g2, with effect sizes E1 and E2 is to replace some portion of g2 with g1 values according to provided r122 coefficient, which reflects a correlation between two SNPs. We also consider applying other techniques, such as copulas, in order to simulate LD.

Epistatic interactions

These are modeled with the following equation for ith individual:

zi=E1g1i+E2g2i+E12g1ig2i+j=1kαiXji+ϵi(5)
where the term E12 g1i g2i is the interaction term in which E12 is the epistatic effect size (user-defined, zero by default); αj is the effect size for jth covariate X.

Output files

Output files are in the formats as the direct inputs for the following tools: EMMAX14, Blossoc4, Plink (.ped file), QTDT15, TASSEL16, GenABEL17 (see Table 1).

Table 1. Output file formats supported by phenotype simulator cophesim.

Applying one of the options shown below controls the output format. Each output format has a special suffix type, which defines the file format. These output formats are concordant to those used in phenosim.

ApplicationOptionCommentary
EMMAX-emmaxSuffices .emma_geno, .emma_pheno
BlOSSOC-blossocSuffices .blossoc_pos, .blossoc_geno
PLINK-plinkUsed by default across all phenotypes,
except survival. Suffices .ped, .map,
.pheno.
QTDT-qtdtSuffices .ped, .map, .dat
TASSEL-tasselSuffices .poly, .trait
GenABEL-This format is used in simulation of
survival phenotype.

Operation

cophesim is freely available for download from the following link: https://bitbucket.org/izhbannikov/cophesim. Requirements: Python v2.7.10 and newer, plinkio v0.9.6, R v3.2.4 and newer, Plink v1.07, - in order to run the examples. The user manual is provided in a separate file “cophesim.pdf” located in the program directory and is also available as Supplementary File 1.

Use case

Below we present an example that shows simulation of genetic data and then simulation of three different phenotypic traits. Other examples and installation instructions are provided at the program website and also in the user manual. Refer to the user manual for description of input parameters.

#-------------------------------------------Example begins---------------------------------------#
#Step 1: genetic data simulation:
plink --simulate-ncases 5000 --simulate-ncontrols 5000 --simulate wgas.sim --out sim.plink --make-bed
#Step 2: Convert .bed to .ped:
plink --bfile sim.plink --recode --out sim.plink
#Step3: phenotype simulation from previously made genetic data:
python cophesim.py -i sim.plink -o testout -itype plink -otype plink -c -ce effects.txt -s -gomp
#-------------------------------------------End of example---------------------------------------#

In this example, we first (Step 1) simulate genetic data using Plink. We simulate N.cases = N.control = 5,000 cases and controls and 1,000 SNPs (defined in wgas.sim file, refer to the Plink website to see documentation for this type of file). Then (Step 2) we convert a binary sim.plink.bed file to sim.plink.ped (option --recode in Plink). This step is not required since cophesim can handle binary Plink files (.bed, .bim, .fam), but we provide this step in order to show the ability of the program to deal with Plink PED format. Finally (Step 3), we simulate dichotomous (by default), continuous (option -c) and survival (option -s) traits from previously simulated data stored in files sim.plink.ped and sim.plink.map. Note that we simulate survival trait with Gompertz hazard function (option -gomp); effect sizes for causal variants are provided in file effects.txt (to include this file we use option -ce).

ROC curves

We provide Receiver-Operating Characteristic (ROC) curves (Figure 2) constructed from association tests performed on a simulated dataset. Simulation and association testing were performed with Plink suite. The following parameters were used: N = 10,000 individuals, N.snp.c = 100 causal, with total N.snp = 1,000 variants. Causal variants were labeled with ‘1’ and the other (neutral) variants were labeled with ‘0’. These labels are then used later as true identifiers during calculation of TPR (true positive rate) and FPR (false positive rate). Dichotomous, continuous and survival phenotypic traits were simulated with cophesim. Then association tests were performed with Plink for dichotomous and continuous traits (using Plink flags –logistic and -regression, respectively). Association tests for survival trait were performed with the R package GenABEL. Then calculated p-values provided by association tests for each variant were compared to the significance threshold. Those variants passed the threshold were recognized as causal and associated with simulated phenotype. These classification results later were compared to the true identifiers (defined above) in order to obtain TPR and FPR. For all these tests, we varied the significance threshold from 0 to 1 with the increment of 0.001.

139a6dfe-3ead-4a3c-b6cd-eae867165496_figure2.gif

Figure 2. ROC curves constructed from results of association tests performed on a simulated dataset of N = 10,000 individuals, 100 causal and 1,000 of total SNP sites.

TPR: True Positive Ratio, FPR: False Positive Ratio. These results were calculated for dichotomous, continuous and survival traits. The dashed, 45 degrees line represents random guessing.

The R code to construct ROC curves is provided in the file “roc.R”. This file is attached to this computer note and also in the data repository: https://bitbucket.org/izhbannikov/cophesim_data/ROC/roc.R

Conclusion

In this work we presented the cophesim for phenotype simulation from genetic data obtained either from simulation or real data collecting. cophesim makes it possible to simulate various demographic models under user-defined scenarios.

Software and data availability

Tool and source code available from: https://bitbucket.org/izhbannikov/cophesim

Archived source code as at time of publication: doi:10.5281/zenodo.81019518

License: MIT

The example script and output files for the software are available at: https://doi.org/10.5281/zenodo.80409019.

To test the cophesim we provided a repository “cophesim_data”: https://bitbucket.org/izhbannikov/cophesim_data. Download or clone this repository to be able to run tests.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 01 Aug 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Zhbannikov IY, Arbeev KG and Yashin AI. cophesim: A comprehensive phenotype simulator for testing novel association methods [version 1; peer review: 2 approved]. F1000Research 2017, 6:1294 (https://doi.org/10.12688/f1000research.11968.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 01 Aug 2017
Views
10
Cite
Reviewer Report 18 Sep 2017
Lars Rönnegård, Dalarna University, Falun, Sweden 
Elena - Flavia Mouresan, Department of Animal Breeding and Genetics, Swedish University of Agricultural Sciences, Uppsala, Sweden 
Approved
VIEWS 10
Zhbannikov and co-workers present the cophesim software that they have developed for simulating phenotypic data using genotype information. The input and output file formats are compatible with many of the most commonly used computer programs for genome wide association studies. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Rönnegård L and Mouresan EF. Reviewer Report For: cophesim: A comprehensive phenotype simulator for testing novel association methods [version 1; peer review: 2 approved]. F1000Research 2017, 6:1294 (https://doi.org/10.5256/f1000research.12938.r25816)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
7
Cite
Reviewer Report 08 Aug 2017
Arnold B. Mitnitski, Department of Medicine, Dalhousie University, Halifax, NS, Canada 
Approved
VIEWS 7
In the manuscript “Cophesim: a comprehensive phenotype simulator for testing novel association methods”, I. Zhbannikov and colleagues from the Duke University, presented a software that allowed to generate genotype data prepared with a genetic simulator for the use in the ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Mitnitski AB. Reviewer Report For: cophesim: A comprehensive phenotype simulator for testing novel association methods [version 1; peer review: 2 approved]. F1000Research 2017, 6:1294 (https://doi.org/10.5256/f1000research.12938.r24718)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 01 Aug 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.