A Genome-Wide Association Study of spontaneous preterm birth in a European population

Preterm birth is defined as a birth prior to 37 completed weeks’ Background: gestation. It affects more than 10% of all births worldwide, and is the leading cause of neonatal mortality in non-anomalous newborns. Even if the preterm newborn survives, there is an increased risk of lifelong morbidity. Despite the magnitude of this public health problem, the etiology of spontaneous preterm birth is not well understood. Previous studies suggest that genetics is an important contributing factor. We therefore employed a genome-wide association approach to explore possible fetal genetic variants that may be associated with spontaneous preterm birth. We obtained preterm birth phenotype and genotype data from the Methods: National Center for Biotechnology Information Genotypes and Phenotypes Database (study accession phs000103.v1.p1). This dataset contains participants collected by the Danish National Birth Cohort and includes 1000 preterm births and 1000 term births as controls. Whole genomes were genotyped on the Illumina Human660W-Quad_v1_A platform, which contains more than 500,000 markers. After data quality control, we performed genome-wide association studies for the 22 autosomal chromosomes. No single nucleotide polymorphism reached genome-wide Results: significance after Bonferroni correction for multiple testing. We found no evidence of genetic association with spontaneous Conclusion: preterm birth in this European population. Approaches that facilitate detection of both common and rare genetic variants, such as evaluation of high-risk pedigrees and genome sequencing, may be more successful in identifying genes associated with spontaneous preterm birth. 1 2 2 2 2

Preterm birth (PTB), defined as birth prior to 37 completed weeks' gestation, is a major public health problem that affects more than 10% of all births worldwide 1,2 .Globally, an estimated 15 million babies are born premature each year 1,2 .Despite substantial public health efforts over the past several decades, the U.S. PTB rate remained at 11.72% in 2011 3,4 .
Definition of spontaneous preterm birth Some PTBs are iatrogenic and can be attributed to obstetric intervention aimed at reducing maternal and/or fetal risk.The remaining PTBs are known as spontaneous PTB (SPTB) and are the focus of research efforts to identify genetic and environmental risk factors.Although SPTB is a pressing health issue, the incomplete understanding of its biology has inhibited development of effective prevention and treatment strategies.
Our genome-wide approach for spontaneous preterm birth Here, we employed an unbiased, genome-wide approach to search for possible candidate genes associated with SPTB.Genome-wide association studies (GWAS), or whole genome association studies, are a commonly used genetic approach to study a disease or a trait.GWAS compares thousands or even millions of common genetic markers, mainly single nucleotide polymorphisms (SNPs), across individuals with a disease or trait status.There are 1,688 publications identifying 11,299 SNPs that are significant for diseases or traits in 17 different categories 37,38 .We obtained the SPTB phenotype and genotype data deposited in the National Center for Biotechnology Information (NCBI) Genotypes and Phenotypes Database (dbGaP) 39,40 , study accession phs000103.v1.p1, to perform a GWAS to explore genetic variants associated with SPTB.

Data application and approval
We applied to and received approval from dbGaP 39,40 for access to a dataset for SPTB phenotype and genotype (study accession phs000103.v1.p1).We followed the Data Use Certification Agreement we signed during the application.
To access the data, one must apply and agree to the dbGAP terms of usage.A detailed instructions and procedures for application can be obtained at https://dbgap.ncbi.nlm.nih.gov/.

Data content
This dataset contains participants collected by the Danish National Birth Cohort (DNBC) 41 .DNBC is a prospective cohort that enrolled more than 100,000 pregnant women in the first trimester, before most adverse outcomes occurred, and therefore is free from sampling or collection bias.In this dataset, there are approximately 1000 preterm births and 1000 term births as controls.These study subjects were collected from 1997 to 2003.All are singleton gestations.Each birth has records of mother-child pairs.With the exception of 24 children with one or two grandparents from other Nordic countries, all other children in the dataset had parents and all four grandparents born in Denmark.
The case (preterm) group contains births delivered before 37 gestational weeks.The control (term) group contains births delivered at approximately 40 weeks' gestation.In both preterm and term groups, children born with any recognized congenital or genetic abnormality were excluded.Pregnancies with maternal conditions known to be associated with PTB, iatrogenic or spontaneous (placenta previa, placental abruption, hydramnios, isoimmunization, placental insufficiency, pre-eclampsia/eclampsia), were also excluded by DNBC.
The blood sample (buffy coat) was collected for each mother-child pair.
Their whole genomes were genotyped on the Illumina Human660W-Quad_v1_A platform (Illumina, Inc., San Diego, California, USA), which contains more than 500,000 markers.Genotyping was performed by the Johns Hopkins University Center for Inherited Disease Research (Baltimore, Maryland, USA).Further data cleaning and harmonization were done at the GENEVA Coordinating Center at the University of Washington (Seattle, Washington, USA).

Quality control
In this study, we focused only on fetal genomes for further analysis.Individuals with missing genotypes greater than 3% were filtered out.In addition, individuals with a heterozygosity rate deviating more than 3 standard deviations were also excluded from further analysis 42 .After per-individual quality control, we also performed per-SNP filtering.The SNPs that had missing genotypes greater than 3% were excluded 42 .Those having significantly different genotype missing rate between the case and control groups were also eliminated.A conservative cut-off with p < 1×10 -5 was applied 42 .We also excluded SNPs that significantly deviated from Hardy-Weinberg equilibrium in the control group -those with p < 1×10 -5 were filtered out 42,43 .

Data overview
We started with a dataset comprised of 1,900 children.There were 31 children having a missing genotype rate greater than 3%.The average heterozygosity was 0.3238, with standard deviation of 0.0059.There were also 31 children having heterozygosity that deviated more than 3 standard deviations.In fact, there was considerable overlap when applying these two filtering criteria (Figure 1) -26 individuals were identified by both exclusion criteria.During per-individual quality control, a total of 36 individuals were excluded, resulting in 1864 individuals.However, 66 of them had a missing phenotype, yielding a final total of 849 cases and 949 controls (Table 1).
Among the 560,768 markers, there were 1,933 SNPs that exceeded the missing rate threshold of 3%.There were 367 SNPs that had a significantly different missing rate between the case group and control group (P value < 1×10 -5 ).Further, 885 SNPs significantly deviated from Hardy-Weinberg equilibrium in the control group (P value < 1×10 -5 ).These three criteria also identified some overlapping SNPs; a total of 2,670 SNPs were excluded using all criteria.These quality control steps left 558,098 SNPs remaining in the dataset.Of these SNPs, 544,675 of them are located on 22 autosomes, and were included in the analysis (Table 1).

Allelic test
We carried out GWAS for these 1,798 individuals, of which 849 are SPTB cases and 949 are term controls, over 544,675 SNPs on 22 autosomal chromosomes.An allelic test was first carried out, and no SNPs reached genome-wide significance after Bonferroni correction for multiple testing (Manhattan plot (Figure 2) and QQ plot (Figure 3)).

Other genetic models
We then performed GWAS with different genetic models.We tested three classical Mendelian inheritance models here.The recessive model assumes that carrying two variant alleles is required to present a different phenotype; while in the dominant model, one variant allele is sufficient to present a different phenotype as carrying two variant alleles.The additive model assumes the heterozygotes present an intermediate phenotype between the two homozygotes and thus consider the three genotypes separately.The Manhattan plot for the additive model (Figure 4), dominant model (Figure 5), and recessive model (Figure 6) are shown.The dominant or recessive models refer to the action of the minor allele.Within these    genetic models, no SNP reached genome-wide significance after Bonferroni correction for multiple testing.

Discussion
Here we describe a negative result for associations between SPTB and genetic polymorphisms of 22 autosomal chromosomes in a homogeneous European population.Myking et al. reported a GWAS focusing on the X chromosome 47 and they incorporated Danish cases and controls from DNBC 41,47 , in addition to participants enrolled from the Norwegian Mother and Child Cohort Study (MoBa) 48 .Nevertheless, with a larger sample size (DNBC + MoBa), and fewer independent tests limited to the markers on X chromosomes, no SNP reached genome-wide significance after Bonferroni correction 47 .
One way to decrease the probability of a negative result is to increase the sample size, either by recruiting more cases and controls directly, or by combining different studies and conducting a meta-analysis 49,50 .Instead of sequencing thousands or millions of sporadic cases plus controls, another approach is to study SPTB using a family-based design -i.e., identify high-risk pedigrees in which a genetic mutation is more likely to be present in multiple individuals.Pedigree studies have the additional advantage of reduced phenotypic heterogeneity 49 .Several loci associated with SPTB have been identified by using family-based linkage studies 51,52 .

8.0
Additive model Manhattan plot Another approach is to employ whole genome or whole exome sequencing.This will help to identify rare genetic variants with potentially larger effect sizes.
Another approach that may increase statistical power is to analyze the SPTB phenotype as a quantitative trait instead of a dichotomous one.The distribution of gestational age in the population is approximately normal 53 .Therefore, to analyze SPTB as a quantitative trait (i.e., gestational age), samples should be drawn randomly from the population of newborns.

Conclusion
We found no evidence of genetic association with SPTB in Danish population using an unbiased genome-wide approach.A familybased design in a high risk pedigree, and whole genome or exome sequencing, may yield higher detection rates of both common and rare variants associated with SPTB.

Open Peer Review
Current Referee Status: I have now read the paper and as it stands it is ok.I would however recommend several things that may improve it: State that the data generally suggests that most effects are maternal so that this may be a hard task.
Note whether any of the candidate genes that have even nominal data fall in regions of statistical significance.Although not proof positive such findings would support an association -albeit not novel.
I always like to ask if the cases are in HWE as well as the controls as there are several papers I could cite -including some of our own that indicate this is a good way to detect effects.
Lastly, I am concerned that their bonferonni correction is too conservative as the ~550,000 are not all independent.We have paper in press on this point.That said nothing the authors have done is incorrect -only what I would consider to be a too limited exploration of the data.And since this type of study is by nature exploratory I would focus on lowering type 2 errors as opposed to type 1 (e.g., bonferonni corrections).
From David Olson: Preterm birth is a very complex and heterogeneous problem; there are many causes, and these can vary according to gestational age.I would have expected the outcomes observed (no significant SNPs) because the authors did not stratify their preterm birth subjects into homogeneous groups.They should have separated single fetus pregnancies from multiple-fetus pregnancies, excluded those with other pre-disposing problems such as pre-eclampsia, preterm premature rupture of membranes, incompetent cervix, etc., and they should have stratified according to age: <28 weeks gestational age, 28-30, 30-34 and 34-37.I would suggest they re-analyze their data according to such stratification in an attempt to create more homogeneous groups.
We have read this submission.We believe that we have an appropriate level of expertise to 2.

5.
We have read this submission.We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
No competing interests were disclosed.Competing Interests: In their paper, "A Genome-Wide Association Study of spontaneous preterm birth in a European population", the authors perform a GWAS on spontaneous preterm birth in a European population.The GWAS is based on 849 cases (preterm births) and 949 controls (term births), and use a dense genotyping panel that contains over 500,000 SNPs.The paper is nicely written and the analyses are well done.However, I have several issues that need to be taken care of.In its actual form, this paper lacks additional analyses.Below are my non-exhaustive comments: In this study, no SNP reaches the genome-wide significance threshold.However, as we can see in the Manhattan plot (Figure 2), there are several suggestive signals, especially on chromosome 12, 15, and 3.The authors should investigate these signals and try to find the nearby genes.
They perform four models and show the corresponding Manhattan plots.First, they do not explain the reason why they perform these four models.Second, the allelic test should give similar results to the additive model test.The authors do not compare between these models, at least for the most significant signals.Otherwise, these analyses seem to be redundant.
Providing the Manhattan plots is far from enough.We do not have the ability to see the most significant SNP names.In addition, they should mention if there was a performed GWAS for preterm birth, and if so, they should compare their results with what can be found in the literature.Moreover, they mention that there is an evidence of a candidate gene.They should say how this gene was discovered, and show their SNP results in this gene or its region.In addition, the might consider performing the haplotypic test in this gene or in the region of the most significant SNPs.They should present these results in informative tables.
In the introduction, the authors say that there are genetic factors that contribute to SPTB.It is important to mention the heritability estimates or the familial recurrence risk estimate (if there is any) for SPTB.
The authors talk about negative results in their GWAS.They should discuss why they did not discover any genome-wide significant SNP (lack of power?).They also need to discuss the weaknesses of this study, and what could be done in the future.
I have read this submission.I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Figure 1 .
Figure 1.Per-individual quality control.The X axis is the missing genotype rate for each individual.The Y axis is the heterozygosity rate for each individual.Each dot represents a person.The vertical dash line is the cut-off for per-individual missing rate: 3%.Individuals with missing rate greater than 3% are excluded.The two horizontal dot lines represent the mean heterozygosity ± 3 standard deviations.People with heterozygosity deviate above 3 standard deviations are also excluded.The two criteria overlap largely at the right lower part of the graph.

Figure 2 .
Figure 2. Manhattan plot for allelic test.The X axis is the position of each SNP grouped by different chromosomes, and presented with different colors.The Y axis is the P value for each test for each SNP, in -log10 scale.The horizontal blue line indicated the threshold of genome-wide significance after Bonferroni correction.The threshold is at 7.04, which corresponds to -log10 (0.05/544,675) for 544,675 independent tests.No SNP reached genome-wide significance.

6 Figure 3 .
Figure 3. QQ plot for allelic test.The X axis indicated the expected P value, in -log10 scale.The Y axis indicated the observed P value from allelic test, also in -log10 scale.The red diagonal line is the line of Y=X, where observed equals expected.The genome-wide significance threshold for -log10 (P) scale is at 7.04, which corresponds to -log10 (0.05/544,675) for 544,675 independent tests.No SNP reached genome-wide significance.

Figure 4 .
Figure 4. Manhattan plot for additive model.The X axis is the position of each SNP grouped by different chromosomes, and presented with different colors.The Y axis is the P value for each test for each SNP, in -log10 scale.The horizontal blue line indicated the threshold of genome-wide significance after Bonferroni correction.The threshold is at 7.04, which corresponds to -log10 (0.05/544,675) for 544,675 independent tests.No SNP reached genome-wide significance.

Figure 6 .
Figure 6.Manhattan plot for recessive model.This is the GWAS result assuming the recessive action of minor alleles.The Y axis is the P value for each test for each SNP, in -log10 scale.The horizontal blue line indicated the threshold of genome-wide significance after Bonferroni correction.The threshold is at 7.04, which corresponds to -log10 (0.05/544,675) for 544,675 independent tests.No SNP reached genome-wide significance.

Figure 5 .
Figure 5. Manhattan plot for dominant model.This is the GWAS result assuming the dominant action of minor alleles.The Y axis is the P value for each test for each SNP, in -log10 scale.The horizontal blue line indicated the threshold of genome-wide significance after Bonferroni correction.The threshold is at 7.04, which corresponds to -log10 (0.05/544,675) for 544,675 independent tests.No SNP reached genome-wide significance.
Department of Obstetrics and Gynecology, University of Alberta, Edmonton, AB, Canada Department of Genetics, Institute of Quantitative Biomedical Sciences, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA From Scott Williams: of Medical Genetics, University of Washington, Seattle, WA, USA

Table 1 . Number of participants and markers in the dataset.
sd: standard deviations.