Characterization of population-based variation and putative functional elements for the multiple-cancer susceptibility loci at 5p15.33

Background: TERT encodes the telomerase reverse transcriptase, which is responsible for maintaining telomere ends by addition of (TTAGGG) n nucleotide repeats at the telomere. Recent genome-wide association studies have found common genetic variants at the TERT-CLPTM1L locus (5p15.33) associated with an increased risk of several cancers. Results: Data were acquired for 1627 variants in 1092 unrelated individuals from 14 populations within the 1000 Genomes Project. We assessed the population genetics of the 5p15.33 region, including recombination hotspots, diversity, heterozygosity, differentiation among populations, and potential functional impacts. There were significantly lower polymorphism rates, divergence, and heterozygosity for the coding variants, particularly for non-synonymous sites, compared with non-coding and silent changes. Many of the cancer-associated SNPs had differing genotype frequencies among ancestral groups and were associated with potential regulatory changes. Conclusions: Surrogate SNPs in linkage disequilibrium with the majority of cancer-associated SNPs were functional variants with a likely role in regulation of TERT and/or CLPTM1L. Our findings highlight several SNPs that future studies should prioritize for evaluation of functional consequences.


Introduction
The 5p15.33 locus includes the TERT (human telomerase reverse transcriptase) and the CLPTM1L (alias CRR9; cleft lip and palate transmembrane 1 like) genes. Telomerase reverse transcriptase (TERT) is the essential catalytic component of the telomerase holoenzyme responsible for maintaining telomere ends. Telomerase compensates for DNA polymerase's inability to fully replicate the lagging DNA strand by adding hexanucleotide (5'-TTAGGG-3') n repeats to the 3' end of chromosomes using a template sequence within the RNA component (TERC) of the enzyme 1 . Telomeres, consisting of these hexanucleotide repeats and several associated proteins, are responsible for preserving chromosomal stability by protecting chromosomes from end-to-end fusion, atypical recombination, and degradation 2 . In normal differentiated cells, expression of telomerase is very low or absent and telomeres erode by 50 to 200 base pairs with each cell division 1 . When the telomeres become critically short, they act as a cellular clock and signal cellular senescence and apoptosis 3,4 . In contrast, telomerase activity has been detected in 90% of human cancers 5,6 and allows these malignant cells to continually divide by bypassing cellular crisis 7 .
CLPTM1L is located approximately 23 kilobases (kb) centromeric of TERT. Little is known about the function of the CLPTM1L protein. It is a predicted transmembrane protein that is expressed in a range of normal and malignant tissues including skin, lung, breast, ovary and cervix, and has been shown to sensitize ovarian cancer cells to cisplatin-induced apoptosis 8 .
The clinically related telomere biology disorders (TBDs), such as pulmonary fibrosis or aplastic anemia, are associated with germline mutations causing amino acid substitutions, additions, deletions, and frame shift mutations within TERT 9,10 . Patients with the more severe TBD, dyskeratosis congenita (DC) have very high risks of bone marrow failure and cancer, and have telomeres below the 1 st percentile for their age 11 . DC represents the most clinically severe outcome of germline TERT mutations and often presents in childhood. Individuals with isolated aplastic anemia or pulmonary fibrosis due to TERT mutations tend to manifest clinical symptoms in adulthood.
Both TERT and CLPTM1L are evolutionarily conserved across diverse species, which suggests their functional importance 8,27,28 . TERT has low nucleotide diversity, and common SNPs in this gene region show low levels of differentiation among populations and high ancestral allele frequencies 28,29 ; this pattern of low overall diversity suggests that TERT may be constrained 29 .
The 1000 Genomes Project Consortium has reported that different populations have different profiles of rare and common variants; and, varying degrees of purifying selection at functionally relevant low-frequency sites which lead to substantial local population differentiation 30 . Large surveys of human genetic variation have described an excess of rare genetic variants as a result of a recent population expansion and weak purifying selection 31-33 , particularly for variants in disease genes and for individuals of European ancestry 33 .
In order to better understand the population genetics underlying the 5p13.3 locus associated with cancer, we conducted a detailed analysis of allele frequency patterns among ancestral group, levels of differentiation, and recombination at the 5p15.33 locus using 1000 Genomes Project 34 data. We retrieved data for the TERT-CLPTM1L genes and flanking regions for 1092 individuals from 14 populations. Analyses were focused on understanding how allele frequencies differ between populations, and evaluation of the cancer-associated SNPs and their surrogate markers for potential functional elements.

Data analysis
The package ARLEQUIN version 3.5 35 was used to compute F ST values, diversity, AMOVA, and heterozygosity. F ST values based on allele frequencies were calculated as a measure of population differentiation, and significance was estimated with 10,000 permutations; and, these levels were compared to the genome-wide average for autosomal SNPs (F ST ≈ 0.1 [36][37][38][39] ). The population of African-Americans in the Southwestern United States (ASW) was grouped with the two populations of West African ancestry (Luhya in Kenya [LWK] and Yoruba in Nigeria [YRI]) since in our population level analyses they were found to be most closely related to these individuals of African ancestry, as previously observed 40 .
In order to apportion the fraction of the genetic variance due to differences between and within ancestral groups (European, East Asian, West African, and American) and infer the genetic structure of the populations, AMOVA was performed with 10,000 permutations. HAPLOVIEW version 4.1 41 was used to determine the degree of linkage disequilibrium (LD) and minor allele frequency (MAF). The GLU genetics' ld.tagzilla module was used for the tag analysis with a LD pairwise r 2 threshold of 0.8. Pairwise LD was analyzed separately for the four ancestral groups and used to select tag SNPs for each region.
SNPs within TERT and CLPTM1L were grouped by functional category (i.e., coding vs. non-coding, and synonymous vs. nonsynonymous variants), and tested for significant differences in the normalized number of variant sites, allelic frequency divergence, heterozygosity, minor allele frequency (MAF), and levels of differentiation among populations; significant differences would suggest that these functional categories of loci were not affected similarly, as expected under the assumption of neutrality. The allelic frequency divergence between ancestral groups was computed using: d = 1-[(x 1 y 1 ) 1/2 + (x 2 y 2 ) 1/2 ], where x 1 and y 1 are the frequencies of the first allele and x 2 and y 2 are the frequencies of the second allele 42 . The normalized number of variant sites was calculated as: θ^ = K/Σ n-1 i=1 i -1 L, where K is the number of variant sites, n is the number of chromosomes, and L is the total sequence length. Differences between the SNP functional categories were tested for significance with a two-tailed t-test. SIFT (Sorts Intolerant From Tolerant) and Polyphen 2 (Polymorphism Phenotyping v2) were used to predict the potential impact of an amino acid substitution 43,44 .
To identify recombination hotspots in this region, we used SequenceLDhot 45 , a program that uses the approximate marginal likelihood method 46 and calculates likelihood ratio statistics at a set of possible hotspots. We used the four ancestral groups [European (EUR; n=379), East Asian (EA; n=286), American (AM; n=184), and African (AFR; n=246)] to calculate background recombination rates using PHASE v2.1 47,48 . The likelihood ratio statistics of 12 predicts the presence of a hotspot with a false-positive rate of 1 in 3,700 independent tests.
Putative functional elements were assessed using the UCSC genome browser (http://genome.ucsc.edu/), a publically available bioinformatics website, for ENCODE Regulation and Comparative Genomics tracks for all of the cancer-associated SNPs and their surrogates for each ancestral group. SNPs were considered surrogates for cancer-associated SNPs for each ancestral group if the r 2 ≥0.60, the inter-marker distance ≤200kb, and the MAF ≥0.05. We assessed potential regions of open chromatin with DNase hypersensitivity; potential regulatory histone marks (H3K4Me1, H3K4Me3, H3K27Ac); protein binding sites; regulatory motifs; CpG islands; conserved mammalian microRNA regulatory binding sites; and evolutionary conservation among placental mammals using the phylop basewise conservation measurement 49 . Functional elements were also assessed using RegulomeDB, an integrated database that annotates SNPs with known or predicted regulatory DNA elements, including DNase hypersensitivity, transcription factor binging sites, and promoter regions that regulate transcription using data from GEO, ENCODE, and published literature 50 . RegulomeDB scores are a heuristic scoring system based on confidence that a variant is located in a functional region and likely results in a functional consequence, these are used to assist comparison among annotations 50 . Lower scores indicate increased evidence; category 2 scores are variants likely to affect binding, category 3 scores are less likely to affect binding; and 4, 5, or 6 scores are variants with minimal binding evidence.  Table 1. The majority of SNPs in TERT and CLPTM1L were in intronic regions (N=903), only 72 were exonic (49 in TERT and 18 in CLPTM1L). 46 of the exonic variants were synonymous changes (32 in TERT and 9 in CLPTM1L) and 26 were non-synonymous protein altering variants (PAV) (17 in TERT and 9 in CLPTM1L). The SNPs previously associated with cancer at 5p15.33 25 are all located in the intronic regions of TERT or CLPTM1L or intergenic between these genes, except for one which is a coding synonymous SNP in TERT (rs2736098; Table 2).

Results
Since there were so few coding variants in the TERT and CLPTM1L loci, we combined them for the following analyses. The normalized number of variant sites, heterozygosity, and MAFs were significantly different by functional SNP category in TERT and CLPTM1L (P values <0.01; Table 1). Specifically, the non-coding SNPs (compared with coding SNPs) and synonymous SNPs (compared with non-synonymous SNPs) had significantly higher numbers of variant sites, heterozygosity, and MAFs (Table 1). These trends were consistent in all ancestral groups ( Figure 1A). The most significant differences between coding and non-coding SNPs were in African populations (non-coding average MAF 9.8% vs. coding average   MAF 0.9%); and, the most significant differences between synonymous (syn.) versus non-synonymous (non-syn.) SNPs were in East Asian populations (syn. average MAF 4.8% vs. non-syn. average MAF 0.2%) ( Figure 1A). There were significantly different levels of differentiation among ancestral groups for coding versus noncoding and synonymous versus non-synonymous SNPs ( Figure 1B).

Protein altering variation
All PAVs were present at a rare or low frequency ( Figure 1C). European ancestry individuals had higher MAFs for many of the PAVs in TERT and CLPTM1L, and there were significant MAF differences among ancestral groups for rs35719940, rs61748181, rs33955038, and rs113203740 ( Figure 1C). Nine (53%) of the 17 PAVs observed in TERT and three (33%) of the nine PAVs observed in CLPTM1L were reported to be damaging by Polyphen and/or SIFT (two in silico approaches; underlined in Figure 1C). Most of these potentially damaging variants were only observed in one individual. However, three possibly damaging variants in TERT were observed in multiple individuals [rs34094720 (N=3), rs61748181 (N=31), rs200843534 (N=5)] ( Figure 1C).
Patterns of diversity and recombination among ancestral groups A summary of the variation by ancestral group for this region is given in Table 3. There was low nucleotide diversity (average of 5. There was little to no LD in the TERT gene region but high LD was present in the CLPTM1L gene region (Figure 2 and Supplementary Figure 1). There were 4-5 main recombination hotspots in TERT and between TERT and CLPTM1L, there were no hotspots located within CLPTM1L (Supplementary Table 1). The greatest recombination was observed in individuals with African ancestry (5 recombination hotspots), and the lowest recombination in individuals with East Asian ancestry (4 recombination hotspots and lower likelihood ratio statistics) (Figure 2 and Supplementary Figure 1).

Cancer-associated SNPs
Twenty-three SNPs significantly associated with cancer at 5p15.33 25 were included in the analysis ( Table 2). Many of the cancer associated SNPs in this region had differing allele frequencies and heterozygosity among ancestral groups and populations, and had F ST values close to or greater than 0.1 (Table 2 and Supplementary  Table 4). The risk allele was the rare allele at all of these SNPs, except at rs4246742 (associated with lung cancer; Table 2). Most of the cancer-associated SNPs in the CLPTM1L gene region are in regions of high LD, and therefore, have many surrogates (25-54 surrogate SNPs) with r 2 ≥0.6 ( Table 4 and Supplementary Table 2). In contrast, most of the SNPs in the TERT gene region are in a region of low LD and have no or few surrogates (0-5 surrogate SNPs) with r 2 ≥0.6 ( Table 4 and Supplementary Table 2). In East Asian ancestry individuals SNPs in the CLPTM1L gene region are particularly highly correlated, even some of the SNPs within TERT are in high LD in these individuals (i.e., rs10069690, rs2242652, and rs13167280; Supplementary Figure 1).

Potential regulatory changes
All previously reported cancer-associated SNPs and all possible surrogates at r 2 ≥0.6 were assessed for the presence of potential regulatory elements and evolutionary conservation among mammalian species (summarized in Table 4 and Supplementary Table 3). Surprisingly, none of the cancer-associated SNP surrogates were located in the coding regions of TERT or CLPTM1L. Many of these SNPs are associated with open chromatin (DNase hypersensitivity) and/or regulatory histone marks (H3K4Me1, H3K4Me3, H3K27Ac) in multiple cell types, alter known regulatory motifs and/or protein binding sites. One of the surrogate SNPs in the putative promoter region of TERT, rs2853669, is a conserved binding site for POLR2A, as were six other surrogate SNPs located intergenic between TERT and CLPTM1L, within the CLPTM1L gene region, and in the putative promoter region of CLPTM1L. One of the cancer-associated SNPs, rs2736098, and three surrogate SNPs in the 5' region and putative promoter region of TERT were C>T SNPs located in the CpG island. Clusters of several surrogate SNPs     RegulomeDB score indicates: 4 = TF binding + DNase peak, 5 = TF binding or DNase peak, 6 = motif hit, -= no data available; Highlighted rows indicate that one or more surrogates for this SNP results in a likely functional consequence (RegulomeDB score of 2); Mammal Conserv. = measurement of evolutionary placental mammal basewise conservation, the conserved sites are indicated.
located within CLPTM1L and just 3' and 5' of CLPTM1L were associated with many histone marks and open chromatin, and/or altered regulatory motifs and protein binding sites. None of the cancer-associated SNPs or their surrogates were associated with microRNA binding sites.
We used the RegulomeDB scoring system to compare and prioritize potential functional consequences of these SNPs. The cancerassociated SNPs in the 5' region of TERT, most of the intergenic cancer-associated SNPs, and all the cancer-associated SNPs within CLPTM1L had surrogates with a likely functional consequence of affecting binding, indicated by a category 2 score (highlighted in Table 4 and Supplementary Table 3). None of the SNPs were identified to be associated with changes in expression of these genes.

Discussion
Data from the 1000 Genomes Project 34 on 1627 variants at 5p15.33 for 1074 unrelated individuals were used to describe the population genetic patterns in this region. We evaluated differentiation among ancestral groups, allele frequency patterns, and the cancerassociated SNPs and surrogates for potential regulatory elements.
We have previously shown that there is low nucleotide diversity and differentiation among populations in TERT and suggested that TERT may be constrained 28,29 ; however, our previous population genetics study focused on telomere genes as a gene set and was limited to only four SNPs located within the TERT gene 29 . In this study with better coverage of the TERT-CLPTM1L region, we determined that there is low nucleotide diversity across the 5p15.33 region in all ancestral groups and low differentiation among groups. As expected, African populations had more diversity, specifically at non-coding SNPs, compared to the other ancestral groups. However, East Asian populations had greater diversity at synonymous SNPs, and Europeans had the greatest frequency of non-synonymous changes. European and American ancestry individuals had very similar allele frequency patterns, as others have observed 51 .
The significantly reduced normalized number of variant sites, heterozygosity, and MAFs, and low differentiation among ancestral groups for the coding sites, particularly for non-synonymous sites, compared with non-coding and silent changes suggests purifying selection in TERT and CLPTM1. African ancestry individuals had the greatest difference between the frequencies of non-coding vs. coding variants, consistent with stronger purifying selection; in contrast, European ancestry individuals had an excess of potentially deleterious non-synonymous SNPs. These observations are consistent with reports of genes important in cancer and complex disease 42,[52][53][54] and recent genomic reports 30-33 . European ancestry individuals have been reported to have an excess of recently arisen potentially deleterious variants in disease genes 33 . American and East Asian ancestry individuals also had an excess of coding variants compared to African ancestry individuals, suggesting weaker purifying selection in these populations as well. East Asian individuals had a particular excess of synonymous variants and very few non-synonymous variants. For the cancer-associated SNPs in this region, the risk allele was primarily the rare allele which additionally provides support for the hypothesis of constraint in this region. This evidence of purifying selection supports the importance of TERT and CLPTM1 in disease, and the variation by ancestry suggests the level of selection differs by geographic region.
We found that several of the 23 SNPs that have been significantly associated with cancer at 5p15.33 [Reviewed in 25] had differing MAFs and heterozygosity among ancestral groups. Europeans and Americans had the most similar MAFs and heterozygosity estimates, which suggests significant admixture. These differences, reflected in the high F ST values, may correlate to varying disease incidence rates among ancestral groups. For example, the breast cancer associated SNP, rs10069690 23 , had significantly different minor allele frequencies among ancestral groups; the homozygous risk allele genotype was significantly more common in African ancestry individuals (genotype frequency of 40% vs. 2.4% in East Asian, 6.8% in American, and 8.4% in European ancestry individuals) and less common in East Asian ancestry individuals. This difference may be associated with the higher incidence of breast cancer in African ancestry individuals (particularly for estrogen receptor-negative breast cancer) and lower incidence in East Asian individuals.
Many of the cancer-associated SNPs and surrogate SNPs were associated with potential regulatory elements, including histone marks, open chromatin, transcription factor binding sites, and/or regulatory motifs. There were only a few surrogates for the SNPs located within TERT and just 5' of TERT due to the low levels of LD in these regions; and, there were a large number of surrogates for the SNPs located close to and within CLPTM1L where LD was strong and recombination low, most of these surrogates were shared among the cancer-associated SNPs in this region. Many of the surrogate markers were located in the putative promoter regions of TERT and CLPTM1L and may affect gene regulation. The Regu-lomeDB scoring approach allowed us to classify variants based on all of the regulatory information. This approach determined that surrogate SNPs for many of the cancer-associated SNPs are functional variants with a likely role in regulation; these should be prioritized for functional assays.

Conclusions
Our analysis of diversity in this important cancer-associated region of 5p15.33 provides background information for understanding variation in the general population. The functional impact of common variation in this region needs to be examined experimentally, but we could speculate that the diversity of coding variants among different ethnicities could have mild effects on the phenotype disparity observed among these populations. Many of the cancer-associated SNPs and/or surrogates at 5p15.33 are associated with regulatory changes and candidates for evolutionary selection. Evidence of purifying selection in TERT and CLPTM1L highlights their functional importance and associations with complex disease. We have identified SNPs in this region that are likely involved in regulation of the TERT and/or CLPTM1 genes. Future studies of the functional consequences of the 5p15.33 variants will be required to understand their contribution to cancer etiology.

Author contributions
Project design was carried out by S.A.S., L.M., and M.Y.
Genotyping data were retrieved by C.C.C.
Analyses were performed by L.M.
The manuscript was written by L.M. and S.A.S., and reviewed by all co-authors. The rationale for setting the threshold for marker surrogacy at r = 0.6 (p7) while using r = 0.8 for LD calculations (p3) should be explained.
In summary, this is well designed and presented study that demonstrates the potential of using high throughput sequencing data together with growing resources such as ENCODE to enhance understanding of traditional genome-wide genotyping experiments. The title reflects well the contents, the abstract is appropriate and occlusions are justified and balanced.
We have read this submission. We believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed.

Duncan Baird
Institute of Cancer and Genetics, Cardiff University, Cardiff, UK Numerous studies have identified variation at the TERT-CLPTM1L locus in conferring an increased risk of many different cancer types.
Here the authors have examined the genetic architecture of the TERT-CLPTM1L locus using sequence data from the 1000 genomes project. Given the potential significance of this locus, this type of work is important as it has the potential to identify functional variants that might not have been uncovered with the various GWAS undertaken to identify risk variants. Thus far none of the risk variants identified at this locus with GWAS results in non-synonymous protein changes, however this study provides data to indicate that some of these variants may be associated with regulatory sequences and chromatin marks. This study also identified 26 variants that result in non-synonymous protein changes in the hTERT or the CLPTM1L genes. This is a well written manuscript and the conclusions are appropriately backed up by the data provided. The title is appropriate and the abstract adequately summarises the article. Overall this manuscript provides useful information that that will underpin future work to establish the importance of this locus in conferring cancer risk.
I have no major criticisms of this work; however I recommend that a more rigorous statistical review, than I am able to provide, is undertaken of this manuscript.
I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: