Epigenetic germline variants predict cancer prognosis and risk and distribute uniquely in topologically associating domains

Shervin Goudarzi; Meghana Pagadala; Adam Klie; James V Talwar; Hannah Carter

doi:10.12688/f1000research.139476.2

Home Browse Epigenetic germline variants predict cancer prognosis and risk and...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Revised

Epigenetic germline variants predict cancer prognosis and risk and distribute uniquely in topologically associating domains

[version 2; peer review: 2 approved, 2 approved with reservations]

Shervin Goudarzi¹, Meghana Pagadala², Adam Klie^2,3, James V Talwar^2,3, Hannah Carter ^3-5

Shervin Goudarzi¹, Meghana Pagadala², [...] Adam Klie^2,3, James V Talwar^2,3, Hannah Carter ^3-5

PUBLISHED 24 Jul 2025

Author details Author details

¹ Canyon Crest Academy, San Diego, California, 92130, USA
² Biomedical Sciences Program, University of California San Diego, La Jolla, California, 92093, USA
³ Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, California, 92093, USA
⁴ Medicine, University of California San Diego, La Jolla, California, 92093, USA
⁵ Moores Cancer Center, La Jolla, California, CA 92093, USA

Shervin Goudarzi
Roles: Conceptualization, Data Curation, Formal Analysis, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Meghana Pagadala
Roles: Conceptualization, Data Curation, Methodology, Software

Adam Klie
Roles: Data Curation, Software

James V Talwar
Roles: Data Curation, Software

Hannah Carter
Roles: Methodology, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

This article is included in the Bioinformatics in Cancer Research collection.

Abstract

Background

Methylation quantitative trait loci (meQTLs) associate with different levels of local DNA methylation in cancers. Here, we investigated whether the distribution of cancer meQTLs reflected functional organization of the genome in the form of chromatin topologically associated domains (TADs) and evaluated whether cancer meQTLs near known driver genes have the potential to influence cancer risk or progression.

Methods

Published cancer meQTLs were analyzed according to their location in transcriptionally active or inactive TADs and TAD boundary regions. Cancer meQTLs near known cancer genes were analyzed for association with cancer risk in the UKBioBank , and prognosis in The Cancer Genome Atlas (TCGA).

Results

In TAD boundary regions, the density of cancer meQTLs was higher near inactive TADs. Furthermore, we observed an enrichment of cancer meQTLs in active TADs near tumor suppressors, whereas there was a depletion of such meQTLs near oncogenes. Several meQTLs were associated with cancer risk in the UKBioBank, and we were able to reproduce breast cancer risk associations in the DRIVE cohort. Survival analysis in TCGA implicated a number of meQTLs in 13 tumor types. In 10 of these, polygenic cancer meQTL scores were associated with increased hazard in a CoxPH analysis. Risk and survival-associated meQTLs tended to affect cancer genes involved in DNA damage repair and cellular adhesion and reproduced cancer-specific associations reported in prior literature.

Conclusions

This study provides evidence that genetic variants that influence local DNA methylation are affected by chromatin structure and can impact tumor evolution.

Keywords

meQTLs, TAD, Cancer, Polygenic Risk Score, XGBoost, Machine learning

Corresponding author: Hannah Carter

Competing interests: No competing interests were disclosed.

Grant information: This work was supported by an NIH National Cancer Institute grant (R01CA269919) to HC and a National Institute of General Medical Sciences infrastructure grant (2P41GM103504-11).

Copyright: © 2025 Goudarzi S et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Goudarzi S, Pagadala M, Klie A et al. Epigenetic germline variants predict cancer prognosis and risk and distribute uniquely in topologically associating domains [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2025, 12:1083 (https://doi.org/10.12688/f1000research.139476.2) First published: 01 Sep 2023, 12:1083 (https://doi.org/10.12688/f1000research.139476.1) Latest published: 24 Jul 2025, 12:1083 (https://doi.org/10.12688/f1000research.139476.2)

Revised Amendments from Version 1

This version of the manuscript includes clarifications requested by the reviewers. This resulted to updates of figures 1,2 and 5 and the addition of 2 supplementary figures, and some added text and references.

See the authors' detailed response to the review by Charu Mehta
See the authors' detailed response to the review by Chiara Herzog

Introduction

Cancer is a heterogeneous disease and common treatments like chemotherapy have only a 55% response rate.¹ Precision medicine and biomarker analysis can tailor treatment options and optimize outcomes. Genetic factors, such as germline and somatic mutations, contribute to heterogeneous disease risk and progression. For example, germline variants in the BRCA2 gene can greatly increase the risk of developing breast and ovarian cancer.² Epigenetic factors including DNA methylation, histone modification, and acetylation also play a key role in cancer progression. Recently, promising therapeutics have been developed that inhibit DNA methyltransferases (DNMTs), reducing tumor growth in breast cancer and highlighting the importance of DNA methylation and other epigenetic factors in carcinogenesis.²^,³ However, the interplay between epigenetics and genetics in cancer risk and progression remains mostly elusive.

Methylation quantitative trait loci, or meQTLs, are single nucleotide polymorphisms (SNPs) that significantly correlate with DNA methylation at CpG sites. These SNPs provide a bridge between genetic variation and corresponding epigenetic effects shown to correlate with cancer risk.⁴ Disruptions in DNA methylation are well-known in the context of cancer; DNA is frequently hypermethylated at promoter regions of tumor suppressor genes while hypomethylated at the promoters of oncogenes, and there is an inverse correlation with gene expression.⁵ Promoter hyper- and hypo-methylation has been of specific interest due to its role in regulating the expression of cancer genes including suppression of tumor suppressor genes like BRCA⁶ and the expression of oncogenes like L1NE1.⁷ Subsequently, germline SNPs that acted as meQTLs were shown to predict risk in many cancer types like breast and lung, regulating expression and methylation of genes like FBXO-18.⁴

The organization of the genome into 3D structures may further modify the potential of genetic variants to interact with epigenetic factors in a disease specific manner.⁸ Topologically associating domains (TADs) are isolated regions of highly-interacting and folded chromatin separated by insulator proteins. TADs are important for maintaining controlled patterns of local gene regulation and provide a framework for transcriptionally similar genes and SNPs to interact with one another.⁹ In fact, because TADs have been found to be highly stable across tissue types, they provide valuable context for understanding the genome’s functional landscape allowing the study of genetic variation in the context of 3D chromatin structure.¹⁰ Mutational burden of somatic mutations within the context of cancer demonstrated correlation with TADs.¹¹ In addition, genes within TADs demonstrate correlated gene expression and histone modification,¹²^,¹³ allowing us to group similar acting genes and SNPs, narrowing a search for potentially cancer related SNPs.

In this study, we integrate genetic correlates of DNA methylation across 23 cancer types (i.e. cancer meQTLs) and TAD domains to better understand how 3-D chromatin structure might determine the potential of meQTLs to influence cancer risk and survival. We focus on meQTLs near TADs containing key cancer-related genes. Analyzing the location and distribution of such variants across the genome, we find that methylation-related germline variants, or meQTLs, in cancer do not lie uniformly across the genome and the occurrence of TAD boundaries correlates with significant cancer meQTL presence. In addition, meQTLs closely related to cancer progression show specific nonrandom distribution in TAD domains. Then we assessed whether meQTLs near cancer genes could predict cancer survival and risk and found significant prediction power of these meQTLs across multiple cancer types. Our study suggests that the potential of meQTLs to contribute to cancer risk and progression depends in part on local genome architecture and chromatin state.

Results

Active TADs are associated with less DNA methylation at cancer meQTLs

We identified 1100 TADs shared across 5 cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK) and categorized them into “Mixed”, “Inactive-1”, “Inactive-2”, “Active-1”, and “Active-2” groups using chromatin state information ( Figure 1A). Combining the active and inactive groups resulted in 222 active, 626 inactive and 252 mixed TADs. DNA methylation is linked with TAD activity via nucleosome positioning and chromatin condensation¹⁴ as well as to regulation of gene expression, where promoter CpG methylation is associated with gene silencing.¹⁵ We compared our categorization of TAD activity with genome-wide DNA methylation in promoter regions defined based on the ENCODE Screen Pipeline. Promoters in active TADs showed overall lower levels of methylation whereas those in inactive TADs had a higher level of methylation (Kruskal-Wallis, p-value<0.001) ( Figure 1B), supporting that promoter methylation silencing aligns with categorization of TADs into transcriptionally different groups, namely into “active” and “inactive”.

Figure 1. Evaluating DNA methylation and meQTL burden in topologically associated domains (TADs).

(A) 5 state-based K-Means clustering of common TAD domains (n=1100) between 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Shared TAD domains are on the y-axis (n=1100) and are grouped according to 15 chromatin states (x-axis). Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs. (B) On average, inactive TADs have higher DNA methylation levels than active TADs (Kruskal-Wallis test, p-value<0.001). These results are supported by previous literature concerning promoter methylation and transcriptional activity. (C) Number of meQTLs across inactive TADs versus active TADs are shown. meQTL counts per TAD were normalized by TAD length in base pairs. Active TADs show on average a larger normalized burden of meQTLs than inactive TADs (Student-t Test, p<0.05).

Cancer meQTLs are more abundant in inactive domains

Next we measured the overall burden of independent cancer meQTLs (i.e. meQTLs deemed to represent distinct haplotypes based on the level of linkage disequilibrium; LD) across TAD categories, normalized by TAD length in base pairs. To obtain independent meQTLs, we clumped related meQTLs from Gong et al.¹⁶ based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning ( Table 1). We observed a slightly increased number of cancer meQTLs in inactive domains relative to active regions (Student T-test, p-value<0.05; Figure 1C).

Table 1. General Information on meQTL number across TADs and multiple analyses.

Each row shows the total number of meQTLs after each analysis across each TAD type. The rows are as follows: all meQTLs without filtration, meQTLs in LD from PLINK clumping software (p<1×10^-5) and meQTLs in LD with CpG probe in cancer driver gene promoter region. Other indicates meQTLs that are in the inter-TAD region but do not fall within the boundary region as defined.

meQTL filtration methods	Active TAD meQTLs	Inactive TAD meQTLs	Boundary meQTLs	Other	Total
All meQTLs	30,210	70,101	56,304	1,079,527	1,236,142
Clumped meQTLs	1,159	4,490	2,763	52,190	60602
Cancer gene-related clumped meQTLs	21	8	20	107	156

We also evaluated cancer meQTLs at TAD boundaries, considering four categories of boundary based on the category of the flanking TADs: “Active-Boundary-Active”, “Inactive-Boundary-Inactive”, “Active-Boundary-Inactive”, and “Inactive-Boundary-Active”. To allow aggregation across variable length regions, we divided each boundary region into 40 equal genomic bins and calculated the number of meQTLs in each. We then compared the observed density of meQTLs to that obtained by randomizing flanking TAD categories 100 times. Comparing the density of meQTLs in each boundary category to the randomized equivalent, the active-active (student t-test, p<0.01), active-inactive (p<0.01), and inactive-active boundaries (p<0.01) all showed difference in distribution from random, while inactive-inactive (p=0.089) did not ( Figure 2A-D). Distributions suggested an increase in density of clumped meQTLs when transitioning from active to inactive regions, and conversely, a decrease from inactive to active regions (Kruskal-Wallis ANOVA, p-value<0.05) when compared to the randomly shuffled distribution, but no shift in density for Active-Boundary-Active and Inactive-Boundary-Inactive categories ( Figure 2B-D).

Figure 2. Normalized burden of meQTLs in adjacent TADs.

The binned average normalized meQTL burden distribution is shown across boundaries between consecutive TADs, grouped by transition category: active to active, active to inactive, inactive to active, and inactive to inactive. The start/end of the TADs for both active and inactive are shown red and blue, respectively. Distributions are smoothened by rolling average for visualization purposes. The graphs represent a unique distribution of meQTL burden across consecutive TADs as opposed to an even spread. The dotted brown line represents the distribution for shuffled random TADs to act as control. (A) Active-active (p=3.51×10^-10), (B) active-inactive (p=3.45×10^-46), and (C) inactive-active (p=1.65×10^-25) boundaries all showed clear difference in distribution from random, while (D) inactive-inactive (p=0.089) did not.

Oncogene and tumor suppressor gene-related cancer meQTLs cluster differentially in TADs

Clumped cancer meQTLs were further narrowed to those whose corresponding affected CpG probes were within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.¹⁷ In total, 103 oncogenes and 223 TSGs were used for this analysis, where only 67 of them contained meQTL-associated CpG probes in their promoter regions (i.e. 49 TSGs and 18 oncogenes). Out of the 60,602 clumped meQTLs, 156 of them significantly affected CpG probes located in promoter regions of cancer driver genes (driver meQTLs; Table 1). Overall, we saw an overwhelming bias for driver meQTLs to occur in active regions, followed by boundary, and inactive ( Figure 3A). To understand whether the observed distribution of driver meQTLs was expected, we selected equivalent numbers of meQTLs at random and evaluated their distribution across region types. We did this separately for meQTLs associated with oncogenes versus TSGs, as meQTLs might have different implications in the context of selection for gain versus loss of function. In the oncogene case, meQTLs were depleted relative to random in active TADs, and enriched relative to random in inactive TADs, with no difference in boundary regions. Conversely, for TSGs, there was a significant enrichment of cancer-related meQTLs in active TADs and boundary regions, but a depletion in inactive TADs ( Figure 3B-C). These opposing trends could suggest genes with the potential to be oncogenes or tumor suppressors (i.e. growth promoting versus limiting) are under different constraints with respect to the propensity for methylation to accumulate in their promoter regions.

Figure 3. Expected versus observed occurrence of Driver meQTLs for oncogenes and TSGs by region type.

(A) The number of driver meQTLs per MB are plotted, divided according to the category of TAD they are located in. Normalization was conducted by the total region size in each category. (B-C) Randomization analysis for burden of non-cancer meQTLs normalized by number of base pairs in each region was conducted to obtain the expected number of cancer meQTLs per MB. To model random expectation (B) 54 non-cancer meQTLs (i.e. number of oncogene-proximal meQTLs) and (C) 102 non-cancer meQTLs (i.e. number of TSG-proximal meQTLs) were sampled 1000 times for oncogenes and TSGs respectively. Bar graphs are drawn with standard errors. The actual observed cancer meQTL burden is shown as a red dot.

Assessment of driver meQTL association with cancer risk and overall survival across tumor types

We next evaluated the potential for driver meQTLs to have clinical relevance. A principal component analysis (PCA) was first conducted on the 156 driver meQTLs across individuals in the TCGA (Extended Data Figure 1).⁶¹ The principal components (PCs) that explained more than 1% of the variance were assessed for association with clinical covariates by linear regression. We noted some association of PCs with tumor type, age at diagnosis and tumor stage at diagnosis, suggesting that cancer meQTLs could have tumor-type specific implications for risk and prognosis. Interestingly, further examining the 10 meQTLs with the strongest loadings in PCs correlated with tumor type, we found that the meQTLs disproportionately affected oncogenes, suggesting that tumor types differ more in oncogene effects than in tumor suppressor effects of DNA methylation.

We first evaluated the driver meQTLs for cancer risk associations using the UKBioBank. In total, 86 of the 155 (1 SNP was not in the UKBioBank registry) driver meQTLs in the initial PheWAS analysis from UKBioBank patients showed a nominal association with one or more cancer ICD10 codes (p-value<0.05) with 5 SNPs passing a Benjamini-Hochberg FDR threshold of 0.05 ( Table 2). In total, meQTLs were associated with risk of 15 different cancer types as described by ICD10 codes ( Table 3). We focused on C50-C50 (malignant neoplasm of the breast) as this tumor type had a large sample size in UKBioBank (n=11,188) and other large cohorts exist to support validation studies.

Table 2. List of meQTLs significantly affecting risk and survival in a pan-cancer model (Benjamini-Hochberg FDR<0.05).

The beta value is the correlation coefficient of the meQTLs with DNA methylation at the promoter region of the probe gene. The TAD type that the meQTL resides is also represented.

rsid	SNP	p-value	TAD type	Probe gene	Risk/Survival
rs6500442	16:89828862:T:C	0	Active	FANCA	Survival
rs36083956	16:74679883:C:T	0	Boundary-Active	RFWD3	Survival
rs1163248	10:104896563:A:G	0	Neither	NT5C2	Survival
rs1006548	16:89844043:T:C	0	Active	FANCA	Survival
rs8047581	16:89884502:C:T	0	Active	FANCA	Survival
rs17581498	17:73794047:G:T	0	Inter-TAD	H3F3B	Survival
rs62051918	16:74613781:T:C	0	Active	RFWD3	Survival
rs3935784	16:74604841:G:A	0	Active	RFWD3	Survival
rs8046036	16:74552127:C:T	0.00000000252	Active	RFWD3	Survival
rs36030784	2:178119204:A:C	0	Inter-TAD	NFE2L2	Survival
rs1407920	9:10389328:C:G	0.00000628	Inter-TAD	PTPRD	Survival
rs1725213	7:5584599:A:G	0.00000952	Active	RAC1	Survival
rs11859725	16:74384296:C:T	0	Inter-TAD	RFWD3	Survival
rs4265826	16:74723707:A:G	0.0000000519	Neither	RFWD3	Survival
rs12441344	15:67447895:A:G	0.000000629	Inter-TAD	SMAD3	Survival
rs200282	16:74222799:C:G	0	Inter-TAD	RFWD3	Survival
rs6679323	1:15914135:A:G	0	Active	CASP9	Survival
rs3743861	16:89818340:G:C	0.0000000349	Active	FANCA	Survival
rs10999617	10:72723176:G:A	0.00000886	Inactive	PRF1	Risk
rs12597188	16:68814826:G:A	0	Inter-TAD	CDH1	Risk
rs7554885	1:18247811:G:T	0.000000215	Inter-TAD	SDHB	Risk
rs10845664	12:13043119:C:T	0.000000288	Inter-TAD	CDKN1B	Risk
rs741482	3:185903412:C:G	0.0000052	Inter-TAD	MAP 3K13	Risk

Table 3. The ICD 10 code.

The ICD 10 code used by UKBioBank is shown alongside their definitions for the risk analysis.

ICD 10 Codes	Definitions
C00-C14	Malignant neoplasms of lip, oral cavity and pharynx
C15-C26	Malignant neoplasms of digestive organs
C30-C39	Malignant neoplasms of respiratory and intrathoracic organs
C40-C41	Malignant neoplasms of bone and articular cartilage
C43-C44	Melanoma and other malignant neoplasms of skin
C45-C49	Malignant neoplasms of mesothelial and soft tissue
C50-C50	Malignant neoplasms of breast
C51-C58	Malignant neoplasms of female genital organs
C60-C63	Malignant neoplasms of male genital organs
C64-C68	Malignant neoplasms of urinary tract
C69-C72	Malignant neoplasms of eye, brain and other parts of central nervous system
C73-C75	Malignant neoplasms of thyroid and other endocrine glands
C76-C80	Malignant neoplasms of ill-defined, other secondary and unspecified sites
C7A-C7A	Malignant neuroendocrine tumors
C7B-C7B	Secondary neuroendocrine tumors

To further assess the relevance of driver meQTLs to cancer risk, we used them to predict breast cancer status alongside clinical covariates using the approach described by Elgart et al.¹⁸ We first performed feature selection by LASSO on nominally significant driver meQTLs and available clinical factors (age, ancestry as represented by the top 10 genotype-derived PCs); LASSO regularization removed ancestry and some meQTLs. Selected features were then used to train an XGBoost classifier on 189,022 examples derived from UKBioBank breast cancer cases and non-cancer controls (Methods). The score resulting from the trained XGBoost model was used as the PRS. We applied the trained model to predict breast cancer status for individuals in the DRIVE dataset, comprising 26,374 breast cancer cases and 32,428 controls (ROC AUC: 0.5534, 95%CI [0.5505, 0.5563]). The distribution of PRS values across cases was significantly higher than controls for the breast cancer outcome, as expected (Mann-Whitney U, p-value<0.001) ( Figure 4A). In both UKBioBank and DRIVE datasets, the incidence of breast cancer was significantly higher among individuals in the upper 20% percentile of the PRS score versus the bottom 20% percentile (Fisher’s exact test, UKBioBank: p=4.25×10^-7<0.001, DRIVE: p=1.47×10^-13<0.001), suggesting that a higher burden of meQTLs impacts breast cancer risk ( Figure 4B-C).

Figure 4. XGBoost validation of breast cancer risk in DRIVE dataset.

(A) An XGBoost classifier trained to predict incidence of breast cancer in the UKBioBank, was applied to predict cancer risk in the DRIVE cohort. PRS scores provided by the model were higher for individuals diagnosed with breast cancer (Mann-Whitney U p=2.4×10^-19). (B-C) Plots showing the odds ratio of a breast cancer diagnosis across 10% quantiles of the XGBoost predicted PRS in the UKBioBank and DRIVE cohorts respectively. Risk increased from a hazards ratio of ~0.8 to ~1.1 between 0th and 90th PRS percentiles, supporting that cancer meQTLs impact breast cancer risk. C50-C50: ICD10 code for malignant neoplasms of the breast.

We extracted feature importances from the UKBioBank-trained PRS to better understand the driver meQTLs underlying breast cancer risk ( Figure 5A). Overall, cancer meQTLs near 29 cancer genes were included in the model. The most predictive driver meQTL was associated MSH2, a gene associated with Lynch syndrome and increased risk of breast cancer.¹⁹ Polymorphic variation affecting the expression of EZH2, the second most informative feature, has also been linked to breast cancer risk.²⁰ ASXL2 may be required for estrogen receptor alpha (ERa) activation in ERa positive breast cancers.²¹ Notably, EZH2 overexpression has been linked more strongly to triple negative breast cancer²² suggesting that the model includes features predictive of multiple subtypes. More direct mechanistic insight might be gained by studying expression, genotype and methylation in healthy and pre-cancerous breast tissues and cell types. Studying the average expression of MSH2, EZH2, and ASXL2 within TCGA patients stratified by meQTL risk PRS suggested a potential decrease in expression of ASXL2 and EZH2 from in the highest PRS quantile relative to the lowest while MSH2 did not show much difference (Figure 5B). However, this difference needs to be studied further with more specific tumor sub-type stratification and cell type-specific expression. Indeed, classic polygenic risk scores for breast cancer have shown bias for predicting certain subtypes.²³ Lakeman et al.²⁴ demonstrated that women in the highest 1% of risk showed a 4.37-fold increased risk for ER-positive disease but only a 2.78-fold increased risk for ER-negative disease compared to the middle quintile showing bias in certain subtypes.

Figure 5. Feature importances for breast cancer risk classifier.

A) Features are ranked according to their contribution to classifier predictive performance. Total importances sum to 1. B) Average expression of ASXL2, EZH2 and MSH2 in TCGA breast cancer samples, stratified by PRS quantile.

Finally, we evaluated the implications of driver meQTLs for prognosis. We first removed one meQTL, 2:209220238:C:G, that had a minor allele frequency <1% across TCGA samples, then conducted a Kaplan-Meier analysis for the remaining meQTLs separately for each tumor type with at least 100 samples. Out of the 155 SNPs, 21 passed the Benjamini-Hochberg adjusted FDR of less than 0.05 ( Table 2). To assess overall contribution of driver meQTLs to survival, we built polygenic survival scores (PSS) using XGBoost and incorporated them into Cox proportional hazards (PH) models alongside relevant covariates. Here we only evaluated tumor types that had at least 5 SNPs implicated as nominally significant by Kaplan-Meier analysis (n=23 tumor types). Nominally significant driver meQTLs for each tumor type were subjected to selection by LASSO and used to train XGBoost models to predict binary survival outcome (binarized based on median time to an event) separately for each tumor type. Out of the 23 tumor types, 13 had a higher XGBoost classification AUC value when both SNPs and clinical covariates were combined as compared with using only clinical covariates. These included BLCA, BRCA, PAAD, PRAD, UCEC, OV, STAD, SKCM, PCPG, LUSC, KIRC, HNSC and ESCA. This suggests that for these cases, meQTLs contributed survival-relevant information beyond the covariates (i.e. age, sex, tumor stage in some cases). For these tumor types, we trained XGBoost models using only meQTLs to obtain tumor-type specific polygenic survival scores (PSS) that were then included alongside covariates (tumor stage, age at diagnosis and sex) in Cox PH models to predict overall survival time in months (Methods).

PSS values made a significant contribution to predicting overall survival time for all cancer types except BRCA and SKCM ( Figure 6). PSS had the highest hazard ratios compared to other covariates for most cancer types, including: ESCA, BLCA, KIRC, LUSC, OV, PAAD, PCPG, PRAD, STAD, UCEC. PSS was also predictive of disease free interval in KIRC, PCPG, LUSC, HNSC and UCEC (Extended Data Figure 2).⁶² Most covariates behaved as expected in the analysis with tumor stage having one of the highest odds ratios. However, it is difficult to assess the generalizability of the estimated effect sizes in the absence of independent validation cohorts with both genotype and survival measured in the same cancer types. Nonetheless, to further investigate the prognostic implications of driver meQTLs, we analyzed their feature importances in their respective XGBoost models ( Figure 7). The number of meQTLs contributing to tumor type specific PSS ranged from 2 to 12, often with 1 or 2 meQTLs dominating the model.

Figure 6. CoxPH Hazard Ratios and 95% confidence interval of PSS and covariates in TCGA overall survival.

The hazard ratios and 95% confidence intervals associated with various covariates are shown across 13 cancer types: BLCA, BRCA, PAAD, PRAD, UCEC, OV, STAD, SKCM, PCPG, LUSC, KIRC, HNSC, ESCA. Due to limitations in availability of data some tumor types lacked covariates like tumor stage. Sex was excluded for tumors that only occur in males or females. ER: Estrogen receptor, PR: Progesterone Receptor.

Figure 7. Feature importance of SNPs in XGBoost polygenic survival scores.

A heatmap of the feature importances of SNPs for the cancer type specific XGBoost survival classifiers is shown. For each model across the 13 tumor types, the feature importances sum to 1 with red demonstrating larger importance of a SNP and blue demonstrating lesser importance.

Focusing on the most informative tumor type-associated meQTLs, we investigated the relevance of the associated oncogenes to cancer progression. In many cases, the identified genes were supported by previous studies. For example, PTPRD loss in melanoma was shown to cause disruption of desmosomes, resulting in increased invasive potential.²⁵ Polymorphisms in exonuclease ERCC2 have also been found to modify melanoma prognosis²⁶ and have been linked to prostate cancer progression as well.²⁷ In pancreatic cancer, RFWD3 expression quantitative trait loci (eQTLs) are associated with survival.²⁸ RFWD3 is an E3 protein ubiquitin ligase important for DNA damage and has been shown to stabilize TP53 in response to DNA damage.²⁹ We note that RFWD3 meQTLs were among the informative features for many other tumor types as well ( Figure 7). RAC1 has previously been shown to determine the metastatic potential of renal cell carcinoma (KIRC).³⁰ Reduced expression of CDKN1B is a known risk factor for PCPG and is common in this disease but usually cannot be explained by somatic alterations, though cases of allelic imbalance have been noted.³¹ CASP9 promoter polymorphisms confer increased risk of breast cancer³² and higher expression of CASP9 was associated with better survival.³³ Downregulation of ERCC5 is associated with longer progression free survival in ovarian cancer treated with platinum therapy³⁴ as is the case for OV in TCGA. In head and neck cancer, the most informative driver meQTL was associated with ETNK1, a cancer gene more commonly associated with myeloid neoplasms³⁵ though there is increasing evidence that it may contribute to dysregulation of phospholipid metabolism in multiple tumor types.³⁶

Discussion

There is an increasing appreciation that both genome structure³⁷^–⁴¹ and common genetic variants¹⁶^,⁴²^–⁴⁸ modify to the potential for carcinogenesis. However, the interplay between these factors is not well understood. To start to understand this, we investigated the relationship between the cancer meQTLs recently reported by Gong et al., and 3D genome structure in the form of TADs. To determine the relevance to cancer, we further investigated cancer meQTLs near driver genes for potential to modify cancer risk and progression. We took advantage of a recently introduced modeling strategy that first performs feature selection on a set of nominally associated SNPs, then trains a non-linear XGBoost model based on those features.¹⁸ Feature importances can be extracted from the trained model to gain insight as to which features were most influential, suggesting biological hypotheses that can be further investigated.

We observed higher levels of promoter methylation in inactive versus active TADs, slightly more meQTLs in active TADs and higher densities of meQTLs in boundary regions proximal to inactive versus active TADs. Furthermore, analyzing meQTL distribution across TAD boundaries revealed a non-uniform pattern, suggesting that TAD boundaries affected distributional burden of meQTLs. It is of note that TAD boundaries conserved across cell types are reportedly highly enriched for evolutionary constraint and complex trait heritability.¹⁰ Our data suggest that variability in gene expression due to meQTLs is also evolutionarily more constrained in and around active TADs and their boundaries, consistent with these TAD boundaries playing a critical role in development.⁴⁹ These results may suggest that TAD boundaries play a role in making the recruitment of regulatory machinery more specific, particularly as it pertains to DNA methylation.

Interestingly, we found that meQTLs associated with driver genes showed patterns of enrichment or depletion in a manner dependent on the activity state of the TAD in which the meQTLs occurred. Investigating cancer meQTLs, which are polymorphic sites that associate with differences in the level of DNA methylation found in tumors, showed depletion for germline meQTLs affecting oncogenes but enrichment for such meQTLs affecting tumor suppressor genes in active TADs. This could suggest that the potential to modulate tumor suppressor gene expression through methylation is evolutionarily advantageous whereas modulating oncogene expression by promoter methylation may be less so. These trends point to evolutionary constraints on the distribution of meQTLs imposed by 3D genome architectures and that could set the stage for genomic vulnerabilities to later malignancy.

Focusing on meQTLs near known driver genes, we evaluated the potential of meQTLs to modify cancer risk or progression. We found a number of meQTLs associated with survival in the UKBioBank and were able to validate a polygenic score constructed from these meQTLs in the independent DRIVE cohort. The inclusion of genes linked to distinct breast cancer subtypes among the features that most contributed to classifier performance suggests that cancer meQTLs may differentially affect risk of developing different forms of breast cancer and raises the possibility that subtype-specific meQTL-based risk classifiers may outperform a generic model. The meQTLs most strongly predictive of prognosis tended to occur near cancer genes that were also associated with risk or prognosis in the same tumor type. However, we saw cases such as ETNK1 in head and neck cancer, where meQTLs implicated a gene that has not been considered a factor promoting progression. This could point to a new therapeutic opportunity in this disease. Further studies are merited to determine whether the observed associations result from meQTLs being in linkage with eQTLs or coding variants that contribute to risk or progression, or whether meQTLs themselves make it easier or more difficult for genes to be modulated through DNA methylation. Interestingly, we noted multiple independent meQTLs for the same cancer gene were informative in predictive models. This suggests that at least in some cases, the cumulative burden of meQTLs near driver genes could further alter gene function to exacerbate risk or progression. While we focused on cancer genes, other studies have more broadly implicated meQTLs in cancer survival, supporting expanded analyses in the future.

There are a few limitations for this study. First, the meQTLs utilized for this study are derived from a study of tumors¹⁶ which could be biased toward detecting meQTLs associated with DNA methylation events that are positively selected in tumors. For risk prediction, we focused on meQTLs and their corresponding CpG probes that are overlapping the promoter regions of known cancer genes, however we cannot be sure that these meQTLs are not also affecting other genes in the region, for example through effects on enhancer activity. Second, once focusing on specific tumor types, the number of samples available to predict prognosis is relatively small, and some samples were missing tumor stage or age at diagnosis data, key clinical features for survival prediction. In addition, we lacked independent cohorts to validate the generalizability of polygenic survival scores based on meQTLs, which could lead to overfitting in some of our results as suggested by the large hazard ratios observed in CoxPH analysis. This validation should be a priority as suitable data sets become available. We also made a few assumptions. We only considered common TADs across multiple human cell lines which could have potentially removed some important cell-type specific TAD domains, though our methodology follows what other studies¹¹^,⁴⁹ have done. For predicting prognosis, we made the assumption that TAD domains from healthy human cell lines would also apply to cancer patients and thus avoided events where TAD structure could change. We justified our decision through previous studies determining TAD domains are overwhelmingly similar across cancer and noncancer patients.⁴⁹ In future studies, it would be of interest to study meQTL trends in normal tissue samples to see if enrichment patterns associated with cancer genes are driven by selection in tumors, or highlight evolutionary constraints more broadly associated with human health that coincidentally are advantageous for tumor development.

This study investigated the relationship between epigenetic factors like chromatin structure and DNA methylation and genetic variation in the context of cancer, and established the potential for cancer gene associated meQTLs to uncover cancer-specific modifiers of risk and progression. There are also a number of non-genetic risk factors that act by modifying DNA methylation levels and which could interact with genetic regulation. These include aging, exercise, stress, diet and obesity, and a broad variety of environmental exposures. In our analysis, age had the highest impact on DNA methylation modulation, however, as age and sex were the only clinical factors for the majority of our study, future analysis of other non-genetic factors in relation to genetic regulators of DNA methylation are merited. Future efforts could integrate dynamic methylation changes due to these non-genetic factors with static polygenic scores such as we describe here to provide a more accurate estimate of risk. This type of approach could benefit in particular from non-invasive biomarkers, such as cell free DNA methylation from blood, though studies will be needed to establish the cumulative effect of dynamic exposures and the extent to which they can be accurately evaluated from cell free DNA.⁵⁰

Methods

TCGA and promoter data

TCGA meQTLs data were obtained from Gong et al.¹⁶ TCGA outcome and survival data alongside RNA-seq expression data were obtained from the pan-can atlas, Liu et al.⁵¹ Illumina 450k DNA methylation data were also obtained from the TCGA pan-cancer atlas.⁵¹ The promoter data was obtained from the ENCODE Screen pipeline.⁵²^,⁵³

UKBioBank data

Genotypes and ICD10 codes were obtained for 394,034 samples across 40 ICD 10 codes from the UK BioBank.⁵⁴ For the C50-C50 analysis, only exclusive cases and controls were considered: patients who were only diagnosed with the breast neoplasm were compared with controls who were not diagnosed for any neoplasm. This reduced the sample size to 189,022 for the breast cancer risk analysis.

DRIVE breast cancer data

Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) (dbGaP Study Accession: phs001265.v1.p1)⁵⁵ was used to validate the risk outcome analysis of our XGBoost model. There were 60,231 breast cancer cases and controls with genotype data alongside outcome, age, and ancestry principal components.

TAD identification and clustering based on chromHMM and DNA methylation

Topologically associating domain (TAD) regions from the GM12878, HMEC, HUVEC, IMR90, and NHEK cell lines were downloaded from Rao et al.¹² and only common TAD domains using a 20% overlap algorithm described previously across all 5 cancer cell lines were considered for the rest of the analysis. TAD domains were characterized into 5 clusters: “Active-1”, “Active-2”, “Inactive-1”, “Inactive-2”, and “Mixed” through K-means clustering and use of a 15-chromatin state model derived from the Roadmap Epigenomics Project.⁵⁶ For most of the analysis, the two active and two inactive groups were combined for simpler visualization and mixed regions were ignored due to their biological ambiguity. The boundary of each TAD was considered as the 50 kb region upstream and downstream of TAD endpoints (i.e. 100 kb long boundaries) with the exception of consecutive TADs that had a region in between smaller than 100k base pairs. For those cases, the boundary was considered as the proximal half of the region for each of the two TADs. This TAD boundary definition using a 100 kb boundary ±50 kb upstream and downstream from the start and end of a TAD-is supported by previous literature.¹⁰

DNA methylation levels were compared to TAD domains as follows. DNA methylation levels were summarized at promoters identified by the ENCODE’s SCREEN pipeline for in human hg38. We compared the methylation beta values (i.e. the proportion of methylated region) using TCGA’s DNA methylation data, and averaged these beta values for all promoter regions across Active 1, Active 2, Inactive 1, Inactive 2, and Mixed regions. The hypothesis that methylation levels in promoter regions of actively transcribed TADs would be lower than in inactive TADs was tested by a Kruskal-Wallis test.

meQTL distribution within TADs

We retrieved 1,236,142 unique cis-meQTLs across 23 cancer types from the Pancan-meQTL database.¹⁶ meQTLs were further clumped by linkage-disequilibrium (LD) to obtain independent associations using the PLINK⁵⁷ clumping function using association p-values derived from the Pancan-meQTL database as input and default parameters (p1=0.0001, p2=0.01, r²=0.5, kb=250). These clumped, independent meQTLs were used for all subsequent analyses. First, the burden of clumped meQTLs across Active, Inactive, and Mixed TAD regions was measured. The burden was normalized by the length in base pairs of each region. To understand how meQTLs are distributed across the genome and whether TADs have an effect on the distribution of meQTLs, we analyzed the distributional burden of meQTLs within consecutive TADs. We compared the average meQTL density across different TAD transitions (i.e. Active-Boundary-Active, Active-Boundary-Inactive, Inactive-Boundary-Active and Inactive-Boundary-Inactive) by binning the genome between two TADs into 40 equal-sized bins and calculating average burden of meQTLs within these bins normalized by the bin size in base pairs. Resulting graphs were smoothed by a rolling average for visualization purposes. To evaluate whether the distribution reflected an association with transitions in TAD activity status, we shuffled the labels (i.e. “Active”, “Inactive”, etc.) of the TADs while preserving the number of transition categories (i.e. “Active-Active”, “Inactive-Active”, etc.) 100 times and ran the distribution analysis again on these randomly shuffled TADs by taking an average over all trials. Significance was assessed by comparing the observed difference in density between the TADs to the 100 average randomized trials using a student t-test.

Randomized distribution of cancer-gene-clumped meQTLs

Clumped meQTLs were annotated according to LD with CpG probes located in the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.¹⁷ A total of 231 oncogenes and TSGs were used for this analysis and promoter regions used were those identified by ENCODE’s SCREEN pipeline.⁵⁸ To evaluate whether active/inactive TADs or boundary regions harboring cancer genes showed enrichment or depletion for meQTLs, we conducted a randomization analysis with 1000 trials. In each trial, we chose a random sample of meQTLs associated with non-cancer genes with matching minor allele frequency (±5%) to the set cancer-gene associated meQTLs, while also matching the number of randomly sampled meQTLs. We then mapped genes with meQTLs to active or inactive TADs and TAD boundaries, summed the meQTLs in each and normalized by the size of the region. The standard error was plotted alongside the true burden to see if the burden across TADs is significantly different from random.

Correlation of meQTL profiles with clinical characteristics in TCGA

We conducted a principal component analysis of TCGA genotype at the 156 meQTLs in European ancestry samples (n=8217), evaluating association of meQTL genotype-based PCs with clinical covariates. meQTL SNPs were quantified by the number of minor alleles carried (0, 1, 2). PCs explaining more than 1% of the genotypic variance across individuals were regressed with clinical variables including sex, age at diagnosis, tumor stage, and tumor type.

Machine-learning for meQTL-based risk and survival prediction

For both risk and survival analysis, we used a synthesis of LASSO regularization as a feature selector and XGBoost classifier as the machine learning predictor, described fully in Elgart et al.¹⁸ Specifically, after a preliminary association analysis, SNPs achieving a nominal p-value<0.05 were further selected by LASSO, and the selected SNPs were used to train an XGBoost model on a predictive task (e.g. cancer versus no cancer for risk, or high survival or low survival at median overall survival time), using a set of training samples. The probabilities achieved from the XGBoost classifier were then used to create a polygenic risk score (PRS) or polygenic survival score (PSS). Predictive performance was evaluated using cross validation for survival analysis and using an independent cohort of matched tumor types for the risk analysis.

UKBioBank risk

To determine the association of meQTLs with risk of developing cancer, we conducted a phenome-wide association study (PheWAS) for each meQTL using the PLATO⁵⁹ software. The genotype and phenotype data of 487,409 patients harboring the 156 cancer-related clumped meQTLs was retrieved from the UKBioBank⁵⁴ and genotype at each meQTL was evaluated for association with all cancer phenotypes while controlling for covariates including age and ancestry. Individuals with multiple cancer diagnoses were excluded from the analysis, leaving 189,022 patients for risk analysis.

UKBioBank PRS construction and breast cancer drive validation

Nominally significant SNPs (p-value<0.05) were used for polygenic risk modeling with LASSO plus XGBoost. Out of the resulting tumor types where meQTLs were associated with risk we pursued breast (ICD-10: C50-C50) due to the abundance of validation data. Of the 189,022 UKBioBank individuals analyzed, 177,834 and 11,188 patients were non-cancer controls and breast cancer cases, respectively. An initial 10% quantile plot from the PheWAS analysis in UKBioBank was created using the PRS with the odds ratio for C50-C50 to compare the odds ratio of the 0th quantile PRS group to the 90th quantile PRS group.

To create a polygenic risk score (PRS) we utilized the approach described above under “Machine-learning for meQTL-based risk and survival prediction” section. Out of the tumor types that had nominally significant (p<0.05) risk-related SNPs (i.e.C64-C68, C40-C41, C69-C72, C00-C14, C15-C26, C81-C96, C50-C50, C43-C44, C45-C49, C76-C80, C60-C63, C51-C58, C97-C97, C73-C75, C30-C39), we chose to validate this relationship on an external cohort, DRIVE, on the C50-C50 or the breast cancer outcome due to an abundance of validation data. Similar to the survival analysis, we considered SNPs nominally associated with cancer risk using the associations from the PheWAS (p<0.05) for the rest of the analysis. We included other covariates including age and the first 10 principal components to represent population substructure in UKBioBank. Due to the class imbalance of the UKBioBank cohort (10,840 cases, 94,871 controls), we oversampled the cases to obtain a 1:1 case control ratio, resulting in a dataset size of 189,742 rows. Furthermore, we only included samples without any neoplasm diagnosis as controls to minimize confounding by other tumor types.

We first trained our XGBoost classification model on the entirety of the UKBioBank dataset. First the UKBioBank cohort (i.e. training cohort) was inputted into a LASSO regression model with $α$ =0.001 (based on Ref. 18) to predict the intended phenotype. SNPs were further filtered to remove those that had a LASSO coefficient of 0. The modified cohort was used to train an XGBoost model on the filtered feature set using the entire UKBioBank cohort (n_estimators=500, learning_rate=0.1, max_depth=9). The probability of trees voting for either class (i.e. 0: no cancer, 1:cancer) was used as a polygenic risk score. We validated the breast cancer risk association of meQTLs alongside covariates using the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE⁵⁵) validation cohort. This validation cohort consists of 32,428 controls and 26,374 breast cancer cases for a total of 58,802 patients. Before validating, we mapped the MAF values of the SNPs in UKBioBank and DRIVE, and removed SNPs with MAF values of 2 standard deviations away from one another. PRS scores were predicted based on individual genotypes in DRIVE using the UKBioBank-trained XGBoost model (as described in Ref. 18). We compared score distributions across case and control in DRIVE using a Mann-Whitney U test. We also compared the incidence of breast cancer by partitioning the UKBioBank and DRIVE probabilities into 10% quantiles on PRS score. We plotted the 10% quantiles using the min-max normalized XGBoost-derived PRS scores.

Prediction of survival time in TCGA tumor types

Survival was modeled separately for each of 20 tumor types in TCGA (BLCA, CESC, KIRC, KIRP, PAAD, BRCA, HNSC, LGG, SKCM, PRAD, OV, UCEC, THCA, LUAD, LUSC, COAD, STAD, LIHC, SARC). Cancer meQTLs were included in predictive modeling if they were present with at least 1% minor allele frequency in the specific tumor type, and nominally significant in Kaplan-Meier analysis. Tumor types where fewer than 5 meQTLs showed a nominal association with overall survival or had less than 100 patients in TCGA were excluded from the analysis. For the remaining tumor types, we divided the analysis into three categories: clinical group containing only clinical features including sex, age, and tumor stage in certain cancer types (i.e. only cancer types >100 patients with non-null tumor stage contained stage as a covariate), control group and SNPs, and SNPs exclusively. For each of the categories, SNPs were selected by LASSO then used the complete dataset to train an XGBoost model, using 5-fold cross validation to estimate the generalization error and generate an AUC value. Specifically, for each individual we simplified the genotypes to a binary feature valued 1 if the patient had the heterozygous or homozygous meQTL allele and 0 if they didn’t. Binarized genotypes were then z-score normalized and input into a LASSO regularization model (α=0.001). Features with a LASSO coefficient of 0 (i.e. non-informative features) were removed and the LASSO-filtered SNP set was used to train an XGBoost classifier (n_estimators=500, learning_rate=0.1, max_depth=9) to predict binarized median overall survival (OS, 1=low survival<median survival, 0=high survival>median survival). Cancer types with a higher AUC value in the clinical+SNP group compared to the clinical group were only considered for the SNP only analysis. A higher AUC on the combined group could suggest that SNPs bring additive information. The output of the SNP-only XGBoost model used a non-linear polygenic survival score (PSS). Before inputting into the Cox, the PSS was scaled using the min-max algorithm and outliers were removed using a 1.5*(interquartile range) threshold.

Cox proportional hazard using PSS

We used Cox proportional hazards models to evaluate the meQTL-based PSS as a predictor of survival interval across tumor types in TCGA. We combined the PSS with clinical features including sex, age at diagnosis and tumor stage in a multivariable Cox-proportional hazards model to predict OS, and evaluated the hazard ratios and 95% confidence intervals for each covariate. We repeated this for disease free interval (DFI).

Author contributions

Original concept by SG and MP. HC supervised the project. SG performed computational data processing and analysis. MP, AK, JT provided support with data set preparation and contributed to computer code. SG, HC wrote the manuscript.

Data availability

Source data

Data were obtained from public sources including The Cancer Genome Atlas (TCGA; dbGaP: phs000178.v11.p8) and Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE; dbGaP: phs001265.v1.p1). dbGaP requires an application to access data; applicants will need to create an eRA Commons account and begin a project request. Senior Investigators and NIH Investigators are eligible to apply to access.

We use data from the UKBiobank resource under application number 37671 for this work. All bona fide researchers can apply to use the UK Biobank resource for health-related research that is in the public interest. Further information on the application process is available from the UK Biobank website.

meQTLs were obtained from Gong et al.¹⁶ (http://bioinfo.life.hust.edu.cn/Pancan-meQTL/). TADs were obtained from Rao et al.¹² (https://doi.org/10.1016/j.cell.2014.11.021).

Software availability

Source code available from: https://github.com/cartercompbio/meQTLs.

Archived source code at time of publication: https://doi.org/10.5281/zenodo.8168488.⁶⁰

License: MIT.

Extended data

Extended data Figure 1 can be found in Figshare at https://doi.org/10.6084/m9.figshare.29610992.v1.⁶¹

Extended data Figure 2 can be found in Figshare at https://doi.org/10.6084/m9.figshare.29610998.v1.⁶²

Data are available under the terms of the Creative Commons Attribution International License (CC BY 4.0)

Acknowledgements

We would like to acknowledge Rany M Salem for providing access to UKBioBank data and TJ Sears for helpful scientific discussion. This research has been conducted using the UK Biobank Resource under Application Number 37671. The results shown here are also based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga. OncoArray genotyping and phenotype data harmonization for the Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) breast-cancer case control samples was supported by X01 HG007491 and U19 CA148065 and by Cancer Research UK (C1287/A16563). Genotyping was conducted by the Center for Inherited Disease Research (CIDR), Centre for Cancer Genetic Epidemiology, University of Cambridge, and the National Cancer Institute. The following studies contributed germline DNA from breast cancer cases and controls: the Two Sister Study (2SISTER), Breast Oncology Galicia Network (BREOGAN), Copenhagen General Population Study (CGPS), Cancer Prevention Study 2 (CPSII), The European Prospective Investigation into Cancer and Nutrition (EPIC), Melbourne Collaborative Cohort Study (MCCS), Multiethnic Cohort (MEC), Nashville Breast Health Study (NBHS), Nurses Health Study (NHS), Nurses Health Study 2 (NHS2), Polish Breast Cancer Study (PBCS), Prostate Lung Colorectal and Ovarian Cancer Screening Trial (PLCO), Studies of Epidemiology and Risk Factors in Cancer Heredity (SEARCH), The Sister Study (SISTER), Swedish Mammography Cohort (SMC), Women of African Ancestry Breast Cancer Study (WAABCS), Women’s Health Initiative (WHI).

References

1. Iyer JG, et al.: Response rates and durability of chemotherapy among 62 patients with metastatic Merkel cell carcinoma. Cancer Med. 2016; 5: 2294–2301. PubMed Abstract | Publisher Full Text | Free Full Text
2. Gayther SA, et al.: Variation of risks of breast and ovarian cancer associated with different germline mutations of the BRCA2 gene. Nat. Genet. 1997; 15: 103–105. PubMed Abstract | Publisher Full Text
3. Chequin A, et al.: Antitumoral activity of liraglutide, a new DNMT inhibitor in breast cancer cells in vitro and in vivo. Chem. Biol. Interact. 2021; 349: 109641. PubMed Abstract | Publisher Full Text
4. Heyn H, et al.: Linkage of DNA methylation quantitative trait loci to human cancer risk. Cell Rep. 2014; 7: 331–338. PubMed Abstract | Publisher Full Text
5. Irizarry RA, et al.: The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat. Genet. 2009; 41: 178–186. PubMed Abstract | Publisher Full Text | Free Full Text
6. Esteller M, et al.: Promoter hypermethylation and BRCA1 inactivation in sporadic breast and ovarian tumors. J. Natl. Cancer Inst. 2000; 92: 564–569. PubMed Abstract | Publisher Full Text
7. Wolff EM, et al.: Hypomethylation of a LINE-1 promoter activates an alternate transcript of the MET oncogene in bladders with cancer. PLoS Genet. 2010; 6: e1000917. PubMed Abstract | Publisher Full Text | Free Full Text
8. Jablonski KP, et al.: Contribution of 3D genome topological domains to genetic risk of cancers: a genome-wide computational study. Hum. Genomics. 2022; 16: 2. PubMed Abstract | Publisher Full Text | Free Full Text
9. Dixon JR, et al.: Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012; 485: 376–380. PubMed Abstract | Publisher Full Text | Free Full Text
10. McArthur E, Capra JA: Topologically associating domain boundaries that are stable across diverse cell types are evolutionarily constrained and enriched for heritability. Am. J. Hum. Genet. 2021; 108: 269–283. PubMed Abstract | Publisher Full Text | Free Full Text
11. Akdemir KC, et al.: Somatic mutation distributions in cancer genomes vary with three-dimensional chromatin structure. Nat. Genet. 2020; 52: 1178–1188. PubMed Abstract | Publisher Full Text | Free Full Text
12. Rao SSP, et al.: A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014; 159: 1665–1680. PubMed Abstract | Publisher Full Text | Free Full Text
13. Nora EP, et al.: Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature. 2012; 485: 381–385. PubMed Abstract | Publisher Full Text | Free Full Text
14. Li S, Peng Y, Panchenko AR: DNA methylation: Precise modulation of chromatin structure and dynamics. Curr. Opin. Struct. Biol. 2022; 75: 102430. PubMed Abstract | Publisher Full Text
15. Curradi M, Izzo A, Badaracco G, et al.: Molecular mechanisms of gene silencing mediated by DNA methylation. Mol. Cell. Biol. 2002; 22: 3157–3173. PubMed Abstract | Publisher Full Text | Free Full Text
16. Gong J, et al.: Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer. Nucleic Acids Res. 2019; 47: D1066–D1072. PubMed Abstract | Publisher Full Text | Free Full Text
17. Tate JG, et al.: COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 2019; 47: D941–D947. PubMed Abstract | Publisher Full Text | Free Full Text
18. Elgart M, et al.: Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun. Biol. 2022; 5: 856. PubMed Abstract | Publisher Full Text | Free Full Text
19. Sheehan M, et al.: Investigating the Link between Lynch Syndrome and Breast Cancer. Eur. J. Breast Health. 2020; 16: 106–109. PubMed Abstract | Publisher Full Text | Free Full Text
20. Ma S-J, Liu Y-M, Zhang Y-L, et al.: Correlations of and gene polymorphisms with breast cancer susceptibility and prognosis. Biosci. Rep. 2018; 38. PubMed Abstract | Publisher Full Text | Free Full Text
21. Park U-H, et al.: ASXL2 promotes proliferation of breast cancer cells by linking ERα to histone methylation. Oncogene. 2016; 35: 3742–3752. PubMed Abstract | Publisher Full Text
22. Wang X, et al.: Clinical and prognostic relevance of EZH2 in breast cancer: A meta-analysis. Biomed. Pharmacother. 2015; 75: 218–225. PubMed Abstract | Publisher Full Text
23. Mavaddat N, Michailidou K, Dennis J, et al.: Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet. 2019 Jan 3; 104(1): 21–34. PubMed Abstract | Publisher Full Text | Free Full Text
24. Lakeman IMM, Rodríguez-Girondo M, Lee A, et al.: Validation of the BOADICEA model and a 313-variant polygenic risk score for breast cancer risk prediction in a Dutch prospective cohort. Genet. Med. 2020; 22: 1803–1811. PubMed Abstract | Publisher Full Text | Free Full Text
25. Walia V, et al.: Mutational and functional analysis of the tumor-suppressor PTPRD in human melanoma. Hum. Mutat. 2014; 35: 1301–1310. PubMed Abstract | Publisher Full Text
26. Schrama D, et al.: ERCC5 p.Asp1104His and ERCC2 p.Lys751Gln polymorphisms are independent prognostic factors for the clinical course of melanoma. J. Invest. Dermatol. 2011; 131: 1280–1290. PubMed Abstract | Publisher Full Text
27. Henríquez-Hernández LA, et al.: Single nucleotide polymorphisms in DNA repair genes as risk factors associated to prostate cancer progression. BMC Med. Genet. 2014; 15: 143. PubMed Abstract | Publisher Full Text | Free Full Text
28. Zhu Y, et al.: Systematic analysis on expression quantitative trait loci identifies a novel regulatory variant in ring finger and WD repeat domain 3 associated with prognosis of pancreatic cancer. Chin. Med. J. 2022; 135: 1348–1357. PubMed Abstract | Publisher Full Text | Free Full Text
29. Fu X, et al.: RFWD3-Mdm2 ubiquitin ligase complex positively regulates p53 stability in response to DNA damage. Proc. Natl. Acad. Sci. U. S. A. 2010; 107: 4579–4584. PubMed Abstract | Publisher Full Text | Free Full Text
30. Dasgupta P, et al.: LncRNA CDKN2B-AS1/miR-141/cyclin D network regulates tumor progression and metastasis of renal cell carcinoma. Cell Death Dis. 2020; 11: 660. PubMed Abstract | Publisher Full Text | Free Full Text
31. Pellegata NS, et al.: Human pheochromocytomas show reduced p27Kip1 expression that is not associated with somatic gene mutations and rarely with deletions. Virchows Arch. 2007; 451: 37–46. Publisher Full Text
32. Theodoropoulos GE, et al.: Caspase 9 promoter polymorphisms confer increased susceptibility to breast cancer. Cancer Genet. 2012; 205: 508–512. PubMed Abstract | Publisher Full Text
33. Rodriguez-Ruiz ME, et al.: Apoptotic caspases inhibit abscopal responses to radiation and identify a new prognostic biomarker for breast cancer patients. Oncoimmunology. 2019; 8: e1655964. PubMed Abstract | Publisher Full Text | Free Full Text
34. Walsh CS, et al.: ERCC5 is a novel biomarker of ovarian cancer prognosis. J. Clin. Oncol. 2008; 26: 2952–2958. Publisher Full Text
35. Shuai W, et al.: ETNK1 mutation occurs in a wide spectrum of myeloid neoplasms and is not specific for atypical chronic myeloid leukemia. Cancer. 2023; 129: 878–889. PubMed Abstract | Publisher Full Text
36. Stoica C, Ferreira AK, Hannan K, et al.: Bilayer Forming Phospholipids as Targets for Cancer Therapy. Int. J. Mol. Sci. 2022; 23. PubMed Abstract | Publisher Full Text | Free Full Text
37. Ahmed M, et al.: CRISPRi screens reveal a DNA methylation-mediated 3D genome dependent causal mechanism in prostate cancer. Nat. Commun. 2021; 12: 1781. PubMed Abstract | Publisher Full Text | Free Full Text
38. Xia J-H, Wei G-H: Enhancer Dysfunction in 3D Genome and Disease. Cells. 2019; 8. PubMed Abstract | Publisher Full Text | Free Full Text
39. Fudenberg G, Pollard KS: Chromatin features constrain structural variation across evolutionary timescales. Proc. Natl. Acad. Sci. U. S. A. 2019; 116: 2175–2180. PubMed Abstract | Publisher Full Text | Free Full Text
40. Rovirosa L, Ramos-Morales A, Javierre BM: The Genome in a Three-Dimensional Context: Deciphering the Contribution of Noncoding Mutations at Enhancers to Blood Cancer. Front. Immunol. 2020; 11: 592087. PubMed Abstract | Publisher Full Text | Free Full Text
41. Valton A-L, Dekker J: TAD disruption as oncogenic driver. Curr. Opin. Genet. Dev. 2016; 36: 34–40. PubMed Abstract | Publisher Full Text | Free Full Text
42. Pagadala M, et al.: Germline modifiers of the tumor immune microenvironment implicate drivers of cancer risk and immunotherapy response. Nat. Commun. 2023; 14: 2744. PubMed Abstract | Publisher Full Text | Free Full Text
43. Zhang P, et al.: Germline and Somatic Genetic Variants in the p53 Pathway Interact to Affect Cancer Risk, Progression, and Drug Response. Cancer Res. 2021; 81: 1667–1680. PubMed Abstract | Publisher Full Text | Free Full Text
44. Sayaman RW, et al.: Germline genetic contribution to the immune landscape of cancer. Immunity. 2021; 54: 367–386.e8. PubMed Abstract | Publisher Full Text | Free Full Text
45. Carter H, et al.: Interaction Landscape of Inherited Polymorphisms with Somatic Events in Cancer. Cancer Discov. 2017; 7: 410–423. PubMed Abstract | Publisher Full Text | Free Full Text
46. Dworkin AM, et al.: Germline variation controls the architecture of somatic alterations in tumors. PLoS Genet. 2010; 6: e1001136. PubMed Abstract | Publisher Full Text | Free Full Text
47. Li Q, et al.: Expression QTL-based analyses reveal candidate causal genes and loci across five tumor types. Hum. Mol. Genet. 2014; 23: 5294–5302. PubMed Abstract | Publisher Full Text | Free Full Text
48. Li W, et al.: Cis- and Trans-Acting Expression Quantitative Trait Loci of Long Non-Coding RNA in 2,549 Cancers With Potential Clinical and Therapeutic Implications. Front. Oncol. 2020; 10: 602104. PubMed Abstract | Publisher Full Text | Free Full Text
49. Akdemir KC, et al.: Disruption of chromatin folding domains by somatic genomic rearrangements in human cancer. Nat. Genet. 2020; 52: 294–305. PubMed Abstract | Publisher Full Text | Free Full Text
50. Yousefi PD, Suderman M, Langdon R, et al.: DNA methylation-based predictors of health: applications and statistical considerations. Nat. Rev. Genet. 2022; 23: 369–383. PubMed Abstract | Publisher Full Text
51. Liu J, et al.: An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell. 2018; 173: 400–416.e11. PubMed Abstract | Publisher Full Text | Free Full Text
52. Kazachenka A, et al.: Identification, Characterization, and Heritability of Murine Metastable Epialleles: Implications for Non-genetic Inheritance. Cell. 2018; 175: 1717. PubMed Abstract | Publisher Full Text | Free Full Text
53. Inoue F, et al.: A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res. 2017; 27: 38–52. PubMed Abstract | Publisher Full Text | Free Full Text
54. Bycroft C, et al.: The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018; 562: 203–209. PubMed Abstract | Publisher Full Text | Free Full Text
55. Amos CI, et al.: The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers. Cancer Epidemiol. Biomark. Prev. 2017; 26: 126–135. PubMed Abstract | Publisher Full Text | Free Full Text
56. Roadmap Epigenomics Consortium et al.: Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518: 317–330.
57. Purcell S, et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007; 81: 559–575. PubMed Abstract | Publisher Full Text | Free Full Text
58. ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489: 57–74. PubMed Abstract | Publisher Full Text | Free Full Text
59. Hall MA, et al.: PLATO software provides analytic framework for investigating complexity beyond genome-wide association studies. Nat. Commun. 2017; 8: 1167. PubMed Abstract | Publisher Full Text | Free Full Text
60. Goudarzi S, Hcarter: cartercompbio/meQTLs: Initial release (v1.0.0). Zenodo. 2023. Publisher Full Text
61. Carter H: Extended_Data_Figure_1.pdf. figshare. Figure. 2025. Publisher Full Text
62. Carter H: Extended Data Figure 2. figshare. Figure. 2025. Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 01 Sep 2023

Author details Author details

Shervin Goudarzi
Roles: Conceptualization, Data Curation, Formal Analysis, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Meghana Pagadala
Roles: Conceptualization, Data Curation, Methodology, Software

Adam Klie
Roles: Data Curation, Software

James V Talwar
Roles: Data Curation, Software

Hannah Carter
Roles: Methodology, Supervision, Writing – Original Draft Preparation, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

This work was supported by an NIH National Cancer Institute grant (R01CA269919) to HC and a National Institute of General Medical Sciences infrastructure grant (2P41GM103504-11).

Article Versions (2)

version 2

Revised

Published: 24 Jul 2025, 12:1083

https://doi.org/10.12688/f1000research.139476.2

version 1

Published: 01 Sep 2023, 12:1083

https://doi.org/10.12688/f1000research.139476.1

© 2025 Goudarzi S et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Goudarzi S, Pagadala M, Klie A et al. Epigenetic germline variants predict cancer prognosis and risk and distribute uniquely in topologically associating domains [version 2; peer review: 2 approved, 2 approved with reservations]. F1000Research 2025, 12:1083 (https://doi.org/10.12688/f1000research.139476.2)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 2

VERSION 2

PUBLISHED 24 Jul 2025

Revised

Views

Reviewer Report 16 Jan 2026

Kosuke Yamaguchi, Molecular Cell Engineering Laboratory, National Institute of Genetics, Misima, Shizuoka, Japan

Approved with Reservations

https://doi.org/10.5256/f1000research.185383.r446700

Methylation quantitative trait loci (meQTLs) are single nucleotide polymorphisms (SNPs) that are statistically associated with variation in DNA methylation levels at specific CpG sites. Such genetic-epigenetic interactions have attracted considerable interest in cancer research, as altered DNA methylation patterns can influence the expression of tumor suppressor genes and oncogenes. In addition, higher-order genome organization, including topologically associating domains (TADs), may further modulate the functional impact of genetic variants on epigenetic regulation in a context- and disease-specific manner.

In this study, the authors integrate cancer-associated meQTLs with TAD organization. They report that transcriptionally active TADs tend to exhibit relative DNA hypomethylation, whereas inactive TADs are associated with DNA hypermethylation (Fig. 1). Furthermore, the authors observe that meQTLs are enriched at TAD boundaries, particularly at active-boundary-active, inactive-boundary-active, and active-boundary-inactive configurations (Fig. 2). The authors also report that oncogene-associated meQTLs are preferentially located within inactive TADs, whereas tumor suppressor gene-associated meQTLs are more frequently found in active TADs, based on randomization analyses (Fig. 3).

The authors further identify 156 cancer gene-related clustered meQTLs (Table 1), among which 23 are reported to be significantly associated with cancer risk and survival across multiple cancer types in a pan-cancer model (Table 2). Breast cancer is highlighted as a representative example, supported by relatively large sample sizes from the UK Biobank and additional cohorts (Table 2, C50-C50). Using an XGBoost-trained model, the authors calculate a polygenic risk score (PRS) and suggest that this score may be associated with cancer risk (Fig. 4). The authors additionally attempt to isolate key contributing factors (MSH2, EZH2, and ASXL2) for cancer risk prediction; however, the expression levels of these genes do not show a significant correlation with PRS values (Fig. 5). Finally, the study proposes polygenic survival scores (PSS) derived from these meQTLs, suggesting that these scores may stratify cancer risk and survival beyond conventional clinical parameters (Fig. 6 and 7).

Overall, this study provides conceptually interesting insights into the integration of genetic variation, DNA methylation, and three-dimensional genome organization in cancer risk assessment. However, several conclusions are not fully supported by the presented data, and important controls or additional analyses appear to be lacking. This reviewer therefore suggests that additional experiments and/or analyses would substantially strengthen the manuscript.

Major Comments
1. Figure and table labeling issues:
Several figures and tables are not correctly labeled or lack essential information. The authors should carefully review all figures and tables to ensure clarity and completeness. Examples identified by this reviewer include:

Figure 1A: The meaning of the X-axis label is unclear and should be explicitly defined.
Table 2: No false discovery rate (FDR) or q-value information is provided; only p-values associated with ICD-10 codes are shown. If these p-values are intended to represent FDR-adjusted values, they should be clearly labeled as such.
Table 2: The table legend describes beta values as correlation coefficients between meQTLs and promoter DNA methylation; however, it is not sufficiently clear from the table how these beta values should be interpreted in relation to the reported cancer risk and survival associations. Additional clarification would improve the readability of the table.
Figure 5B: All Y-axis labels are shown as "MSH2 Expression (Z-score)," which appears to be incorrect.

2. DNA methylation source in Fig. 1B:
In Fig. 1B, the authors analyze DNA methylation levels across 1,100 TADs shared among five cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). As all these cell lines are derived from normal tissues, it is critical to clarify whether normal tissue DNA methylation data were used in this analysis. This reviewer requests explicit confirmation that normal (non-tumor) DNA methylation data from the TCGA database were used, rather than cancer-derived samples.

3. Interpretation of Table 2 and biological linkage:
Table 2 presents a list of meQTLs reported to significantly affect cancer risk and survival in a pan-cancer model. This reviewer assumes that the authors aim to establish a mechanistic link between meQTLs, TAD organization, promoter DNA methylation of cancer-related genes, gene expression changes, and cancer risk or survival. To strengthen this interpretation, this reviewer suggests:

Adding gene expression comparisons between cancer and normal tissues for the genes listed in Table 2.
Explicitly annotating whether each gene in Table 2 is classified as a tumor suppressor gene or an oncogene.

4. Evaluation of PRS specificity in Fig. 4:
In Fig. 4, the authors show that a PRS derived from 23 TAD-associated meQTLs predicts breast cancer risk. However, meQTLs themselves have been reported as cancer-associated variants independent of TAD context. To specifically demonstrate the added value of TAD information, this reviewer recommends performing parallel PRS analyses using:

Pan-cancer gene-related meQTLs, or
All identified meQTLs,
and directly comparing the prediction performance between TAD-associated meQTLs and these broader meQTL sets.

5. PSS comparison in Fig. 5:
Similarly, for Fig. 5, the authors should calculate PSS values using pan-cancer gene-related meQTLs and compare their predictive performance with TAD-associated meQTL-based PSS. This comparison is necessary to demonstrate that incorporating TAD information provides added predictive value.

6. Discussion: relevance of DNA methylation-CTCF interactions:
Recent studies have reported that DNA methylation can directly affect CTCF binding, a key regulator of 3D genome organization. Incorporating this literature would strengthen the conceptual framework of the manuscript. The following references are suggested for discussion:

PMID: 10839546
PMID: 10839547
PMID: 12461525
PMID: 26257180
PMID: 30948436
PMID: 39180406

Minor Comments
Introduction, second paragraph:
"L1NE1" appears to be a typographical error and should be corrected to "LINE1."

Figures 3B and 3C:
The distinction between observed and random values is difficult to interpret. Using bar plots for both values with a clear legend would improve readability.

Figure 4:
The term polygenic risk score (PRS) is not defined prior to the use of the abbreviation. For readers outside the field, the full term should be introduced before abbreviation.

Figure 4A:
The X-axis labels "0" and "1" are not explained. It would be clearer to label these as "control" and "cancer," respectively.

Text related to Fig. 5:
Brief functional descriptions of MSH2 and EZH2 would improve clarity. MSH2 is a key component of DNA mismatch repair, and EZH2 is a core subunit of the PRC2 complex responsible for H3K27 methylation, both of which are closely linked to cancer biology.

Page 9, Table 2 inconsistency:
The manuscript states: "Out of the 155 SNPs, 21 passed the Benjamini-Hochberg adjusted FDR of less than 0.05 (Table 2)." However, Table 2 lists 23 SNPs. This discrepancy should be resolved. If an additional table exists, it should be included in the manuscript.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Epigenetic, Cell biology, Molecular cell biology.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 07 Jan 2026

Busra Unal, Umraniye Training and Research Hospital, Istanbul, Turkey

Approved

https://doi.org/10.5256/f1000research.185383.r432966

The authors present an interesting and carefully conducted study. The manuscript is overall well written and provides valuable insights. However, several issues would benefit from clarification and further consideration
1.Across the manuscript, the terms “epigenetic germline variants” and “cancer meQTLs” appear to be used somewhat interchangeably. It is not always clear if the authors are referring to germline variants with stable methylation effects across tissues and disease states, or to tumor-context–dependent meQTLs that may emerge within the altered epigenetic landscape of cancer. More consistent terminology would be helpful to clarify; which findings are reflecting underlying germline regulatory structure, and which findings are tumor-specific. This clarification is important because the manuscript’s conclusions regarding cancer risk inference may have potential clinical implications.
2.The primary datasets used in this study (TCGA, UK Biobank, and DRIVE) are all substantially enriched for individuals of European ancestry. However, the manuscript does not evaluate if the reported findings generalize beyond these populations. Additionally, a discussion on how ancestry-related differences in linkage disequilibrium structure, baseline methylation landscapes, chromatin organization may influence both meQTL discovery and model performance is not fully addressed in the manuscript. Considering that epigenetic regulation and variant–methylation coupling can differ across populations, the absence of ancestry-stratified analyses or a conceptual consideration of these issues limits the generalisability of the work. I would recommend that the authors either incorporate ancestry-aware analyses or provide a discussion of how population structure may affect meQTL architecture and the broader applicability of their conclusions.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: My area of research is hereditary cancer predisposition syndromes and disparities in genomic medicine application.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Respond or Comment

Views

Reviewer Report 12 Aug 2025

Charu Mehta, Basic Sciences Division, Fred Hutchinson Cancer Center, Seattle, Washington, USA

Approved

https://doi.org/10.5256/f1000research.185383.r400300

CITE

Report a concern

Respond or Comment

Version 1

VERSION 1

PUBLISHED 01 Sep 2023

Views

Reviewer Report 08 Feb 2024

Chiara Herzog, Universitat Innsbruck, Innsbruck, Tyrol, Austria

Approved with Reservations

https://doi.org/10.5256/f1000research.152752.r233315

The study by Goudarzi et al. provides interesting and new insights into the association of meQTLs and TAD regions the genome, and investigates the capacity of meQTLs to predict cancer status and survival.

Overall the study is well done and clearly presented but I have a few comments and suggestions for improvement.

Major:
- In the introduction and discussion the authors state “This study investigated the relationship between epigenetic factors like chromatin structure and DNA methylation and genetic variation in the context of cancer”. While the authors indeed investigate methylation and TADs in the first part of the manuscript, the majority of the predictors focus on genetic loci and not epigenetics or their interaction. Arguably, one of the most interesting aspects of meQTLs in their capacity for risk prediction are their modulated methylation levels and potential to reflect the integration of genetic and dynamic nonheritable factors (such as due to aging or lifestyle factors), but this was not looked at in detail. Could the authors comment on how meQTLs might be modulated by nonheritable factors as well as genetic factors, and e.g. look into methylation at these sites in cancers or samples preceding cancer?
- Along these lines, it might in the future be interesting to develop dynamic cancer risk predictors as opposed to static tools (such as the PRSs), which might be enabled by nongenetic ‘omics’. Could the authors discuss the potential of these and how their findings might contribute to this (i.e. how meQTLs might contribute to dynamic risk monitoring)?

Minor:

- The authors describe a PCA but do not show any figures or supporting data. Could the authors either add a statement that no data are shown in the text, or (preferred) provide these data in the supplementary information?
- Previous breast cancer PRSs (not based on meQTL) such as the PRS313 have already shown that they may be biased towards certain subtypes - it might be worth mentioning these prior models (e.g. Mavaddat et al 2019) when discussing the current study’s findings in context.
- What was the ROC AUC (and 95% CI) of the cancer risk score (Figure 4A)?
- Can the authors explain the discrepancy of the more ‘linear’ increase in risk in the UKBB compared to DRIVE (Figure 4b versus 4c)?
- Figure 6/TCGA survival: It might also be interesting to look at recurrence-free survival in addition to overall survival.
- In the section Oncogenes and tumor suppressor gene-related (…): capitalise L in ‘Clumped cancer mQTls’
- Figures:
    - Figure 1b: Could the authors also indicate in the Figure legend that this is a Kruskal-Wallis p value.
    - Figure 1c - For interpretability, it might be helpful to add at least y axis grid lines behind box plots. The effects may be significant and is visualised with the violin density plot, but is difficult to see using box plot.
    - Figure 2 - takes some time to understand upon first reading. It might be helpful to label the blue bars in B and C with a legend ‘expected’ and the red dot as ‘observed’ to make it easier to grasp quickly.
    - Figure 4: The text in the caption for A refers to a ‘Figure 8’ that does not exist. Please check this.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: C.H. is the author of a patent on epigenetics-based breast cancer risk prediction using meQTLs, the WID™-qtBC test

Reviewer Expertise: epigenetics and cancer risk prediction

CITE

Report a concern

Author Response 09 Aug 2025

Hannah Carter, Moores Cancer Center, La Jolla, CA 92093, USA

09 Aug 2025

Author Response

The study by Goudarzi et al. provides interesting and new insights into the association of meQTLs and TAD regions of the genome, and investigates the capacity of meQTLs to predict ... Continue reading The study by Goudarzi et al. provides interesting and new insights into the association of meQTLs and TAD regions of the genome, and investigates the capacity of meQTLs to predict cancer status and survival.
Overall the study is well done and clearly presented but I have a few comments and suggestions for improvement.

Major:
- In the introduction and discussion the authors state “This study investigated the relationship between epigenetic factors like chromatin structure and DNA methylation and genetic variation in the context of cancer”. While the authors indeed investigate methylation and TADs in the first part of the manuscript, the majority of the predictors focus on genetic loci and not epigenetics or their interaction. Arguably, one of the most interesting aspects of meQTLs in their capacity for risk prediction are their modulated methylation levels and potential to reflect the integration of genetic and dynamic non heritable factors (such as due to aging or lifestyle factors), but this was not looked at in detail. Could the authors comment on how meQTLs might be modulated by non heritable factors as well as genetic factors, and e.g. look into methylation at these sites in cancers or samples preceding cancer?

We agree with the reviewer that it is important to consider potential for interaction with non-heritable factors. We have modified our discussion to acknowledge the need to evaluate non-heritable factors as follows:

“There are also a number of non-genetic risk factors that act by modifying DNA methylation levels and which could interact with genetic regulation. These include aging, exercise, stress, diet and obesity, and a broad variety of environmental exposures. In our analysis, age had the highest impact on DNA methylation modulation, however, as age and sex were the only clinical factors for the majority of our study, future analysis of other non-genetic factors in relation to genetic regulators of DNA methylation are merited.”

- Along these lines, it might in the future be interesting to develop dynamic cancer risk predictors as opposed to static tools (such as the PRSs), which might be enabled by nongenetic ‘omics’. Could the authors discuss the potential of these and how their findings might contribute to this (i.e. how meQTLs might contribute to dynamic risk monitoring)?

This is a nice suggestion. We have added the following to the discussion:

“Future efforts could integrate dynamic methylation changes due to these non-genetic factors with static polygenic scores such as we describe here to provide a more accurate estimate of risk. This type of approach could benefit in particular from non-invasive biomarkers, such as cell free DNA methylation from blood, though studies will be needed to establish the cumulative effect of dynamic exposures and the extent to which they can be accurately evaluated from cell free DNA.”

Ref:
Yousefi, P.D., Suderman, M., Langdon, R. et al. DNA methylation-based predictors of health: applications and statistical considerations. Nat Rev Genet 23, 369–383 (2022). https://doi.org/10.1038/s41576-022-00465-w

Minor:
- The authors describe a PCA but do not show any figures or supporting data. Could the authors either add a statement that no data are shown in the text, or (preferred) provide these data in the supplementary information?

We have added the PCA figure as Supplementary Figure 1.

- Previous breast cancer PRSs (not based on meQTL) such as the PRS313 have already shown that they may be biased towards certain subtypes - it might be worth mentioning these prior models (e.g. Mavaddat et al 2019) when discussing the current study’s findings in context.

We have now included the following text:

“Indeed, classic polygenic risk scores for breast cancer have shown bias for predicting certain subtypes (Mavaddat et al). Lakeman et al 2020 demonstrated that women in the highest 1% of risk showed a 4.37-fold increased risk for ER-positive disease but only a 2.78-fold increased risk for ER-negative disease compared to the middle quintile showing bias in certain subtypes.”

Refs
Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, Tyrer JP, Chen TH, Wang Q, Bolla MK, Yang X. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics. 2019 Jan 3;104(1):21-34.

Lakeman, I.M.M., Rodríguez-Girondo, M., Lee, A. et al. Validation of the BOADICEA model and a 313-variant polygenic risk score for breast cancer risk prediction in a Dutch prospective cohort. Genet Med 22, 1803–1811 (2020). https://doi.org/10.1038/s41436-020-0884-4

- What was the ROC AUC (and 95% CI) of the cancer risk score (Figure 4A)?

THE DRIVE AUC 0.5534 with 95% CI between [0.5505, 0.5563]. This has now been added to the manuscript. We note that this PRS is based solely upon meQTLs near driver genes and is not expected to be a strong predictor of risk relative to more comprehensive breast cancer polygenic scores. Rather we sought to reproduce effects on risk attributable solely to driver meQTLs in an independent cohort.

- Can the authors explain the discrepancy of the more ‘linear’ increase in risk in the UKBB compared to DRIVE (Figure 4b versus 4c)?

This discrepancy could be due to inherent differences in the composition of the two datasets. UKBB comes predominantly from volunteers in the UK and is more representative of disease incidence in the general population. The DRIVE study was designed specifically to analyze breast cancer risk, and therefore individuals were included to either represent breast cancer or serve as non-breast cancer controls.Breast cancer status in the UKBB was defined based on ICD-10 codes, and we excluded individuals from the controls if they had any ICD10 code associated with neoplasms, but the number of cases relative to controls was unbalanced for UKBB whereas it was balanced for DRIVE. The difference in linearities between the patients could be due to differences in diversity in genotype and phenotype in the UKBB cohort compared to the DRIVE cohort. There could also be discrepancies in environmental risk factors between these cohorts.

- Figure 6/TCGA survival: It might also be interesting to look at recurrence-free survival in addition to overall survival.

We now include an analysis of disease free intervals as Figure SX. The methods have been updated accordingly.

- In the section Oncogenes and tumor suppressor gene-related (…): capitalise L in ‘Clumped cancer mQTls’

We have now corrected this.

- Figures:

- Figure 1b: Could the authors also indicate in the Figure legend that this is a Kruskal-Wallis p value.

We have updated the figure legend to reflect the test used.

- Figure 1c - For interpretability, it might be helpful to add at least y axis grid lines behind box plots. The effects may be significant and is visualised with the violin density plot, but are difficult to see using box plots.

We added gridlines as suggested.

- Figure 2 - takes some time to understand upon first reading. It might be helpful to label the blue bars in B and C with a legend ‘expected’ and the red dot as ‘observed’ to make it easier to grasp quickly.

We have updated the legends to read observed and expected.

- Figure 4: The text in the caption for A refers to a ‘Figure 8’ that does not exist. Please check this.

Thank you for catching this. The erroneous figure reference has been removed from the caption.
The study by Goudarzi et al. provides interesting and new insights into the association of meQTLs and TAD regions of the genome, and investigates the capacity of meQTLs to predict cancer status and survival.
Overall the study is well done and clearly presented but I have a few comments and suggestions for improvement.

Major:
- In the introduction and discussion the authors state “This study investigated the relationship between epigenetic factors like chromatin structure and DNA methylation and genetic variation in the context of cancer”. While the authors indeed investigate methylation and TADs in the first part of the manuscript, the majority of the predictors focus on genetic loci and not epigenetics or their interaction. Arguably, one of the most interesting aspects of meQTLs in their capacity for risk prediction are their modulated methylation levels and potential to reflect the integration of genetic and dynamic non heritable factors (such as due to aging or lifestyle factors), but this was not looked at in detail. Could the authors comment on how meQTLs might be modulated by non heritable factors as well as genetic factors, and e.g. look into methylation at these sites in cancers or samples preceding cancer?

We agree with the reviewer that it is important to consider potential for interaction with non-heritable factors. We have modified our discussion to acknowledge the need to evaluate non-heritable factors as follows:

“There are also a number of non-genetic risk factors that act by modifying DNA methylation levels and which could interact with genetic regulation. These include aging, exercise, stress, diet and obesity, and a broad variety of environmental exposures. In our analysis, age had the highest impact on DNA methylation modulation, however, as age and sex were the only clinical factors for the majority of our study, future analysis of other non-genetic factors in relation to genetic regulators of DNA methylation are merited.”

- Along these lines, it might in the future be interesting to develop dynamic cancer risk predictors as opposed to static tools (such as the PRSs), which might be enabled by nongenetic ‘omics’. Could the authors discuss the potential of these and how their findings might contribute to this (i.e. how meQTLs might contribute to dynamic risk monitoring)?

This is a nice suggestion. We have added the following to the discussion:

“Future efforts could integrate dynamic methylation changes due to these non-genetic factors with static polygenic scores such as we describe here to provide a more accurate estimate of risk. This type of approach could benefit in particular from non-invasive biomarkers, such as cell free DNA methylation from blood, though studies will be needed to establish the cumulative effect of dynamic exposures and the extent to which they can be accurately evaluated from cell free DNA.”

Ref:
Yousefi, P.D., Suderman, M., Langdon, R. et al. DNA methylation-based predictors of health: applications and statistical considerations. Nat Rev Genet 23, 369–383 (2022). https://doi.org/10.1038/s41576-022-00465-w

Minor:
- The authors describe a PCA but do not show any figures or supporting data. Could the authors either add a statement that no data are shown in the text, or (preferred) provide these data in the supplementary information?

We have added the PCA figure as Supplementary Figure 1.

- Previous breast cancer PRSs (not based on meQTL) such as the PRS313 have already shown that they may be biased towards certain subtypes - it might be worth mentioning these prior models (e.g. Mavaddat et al 2019) when discussing the current study’s findings in context.

We have now included the following text:

“Indeed, classic polygenic risk scores for breast cancer have shown bias for predicting certain subtypes (Mavaddat et al). Lakeman et al 2020 demonstrated that women in the highest 1% of risk showed a 4.37-fold increased risk for ER-positive disease but only a 2.78-fold increased risk for ER-negative disease compared to the middle quintile showing bias in certain subtypes.”

Refs
Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, Tyrer JP, Chen TH, Wang Q, Bolla MK, Yang X. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics. 2019 Jan 3;104(1):21-34.

Lakeman, I.M.M., Rodríguez-Girondo, M., Lee, A. et al. Validation of the BOADICEA model and a 313-variant polygenic risk score for breast cancer risk prediction in a Dutch prospective cohort. Genet Med 22, 1803–1811 (2020). https://doi.org/10.1038/s41436-020-0884-4

- What was the ROC AUC (and 95% CI) of the cancer risk score (Figure 4A)?

THE DRIVE AUC 0.5534 with 95% CI between [0.5505, 0.5563]. This has now been added to the manuscript. We note that this PRS is based solely upon meQTLs near driver genes and is not expected to be a strong predictor of risk relative to more comprehensive breast cancer polygenic scores. Rather we sought to reproduce effects on risk attributable solely to driver meQTLs in an independent cohort.

- Can the authors explain the discrepancy of the more ‘linear’ increase in risk in the UKBB compared to DRIVE (Figure 4b versus 4c)?

This discrepancy could be due to inherent differences in the composition of the two datasets. UKBB comes predominantly from volunteers in the UK and is more representative of disease incidence in the general population. The DRIVE study was designed specifically to analyze breast cancer risk, and therefore individuals were included to either represent breast cancer or serve as non-breast cancer controls.Breast cancer status in the UKBB was defined based on ICD-10 codes, and we excluded individuals from the controls if they had any ICD10 code associated with neoplasms, but the number of cases relative to controls was unbalanced for UKBB whereas it was balanced for DRIVE. The difference in linearities between the patients could be due to differences in diversity in genotype and phenotype in the UKBB cohort compared to the DRIVE cohort. There could also be discrepancies in environmental risk factors between these cohorts.

- Figure 6/TCGA survival: It might also be interesting to look at recurrence-free survival in addition to overall survival.

We now include an analysis of disease free intervals as Figure SX. The methods have been updated accordingly.

- In the section Oncogenes and tumor suppressor gene-related (…): capitalise L in ‘Clumped cancer mQTls’

We have now corrected this.

- Figures:

- Figure 1b: Could the authors also indicate in the Figure legend that this is a Kruskal-Wallis p value.

We have updated the figure legend to reflect the test used.

- Figure 1c - For interpretability, it might be helpful to add at least y axis grid lines behind box plots. The effects may be significant and is visualised with the violin density plot, but are difficult to see using box plots.

We added gridlines as suggested.

- Figure 2 - takes some time to understand upon first reading. It might be helpful to label the blue bars in B and C with a legend ‘expected’ and the red dot as ‘observed’ to make it easier to grasp quickly.

We have updated the legends to read observed and expected.

- Figure 4: The text in the caption for A refers to a ‘Figure 8’ that does not exist. Please check this.

Thank you for catching this. The erroneous figure reference has been removed from the caption.
Competing Interests: No competing interests. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 09 Aug 2025

Hannah Carter, Moores Cancer Center, La Jolla, CA 92093, USA

09 Aug 2025

Author Response

The study by Goudarzi et al. provides interesting and new insights into the association of meQTLs and TAD regions of the genome, and investigates the capacity of meQTLs to predict ... Continue reading The study by Goudarzi et al. provides interesting and new insights into the association of meQTLs and TAD regions of the genome, and investigates the capacity of meQTLs to predict cancer status and survival.
Overall the study is well done and clearly presented but I have a few comments and suggestions for improvement.

Major:
- In the introduction and discussion the authors state “This study investigated the relationship between epigenetic factors like chromatin structure and DNA methylation and genetic variation in the context of cancer”. While the authors indeed investigate methylation and TADs in the first part of the manuscript, the majority of the predictors focus on genetic loci and not epigenetics or their interaction. Arguably, one of the most interesting aspects of meQTLs in their capacity for risk prediction are their modulated methylation levels and potential to reflect the integration of genetic and dynamic non heritable factors (such as due to aging or lifestyle factors), but this was not looked at in detail. Could the authors comment on how meQTLs might be modulated by non heritable factors as well as genetic factors, and e.g. look into methylation at these sites in cancers or samples preceding cancer?

We agree with the reviewer that it is important to consider potential for interaction with non-heritable factors. We have modified our discussion to acknowledge the need to evaluate non-heritable factors as follows:

“There are also a number of non-genetic risk factors that act by modifying DNA methylation levels and which could interact with genetic regulation. These include aging, exercise, stress, diet and obesity, and a broad variety of environmental exposures. In our analysis, age had the highest impact on DNA methylation modulation, however, as age and sex were the only clinical factors for the majority of our study, future analysis of other non-genetic factors in relation to genetic regulators of DNA methylation are merited.”

- Along these lines, it might in the future be interesting to develop dynamic cancer risk predictors as opposed to static tools (such as the PRSs), which might be enabled by nongenetic ‘omics’. Could the authors discuss the potential of these and how their findings might contribute to this (i.e. how meQTLs might contribute to dynamic risk monitoring)?

This is a nice suggestion. We have added the following to the discussion:

“Future efforts could integrate dynamic methylation changes due to these non-genetic factors with static polygenic scores such as we describe here to provide a more accurate estimate of risk. This type of approach could benefit in particular from non-invasive biomarkers, such as cell free DNA methylation from blood, though studies will be needed to establish the cumulative effect of dynamic exposures and the extent to which they can be accurately evaluated from cell free DNA.”

Ref:
Yousefi, P.D., Suderman, M., Langdon, R. et al. DNA methylation-based predictors of health: applications and statistical considerations. Nat Rev Genet 23, 369–383 (2022). https://doi.org/10.1038/s41576-022-00465-w

Minor:
- The authors describe a PCA but do not show any figures or supporting data. Could the authors either add a statement that no data are shown in the text, or (preferred) provide these data in the supplementary information?

We have added the PCA figure as Supplementary Figure 1.

- Previous breast cancer PRSs (not based on meQTL) such as the PRS313 have already shown that they may be biased towards certain subtypes - it might be worth mentioning these prior models (e.g. Mavaddat et al 2019) when discussing the current study’s findings in context.

We have now included the following text:

“Indeed, classic polygenic risk scores for breast cancer have shown bias for predicting certain subtypes (Mavaddat et al). Lakeman et al 2020 demonstrated that women in the highest 1% of risk showed a 4.37-fold increased risk for ER-positive disease but only a 2.78-fold increased risk for ER-negative disease compared to the middle quintile showing bias in certain subtypes.”

Refs
Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, Tyrer JP, Chen TH, Wang Q, Bolla MK, Yang X. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics. 2019 Jan 3;104(1):21-34.

Lakeman, I.M.M., Rodríguez-Girondo, M., Lee, A. et al. Validation of the BOADICEA model and a 313-variant polygenic risk score for breast cancer risk prediction in a Dutch prospective cohort. Genet Med 22, 1803–1811 (2020). https://doi.org/10.1038/s41436-020-0884-4

- What was the ROC AUC (and 95% CI) of the cancer risk score (Figure 4A)?

THE DRIVE AUC 0.5534 with 95% CI between [0.5505, 0.5563]. This has now been added to the manuscript. We note that this PRS is based solely upon meQTLs near driver genes and is not expected to be a strong predictor of risk relative to more comprehensive breast cancer polygenic scores. Rather we sought to reproduce effects on risk attributable solely to driver meQTLs in an independent cohort.

- Can the authors explain the discrepancy of the more ‘linear’ increase in risk in the UKBB compared to DRIVE (Figure 4b versus 4c)?

This discrepancy could be due to inherent differences in the composition of the two datasets. UKBB comes predominantly from volunteers in the UK and is more representative of disease incidence in the general population. The DRIVE study was designed specifically to analyze breast cancer risk, and therefore individuals were included to either represent breast cancer or serve as non-breast cancer controls.Breast cancer status in the UKBB was defined based on ICD-10 codes, and we excluded individuals from the controls if they had any ICD10 code associated with neoplasms, but the number of cases relative to controls was unbalanced for UKBB whereas it was balanced for DRIVE. The difference in linearities between the patients could be due to differences in diversity in genotype and phenotype in the UKBB cohort compared to the DRIVE cohort. There could also be discrepancies in environmental risk factors between these cohorts.

- Figure 6/TCGA survival: It might also be interesting to look at recurrence-free survival in addition to overall survival.

We now include an analysis of disease free intervals as Figure SX. The methods have been updated accordingly.

- In the section Oncogenes and tumor suppressor gene-related (…): capitalise L in ‘Clumped cancer mQTls’

We have now corrected this.

- Figures:

- Figure 1b: Could the authors also indicate in the Figure legend that this is a Kruskal-Wallis p value.

We have updated the figure legend to reflect the test used.

- Figure 1c - For interpretability, it might be helpful to add at least y axis grid lines behind box plots. The effects may be significant and is visualised with the violin density plot, but are difficult to see using box plots.

We added gridlines as suggested.

- Figure 2 - takes some time to understand upon first reading. It might be helpful to label the blue bars in B and C with a legend ‘expected’ and the red dot as ‘observed’ to make it easier to grasp quickly.

We have updated the legends to read observed and expected.

- Figure 4: The text in the caption for A refers to a ‘Figure 8’ that does not exist. Please check this.

Thank you for catching this. The erroneous figure reference has been removed from the caption.
The study by Goudarzi et al. provides interesting and new insights into the association of meQTLs and TAD regions of the genome, and investigates the capacity of meQTLs to predict cancer status and survival.
Overall the study is well done and clearly presented but I have a few comments and suggestions for improvement.

Major:
- In the introduction and discussion the authors state “This study investigated the relationship between epigenetic factors like chromatin structure and DNA methylation and genetic variation in the context of cancer”. While the authors indeed investigate methylation and TADs in the first part of the manuscript, the majority of the predictors focus on genetic loci and not epigenetics or their interaction. Arguably, one of the most interesting aspects of meQTLs in their capacity for risk prediction are their modulated methylation levels and potential to reflect the integration of genetic and dynamic non heritable factors (such as due to aging or lifestyle factors), but this was not looked at in detail. Could the authors comment on how meQTLs might be modulated by non heritable factors as well as genetic factors, and e.g. look into methylation at these sites in cancers or samples preceding cancer?

We agree with the reviewer that it is important to consider potential for interaction with non-heritable factors. We have modified our discussion to acknowledge the need to evaluate non-heritable factors as follows:

“There are also a number of non-genetic risk factors that act by modifying DNA methylation levels and which could interact with genetic regulation. These include aging, exercise, stress, diet and obesity, and a broad variety of environmental exposures. In our analysis, age had the highest impact on DNA methylation modulation, however, as age and sex were the only clinical factors for the majority of our study, future analysis of other non-genetic factors in relation to genetic regulators of DNA methylation are merited.”

- Along these lines, it might in the future be interesting to develop dynamic cancer risk predictors as opposed to static tools (such as the PRSs), which might be enabled by nongenetic ‘omics’. Could the authors discuss the potential of these and how their findings might contribute to this (i.e. how meQTLs might contribute to dynamic risk monitoring)?

This is a nice suggestion. We have added the following to the discussion:

“Future efforts could integrate dynamic methylation changes due to these non-genetic factors with static polygenic scores such as we describe here to provide a more accurate estimate of risk. This type of approach could benefit in particular from non-invasive biomarkers, such as cell free DNA methylation from blood, though studies will be needed to establish the cumulative effect of dynamic exposures and the extent to which they can be accurately evaluated from cell free DNA.”

Ref:
Yousefi, P.D., Suderman, M., Langdon, R. et al. DNA methylation-based predictors of health: applications and statistical considerations. Nat Rev Genet 23, 369–383 (2022). https://doi.org/10.1038/s41576-022-00465-w

Minor:
- The authors describe a PCA but do not show any figures or supporting data. Could the authors either add a statement that no data are shown in the text, or (preferred) provide these data in the supplementary information?

We have added the PCA figure as Supplementary Figure 1.

- Previous breast cancer PRSs (not based on meQTL) such as the PRS313 have already shown that they may be biased towards certain subtypes - it might be worth mentioning these prior models (e.g. Mavaddat et al 2019) when discussing the current study’s findings in context.

We have now included the following text:

“Indeed, classic polygenic risk scores for breast cancer have shown bias for predicting certain subtypes (Mavaddat et al). Lakeman et al 2020 demonstrated that women in the highest 1% of risk showed a 4.37-fold increased risk for ER-positive disease but only a 2.78-fold increased risk for ER-negative disease compared to the middle quintile showing bias in certain subtypes.”

Refs
Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, Tyrer JP, Chen TH, Wang Q, Bolla MK, Yang X. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics. 2019 Jan 3;104(1):21-34.

Lakeman, I.M.M., Rodríguez-Girondo, M., Lee, A. et al. Validation of the BOADICEA model and a 313-variant polygenic risk score for breast cancer risk prediction in a Dutch prospective cohort. Genet Med 22, 1803–1811 (2020). https://doi.org/10.1038/s41436-020-0884-4

- What was the ROC AUC (and 95% CI) of the cancer risk score (Figure 4A)?

THE DRIVE AUC 0.5534 with 95% CI between [0.5505, 0.5563]. This has now been added to the manuscript. We note that this PRS is based solely upon meQTLs near driver genes and is not expected to be a strong predictor of risk relative to more comprehensive breast cancer polygenic scores. Rather we sought to reproduce effects on risk attributable solely to driver meQTLs in an independent cohort.

- Can the authors explain the discrepancy of the more ‘linear’ increase in risk in the UKBB compared to DRIVE (Figure 4b versus 4c)?

This discrepancy could be due to inherent differences in the composition of the two datasets. UKBB comes predominantly from volunteers in the UK and is more representative of disease incidence in the general population. The DRIVE study was designed specifically to analyze breast cancer risk, and therefore individuals were included to either represent breast cancer or serve as non-breast cancer controls.Breast cancer status in the UKBB was defined based on ICD-10 codes, and we excluded individuals from the controls if they had any ICD10 code associated with neoplasms, but the number of cases relative to controls was unbalanced for UKBB whereas it was balanced for DRIVE. The difference in linearities between the patients could be due to differences in diversity in genotype and phenotype in the UKBB cohort compared to the DRIVE cohort. There could also be discrepancies in environmental risk factors between these cohorts.

- Figure 6/TCGA survival: It might also be interesting to look at recurrence-free survival in addition to overall survival.

We now include an analysis of disease free intervals as Figure SX. The methods have been updated accordingly.

- In the section Oncogenes and tumor suppressor gene-related (…): capitalise L in ‘Clumped cancer mQTls’

We have now corrected this.

- Figures:

- Figure 1b: Could the authors also indicate in the Figure legend that this is a Kruskal-Wallis p value.

We have updated the figure legend to reflect the test used.

- Figure 1c - For interpretability, it might be helpful to add at least y axis grid lines behind box plots. The effects may be significant and is visualised with the violin density plot, but are difficult to see using box plots.

We added gridlines as suggested.

- Figure 2 - takes some time to understand upon first reading. It might be helpful to label the blue bars in B and C with a legend ‘expected’ and the red dot as ‘observed’ to make it easier to grasp quickly.

We have updated the legends to read observed and expected.

- Figure 4: The text in the caption for A refers to a ‘Figure 8’ that does not exist. Please check this.

Thank you for catching this. The erroneous figure reference has been removed from the caption.
Competing Interests: No competing interests. Close
Report a concern

Views

Reviewer Report 17 Jan 2024

Charu Mehta, Basic Sciences Division, Fred Hutchinson Cancer Center, Seattle, Washington, USA

Approved with Reservations

https://doi.org/10.5256/f1000research.152752.r233320

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: My area of research is gene regulation. I am able to assess significant parts of this manuscript, however, I am not a statistician or computational biologist, so I cannot speak to the soundness of their methods.

CITE

Report a concern

Author Response 09 Aug 2025

Hannah Carter, Moores Cancer Center, La Jolla, CA 92093, USA

09 Aug 2025

Author Response

"To obtain independent meQTLs, we clumped related meQTLs based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1)."

Comment: Cite ... Continue reading "To obtain independent meQTLs, we clumped related meQTLs based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1)."

Comment: Cite the database used as a source of meQTLs? In the discussion, authors cited a database (46) but it is unclear if this is the same database they used to identify meQTLs.

The data used as a source of the meQTLs are the Pancan-meQTLs from the cited number 46 [Gong J, et al.: Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer. Nucleic Acids Res. 2019; 47: D1066–D1072]

We have modified the text to clarify this as follows:

“To obtain independent meQTLs, we clumped related meQTLs from Gong et al (46) based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1).”

"(A) 5 state-based K-Means clustering of common TAD domains (n=1100) between 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs."

Comment: what are the x- and y-axes labels in 1A? Where are the five cell types indicated?

The x-axis are the 15 chromatin states that are used to cluster the TADs. The y-axis are all 1100 TADs that were used for the analysis and the 5 cell types are indicated through the legend shown on the left. The different colors represent the “Mixed”, “Inactive-1”, “Active-1”, “Active-2”, “Inactive-2”.

We have updated the figure legend as follows:

“(A) 5 state-based K-Means clustering of common TAD domains shared across 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Shared TAD domains are on the y-axis (n=1100) and are grouped according to 15 chromatin states (x-axis). K-means clusters are shown as a side bar along the y-axis. Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs.”

Comment: What does ‘other’ mean in Table 1?

Other represents all meQTLs that are in the inter-TAD region that aren’t technically in the boundary region, since we defined the boundary region as +/- 50kb around the TAD boundaries. We have added a note to the table legend to clarify this.

“Other indicates meQTLs that are in the inter-TAD region but do not fall within the boundary region as defined.”

"Distributions suggested an increase in density of clumped meQTLs when transitioning from active to inactive regions, and conversely, a decrease from inactive to active regions (Kruskal-Wallis ANOVA, p-value<0.05) when compared to the randomly shuffled distribution, but no shift in density for Active-Boundary-Active and Inactive-Boundary-Inactive categories (Figure 2B-D)."

Comment: So what does any of this suggest??? expand?

We have tried to further expand our interpretation of this observation in the discussion as follows:

“It is of note that TAD boundaries conserved across cell types are reportedly highly enriched for evolutionary constraint and complex trait heritability.10 Our data suggest that variability in gene expression due to meQTLs is also evolutionarily more constrained in and around active TADs and their boundaries, consistent with these TAD boundaries playing a critical role in development (47). These results may suggest that TAD boundaries play a role in making the recruitment of regulatory machinery more specific, particularly as it pertains to DNA methylation.”

"In total, 103 oncogenes and 223 TSGs were used for this analysis, where only 67 of them contained meQTL-affecting CpG probes in their promoter regions (i.e. 49 TSGs and 18 oncogenes)."

Comment: This suggests CpGs affect meQTLs but it’s the other way round.

We changed “meQTL-affecting” to “meQTL-associated”.

"Clumped cancer meQTls were further narrowed to those associated with the methylation status of CpG probes located within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

Comment: Also clarify if the correlation between methylation status of CpG vs meQTLs is also observed in normal tissues or only cancer tissues?

The meQTLs generated by Gong et al were based on methylation status at CpG markers as measured in bulk tumor tissues which generally include a mixture of tumor and normal cells (stroma and immune infiltrates). Nonetheless, it is not clear whether the effects detected here are biased toward effects on methylation that are positively selected in tumors which might not be reflected in normal tissues. This limitation is described in the discussion:

“First, the meQTLs utilized for this study are derived from a study of tumors 46 which could be biased toward detecting meQTLs associated with DNA methylation events that are positively selected in tumors. “

and

“In future studies, it would be of interest to study meQTL trends in normal tissue samples to see if enrichment patterns associated with cancer genes are driven by selection in tumors, or highlight evolutionary constraints more broadly associated with human health that coincidentally are advantageous for tumor development.”

We also believe that our original phrasing was confusing here. We have rephrased as follows:

"Clumped cancer meQTLs were further narrowed to those whose corresponding affected CpG probes were within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

"Overall, cancer meQTLs near 29 cancer genes were included in the model. The most predictive driver meQTL was associated MSH2, a gene associated with Lynch syndrome and increased risk of breast cancer.

Polymorphic variation affecting the expression of EZH2, the second most informative feature, has also been linked to breast cancer risk. ASXL2 may be required for estrogen receptor alpha (ERa) activation in ERa positive breast cancers. Notably, EZH2 overexpression has been linked more strongly to triple negative breast suggesting that the model includes features predictive of multiple subtypes.”

Comment:
1. Since some of these meQTLs lie close to genes involved in epigenetic modifications --- have you looked if these are in the enhancer or otherwise defined regulatory domains?

The challenge here is that we do not know if the tag meQTL SNP is actually the causal SNP. We focused on meQTLs that affect CpG probes within the promoter regions of cancer genes but that does not preclude the possibility that they are affecting an enhancer. We have added this as a limitation of our study:

“For risk prediction, we focused on meQTLs and their corresponding CpG probes that are overlapping the promoter regions of known cancer genes, however we cannot be sure that these meQTLs are not also affecting other genes in the region, for example through effects on enhancer activity. ”

2. Are these genes (MSH2, EZH2, ASXL2) known to be upregulated or downregulated in these risk cases?
Does that agree with the prediction according to meQTLs?

Establishing the mechanism by which meQTLs drive risk would require tissue-relevant gene expression, methylation measurements and genotypes in a group of individuals prior to them developing breast cancer. In established breast cancers in TCGA, the relationship of gene expression with methylation is confounded by somatic alterations and copy number events and general dysregulation of gene expression networks making it difficult to determine what proportion of gene expression is attributable to meQTLs. We have added a statement about the need to further investigate meQTL effects on oncogene and tumor suppressor gene expression in healthy breast tissue to establish how these effects relate to cancer risk.

“More direct mechanistic insight might be gained by studying expression, genotype and methylation in healthy and pre-cancerous breast tissues and cell types. Studying the average expression of MSH2, EZH2, and ASXL2 within TCGA patients stratified by meQTL risk PRS suggested a potential decrease in expression of ASXL2 and EZH2 from in the highest PRS quantile relative to the lowest while MSH2 did not show much difference (Figure 5B). However, this difference needs to be studied further with more specific tumor sub-type stratification and cell type-specific expression. ”

Figure 5 and caption have been updated as follows:

“A) Features are ranked according to their contribution to classifier predictive performance. Total importances sum to 1. B) Average expression of ASXL2, EZH2 and MSH2 in TCGA breast cancer samples, stratified by PRS quantile.”
"To obtain independent meQTLs, we clumped related meQTLs based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1)."

Comment: Cite the database used as a source of meQTLs? In the discussion, authors cited a database (46) but it is unclear if this is the same database they used to identify meQTLs.

The data used as a source of the meQTLs are the Pancan-meQTLs from the cited number 46 [Gong J, et al.: Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer. Nucleic Acids Res. 2019; 47: D1066–D1072]

We have modified the text to clarify this as follows:

“To obtain independent meQTLs, we clumped related meQTLs from Gong et al (46) based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1).”

"(A) 5 state-based K-Means clustering of common TAD domains (n=1100) between 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs."

Comment: what are the x- and y-axes labels in 1A? Where are the five cell types indicated?

The x-axis are the 15 chromatin states that are used to cluster the TADs. The y-axis are all 1100 TADs that were used for the analysis and the 5 cell types are indicated through the legend shown on the left. The different colors represent the “Mixed”, “Inactive-1”, “Active-1”, “Active-2”, “Inactive-2”.

We have updated the figure legend as follows:

“(A) 5 state-based K-Means clustering of common TAD domains shared across 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Shared TAD domains are on the y-axis (n=1100) and are grouped according to 15 chromatin states (x-axis). K-means clusters are shown as a side bar along the y-axis. Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs.”

Comment: What does ‘other’ mean in Table 1?

Other represents all meQTLs that are in the inter-TAD region that aren’t technically in the boundary region, since we defined the boundary region as +/- 50kb around the TAD boundaries. We have added a note to the table legend to clarify this.

“Other indicates meQTLs that are in the inter-TAD region but do not fall within the boundary region as defined.”

"Distributions suggested an increase in density of clumped meQTLs when transitioning from active to inactive regions, and conversely, a decrease from inactive to active regions (Kruskal-Wallis ANOVA, p-value<0.05) when compared to the randomly shuffled distribution, but no shift in density for Active-Boundary-Active and Inactive-Boundary-Inactive categories (Figure 2B-D)."

Comment: So what does any of this suggest??? expand?

We have tried to further expand our interpretation of this observation in the discussion as follows:

“It is of note that TAD boundaries conserved across cell types are reportedly highly enriched for evolutionary constraint and complex trait heritability.10 Our data suggest that variability in gene expression due to meQTLs is also evolutionarily more constrained in and around active TADs and their boundaries, consistent with these TAD boundaries playing a critical role in development (47). These results may suggest that TAD boundaries play a role in making the recruitment of regulatory machinery more specific, particularly as it pertains to DNA methylation.”

"In total, 103 oncogenes and 223 TSGs were used for this analysis, where only 67 of them contained meQTL-affecting CpG probes in their promoter regions (i.e. 49 TSGs and 18 oncogenes)."

Comment: This suggests CpGs affect meQTLs but it’s the other way round.

We changed “meQTL-affecting” to “meQTL-associated”.

"Clumped cancer meQTls were further narrowed to those associated with the methylation status of CpG probes located within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

Comment: Also clarify if the correlation between methylation status of CpG vs meQTLs is also observed in normal tissues or only cancer tissues?

The meQTLs generated by Gong et al were based on methylation status at CpG markers as measured in bulk tumor tissues which generally include a mixture of tumor and normal cells (stroma and immune infiltrates). Nonetheless, it is not clear whether the effects detected here are biased toward effects on methylation that are positively selected in tumors which might not be reflected in normal tissues. This limitation is described in the discussion:

“First, the meQTLs utilized for this study are derived from a study of tumors 46 which could be biased toward detecting meQTLs associated with DNA methylation events that are positively selected in tumors. “

and

“In future studies, it would be of interest to study meQTL trends in normal tissue samples to see if enrichment patterns associated with cancer genes are driven by selection in tumors, or highlight evolutionary constraints more broadly associated with human health that coincidentally are advantageous for tumor development.”

We also believe that our original phrasing was confusing here. We have rephrased as follows:

"Clumped cancer meQTLs were further narrowed to those whose corresponding affected CpG probes were within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

"Overall, cancer meQTLs near 29 cancer genes were included in the model. The most predictive driver meQTL was associated MSH2, a gene associated with Lynch syndrome and increased risk of breast cancer.

Polymorphic variation affecting the expression of EZH2, the second most informative feature, has also been linked to breast cancer risk. ASXL2 may be required for estrogen receptor alpha (ERa) activation in ERa positive breast cancers. Notably, EZH2 overexpression has been linked more strongly to triple negative breast suggesting that the model includes features predictive of multiple subtypes.”

Comment:
1. Since some of these meQTLs lie close to genes involved in epigenetic modifications --- have you looked if these are in the enhancer or otherwise defined regulatory domains?

The challenge here is that we do not know if the tag meQTL SNP is actually the causal SNP. We focused on meQTLs that affect CpG probes within the promoter regions of cancer genes but that does not preclude the possibility that they are affecting an enhancer. We have added this as a limitation of our study:

“For risk prediction, we focused on meQTLs and their corresponding CpG probes that are overlapping the promoter regions of known cancer genes, however we cannot be sure that these meQTLs are not also affecting other genes in the region, for example through effects on enhancer activity. ”

2. Are these genes (MSH2, EZH2, ASXL2) known to be upregulated or downregulated in these risk cases?
Does that agree with the prediction according to meQTLs?

Establishing the mechanism by which meQTLs drive risk would require tissue-relevant gene expression, methylation measurements and genotypes in a group of individuals prior to them developing breast cancer. In established breast cancers in TCGA, the relationship of gene expression with methylation is confounded by somatic alterations and copy number events and general dysregulation of gene expression networks making it difficult to determine what proportion of gene expression is attributable to meQTLs. We have added a statement about the need to further investigate meQTL effects on oncogene and tumor suppressor gene expression in healthy breast tissue to establish how these effects relate to cancer risk.

“More direct mechanistic insight might be gained by studying expression, genotype and methylation in healthy and pre-cancerous breast tissues and cell types. Studying the average expression of MSH2, EZH2, and ASXL2 within TCGA patients stratified by meQTL risk PRS suggested a potential decrease in expression of ASXL2 and EZH2 from in the highest PRS quantile relative to the lowest while MSH2 did not show much difference (Figure 5B). However, this difference needs to be studied further with more specific tumor sub-type stratification and cell type-specific expression. ”

Figure 5 and caption have been updated as follows:

“A) Features are ranked according to their contribution to classifier predictive performance. Total importances sum to 1. B) Average expression of ASXL2, EZH2 and MSH2 in TCGA breast cancer samples, stratified by PRS quantile.”
Competing Interests: No competing interests. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 09 Aug 2025

Hannah Carter, Moores Cancer Center, La Jolla, CA 92093, USA

09 Aug 2025

Author Response

"To obtain independent meQTLs, we clumped related meQTLs based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1)."

Comment: Cite ... Continue reading "To obtain independent meQTLs, we clumped related meQTLs based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1)."

Comment: Cite the database used as a source of meQTLs? In the discussion, authors cited a database (46) but it is unclear if this is the same database they used to identify meQTLs.

The data used as a source of the meQTLs are the Pancan-meQTLs from the cited number 46 [Gong J, et al.: Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer. Nucleic Acids Res. 2019; 47: D1066–D1072]

We have modified the text to clarify this as follows:

“To obtain independent meQTLs, we clumped related meQTLs from Gong et al (46) based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1).”

"(A) 5 state-based K-Means clustering of common TAD domains (n=1100) between 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs."

Comment: what are the x- and y-axes labels in 1A? Where are the five cell types indicated?

The x-axis are the 15 chromatin states that are used to cluster the TADs. The y-axis are all 1100 TADs that were used for the analysis and the 5 cell types are indicated through the legend shown on the left. The different colors represent the “Mixed”, “Inactive-1”, “Active-1”, “Active-2”, “Inactive-2”.

We have updated the figure legend as follows:

“(A) 5 state-based K-Means clustering of common TAD domains shared across 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Shared TAD domains are on the y-axis (n=1100) and are grouped according to 15 chromatin states (x-axis). K-means clusters are shown as a side bar along the y-axis. Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs.”

Comment: What does ‘other’ mean in Table 1?

Other represents all meQTLs that are in the inter-TAD region that aren’t technically in the boundary region, since we defined the boundary region as +/- 50kb around the TAD boundaries. We have added a note to the table legend to clarify this.

“Other indicates meQTLs that are in the inter-TAD region but do not fall within the boundary region as defined.”

"Distributions suggested an increase in density of clumped meQTLs when transitioning from active to inactive regions, and conversely, a decrease from inactive to active regions (Kruskal-Wallis ANOVA, p-value<0.05) when compared to the randomly shuffled distribution, but no shift in density for Active-Boundary-Active and Inactive-Boundary-Inactive categories (Figure 2B-D)."

Comment: So what does any of this suggest??? expand?

We have tried to further expand our interpretation of this observation in the discussion as follows:

“It is of note that TAD boundaries conserved across cell types are reportedly highly enriched for evolutionary constraint and complex trait heritability.10 Our data suggest that variability in gene expression due to meQTLs is also evolutionarily more constrained in and around active TADs and their boundaries, consistent with these TAD boundaries playing a critical role in development (47). These results may suggest that TAD boundaries play a role in making the recruitment of regulatory machinery more specific, particularly as it pertains to DNA methylation.”

"In total, 103 oncogenes and 223 TSGs were used for this analysis, where only 67 of them contained meQTL-affecting CpG probes in their promoter regions (i.e. 49 TSGs and 18 oncogenes)."

Comment: This suggests CpGs affect meQTLs but it’s the other way round.

We changed “meQTL-affecting” to “meQTL-associated”.

"Clumped cancer meQTls were further narrowed to those associated with the methylation status of CpG probes located within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

Comment: Also clarify if the correlation between methylation status of CpG vs meQTLs is also observed in normal tissues or only cancer tissues?

The meQTLs generated by Gong et al were based on methylation status at CpG markers as measured in bulk tumor tissues which generally include a mixture of tumor and normal cells (stroma and immune infiltrates). Nonetheless, it is not clear whether the effects detected here are biased toward effects on methylation that are positively selected in tumors which might not be reflected in normal tissues. This limitation is described in the discussion:

“First, the meQTLs utilized for this study are derived from a study of tumors 46 which could be biased toward detecting meQTLs associated with DNA methylation events that are positively selected in tumors. “

and

“In future studies, it would be of interest to study meQTL trends in normal tissue samples to see if enrichment patterns associated with cancer genes are driven by selection in tumors, or highlight evolutionary constraints more broadly associated with human health that coincidentally are advantageous for tumor development.”

We also believe that our original phrasing was confusing here. We have rephrased as follows:

"Clumped cancer meQTLs were further narrowed to those whose corresponding affected CpG probes were within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

"Overall, cancer meQTLs near 29 cancer genes were included in the model. The most predictive driver meQTL was associated MSH2, a gene associated with Lynch syndrome and increased risk of breast cancer.

Polymorphic variation affecting the expression of EZH2, the second most informative feature, has also been linked to breast cancer risk. ASXL2 may be required for estrogen receptor alpha (ERa) activation in ERa positive breast cancers. Notably, EZH2 overexpression has been linked more strongly to triple negative breast suggesting that the model includes features predictive of multiple subtypes.”

Comment:
1. Since some of these meQTLs lie close to genes involved in epigenetic modifications --- have you looked if these are in the enhancer or otherwise defined regulatory domains?

The challenge here is that we do not know if the tag meQTL SNP is actually the causal SNP. We focused on meQTLs that affect CpG probes within the promoter regions of cancer genes but that does not preclude the possibility that they are affecting an enhancer. We have added this as a limitation of our study:

“For risk prediction, we focused on meQTLs and their corresponding CpG probes that are overlapping the promoter regions of known cancer genes, however we cannot be sure that these meQTLs are not also affecting other genes in the region, for example through effects on enhancer activity. ”

2. Are these genes (MSH2, EZH2, ASXL2) known to be upregulated or downregulated in these risk cases?
Does that agree with the prediction according to meQTLs?

Establishing the mechanism by which meQTLs drive risk would require tissue-relevant gene expression, methylation measurements and genotypes in a group of individuals prior to them developing breast cancer. In established breast cancers in TCGA, the relationship of gene expression with methylation is confounded by somatic alterations and copy number events and general dysregulation of gene expression networks making it difficult to determine what proportion of gene expression is attributable to meQTLs. We have added a statement about the need to further investigate meQTL effects on oncogene and tumor suppressor gene expression in healthy breast tissue to establish how these effects relate to cancer risk.

“More direct mechanistic insight might be gained by studying expression, genotype and methylation in healthy and pre-cancerous breast tissues and cell types. Studying the average expression of MSH2, EZH2, and ASXL2 within TCGA patients stratified by meQTL risk PRS suggested a potential decrease in expression of ASXL2 and EZH2 from in the highest PRS quantile relative to the lowest while MSH2 did not show much difference (Figure 5B). However, this difference needs to be studied further with more specific tumor sub-type stratification and cell type-specific expression. ”

Figure 5 and caption have been updated as follows:

“A) Features are ranked according to their contribution to classifier predictive performance. Total importances sum to 1. B) Average expression of ASXL2, EZH2 and MSH2 in TCGA breast cancer samples, stratified by PRS quantile.”
"To obtain independent meQTLs, we clumped related meQTLs based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1)."

Comment: Cite the database used as a source of meQTLs? In the discussion, authors cited a database (46) but it is unclear if this is the same database they used to identify meQTLs.

The data used as a source of the meQTLs are the Pancan-meQTLs from the cited number 46 [Gong J, et al.: Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer. Nucleic Acids Res. 2019; 47: D1066–D1072]

We have modified the text to clarify this as follows:

“To obtain independent meQTLs, we clumped related meQTLs from Gong et al (46) based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1).”

"(A) 5 state-based K-Means clustering of common TAD domains (n=1100) between 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs."

Comment: what are the x- and y-axes labels in 1A? Where are the five cell types indicated?

The x-axis are the 15 chromatin states that are used to cluster the TADs. The y-axis are all 1100 TADs that were used for the analysis and the 5 cell types are indicated through the legend shown on the left. The different colors represent the “Mixed”, “Inactive-1”, “Active-1”, “Active-2”, “Inactive-2”.

We have updated the figure legend as follows:

“(A) 5 state-based K-Means clustering of common TAD domains shared across 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Shared TAD domains are on the y-axis (n=1100) and are grouped according to 15 chromatin states (x-axis). K-means clusters are shown as a side bar along the y-axis. Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs.”

Comment: What does ‘other’ mean in Table 1?

Other represents all meQTLs that are in the inter-TAD region that aren’t technically in the boundary region, since we defined the boundary region as +/- 50kb around the TAD boundaries. We have added a note to the table legend to clarify this.

“Other indicates meQTLs that are in the inter-TAD region but do not fall within the boundary region as defined.”

"Distributions suggested an increase in density of clumped meQTLs when transitioning from active to inactive regions, and conversely, a decrease from inactive to active regions (Kruskal-Wallis ANOVA, p-value<0.05) when compared to the randomly shuffled distribution, but no shift in density for Active-Boundary-Active and Inactive-Boundary-Inactive categories (Figure 2B-D)."

Comment: So what does any of this suggest??? expand?

We have tried to further expand our interpretation of this observation in the discussion as follows:

“It is of note that TAD boundaries conserved across cell types are reportedly highly enriched for evolutionary constraint and complex trait heritability.10 Our data suggest that variability in gene expression due to meQTLs is also evolutionarily more constrained in and around active TADs and their boundaries, consistent with these TAD boundaries playing a critical role in development (47). These results may suggest that TAD boundaries play a role in making the recruitment of regulatory machinery more specific, particularly as it pertains to DNA methylation.”

"In total, 103 oncogenes and 223 TSGs were used for this analysis, where only 67 of them contained meQTL-affecting CpG probes in their promoter regions (i.e. 49 TSGs and 18 oncogenes)."

Comment: This suggests CpGs affect meQTLs but it’s the other way round.

We changed “meQTL-affecting” to “meQTL-associated”.

"Clumped cancer meQTls were further narrowed to those associated with the methylation status of CpG probes located within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

Comment: Also clarify if the correlation between methylation status of CpG vs meQTLs is also observed in normal tissues or only cancer tissues?

The meQTLs generated by Gong et al were based on methylation status at CpG markers as measured in bulk tumor tissues which generally include a mixture of tumor and normal cells (stroma and immune infiltrates). Nonetheless, it is not clear whether the effects detected here are biased toward effects on methylation that are positively selected in tumors which might not be reflected in normal tissues. This limitation is described in the discussion:

“First, the meQTLs utilized for this study are derived from a study of tumors 46 which could be biased toward detecting meQTLs associated with DNA methylation events that are positively selected in tumors. “

and

“In future studies, it would be of interest to study meQTL trends in normal tissue samples to see if enrichment patterns associated with cancer genes are driven by selection in tumors, or highlight evolutionary constraints more broadly associated with human health that coincidentally are advantageous for tumor development.”

We also believe that our original phrasing was confusing here. We have rephrased as follows:

"Clumped cancer meQTLs were further narrowed to those whose corresponding affected CpG probes were within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

"Overall, cancer meQTLs near 29 cancer genes were included in the model. The most predictive driver meQTL was associated MSH2, a gene associated with Lynch syndrome and increased risk of breast cancer.

Polymorphic variation affecting the expression of EZH2, the second most informative feature, has also been linked to breast cancer risk. ASXL2 may be required for estrogen receptor alpha (ERa) activation in ERa positive breast cancers. Notably, EZH2 overexpression has been linked more strongly to triple negative breast suggesting that the model includes features predictive of multiple subtypes.”

Comment:
1. Since some of these meQTLs lie close to genes involved in epigenetic modifications --- have you looked if these are in the enhancer or otherwise defined regulatory domains?

The challenge here is that we do not know if the tag meQTL SNP is actually the causal SNP. We focused on meQTLs that affect CpG probes within the promoter regions of cancer genes but that does not preclude the possibility that they are affecting an enhancer. We have added this as a limitation of our study:

“For risk prediction, we focused on meQTLs and their corresponding CpG probes that are overlapping the promoter regions of known cancer genes, however we cannot be sure that these meQTLs are not also affecting other genes in the region, for example through effects on enhancer activity. ”

2. Are these genes (MSH2, EZH2, ASXL2) known to be upregulated or downregulated in these risk cases?
Does that agree with the prediction according to meQTLs?

Establishing the mechanism by which meQTLs drive risk would require tissue-relevant gene expression, methylation measurements and genotypes in a group of individuals prior to them developing breast cancer. In established breast cancers in TCGA, the relationship of gene expression with methylation is confounded by somatic alterations and copy number events and general dysregulation of gene expression networks making it difficult to determine what proportion of gene expression is attributable to meQTLs. We have added a statement about the need to further investigate meQTL effects on oncogene and tumor suppressor gene expression in healthy breast tissue to establish how these effects relate to cancer risk.

“More direct mechanistic insight might be gained by studying expression, genotype and methylation in healthy and pre-cancerous breast tissues and cell types. Studying the average expression of MSH2, EZH2, and ASXL2 within TCGA patients stratified by meQTL risk PRS suggested a potential decrease in expression of ASXL2 and EZH2 from in the highest PRS quantile relative to the lowest while MSH2 did not show much difference (Figure 5B). However, this difference needs to be studied further with more specific tumor sub-type stratification and cell type-specific expression. ”

Figure 5 and caption have been updated as follows:

“A) Features are ranked according to their contribution to classifier predictive performance. Total importances sum to 1. B) Average expression of ASXL2, EZH2 and MSH2 in TCGA breast cancer samples, stratified by PRS quantile.”
Competing Interests: No competing interests. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 01 Sep 2023

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4
Version 2 (revision) 24 Jul 25	read		read	read
Version 1 01 Sep 23	read	read

Charu Mehta, Fred Hutchinson Cancer Center, Seattle, USA
Chiara Herzog, Universitat Innsbruck, Innsbruck, Austria
Busra Unal, Umraniye Training and Research Hospital, Istanbul, Turkey
Kosuke Yamaguchi, National Institute of Genetics, Misima, Japan

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

3 Views

16 Jan 2026 | for Version 2

Kosuke Yamaguchi, Molecular Cell Engineering Laboratory, National Institute of Genetics, Misima, Shizuoka, Japan

3 Views Cite this report Responses(0)

Approved With Reservations

Figure 1A: The meaning of the X-axis label is unclear and should be explicitly defined.
Table 2: No false discovery rate (FDR) or q-value information is provided; only p-values associated with ICD-10 codes are shown. If these p-values are intended to represent FDR-adjusted values, they should be clearly labeled as such.
Table 2: The table legend describes beta values as correlation coefficients between meQTLs and promoter DNA methylation; however, it is not sufficiently clear from the table how these beta values should be interpreted in relation to the reported cancer risk and survival associations. Additional clarification would improve the readability of the table.
Figure 5B: All Y-axis labels are shown as "MSH2 Expression (Z-score)," which appears to be incorrect.

Adding gene expression comparisons between cancer and normal tissues for the genes listed in Table 2.
Explicitly annotating whether each gene in Table 2 is classified as a tumor suppressor gene or an oncogene.

Pan-cancer gene-related meQTLs, or
All identified meQTLs,
and directly comparing the prediction performance between TAD-associated meQTLs and these broader meQTL sets.

PMID: 10839546
PMID: 10839547
PMID: 12461525
PMID: 26257180
PMID: 30948436
PMID: 39180406

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

Partly
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Partly
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Epigenetic, Cell biology, Molecular cell biology.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

4 Views

07 Jan 2026 | for Version 2

Busra Unal, Umraniye Training and Research Hospital, Istanbul, Turkey

4 Views Cite this report Responses(0)

Approved

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

My area of research is hereditary cancer predisposition syndromes and disparities in genomic medicine application.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

6 Views

12 Aug 2025 | for Version 2

Charu Mehta, Basic Sciences Division, Fred Hutchinson Cancer Center, Seattle, Washington, USA

6 Views Cite this report Responses(0)

Approved

Thanks for your email. I have attached just a couple of minor comments.

In Fig 5A: meQTL associated with/near MSH2 has the highest “relative importance”. However, as the authors mention that MSH2 expression levels are not correlated with cancer risk in TCGA patients. Though the authors mention this in the discussion, it would be useful to mention the possibility of long range interactions wrt meQTL and target gene (e.g. through enhancer activity) here as well.
Missing word on page 8: “predictive dirver meQTL was associated with MSH2…

Competing Interests

No competing interests were disclosed.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

16 Views

08 Feb 2024 | for Version 1

Chiara Herzog, Universitat Innsbruck, Innsbruck, Tyrol, Austria

16 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

C.H. is the author of a patent on epigenetics-based breast cancer risk prediction using meQTLs, the WID™-qtBC test

Reviewer Expertise

epigenetics and cancer risk prediction

Respond to this report

Responses (1)

Author Response

09 Aug 2025

Hannah Carter, Moores Cancer Center, La Jolla, CA 92093, USA

The study by Goudarzi et al. provides interesting and new insights into the association of meQTLs and TAD regions of the genome, and investigates the capacity of meQTLs to predict cancer status and survival.
Overall the study is well done and clearly presented but I have a few comments and suggestions for improvement.

Major:
- In the introduction and discussion the authors state “This study investigated the relationship between epigenetic factors like chromatin structure and DNA methylation and genetic variation in the context of cancer”. While the authors indeed investigate methylation and TADs in the first part of the manuscript, the majority of the predictors focus on genetic loci and not epigenetics or their interaction. Arguably, one of the most interesting aspects of meQTLs in their capacity for risk prediction are their modulated methylation levels and potential to reflect the integration of genetic and dynamic non heritable factors (such as due to aging or lifestyle factors), but this was not looked at in detail. Could the authors comment on how meQTLs might be modulated by non heritable factors as well as genetic factors, and e.g. look into methylation at these sites in cancers or samples preceding cancer?

We agree with the reviewer that it is important to consider potential for interaction with non-heritable factors. We have modified our discussion to acknowledge the need to evaluate non-heritable factors as follows:

“There are also a number of non-genetic risk factors that act by modifying DNA methylation levels and which could interact with genetic regulation. These include aging, exercise, stress, diet and obesity, and a broad variety of environmental exposures. In our analysis, age had the highest impact on DNA methylation modulation, however, as age and sex were the only clinical factors for the majority of our study, future analysis of other non-genetic factors in relation to genetic regulators of DNA methylation are merited.”

- Along these lines, it might in the future be interesting to develop dynamic cancer risk predictors as opposed to static tools (such as the PRSs), which might be enabled by nongenetic ‘omics’. Could the authors discuss the potential of these and how their findings might contribute to this (i.e. how meQTLs might contribute to dynamic risk monitoring)?

This is a nice suggestion. We have added the following to the discussion:

“Future efforts could integrate dynamic methylation changes due to these non-genetic factors with static polygenic scores such as we describe here to provide a more accurate estimate of risk. This type of approach could benefit in particular from non-invasive biomarkers, such as cell free DNA methylation from blood, though studies will be needed to establish the cumulative effect of dynamic exposures and the extent to which they can be accurately evaluated from cell free DNA.”

Ref:
Yousefi, P.D., Suderman, M., Langdon, R. et al. DNA methylation-based predictors of health: applications and statistical considerations. Nat Rev Genet 23, 369–383 (2022). https://doi.org/10.1038/s41576-022-00465-w

Minor:
- The authors describe a PCA but do not show any figures or supporting data. Could the authors either add a statement that no data are shown in the text, or (preferred) provide these data in the supplementary information?

We have added the PCA figure as Supplementary Figure 1.

- Previous breast cancer PRSs (not based on meQTL) such as the PRS313 have already shown that they may be biased towards certain subtypes - it might be worth mentioning these prior models (e.g. Mavaddat et al 2019) when discussing the current study’s findings in context.

We have now included the following text:

“Indeed, classic polygenic risk scores for breast cancer have shown bias for predicting certain subtypes (Mavaddat et al). Lakeman et al 2020 demonstrated that women in the highest 1% of risk showed a 4.37-fold increased risk for ER-positive disease but only a 2.78-fold increased risk for ER-negative disease compared to the middle quintile showing bias in certain subtypes.”

Refs
Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, Tyrer JP, Chen TH, Wang Q, Bolla MK, Yang X. Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. The American Journal of Human Genetics. 2019 Jan 3;104(1):21-34.

Lakeman, I.M.M., Rodríguez-Girondo, M., Lee, A. et al. Validation of the BOADICEA model and a 313-variant polygenic risk score for breast cancer risk prediction in a Dutch prospective cohort. Genet Med 22, 1803–1811 (2020). https://doi.org/10.1038/s41436-020-0884-4

- What was the ROC AUC (and 95% CI) of the cancer risk score (Figure 4A)?

THE DRIVE AUC 0.5534 with 95% CI between [0.5505, 0.5563]. This has now been added to the manuscript. We note that this PRS is based solely upon meQTLs near driver genes and is not expected to be a strong predictor of risk relative to more comprehensive breast cancer polygenic scores. Rather we sought to reproduce effects on risk attributable solely to driver meQTLs in an independent cohort.

- Can the authors explain the discrepancy of the more ‘linear’ increase in risk in the UKBB compared to DRIVE (Figure 4b versus 4c)?

This discrepancy could be due to inherent differences in the composition of the two datasets. UKBB comes predominantly from volunteers in the UK and is more representative of disease incidence in the general population. The DRIVE study was designed specifically to analyze breast cancer risk, and therefore individuals were included to either represent breast cancer or serve as non-breast cancer controls.Breast cancer status in the UKBB was defined based on ICD-10 codes, and we excluded individuals from the controls if they had any ICD10 code associated with neoplasms, but the number of cases relative to controls was unbalanced for UKBB whereas it was balanced for DRIVE. The difference in linearities between the patients could be due to differences in diversity in genotype and phenotype in the UKBB cohort compared to the DRIVE cohort. There could also be discrepancies in environmental risk factors between these cohorts.

- Figure 6/TCGA survival: It might also be interesting to look at recurrence-free survival in addition to overall survival.

We now include an analysis of disease free intervals as Figure SX. The methods have been updated accordingly.

- In the section Oncogenes and tumor suppressor gene-related (…): capitalise L in ‘Clumped cancer mQTls’

We have now corrected this.

- Figures:

- Figure 1b: Could the authors also indicate in the Figure legend that this is a Kruskal-Wallis p value.

We have updated the figure legend to reflect the test used.

- Figure 1c - For interpretability, it might be helpful to add at least y axis grid lines behind box plots. The effects may be significant and is visualised with the violin density plot, but are difficult to see using box plots.

We added gridlines as suggested.

- Figure 2 - takes some time to understand upon first reading. It might be helpful to label the blue bars in B and C with a legend ‘expected’ and the red dot as ‘observed’ to make it easier to grasp quickly.

We have updated the legends to read observed and expected.

- Figure 4: The text in the caption for A refers to a ‘Figure 8’ that does not exist. Please check this.

Thank you for catching this. The erroneous figure reference has been removed from the caption.

View more View less

Competing Interests

No competing interests.

Back to all reports

Reviewer Report

17 Views

17 Jan 2024 | for Version 1

Charu Mehta, Basic Sciences Division, Fred Hutchinson Cancer Center, Seattle, Washington, USA

17 Views Cite this report Responses(1)

Approved With Reservations

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

My area of research is gene regulation. I am able to assess significant parts of this manuscript, however, I am not a statistician or computational biologist, so I cannot speak to the soundness of their methods.

Respond to this report

Responses (1)

Author Response

09 Aug 2025

Hannah Carter, Moores Cancer Center, La Jolla, CA 92093, USA

"To obtain independent meQTLs, we clumped related meQTLs based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1)."

Comment: Cite the database used as a source of meQTLs? In the discussion, authors cited a database (46) but it is unclear if this is the same database they used to identify meQTLs.

The data used as a source of the meQTLs are the Pancan-meQTLs from the cited number 46 [Gong J, et al.: Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer. Nucleic Acids Res. 2019; 47: D1066–D1072]

We have modified the text to clarify this as follows:

“To obtain independent meQTLs, we clumped related meQTLs from Gong et al (46) based on linkage disequilibrium using PLINK. Out of the 1.2 million SNPs, 60,602 remained after LD pruning (Table 1).”

"(A) 5 state-based K-Means clustering of common TAD domains (n=1100) between 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs."

Comment: what are the x- and y-axes labels in 1A? Where are the five cell types indicated?

The x-axis are the 15 chromatin states that are used to cluster the TADs. The y-axis are all 1100 TADs that were used for the analysis and the 5 cell types are indicated through the legend shown on the left. The different colors represent the “Mixed”, “Inactive-1”, “Active-1”, “Active-2”, “Inactive-2”.

We have updated the figure legend as follows:

“(A) 5 state-based K-Means clustering of common TAD domains shared across 5 human cell lines (GM12878, HMEC, HUVEC, IMR90, and NHEK). Shared TAD domains are on the y-axis (n=1100) and are grouped according to 15 chromatin states (x-axis). K-means clusters are shown as a side bar along the y-axis. Purple indicates TADs classified as a “Mixed”, Gray as “Inactive-1”, Light Blue as “Active-1”, Orange as “Active-2”, and Red as “Inactive-2”. Combining active and inactive categories leads to 222 Active, 626 Inactive, and 252 Mixed TADs.”

Comment: What does ‘other’ mean in Table 1?

Other represents all meQTLs that are in the inter-TAD region that aren’t technically in the boundary region, since we defined the boundary region as +/- 50kb around the TAD boundaries. We have added a note to the table legend to clarify this.

“Other indicates meQTLs that are in the inter-TAD region but do not fall within the boundary region as defined.”

"Distributions suggested an increase in density of clumped meQTLs when transitioning from active to inactive regions, and conversely, a decrease from inactive to active regions (Kruskal-Wallis ANOVA, p-value<0.05) when compared to the randomly shuffled distribution, but no shift in density for Active-Boundary-Active and Inactive-Boundary-Inactive categories (Figure 2B-D)."

Comment: So what does any of this suggest??? expand?

We have tried to further expand our interpretation of this observation in the discussion as follows:

“It is of note that TAD boundaries conserved across cell types are reportedly highly enriched for evolutionary constraint and complex trait heritability.10 Our data suggest that variability in gene expression due to meQTLs is also evolutionarily more constrained in and around active TADs and their boundaries, consistent with these TAD boundaries playing a critical role in development (47). These results may suggest that TAD boundaries play a role in making the recruitment of regulatory machinery more specific, particularly as it pertains to DNA methylation.”

"In total, 103 oncogenes and 223 TSGs were used for this analysis, where only 67 of them contained meQTL-affecting CpG probes in their promoter regions (i.e. 49 TSGs and 18 oncogenes)."

Comment: This suggests CpGs affect meQTLs but it’s the other way round.

We changed “meQTL-affecting” to “meQTL-associated”.

"Clumped cancer meQTls were further narrowed to those associated with the methylation status of CpG probes located within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

Comment: Also clarify if the correlation between methylation status of CpG vs meQTLs is also observed in normal tissues or only cancer tissues?

The meQTLs generated by Gong et al were based on methylation status at CpG markers as measured in bulk tumor tissues which generally include a mixture of tumor and normal cells (stroma and immune infiltrates). Nonetheless, it is not clear whether the effects detected here are biased toward effects on methylation that are positively selected in tumors which might not be reflected in normal tissues. This limitation is described in the discussion:

“First, the meQTLs utilized for this study are derived from a study of tumors 46 which could be biased toward detecting meQTLs associated with DNA methylation events that are positively selected in tumors. “

and

“In future studies, it would be of interest to study meQTL trends in normal tissue samples to see if enrichment patterns associated with cancer genes are driven by selection in tumors, or highlight evolutionary constraints more broadly associated with human health that coincidentally are advantageous for tumor development.”

We also believe that our original phrasing was confusing here. We have rephrased as follows:

"Clumped cancer meQTLs were further narrowed to those whose corresponding affected CpG probes were within the promoter regions of cancer driver genes including oncogenes and tumor suppressor genes (TSGs) from the COSMIC database.”

"Overall, cancer meQTLs near 29 cancer genes were included in the model. The most predictive driver meQTL was associated MSH2, a gene associated with Lynch syndrome and increased risk of breast cancer.

Polymorphic variation affecting the expression of EZH2, the second most informative feature, has also been linked to breast cancer risk. ASXL2 may be required for estrogen receptor alpha (ERa) activation in ERa positive breast cancers. Notably, EZH2 overexpression has been linked more strongly to triple negative breast suggesting that the model includes features predictive of multiple subtypes.”

Comment:
1. Since some of these meQTLs lie close to genes involved in epigenetic modifications --- have you looked if these are in the enhancer or otherwise defined regulatory domains?

The challenge here is that we do not know if the tag meQTL SNP is actually the causal SNP. We focused on meQTLs that affect CpG probes within the promoter regions of cancer genes but that does not preclude the possibility that they are affecting an enhancer. We have added this as a limitation of our study:

“For risk prediction, we focused on meQTLs and their corresponding CpG probes that are overlapping the promoter regions of known cancer genes, however we cannot be sure that these meQTLs are not also affecting other genes in the region, for example through effects on enhancer activity. ”

2. Are these genes (MSH2, EZH2, ASXL2) known to be upregulated or downregulated in these risk cases?
Does that agree with the prediction according to meQTLs?

Establishing the mechanism by which meQTLs drive risk would require tissue-relevant gene expression, methylation measurements and genotypes in a group of individuals prior to them developing breast cancer. In established breast cancers in TCGA, the relationship of gene expression with methylation is confounded by somatic alterations and copy number events and general dysregulation of gene expression networks making it difficult to determine what proportion of gene expression is attributable to meQTLs. We have added a statement about the need to further investigate meQTL effects on oncogene and tumor suppressor gene expression in healthy breast tissue to establish how these effects relate to cancer risk.

“More direct mechanistic insight might be gained by studying expression, genotype and methylation in healthy and pre-cancerous breast tissues and cell types. Studying the average expression of MSH2, EZH2, and ASXL2 within TCGA patients stratified by meQTL risk PRS suggested a potential decrease in expression of ASXL2 and EZH2 from in the highest PRS quantile relative to the lowest while MSH2 did not show much difference (Figure 5B). However, this difference needs to be studied further with more specific tumor sub-type stratification and cell type-specific expression. ”

Figure 5 and caption have been updated as follows:

“A) Features are ranked according to their contribution to classifier predictive performance. Total importances sum to 1. B) Average expression of ASXL2, EZH2 and MSH2 in TCGA breast cancer samples, stratified by PRS quantile.”

View more View less

Competing Interests

No competing interests.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Iyer JG, et al.: Response rates and durability of chemotherapy among 62 patients with metastatic Merkel cell carcinoma. Cancer Med. 2016; 5: 2294–2301. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Gayther SA, et al.: Variation of risks of breast and ovarian cancer associated with different germline mutations of the BRCA2 gene. Nat. Genet. 1997; 15: 103–105. PubMed Abstract | Publisher Full Text

[3] 3. Chequin A, et al.: Antitumoral activity of liraglutide, a new DNMT inhibitor in breast cancer cells in vitro and in vivo. Chem. Biol. Interact. 2021; 349: 109641. PubMed Abstract | Publisher Full Text

[4] 4. Heyn H, et al.: Linkage of DNA methylation quantitative trait loci to human cancer risk. Cell Rep. 2014; 7: 331–338. PubMed Abstract | Publisher Full Text

[5] 5. Irizarry RA, et al.: The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nat. Genet. 2009; 41: 178–186. PubMed Abstract | Publisher Full Text | Free Full Text

[6] 6. Esteller M, et al.: Promoter hypermethylation and BRCA1 inactivation in sporadic breast and ovarian tumors. J. Natl. Cancer Inst. 2000; 92: 564–569. PubMed Abstract | Publisher Full Text

[7] 7. Wolff EM, et al.: Hypomethylation of a LINE-1 promoter activates an alternate transcript of the MET oncogene in bladders with cancer. PLoS Genet. 2010; 6: e1000917. PubMed Abstract | Publisher Full Text | Free Full Text

[8] 8. Jablonski KP, et al.: Contribution of 3D genome topological domains to genetic risk of cancers: a genome-wide computational study. Hum. Genomics. 2022; 16: 2. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Dixon JR, et al.: Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012; 485: 376–380. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. McArthur E, Capra JA: Topologically associating domain boundaries that are stable across diverse cell types are evolutionarily constrained and enriched for heritability. Am. J. Hum. Genet. 2021; 108: 269–283. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Akdemir KC, et al.: Somatic mutation distributions in cancer genomes vary with three-dimensional chromatin structure. Nat. Genet. 2020; 52: 1178–1188. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Rao SSP, et al.: A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014; 159: 1665–1680. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Nora EP, et al.: Spatial partitioning of the regulatory landscape of the X-inactivation centre. Nature. 2012; 485: 381–385. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Li S, Peng Y, Panchenko AR: DNA methylation: Precise modulation of chromatin structure and dynamics. Curr. Opin. Struct. Biol. 2022; 75: 102430. PubMed Abstract | Publisher Full Text

[15] 15. Curradi M, Izzo A, Badaracco G, et al.: Molecular mechanisms of gene silencing mediated by DNA methylation. Mol. Cell. Biol. 2002; 22: 3157–3173. PubMed Abstract | Publisher Full Text | Free Full Text

[16] 16. Gong J, et al.: Pancan-meQTL: a database to systematically evaluate the effects of genetic variants on methylation in human cancer. Nucleic Acids Res. 2019; 47: D1066–D1072. PubMed Abstract | Publisher Full Text | Free Full Text

[17] 17. Tate JG, et al.: COSMIC: the Catalogue Of Somatic Mutations In Cancer. Nucleic Acids Res. 2019; 47: D941–D947. PubMed Abstract | Publisher Full Text | Free Full Text

[18] 18. Elgart M, et al.: Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations. Commun. Biol. 2022; 5: 856. PubMed Abstract | Publisher Full Text | Free Full Text

[19] 19. Sheehan M, et al.: Investigating the Link between Lynch Syndrome and Breast Cancer. Eur. J. Breast Health. 2020; 16: 106–109. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Ma S-J, Liu Y-M, Zhang Y-L, et al.: Correlations of and gene polymorphisms with breast cancer susceptibility and prognosis. Biosci. Rep. 2018; 38. PubMed Abstract | Publisher Full Text | Free Full Text

[21] 21. Park U-H, et al.: ASXL2 promotes proliferation of breast cancer cells by linking ERα to histone methylation. Oncogene. 2016; 35: 3742–3752. PubMed Abstract | Publisher Full Text

[22] 22. Wang X, et al.: Clinical and prognostic relevance of EZH2 in breast cancer: A meta-analysis. Biomed. Pharmacother. 2015; 75: 218–225. PubMed Abstract | Publisher Full Text

[23] 23. Mavaddat N, Michailidou K, Dennis J, et al.: Polygenic risk scores for prediction of breast cancer and breast cancer subtypes. Am. J. Hum. Genet. 2019 Jan 3; 104(1): 21–34. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Lakeman IMM, Rodríguez-Girondo M, Lee A, et al.: Validation of the BOADICEA model and a 313-variant polygenic risk score for breast cancer risk prediction in a Dutch prospective cohort. Genet. Med. 2020; 22: 1803–1811. PubMed Abstract | Publisher Full Text | Free Full Text

[25] 25. Walia V, et al.: Mutational and functional analysis of the tumor-suppressor PTPRD in human melanoma. Hum. Mutat. 2014; 35: 1301–1310. PubMed Abstract | Publisher Full Text

[26] 26. Schrama D, et al.: ERCC5 p.Asp1104His and ERCC2 p.Lys751Gln polymorphisms are independent prognostic factors for the clinical course of melanoma. J. Invest. Dermatol. 2011; 131: 1280–1290. PubMed Abstract | Publisher Full Text

[27] 27. Henríquez-Hernández LA, et al.: Single nucleotide polymorphisms in DNA repair genes as risk factors associated to prostate cancer progression. BMC Med. Genet. 2014; 15: 143. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Zhu Y, et al.: Systematic analysis on expression quantitative trait loci identifies a novel regulatory variant in ring finger and WD repeat domain 3 associated with prognosis of pancreatic cancer. Chin. Med. J. 2022; 135: 1348–1357. PubMed Abstract | Publisher Full Text | Free Full Text

[29] 29. Fu X, et al.: RFWD3-Mdm2 ubiquitin ligase complex positively regulates p53 stability in response to DNA damage. Proc. Natl. Acad. Sci. U. S. A. 2010; 107: 4579–4584. PubMed Abstract | Publisher Full Text | Free Full Text

[30] 30. Dasgupta P, et al.: LncRNA CDKN2B-AS1/miR-141/cyclin D network regulates tumor progression and metastasis of renal cell carcinoma. Cell Death Dis. 2020; 11: 660. PubMed Abstract | Publisher Full Text | Free Full Text

[31] 31. Pellegata NS, et al.: Human pheochromocytomas show reduced p27Kip1 expression that is not associated with somatic gene mutations and rarely with deletions. Virchows Arch. 2007; 451: 37–46. Publisher Full Text

[32] 32. Theodoropoulos GE, et al.: Caspase 9 promoter polymorphisms confer increased susceptibility to breast cancer. Cancer Genet. 2012; 205: 508–512. PubMed Abstract | Publisher Full Text

[33] 33. Rodriguez-Ruiz ME, et al.: Apoptotic caspases inhibit abscopal responses to radiation and identify a new prognostic biomarker for breast cancer patients. Oncoimmunology. 2019; 8: e1655964. PubMed Abstract | Publisher Full Text | Free Full Text

[34] 34. Walsh CS, et al.: ERCC5 is a novel biomarker of ovarian cancer prognosis. J. Clin. Oncol. 2008; 26: 2952–2958. Publisher Full Text

[35] 35. Shuai W, et al.: ETNK1 mutation occurs in a wide spectrum of myeloid neoplasms and is not specific for atypical chronic myeloid leukemia. Cancer. 2023; 129: 878–889. PubMed Abstract | Publisher Full Text

[36] 36. Stoica C, Ferreira AK, Hannan K, et al.: Bilayer Forming Phospholipids as Targets for Cancer Therapy. Int. J. Mol. Sci. 2022; 23. PubMed Abstract | Publisher Full Text | Free Full Text

[37] 37. Ahmed M, et al.: CRISPRi screens reveal a DNA methylation-mediated 3D genome dependent causal mechanism in prostate cancer. Nat. Commun. 2021; 12: 1781. PubMed Abstract | Publisher Full Text | Free Full Text

[38] 38. Xia J-H, Wei G-H: Enhancer Dysfunction in 3D Genome and Disease. Cells. 2019; 8. PubMed Abstract | Publisher Full Text | Free Full Text

[39] 39. Fudenberg G, Pollard KS: Chromatin features constrain structural variation across evolutionary timescales. Proc. Natl. Acad. Sci. U. S. A. 2019; 116: 2175–2180. PubMed Abstract | Publisher Full Text | Free Full Text

[40] 40. Rovirosa L, Ramos-Morales A, Javierre BM: The Genome in a Three-Dimensional Context: Deciphering the Contribution of Noncoding Mutations at Enhancers to Blood Cancer. Front. Immunol. 2020; 11: 592087. PubMed Abstract | Publisher Full Text | Free Full Text

[41] 41. Valton A-L, Dekker J: TAD disruption as oncogenic driver. Curr. Opin. Genet. Dev. 2016; 36: 34–40. PubMed Abstract | Publisher Full Text | Free Full Text

[42] 42. Pagadala M, et al.: Germline modifiers of the tumor immune microenvironment implicate drivers of cancer risk and immunotherapy response. Nat. Commun. 2023; 14: 2744. PubMed Abstract | Publisher Full Text | Free Full Text

[43] 43. Zhang P, et al.: Germline and Somatic Genetic Variants in the p53 Pathway Interact to Affect Cancer Risk, Progression, and Drug Response. Cancer Res. 2021; 81: 1667–1680. PubMed Abstract | Publisher Full Text | Free Full Text

[44] 44. Sayaman RW, et al.: Germline genetic contribution to the immune landscape of cancer. Immunity. 2021; 54: 367–386.e8. PubMed Abstract | Publisher Full Text | Free Full Text

[45] 45. Carter H, et al.: Interaction Landscape of Inherited Polymorphisms with Somatic Events in Cancer. Cancer Discov. 2017; 7: 410–423. PubMed Abstract | Publisher Full Text | Free Full Text

[46] 46. Dworkin AM, et al.: Germline variation controls the architecture of somatic alterations in tumors. PLoS Genet. 2010; 6: e1001136. PubMed Abstract | Publisher Full Text | Free Full Text

[47] 47. Li Q, et al.: Expression QTL-based analyses reveal candidate causal genes and loci across five tumor types. Hum. Mol. Genet. 2014; 23: 5294–5302. PubMed Abstract | Publisher Full Text | Free Full Text

[48] 48. Li W, et al.: Cis- and Trans-Acting Expression Quantitative Trait Loci of Long Non-Coding RNA in 2,549 Cancers With Potential Clinical and Therapeutic Implications. Front. Oncol. 2020; 10: 602104. PubMed Abstract | Publisher Full Text | Free Full Text

[49] 49. Akdemir KC, et al.: Disruption of chromatin folding domains by somatic genomic rearrangements in human cancer. Nat. Genet. 2020; 52: 294–305. PubMed Abstract | Publisher Full Text | Free Full Text

[50] 50. Yousefi PD, Suderman M, Langdon R, et al.: DNA methylation-based predictors of health: applications and statistical considerations. Nat. Rev. Genet. 2022; 23: 369–383. PubMed Abstract | Publisher Full Text

[51] 51. Liu J, et al.: An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell. 2018; 173: 400–416.e11. PubMed Abstract | Publisher Full Text | Free Full Text

[52] 52. Kazachenka A, et al.: Identification, Characterization, and Heritability of Murine Metastable Epialleles: Implications for Non-genetic Inheritance. Cell. 2018; 175: 1717. PubMed Abstract | Publisher Full Text | Free Full Text

[53] 53. Inoue F, et al.: A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res. 2017; 27: 38–52. PubMed Abstract | Publisher Full Text | Free Full Text

[54] 54. Bycroft C, et al.: The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018; 562: 203–209. PubMed Abstract | Publisher Full Text | Free Full Text

[55] 55. Amos CI, et al.: The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers. Cancer Epidemiol. Biomark. Prev. 2017; 26: 126–135. PubMed Abstract | Publisher Full Text | Free Full Text

[56] 56. Roadmap Epigenomics Consortium et al.: Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518: 317–330.

[57] 57. Purcell S, et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 2007; 81: 559–575. PubMed Abstract | Publisher Full Text | Free Full Text

[58] 58. ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489: 57–74. PubMed Abstract | Publisher Full Text | Free Full Text

[59] 59. Hall MA, et al.: PLATO software provides analytic framework for investigating complexity beyond genome-wide association studies. Nat. Commun. 2017; 8: 1167. PubMed Abstract | Publisher Full Text | Free Full Text

[60] 60. Goudarzi S, Hcarter: cartercompbio/meQTLs: Initial release (v1.0.0). Zenodo. 2023. Publisher Full Text

[61] 61. Carter H: Extended_Data_Figure_1.pdf. figshare. Figure. 2025. Publisher Full Text

[62] 62. Carter H: Extended Data Figure 2. figshare. Figure. 2025. Publisher Full Text

Epigenetic germline variants predict cancer prognosis and risk and distribute uniquely in topologically associating domains

Abstract

Background

Methods

Results

Conclusions

Keywords

Revised Amendments from Version 1

Introduction

Results

Active TADs are associated with less DNA methylation at cancer meQTLs

Figure 1. Evaluating DNA methylation and meQTL burden in topologically associated domains (TADs).

Cancer meQTLs are more abundant in inactive domains

Table 1. General Information on meQTL number across TADs and multiple analyses.

Figure 2. Normalized burden of meQTLs in adjacent TADs.

Oncogene and tumor suppressor gene-related cancer meQTLs cluster differentially in TADs

Figure 3. Expected versus observed occurrence of Driver meQTLs for oncogenes and TSGs by region type.

Assessment of driver meQTL association with cancer risk and overall survival across tumor types

Table 2. List of meQTLs significantly affecting risk and survival in a pan-cancer model (Benjamini-Hochberg FDR<0.05).

Table 3. The ICD 10 code.

Figure 4. XGBoost validation of breast cancer risk in DRIVE dataset.

Figure 5. Feature importances for breast cancer risk classifier.

Figure 6. CoxPH Hazard Ratios and 95% confidence interval of PSS and covariates in TCGA overall survival.

Figure 7. Feature importance of SNPs in XGBoost polygenic survival scores.

Discussion

Methods

TCGA and promoter data

UKBioBank data

DRIVE breast cancer data

TAD identification and clustering based on chromHMM and DNA methylation

meQTL distribution within TADs

Randomized distribution of cancer-gene-clumped meQTLs

Correlation of meQTL profiles with clinical characteristics in TCGA

Machine-learning for meQTL-based risk and survival prediction

UKBioBank risk

UKBioBank PRS construction and breast cancer drive validation

Prediction of survival time in TCGA tumor types

Cox proportional hazard using PSS

Author contributions

Data availability

Source data

Software availability

Extended data

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated