A putative role for genome-wide epigenetic regulatory mechanisms in Huntington ’ s disease : A computational assessment

Huntington's Disease (HD) is an incurable disease of the adult Background: brain. Massive changes in gene expression are a prominent feature. Epigenetic effects have been reported to be implicated in HD, but the role of chromatin is not well understood. We tested if the chromatin state of dysregulated genes in HD is affected at a genome-wide scale and examined how epigenetic processes are associated with CpG-island-mediated gene expression. Our general approach incorporates computational and functional Methods: analysis of public data before embarking on expensive wet-lab experiments. We compared the location in the genome of the genes that were deregulated in HD human brain, obtained from public gene expression data, to the location of particular chromatin marks in reference tissues using data from the ENCODE project. We found that differentially expressed genes were enriched in the Results: active chromatin state, but not enriched in the silent state. In the caudate nucleus, the most highly affected brain region in HD, genes in the active state were associated with transcription, cell cycle, protein transport and modification, RNA splicing, histone post-translational modifications and RNA processing. Genes in the repressed state were linked with developmental processes and responses related to zinc and cadmium stimulus. We confirmed that genes within CpG-islands are enriched among HD dysregulated genes in human and mouse in HD. Epigenetic processes were associated more with genes that overlap with CpG-islands than genes that do not. Our results suggest that massive transcriptional dysregulation in Conclusion: HD is not matched by large-scale relocation of gene activity, i.e. inactive chromatin regions are altered into actively expressed chromatin regions and vice versa. We expect that changes in epigenetic chromatin state might occur at the level of single genes (e.g. promoters, gene body) and scattered genomic sites (e.g. CTCF sites, enhancer regions) instead of large-scale genomic regions.

Huntington's Disease (HD) is a complex disease of the brain associated with massive changes in gene expression. The genetic cause was identified in 1993 1 , but no successful treatment has been found yet. HD is a dominantly inherited neurodegenerative disease that affects 1 -10/100.000 individuals, making it the most common heritable neurodegenerative disorder 2 .
Although HD is considered a monogenic disease 3 , extensive research since 1993 into the underlying pathology suggests that the disease mechanisms are more complex than originally considered. Transcriptional dysregulation is a widespread phenomenon in HD that can be observed well before the first clinical symptoms appear 4 . It suggests that mutant huntingtin causes a broad and complex cascade of downstream effects. There are several ways in which mutant huntingtin can interfere with the transcriptional machinery and alter gene expression 5,6 . Given the extent of the transcriptional changes, the question arises if epigenetic mechanisms that operate at a genome-wide scale are also involved in HD. There is increasing evidence for epigenetic mechanisms playing a role in HD in human and model systems. For instance, earlier computational analysis of HD gene expression data showed that expression is deregulated in large genomic regions, indicative of a coordinated genome-wide mechanism 7 . These studies did not determine whether genomewide alterations in gene expression are associated with changes in the composition of histone modifications.
More recently, H3K4me3 was shown to be enriched in 136 loci in an HD case/control study, which included genes that may affect the neuronal epigenome at large 8 . In HD mice, down-regulated genes were found to be associated with a selective decrease of H3K27ac and RNA Polymerase II 9 . Inhibitors of HDAC1, broad-range regulators of chromatin structure, have indeed been shown to be effective in HD mice 10 . The interplay of enzymes involved in post-translational histone modifications like histone deacetylaces (HDACs), may make chromatin less accessible in many places and therefore alter gene expression patterns. In Drosophila, the homolog of htt was found to suppress position effect variegation (PEV), possibly by influencing PEV modifier genes 11 .
A role for epigenetic mechanisms in HD is further corroborated by numerous neurodevelopmental and neurodegenerative disorders that have been associated with an altered chromatin structure 12-14 . Neuroepigenetics has therefore become a prime topic of interest 15 , with researchers seeking to identify epigenomic signatures and how they can contribute to brain health or brain disease 16 .
Considering the growing body of evidence that epigenetic regulation is involved in neurological pathology, we asked if the massive transcriptional dysregulation in HD coincides with large scale changes in chromatin state across the genome. Under that assumption, we hypothesize that significant numbers of differentially expressed genes in HD will be found in regions that are not normally associated with active chromatin states.
Here we report on a computational test of our hypothesis before laboratory experiments using only publicly available data sets: HD gene expression from the Gene Expression Omnibus (GEO) 17 and chromatin state from ENCODE 18 . We assessed the overlap between differentially expressed genes and chromatin state, and we applied literature-based concept profile analysis (CPA) to interpret our findings 19,20 .
Our results indicate that the massive transcriptional dysregulation in HD is not matched by a significant largescale change of activity of genes that are part of inactive chromatin states in reference tissue. Our report includes a functional characterization of differentially expressed genes in HD in relation to huntingtin, chromatin state and CpG islands, based on a literature-based semantic analysis.
The analysis we performed to test our hypothesis is part of an interdisciplinary research approach where computational analysis is used to help steer laboratory experiments to increase the overall efficiency of a research laboratory 21 .

Concept profile analysis
In Concept Profile Analysis (CPA), the vector space model is used to associate two concepts mined from the literature with each other. Advantages of this model include efficient and transparent comparisons, and the possibility of attaching a weight to the association 20 . The CPA algorithms have previously been used for a range of different gene expression data analysis purposes such as functional annotation 22 , comparison of studies 23 , prediction of novel interactions 24 , generation of gene sets 19,25 , and association with chemical structures 26 ). The methodology has been described previously 27 . In short: In our database, every concept is associated with PubMed records using the indexing engine Peregrine (https://trac.nbic. nl/data-mining/) which is equipped with an in-house thesaurus of biomedical and chemical concepts that have been prepared for text mining 28,29 . For all concepts except genes and Gene Ontology (GO) terms the PubMed records are comprised of the texts in which the concept is mentioned. For genes, only a subset of PubMed records are used in order to limit the impact of ambiguous terms and distant homologs. GO terms are sometimes given as words or phrases that are infrequently found in the normal texts. To still provide broad coverage of GO terms, the PubMed records that were used as evidence for annotating genes with the GO term are added. For every concept in the thesaurus that is associated to at least five PubMed records, a vector containing all concepts related to the main concept (direct co-occurrence), weighted by the symmetric uncertainty coefficient is created. We call this a "Concept Profile". Concept profiles are matched to identify similarities via their shared concepts (indirect relations). Any distance measure can be used for this matching such as the mutual information, inner product, cosine angle, Euclidean distance or Pearson's correlation. The CPA Web Services that we used for our analysis use an inner product measure 30 . These web services can be found in the BioCatalogue web service registry https://www.biocatalogue.org/services/3559.

Data analysis and interpretation
For data analysis and interpretation we implemented a series of workflows using the Taverna  Workflows for data analysis The first Taverna workflow load_data_identify_DE_genes_Array_ A, was implemented to examine differential gene expression between control and HD samples. Required workflow inputs were two data files with gene expression values and the phenotype information that describes the samples from the microarray experiment. Differential expression was computed using moderated t statistics with the package limma 35 (version 3.14.1), which is provided by the bioconductor project 36 , R version https://www.bioconductor. org/. We analyzed each brain region separately, because previous analysis 37 revealed regional patterns in gene expression. The workflow maps the expression data from probes to entrez gene ids using the Affymetrix Human Genome U133 set annotation data, (packages hgu133a and hgu133b, version 2.8.0; https://bioconductor.org/ packages/hgu133a.db; https://bioconductor.org/packages/hgu133b. db/). When multiple probe names map to the same gene id, the ones exhibiting the most significant changes were used for further analysis. Final outcome of this workflow is a report. Each row is composed of a gene id, a fold change and its corresponding P-value indicating the significance of every change in gene expression, between HD and controls for each brain region. Adjusted P-values, generated by Benjamini and Hochberg's method for multiple testing correction, are also included 38 . This workflow can be adjusted to compute differential gene expression between other variables such as male/female or the grade of disease pathology by editing the nested workflow "compute_DE_limma" within the R workflow component. An additional workflow create_exprs_obj_down-load_files is included in the myExperiment pack that was used to download data from the ArrayExpress repository. The workflow saves the gene expression data and the corresponding phenotype file in the directory indicated in the workflow input.
We note that this particular microarray experiment was composed of two microarrays, Human Genome U133A and U133B. For convenience we added the workflow: Get differentially expressed genes for Array B one brain region, but in principle the first workflow could be reused by adjusting the "libraries" component.
The second workflow map_genes_on_chromosome uses the output from the first workflow in order to map genes to their corresponding genomic location. The workflow uses the Biomart 39 service within R, to obtain information regarding the position of each gene at the chromosome, HGNC gene symbols 40 , transcription start and end site and the transcription strand. The database that was used was the Ensembl genes 68 from Sanger institute and the Homo Sapiens dataset GRCh37.p8. The mouse assembly that was used to map genes to their chromosomal location was Dec 2011, GRCm38mm10.
The last workflow get_promoter_region_calculate_overlaps, first computes a promoter region for each gene and then operates on genomic intervals to compute gene promoters that overlap with a genomic region. The promoter region is computed for each gene, according to prespecified values, indicating the number of base pairs (bp) upstream and downstream of the transcriptional start site (TSS); for the CpG island analysis 5000bp upstream and 2000bp downstream and for the chromatin states analysis 50bp upstream and 50bp downstream was used. The decision for the promoter size in each case was taken after discussing with the domain experts and from knowledge acquired from previous experiments using the ENCODE data from Ernst et al. 41 . Using those data we performed multiple runs with different input values for the "upstream" and "downstream" variables, and the overlap parameter (Supplementary File 1).
When genes had multiple transcription start sites, we computed a promoter region for every TSS. Next part of this workflow is to compute overlapping regions between the input datasets. It includes a two sample Kolmogorov -Smirnov test to compare the empirical cumulative distribution functions (ecdf) of the P-values between the gene promoters that overlap with a specific genomic region and the ones that do not. The null hypothesis tested here was that there is no difference between the two groups. The workflow returns two lists of genes, one for the genes that overlap with a particular genomic region and another that does not. Furthermore, the results of the statistical test are reported: the ks test statistic (maximum distance D between the ecdf of the two samples) and the P-value of the test.
If P-value < 0.05 we reject the null hypothesis.

Workflows for data interpretation
The workflows that we implemented for gene interpretation and gene prioritization are based on the workflow pack at Zenodo, https://doi.org/10.5281/zenodo.164198 (see also, at myExperiment: http://www.myexperiment.org/packs/368), and are implemented using CPA web services 30 .
The CPA workflow Annotate gene list with top ranking concepts annotates a gene list with top ranking concepts by matching concept profiles of genes with for example in our case the concept set of Biological Processes. The web services part of this workflow query the Anni database 27 that stores the concept profiles for each concept of interest. The first web service mapDatabaseIDListToConceptIDs maps a list of concepts, in our case Entrez gene identifiers, to their corresponding concept profile ids. Necessary inputs are a concept list (gene list) in a comma separated file, and the database identifier of the gene list necessary for the mapping (EG for Entrez Gene, see here: https://www.biocatalogue.org/soap_operations/41197 for more details on database identifiers). The next web service getSimilarConceptProfilesPredefined matches our gene list with the predefined concept set "Biological Process" (ID = "5"), and gives the top scoring biological processes that describe our gene list. For a complete list of predefined concept sets, the workflow List Concept Sets (provided in the current pack) can be run to choose the ID of the predefined concept set of interest. The web service getConceptName, gives the complete (human readable) names of the top matching biological processes. Lastly, the workflow Explain score between two concepts can be included to the analysis to provide evidence for the association between each gene and the annotations of the biological processes. The evidence reported is a list of concepts that link one concept with another and the contribution to the overall strength of the association. In addition, the corresponding concept ids are reported.
The workflow Prioritize gene list can prioritize a set of genes with respect to their association with particular concepts, in our case the HTT concept and epigenetics (concept profile: "epigene"). In order to obtain the concept profile identifiers the workflow getConceptSuggestionsFromTerm needs to run first.
Data obtained from public sources Human brain data. The HD human brain data that was used in this analysis was originally produced and analyzed by A. Hodges and co-workers 37 . This experiment contains 44 HD positive cases and 36 age and sex matched controls. The processed data are available from the public repository NCBI Gene Expression Omnibus, entry GSE3790. Three brain regions were included and analyzed; the caudate nucleus, frontal cortex and cerebellum, with an Affymetrix Microarray GeneChip (Human Genome U133A and U133B). Furthermore, the HD positive cases were further classified based on whether symptoms were present or absent and according to Vonsattel grade of disease pathology (scale = 0 -4). In our analysis we used the processed data and performed our own differential gene expression analysis (Dataset 1 42 ) with the workflow that performs differential gene expression analysis that was described previously in Methods. This folder contains the gene expression data for the three human brain regions and the three mouse brain regions. This folder contains the CpG island information both for human and the mouse data.
Chromatin states data. The chromatin marks were obtained from the encode project 45 . The chromatin states were part of an integrative analysis of 111 reference human epigenomes profiled for histone modification patterns based on DNA accessibility, DNA methylation and RNA expression. We used the two cell types that were more suitable for our analysis; the anterior caudate and dorsolateral prefrontal cortex (Dataset 3 46 and Dataset 4 47 , respectively). The chromatin states we used for the current analysis were: active transcription start site proximal promoter (TssA), bivalent regulatory region (TssBiv), heterochromatin (Het) and repressed Polycomb (ReprPC). Mouse data. The mouse brain data was taken from a published study 10 . There, total RNA was extracted from cortex, striatum, and cerebellum from WT and R62 transgenic mice. This study examines the effect of the HDACi 4b inhibitor on the disease phenotype. However only the data from animals treated with vehicle was used (Dataset 1 42 ). This study used Illumina Mouse Mouseref-8 Expression Beadchips v1. The raw data were analyzed using the Bioconductor packages and contrast analysis of differential expression was performed by using the LIMMA package. The differential expression values are available in the supplemental material of that publication.

Chromatin state analysis and semantic interpretation
To test if and how differential gene expression in HD is associated with particular chromatin states we used publicly available datasets from the GEO and ENCODE public repositories. We selected HD gene expression data from three regions of the brain made available by Hodges and coworkers 37 , and data from the ENCODE consortium carrying information about four chromatin states. These were: active TSS proximal promoter (TssA), bivalent regulatory region (TssBiv), heterochromatin (Het) and repressed Polycomb (ReprPC) 45 . Briefly, the first state pertains to active genes, the second to repressed genes that are ready to be activated, and the latter two represent repressed genes. The chromatin states were part of an integrative analysis of 111 reference human epigenomes profiled for histone modification patterns, DNA accessibility, DNA methylation and RNA expression. The different chromatin states were defined by a computational model that is based on a multivariate Hidden Markov Model 48 .
First we confirmed the results reported by the previous study 37 : large and numerous changes in transcriptional activity in the caudate nucleus brain region and notably smaller changes in the frontal cortex and cerebellum. More specifically, the number of differentially expressed genes was 5219, 127 and 96 for each brain region respectively with a FDR of 0.05. These results confirm previous observations illustrating that specific HD-affected brain regions exhibit defined changes in gene expression, in line with observable physiological effects 37 .
Secondly, we paired the data describing gene expression changes in the caudate nucleus and frontal cortex from the Hodges study to chromatin state data from the anterior caudate and dorsolateral prefrontal cortex from ENCODE. We considered these brain regions as the most comparable between the two studies. The cerebellum was excluded from this part of our analysis due to the absence of chromatin state data at ENCODE for the cerebellum. We determined the reference chromatin state of the genes from the gene expression study by determining the overlap of their promoter regions with the start and end positions of the chromatin state from the aforementioned brain regions used by ENCODE. The effect of a chromatin state on differential gene expression in HD was assessed by comparing the distribution of expression levels of genes overlapping with a particular chromatin state with the distribution of expression of non-overlapping genes. If chromatin state has substantial functional impact on the genes that are differentially expressed in HD, then these are expected to be significantly different. More details on the promoter region calculation and the overlap between the genomic regions can be found in the Methods section.
Specifically, we compared the distribution of the P-values for differential expression between these two groups of genes and assessed the difference by a KolmogorovSmirnov test, in the two brain regions and for each chromatin state ( Figure 1). In the caudate region, we found an enrichment of genes overlapping with the active TSS and a depletion of genes overlapping with ReprPC ( Figure 1). Heterochromatic or bivalent chromatin were not significantly associated with a genomic location of differentially expressed genes. A similar, but less pronounced pattern was observed for the frontal cortex for both the active TSS and ReprPC. In addition, in the frontal cortex the number of genes overlapping with the bivalent state was reduced. The magnitude of the association of a brain region with each chromatin state is in line with the HD neurodegeneration pattern, where the caudate nucleus exhibits the largest gene expression changes while the frontal cortex exhibits an intermediate to low pathology. In summary, we found enrichment of differentially expressed genes in the active chromatin state of reference tissue, but no strong evidence for significant enrichment in chromatin regions that are associated with repressed gene expression activity.
To further interpret the results from the chromatin state analysis, we used literature information (CPA 19,27 ) to assess the biological For each brain region, the P-value distribution for differential expression in HD patients compared to controls was compared between genes overlapping with a specific chromatin state and all other genes, in the anterior caudate and dorsolateral prefrontal cortex cells. The plot displays the maximum absolute distance D between the cumulative distribution function of the processes within the lists of genes per chromatin state that were associated with gene deregulation (Dataset 5 49 ).
Using CPA, we found that for the caudate nucleus, genes in the active TSS state are mainly associated with processes related to transcription, cell cycle, protein transport and modification, RNA splicing, chromatin modifications and RNA processing. Genes in the repressed polycomb chromatin state are mainly associated with (brain) developmental processes and responses related to zinc and cadmium stimulus.
For the frontal cortex, genes in the active TSS state are associated with protein transport and modification, cell cycle, RNA splicing and signal transduction pathways (MAPK, notch, smoothened). Genes in the bivalent TSS are associated with brain development, neurogenesis, synapse and responses related to zinc and cadmium stimulus. We found that the genes that are part of the polycomb repressed group to be associated with similar functions as the genes in the bivalent state: (brain) developmental processes, responses related to zinc, cadmium and copper stimuli.
We note that at this point of our analysis, we filtered out genes of which the promoters were labelled with more than one state, using the criterion of at least 50 bps overlap in a promoter region of 100 bps. Interpretation of this set of genes would be ambivalent.

Overlap of HD deregulated genes with CpG islands and semantic interpretation
Because CpG island methylation is a known epigenetic regulatory mechanism that is ubiquitous in the human genome and may be a target for genome-wide regulation 50,51 , we also applied our approach to test if genes within CpG islands are overrepresented among differentially expressed genes in HD. We measured this by similar KS test statistics as in the previous section. We found that genes overlapping with CpG-islands in their promoter region, were significantly enriched in the group of HD-deregulated genes in all three brain regions (Figure 2A; P-value < 0.05), which supports taking this mechanism into account to formulate hypotheses when studying gene deregulation in HD.
To further analyze and verify the association of gene deregulation in HD with the presence of CpG islands, we also analysed overrepresentation in data obtained from a HD transgenic mouse model 10 . Our results based on the mouse data corroborated those from the human data ( Figure 2B). Comparable to the results for caudate nucleus in human data, the most highly affected mouse brain region, the striatum, showed the biggest overrepresentation of CpG islands among deregulated genes.
We next examined the biological processes that are associated with the differentially expressed genes that do or do not overlap with CpG islands by CPA. We inspected the top 50 annotations of each group. We identified many similarities between these two groups, but also annotations that were specific to each group (Dataset 6 52 ). For example, concepts related to chromatin alterations, such as chromatin remodelling and chromatin (dis)assembly, and histone post-translational modifications, such as (de)acetylation, and (de)methylation, were only found with a high rank in the list of genes containing CpG islands in their promoters. Conversely, lymphocyte activation, angiogenesis, antigen presentation and neurogenesis were only high ranking associations for the non-CpG containing genes.
Some annotations were found in both groups but in a different ranking order. For example, the rankings of gene silencing, RNA splicing and phosphorylation were increased for genes within CpG islands while transcriptional activation and mitotic cell cycle rankings were higher for non-CpG containing genes (Dataset 6 52 ). The top 50 annotations characterising each group of genes with CpG islands and without, resulted from the semantic analysis of those two groups of genes with Biological Processes.
Semantic analysis to identify proteins associated with HTT and epigenetics CPA can also be used to prioritize findings by a specific biological interest. In our case, we aimed to prioritise the list of genes overlapping with CpG islands that were also differentially expressed in HD caudate nucleus by their association with HTT and epigenetics. Here, we prioritised 100 proteins based on their association with HTT and epigenetics and absence of a direct relation with HTT by CPA (Dataset 7 53 ). Such relations are typically novel, or lost in tables or figures. If a novel relation is found (i.e. the relation is not found in our database of relations in MedLine abstracts), then CPA also provides intermediate concepts that link the two concepts.
In Table 1, we present the top 5 novel proteins that have the strongest association, and the intermediate concepts that link each prioritized protein to HTT and epigenetics. The intermediate concepts are grouped under the semantic categories "General", "Biological Processes", "Disease or Syndrome", "Homo Sapiens proteins" and "Molecular Functions".
Grouping the evidence provides a more complete insight into the processes involved in gene deregulation in HD mediated by CpG islands. For example, for one of our candidates, the amyloid beta (A4) precursor protein-binding, family A, member 1 (APBA1), we found nerve tissue (General), endocytosis (Biological Process), Alzheimer's Disease (disease or syndrome), GRIN2B (H. sapiens Genes) and kinesin activity (Molecular Function) as intermediate links with HTT. This suggests that mechanisms involving APBA1 in HD share common components with the mechanisms involved in endocytosis and Alzheimer's Disease and those involving GRIN2B and kinesin activity. Intermediate links with epigenetics were respectively: epigene (General), methylation (Biological Process), hypomyelination and congenital cataract (disease or syndrome), CDKN2A (H. sapiens Genes) and MGMT -O6-alkylguanine DNA alkyltransferase -(Molecular Function). Accordingly, these concepts provide suggestions about the epigenetic role of APBA1, which can be taken into account when further studying the role of APBA1 in HD. Top 100 novel proteins, resulted from the semantic analysis associated with HTT and epigenetics.

Validation of concept profile gene prioritization
We next investigated whether the prioritized genes reflect valid biological knowledge. Figure 3 shows that CPA is able to prioritize true associations with huntingtin as measured by a gene expression experiment, but that combining experimental (differential expression) measurements and literature evidence enables to select even more specific HD signatures. We used our concept profile technology to match all genes in our database that have a concept profile (12,391 genes) to the "huntingtin" concept profile (black line). We then compared the distribution of those CPA scores to the CPA scores of genes that were found to be differentially expressed in the caudate nucleus (p value < 0.05). We included two gene lists in our analysis: the top 100 most differentially expressed genes (red line) and the top 1000 (green line). The shift in the distribution of CPA match scores between the differentially expressed genes (top 100 and top 1000) and the scores of all genes reflects the added value of CPA (CPA scores of top 100 and top 1000 can be found in Dataset 8 54 ). We found a significant shift in the scores when comparing all CPA scores from our database with the top 100 differentially expressed genes: p = 2.67e−08 and the top 1000 p < 2.2e − 16.
Also the top 100 and top 1000 differ significantly (p = 0.03184), showing that it is useful to narrow down on the top ranks for follow-up research. In principle, more extreme p-values are associated with higher CPA scores. In addition, to show that our list of 100 prioritized gene-HTT CPA match scores would not have been found by chance, we assessed the percentile score of our list when compared to the frequency distribution of 100 match scores of randomly sampled gene-HTT concept pairs out of the 12,391 genes in our concept profile database (Dataset 9 55 ). All genes were in the top 95 percentile, except NTRK3 (55 percentile).  This folder contains the CPA scores per gene list (top 100 differentially expressed genes in the caudate nucleus and top 1000 differentially expressed from the same brain region) that were obtained by matching the gene lists against the HTT concept profile. This document presents the percentile score of the prioritized gene list when compared to the frequency distribution of 100 match scores of randomly sampled gene-HTT concept pairs out of the 12,391 genes in our concept profile database.

Discussion
The computational analysis with public data that we present in this paper shows that there is no strong evidence for genome-wide relocalization of gene activity to repressed chromatin states, at least not at a scale that could explain the massive transcriptional deregulation that we observe in HD. Most of the deregulated genes mapped to the active chromatin state of our reference tissue and were underrepresented in silenced states of chromatin ( Figure 1). Previous reports supported the implication of large scale chromatin alterations in gene deregulation in HD. For instance, Anderson et al. reported that gene expression is deregulated in large genomic regions in blood and post mortem tissue of HD patients 7 . The authors inferred a relation with repressed and active chromatin, but did not use chromatin annotation data directly. Our results suggest that the association with genome clusters is mostly within the active state and does not extend to disruption of chromatin states at large scales.
Active chromatin is normally more prone to regulation and deregulation. Our results suggest that the epigenetic mechanisms that have been observed in HD are mainly bound to this fraction of chromatin. We speculate that the effects are more closely associated with transcription regulation of individual genes, than with large scale higher-order rearrangements of chromatin structure 56 . Our CPA analysis of deregulated genes with CpG islands corroborates this suggestion: chromatin-related concepts from our CPA ranked high in this set of genes, while CpG-islands are mostly known for their direct role in transcription initiation of the proximal gene. Nevertheless, CpG-mediated changes in gene expression are a common mechanism for many genes and deregulation of this mechanism is thus likely to have a genome-wide effect. It was unexpected that the differentially expressed genes are significantly depleted in the bivalent chromatin state in the prefrontal cortex. The bivalent state is expected to be associated with genes that are prone to become active. Our expectation was that deregulated genes in this brain region would be weakly enriched in this state, as we did find for the caudate nucleus. We currently lack a good explanation for this result.
We incorporated the computational analysis on public data in our research strategy to advise on next experiments. However, working with public data has its limitations. Here, we had to rely on reference tissue for the chromatin state in order to compare it with gene expression. In addition, the reference tissue chromatin state was measured in healthy individuals. Therefore, we cannot rule out that large scale chromatin effects will be observed when chromatin state and differential expression are measured together in new HD specific experiments. This could reveal new evidence for chromatin state deregulation in HD and give insight in the relationship with transcriptional deregulation, possibly at higher resolution than we could achieve in our current study. However, our results suggest that this is not very likely, leading us to advise fellow HD researchers to prioritise experiments that assess the role of epigenetic mechanisms in HD at the scale of individual genes or small clusters of genes, and within the active fraction of chromatin.
Our results can be used to refine hypotheses about molecular mechanisms involved in HD. For instance, we surmise that the reported association of DNA methylation and chromatin organisation, and the effects of HDAC on the HD phenotype in mice 57 are bound to the active fraction of chromatin. Furthermore, it appears that CpG islands located within the promoter region of a gene increase the probability that genes in such genomic regions are deregulated in HD. This is in accordance with a study where changes in DNA methylation were observed in cells expressing mutant huntingtin 58 . This in turn suggests that DNA methylation in promoters is implicated in alterations in the brain, which is in accordance with a study that noted changes in DNA methylation in cells expressing mutant huntingtin 58 . Based on our semantic analysis chromatin remodelling, chromatin (dis)assembly, and histone modification were associated with altered gene expression profiles. In contrast, non-CpG containing genes are more likely involved in immune response and neurogenesis, which represents functionally linked processes 59 .
Furthermore, we used CPA as a means to prioritise genes for our hypotheses. For instance, we prioritised genes overlapping with CpG islands by their association with HTT, assuming that CPA rank scores are higher for genes that are higher-up in the cascade of events caused by the mutant protein. We recently showed that this is a fair assumption for CPA, although literature bias cannot be completely mitigated 60 . Similarly, we further prioritized candidate genes in terms of their association with epigenetics. Genes such as APBA1, KAT2A, CARM1, SLIT2 and BNIP3 came forward as the most likely candidates to play a role in HD in the context of downstream effects of HTT involving epigenetic mechanisms. These candidates were filtered on potential novelty: only those genes were reported for which there was no direct association found in our database of PubMed abstracts. This does not exclude associations that were reported in tables and supplemental material that are much harder to mine for technical and legal reasons. Our study also retrieved several well established associations consistent with earlier studies.
In summary, our results show how literature information in combination with data analysis present useful tools for exploration of hypotheses for possible future experiments.

Conclusions
Our methodology offers support for hypothesis generation to elucidate missing links in mechanisms involved in a complex disease such as HD. We have shown how the analysis of microarray data and the integration of publicly available datasets and literature information enables prioritization of associations, such as proteins and mechanisms, that are likely to be involved in HD. In addition, we were able to focus on mechanisms that are associated with epigenetic regulation that may regulate changes that are part of the disease pathology. We argue that such a methodology can be of great value to the scientific community for narrowing down the amount of possible associations but also providing evidence to support a particular hypothesis.

Supplementary File 1: Decision about the promoter region of the chromatin states analysis and CpG islands.
This document is included as a reference to describe the decisions that were made concerning the promoter size using an older version of epigenetic data from ENCODE. In this file we present the additional runs of the workflow compute_overlaps that were performed with different parameters in order to test and decide for the best promoter region and overlap parameters.
Click here to access the data The purpose of this paper is to leverage publicly-available data to investigate the association between chromatin state and Huntington's Disease (HD). The authors do this by identifying genes that are differentially expressed in individuals with HD relative to healthy individuals and identifying the locations of these genes in the genome and the biological processes associated with these genes. They find that many of these genes' promoters are in the active chromatin state in healthy individuals and in CpG islands. They also find that many of these genes are related to biological processes related to HD and that some are in chromatin modification biological processes. Although this study suggests that there may be an association between chromatin state and HD, the nature of that association remains unclear.
: Major Comments I appreciate how the authors integrated existing literature with differential gene expression results to prioritize biological processes, diseases, genes, and molecular functions. In addition, defining the similarity between concepts based on the number of shared concepts is similar to approaches that have been used for community detection in social networks (Blondel ., et al Journal of , 2008) and, more recently, for clustering cells based Statistical Mechanics: Theory and Experiment on protein expression (Levine ., , 2015) (I do not think that the authors need to cite these et al Cell papers), so I am not surprised that it worked well. I hope that the author's use of this approach will inspire others to use such methods for comparing biological concepts in literature and encourage future researchers to directly integrate literature with differential gene expression. I found many of the results difficult to interpret because the authors seem to have done all of the analyses on the set of all differentially expressed genes. My expectations for up-regulated genes are different from those for down-regulated genes. In the Minor Comments, I point out specific analyses for which I think that separating the genes based on the direction of the differential expression would be helpful. If the authors did use only down-regulated or only up-regulated genes, it would be great if they could make this clear in the Methods section and include what fold-change cutoff they used.
Page 4: The authors state which human and mouse assemblies they used, but they had not previously stated that their analysis included data from mouse. It would be helpful if the authors could state exactly what species they are using for each part of their analysis earlier in the manuscript. Page 4: The authors state that they used a Kolmogorov-Smirnov test to compare p-value distributions. It was not clear to me where these p-values come from. Are they the p-values for differential expression of the genes corresponding to the promoters? It would be helpful if the authors could clarify this. Page 4: It was not clear how many Kolmogorov-Smirnov tests were done. The authors said that they rejected the null hypothesis if the p-value was < 0.05. If they did more than one test, then they should do multiple hypothesis correction. Page 4: It would be helpful if the authors could clarify the purpose of the concept ids. Page 4: It would be helpful if the authors could provide a more detailed explanation of how the concept linking is done. Page 5: It would be helpful if the authors could define the HTT concept and explain why they used it for prioritizing genes. Page 5: It would be helpful if the authors could explain why they decided to prioritize genes based on the "epigene" concept. It seems like the authors are interested in genes that affect epigenetics, such as demethylases or histone modifiers. It is not clear to me how this selection relates to the hypothesis that was described in the introduction. Page 5: It would be helpful if the authors could clarify exactly what differential expression tests were done with the human brain data and what the categories were for each test. Page 5: It would be helpful if the authors could clarify whether the human brain data described here was the only human data used for differential expression analysis and, if it was not, what other data was used. Page 5: It would be helpful if the authors could briefly describe how the CpG island data fits into the rest of the analysis. Page 5: It would be helpful if the authors could explain why they selected the two cell types and four chromatin states that they used in the Methods section. Page 5: I think that it might make sense to incorporate additional chromatin states, such as quiescent, weak repressed Polycomb, and enhancer, as strong repression is not always the cause of a promoter's inactivity. Page 5: It would be helpful if the authors could clarify why they used only the mouse data from animals treated with the vehicle. My intuition is that it would make more sense to use the animals that did not receive the HDACi 4b inhibitor since the human subjects did not receive any kind of treatment. It is possible that I misunderstood the purpose of the mouse analysis. :

Results
Page 6: It is not clear to me why a difference in distribution of expression levels between genes overlapping a chromatin state and genes not overlapping that chromatin state implies that chromatin state has an effect on HD. I think that the authors mean that, if the in gene expression between difference individuals with and without HD is higher for genes overlapping a specific chromatin state than overlapping other chromatin states, then there is an between the chromatin state and HD. association Page 6: It would be helpful to split Figure 1 into two parts, one for genes that have higher expression in people with HD and another for genes that have lower expression for people with HD. My intuition is that most of the differences in p-value distribution are coming from the second category because, since the chromatin state data comes from people without HD, I would expect that genes in an active chromatin state would have higher expression in healthy individuals. Adding onto that, regions of closed chromatin cannot decrease because the genes are not expressed. Regions of open chromatin could either increase or decrease, potentially leading to more variability. Page 6: It would be helpful to have a supplemental figure with all chromatin states because it is not clear from Figure 1 if the differences occur for TSS's in all active chromatin states (including inactive genes that are acting as enhancers for other genes) or only from genes that are transcribed.
are acting as enhancers for other genes) or only from genes that are transcribed. Page 6: It would be helpful if the authors could clarify if the overlaps in Figure 1 are done using the entire gene, only the TSS, or the gene's promoter. Page 7: For the biological process analyses, I think that using a tool for differential enrichment between the two groups of genes would provide more interpretable results than comparing the top hits from CPA because such a tool looks for terms that are significantly enriched in one gene set relative to another. An example of such as tool is CompGO (Waardenberg ., , 2015). et al BMC Bioinformatics Page 8: It would be helpful if the authors clarified what they mean by "top novel protein." Does novel mean that the gene had not been associated with HD in a previous paper? Page 8: It was not clear why Figure 3 shows that CPA is able to prioritize true associations with huntington as measured by a gene expression experiment and why combining differential expression measurements and literature evidence enables the selection of even more specific HD signatures. It would be great if the authors could clarify this. Page 8: It would be helpful if the authors could include the direction of the CPA score shifts for the different groups of differentially expressed genes. Page 8: The authors say that "the top 100 and top 1000 differ significantly." It would be helpful if they stated the way in which these gene sets differ. Page 11: It would be helpful if the authors could clarify what x is in Figure 3. :

Discussion
Page 11: I am not sure if the paper provides a lack of evidence for genome-wide re-localization of gene activity to repressed chromatin states. The paper combined all of the up-regulated and down-regulated genes instead of separating them. If the paper had shown that the genes that are up-regulated in people with HD are not found in repressive chromatin states in healthy individuals, then I would be more convinced of this lack of re-localization. However, I would not be fully convinced because changes in chromatin state do not always cause changes in gene expression. For example, a previous study showed that most single nucleotide polymorphisms associated with histone modifications are not associated with transcription, suggesting that histone modification differences between individuals do not always correspond to gene expression differences (Grubert ., , 2015). Thus, it is possible that there are et al Cell chromatin state differences between HD individuals and controls in parts of the genome where there are no differentially-expressed genes. Page 11: The authors suggest that HD is not associated with the disruption of chromatin states at a large scale. To investigate the association of HD with chromatin state using existing data, the authors would need determine if genes that are up-regulated in people with HD tend to fall in repressive chromatin states and if those that are down-regulated in people with HD tend to fall in active chromatin states. Because there do not seem to be separate evaluations of up-regulated and down-regulated genes, I do not think that the results in this paper can be used to evaluate the relationship between chromatin state disruptions and HD. Page 12: I think that CPA's high ranking of chromatin-related concepts for differentially expressed genes suggests an association between chromatin reorganization and HD. If differentially expressed genes near CpG islands include genes involved in chromatin structure, that suggests that there is -regulatory cis change in the regulation of those genes, which could have a downstream effect on chromatin organization. Page 12: Although the paper shows that there are more differentially expressed genes in the active chromatin state in healthy individuals, I am not sure that there is sufficient evidence to conclude that most important changes in HD are occurring in the active chromatin state. For example, if the majority of differentially expressed genes are down-regulated in individuals with HD, then the findings in this paper would match my expectations, even if the most important differentially-expressed genes are up-regulated and are not found in the active chromatin state in healthy individuals.

Partly
If applicable, is the statistical analysis and its interpretation appropriate? I cannot comment. A qualified statistician is required.
Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Yes No competing interests were disclosed. Competing Interests: Reviewer Expertise: Data science I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com Page 23 of 23