Chromes from Chromatin: Sonification of the Epigenome

The epigenetic modifications are organized in patterns determining the functional properties of the underlying genome. Such patterns, typically measured by ChIP-seq assays of histone modifications, can be combined and translated into musical scores, summarizing multiple signals into a single waveform. As music is recognized as a universal way to convey meaningful information (1), we wanted to investigate properties of music obtained by sonification of ChIP-seq data. We show that the music produced by such quantitative signals is perceived by human listener as more pleasant than that produced from randomized signals. Moreover, the waveform can be analyzed to predict phenotypic properties, such as differential gene expression. Significance Statement Music is recognized as universal way to communicate emotions and, more in general, meaningful information. Various sources of information can be translated into music or sounds, mostly for recreational purposes. It has been shown that human ear can classify information encoded into sounds. Quantitative genomic features, and in particular epigenetic marks, do represent functional information that is exploited by cells to drive biological processes. We test a method to translate such information into music and we study some properties of the sonificated chromatin marks. We show that not only musical representation of epigenetic marks has intrinsic musicality, but also that differences in musical representation of genomic loci reflect differences of the RNA levels of the underlying genes.


Introduction 21
Sonification is the process of converting data into sound. Sonification itself has a long, yet punctuated, 22 story of applications in molecular biology, several algorithms to translate DNA (2) or protein sequences 23 (3, 4) to musical scores have been proposed. The same principles have also been extended to the 24 analysis of complex data (5) showing that, all in all, sonification can be used to describe and classify 25 data. Indeed, the very same procedures may also be applied for recreational purposes. 26 27 One of the limitations of sonification of actual DNA and protein sequences is their intrinsic conservative 28 nature. Assuming the differences in two individual genomes are, on average, one nucleotide every 29 kilobase (6), the corresponding musical scores would have little differences.
On the contrary, dynamic ranges typical of transcriptomic and epigenomic data may provide a richer 1 source for sonification. 2 3 In this work we describe an approach to convert ChIP-seq signals, and in principle any quantitative 4 genomic feature, into a musical score. We started working on our approach for amusement mainly, and 5 we realized that the sonificated chromatin signal were surprisingly harmonious. We then tried to assess 6 some properties of the music tracks we were able to generate. We show that the emerging sounds are 7 not random and instead appear more melodious and tuneful than music generated from randomized 8 notes. We also show that different ChIP-seq signals can be combined into a single musical track and 9 that tracks representing different conditions can be compared allowing for the prediction of 10 In order to translate a single ChIP-seq signal track to music we bin the signal over a specified genomic 3 interval (i.e. chrom:start-end) into fixed-size windows (e.g. 300 bp) and note duration will be 4 proportional to the size of such windows. As we are dealing with MIDI standard, we let the user specify 5 track resolution and the number of ticks per window (see definitions); the combination of these 6 parameters defines the duration of a single note. The default parameters associate a bin of 300 bp with 7 one quaver (1/8 note). 8 9 In order to define the note pitch, we take the logarithm of the average intensity of the ChIP-seq signal 10 in a genomic bin. The sounding range of the whole signal is discretized in a predefined number of 11 semitones. At default parameters, the range is binned into 52 semitones, covering four octaves. In order 12 to introduce pauses, the lowest bin of the signal range represents a rest. If two consecutive notes or 13 rests fall in the same bin, we merge them in one note doubling its duration. 14 15 Using this approach, any ChIP-seq signal can be mapped to a chromatic scale. We implemented the 16 possibility to map a signal on a different scale (major, minor, pentatonic…); to this end, intensity bin 17 boundaries are merged according to the definition of a specific scale (Figure 1). MIDI tracks produced 18 in this way can be then imported into a sequencer software where they can be further processed, 19 setting tempo and time signature. 20 21 Music produced from chromatin marks is not perceived as a random pattern 22 In order to test whether sonification of chromatin marks are perceived as random patterns, we selected 23 ten genomic regions and generated corresponding tracks based on the following histone modifications: 24 H3K27me3, H3K27ac, H3K9ac, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9me3 25 (Supporting Audio files S1.1 to S10.1). For the same regions, we randomized genomic signal at base 26 and bin level (Supporting Audio files S1.2 to S10.2). When data are randomized at base level, the 27 average intensity is uniforms across the bins, resulting in a repeated note (data not shown); this is 28 largely expected as ChIP-seq signals are distributed on the genome according to a Poisson law (7) or, 29 more precisely, to a Negative Binomial law (8).
1 Randomization at bin level, instead, equals to shuffling notes during the execution. We administrated 2 a questionnaire to a set of volunteers (n=8) not previously tested for education in music. Volunteers 3 were asked to listen to each pair of original/random track and choose which track they felt was more 4 appealing. Track order was randomized when testing different volunteers. Notably, in the majority of 5 the cases (62/80) the music generated from genomic signal without randomization was judged more 6 appealing. Results are significant to a Fisher-exact test (p=1.95e-3), suggesting that genomic signals 7 contain information that can be recognized by human ear. The number of correct answers for each 8 volunteer ranged from 5 to 10, with a median value of 8. 9 10 Differences in musical tracks reflect differences in gene expression 11 Once we assessed the existence of musical pattern in genomics signals, we were keen to explore if this 12 kind of information could be exploited to identify biological features of samples. Since the epigenetic 13 DNA modifications reflected by histone marks influence gene expression (9), we tested if differences in 14 musical tracks generated from various ChIP-seq signals reflects differences in gene expression of the 15 corresponding loci. To this end, we downloaded ChIP-seq marks (H3K27me3, H3K27ac, H3K9ac, 16 H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K9me3, Pol2b) and RNA-seq data for K562 and NHEK 17 cell lines from the ENCODE project (10). For each RefSeq locus we converted ChIP-seq signals to music 18 with fixed parameters (see methods). RNA-seq data were used to identify genes that are differentially 19 expressed between the two cell lines, under a p-value < 0.01 and |logFC|>1, according to recent SEQC 20 recommendations (11). 21

22
A common way to classify music is based on summarization of track features after spectral analysis (12, 23 13). Such approach involves the summarization of track as Mel-Frequency Cepstral Coefficients (MFCC) 24 that are subsequently clustered using Gaussian Mixture Models (GMM). A distance between tracks can 25 then be defined as described in (14), who used it as a classifier for musical genres. 26

27
We tested if a similar approach could be used to develop a predictor of differential expression based 28 on the distance between musical tracks generated from two cell lines.
We defined a distance between songs as described in methods and we optimized the parameters using 1 as a training set the 250 genes with the most significant differential expression p-value and as many 2 genes with the least significant p-value according to RNA-seq (Figure 2). We found that optimal 3 performance is at MFCC=30 and GMM=10, with an AUC=0.609. 4

5
We summarized tracks representing all RefSeq genes using such parameters, we then compared 6 distances with differential expression performing a ROC analysis. Our results indicate that differences 7 in information contained in musical representation of chromatin signals may be linked to differential 8 expression, although power of prediction is limited (AUC=0.5184, p=1.4597e-03). 9 10 Similarity between musical tracks overlaps similar biological properties 11 As additional issue we wanted to assess if similarities between musical representation of chromatin 12 status may be linked to the biology of the underlying genes. To this end, we calculated pairwise 13 distances for all regions using parameters identified above on K562 cell line. Hierarchical clustering of 14 the distance matrix identifies eight major clusters (Figure 3, left). We performed Gene Ontology 15 Enrichment analysis on each cluster, here represented as word cloud of significant terms (Figure 3,  16 center, supplementary Table S2); we found that different clusters are linked to genes showing different 17 biological properties. For example, some clusters (6, 7 and 8) were linked to regulation of cell cycle, 18 others were linked to metabolic processes (2 and 5) or vesicle transport (3 and 4). We also evaluated 19 the distribution of expression (expressed as log(RPKM)) of the underlying genes (Figure 3, right); we 20 found that regions clustered by the distance between musical tracks broadly reflects groups of genes 21 with different level of expression, spotting clusters of higher expression (cluster 5) or lower expression 22 (clusers 2 and 3); assessment of statistical significance of differences in distribution of gene expression 23 values among clusters is presented in Table 1. We show, in this work, that the information carried by multiple histone modifications can be caught in 5 a human-friendly way by translating ChIP-seq signals into musical scores. Although the investigation of 6 the psychological factors that underlie tuneful perception of sonificated genomic signals is out of the 7 scope of this manuscript, our results suggest that human hearing is able to perceive patterns conveying 8 information encoded in ChIP-seq data analyzed and to distinguish from random noise. 9

10
We automated the analysis of differences between musical tracks using an established method based 11 on summarization of spectral data. By this approach, we investigated the possible link between 12 differences in ways chromatin sounds and phenotypic features. Our results suggest that differences in 13 transcript levels can be predicted by the differences of sonificated genomic regions, although 14 performances of such approach are limited. We reasoned that many factors may explain such poor 15 results: first of all there is a vast space of parameters that can be tuned to create a single musical track 16 and we still lack methods to explore it efficiently. In addition, the Mel scale used to summarize audio 17 signal has been developed to match human capabilities to perceive sound (21), hence it may not be 18 optimal for the comparison of the tracks generated in this work. 19 20 It has already been shown that it is possible to predict levels of gene expression starting from chromatin 21 states, although the method used to perform chromatin segmentation has a large impact on such 22 predictions (22). In this work we found that differences in chromatin-derived music reflects, to some 23 extent, differences in level of expression of underlying genes and their related biology. 24

25
To conclude, although we cannot advocate the usage of musical analysis as universal tool to analyze 26 biological data yet, we confirm that quantitative features on the genome are patterned and contain 27 information, hence can be converted into sounds that are perceived as musical. We limited our analysis 28 on specific chromatin modifications, but in principle any quantitative genomic feature can be converted 29 and integrated into a musical track. The choice of parameters and instruments has been standardized for the analysis presented, for illustrative purpose we show that different signals from the same region 1 can be combined using different instruments (https://soundcloud.com/davide-cittaro/random-locus-2 blues) and signals from different genomic regions can be merged (https://soundcloud.com/davide-3 cittaro/non-homologous-end-joining). Raw data for various modifications were downloaded from (GSE26320). Read tags were aligned to 10 human genome (hg19) using bwa aligner (23). Alignments were converted to bigwig tracks (24) after 11 filtering for duplicates and quality score higher than 15. 12 In order to define regions to be converted to music scores, we selected intervals around RefSeq gene 13 definition, from 1kb upstream of TSS to 2kb downstream of TES. ChIP-seq signal were firstly converted 14 to MIDI using custom scripts (https://bitbucket.org/dawe/enconcert) according to parameters defined 15 in Table 2 RNA-seq tags were aligned to reference genome using STAR aligner (25). Read counts over RefSeq 23 intervals were extracted using bedtools. Discrete counts were normalized with TMM (26), differential 24 gene expression was evaluated using the voom function implemented in limma (27) with a simple 25 contrast between two cell lines. Genes were considered differentially expressed under a p-value lower 26 than 0.01 and absolute logarithm Fold Change higher than 1. 27

Cluster analysis 1
Cluster analysis was performed on replicate 1 of K562 dataset. We calculated all pairwise Hausdorff 2 distances among genomic loci as defined above. Data were clustered using the Ward method. 3 Enrichment analysis was performed using Enrichr (28). Word clouds were created with world_cloud 4 python package (https://github.com/amueller/word_cloud) using text description of ontologies having 5 positive Enrichr combined score. Differential expression among clusters was evaluated using Mann-6 Whitney U-test. 7 8 Acknowledgements 9 Authors would like to thank all collaborators and relatives who kindly sacrificed their time to listen to 10 music generated while this work was developed.