High-resolution 7-Tesla fMRI data on the perception of musical genres – an extension to the dataset studyforrest

Here we present an extension to the dataset – a versatile studyforrest resource for studying the behavior of the human brain in situations of real-life complexity ( ). This release adds more http://studyforrest.org high-resolution, ultra high-field (7 Tesla) functional magnetic resonance imaging (fMRI) data from the same individuals. The twenty participants were repeatedly stimulated with a total of 25 music clips, with and without speech content, from five different genres using a slow event-related paradigm. The data release includes raw fMRI data, as well as precomputed structural alignments for within-subject and group analysis. In addition to fMRI, simultaneously recorded cardiac and respiratory traces, as well the complete implementation of the stimulation paradigm, including stimuli, are provided. An initial quality control analysis reveals distinguishable patterns of response to individual genres throughout a large expanse of areas known to be involved in auditory and speech processing. The present data can be used to, for example, generate encoding models for music perception that can be validated against the previously released fMRI data from stimulation with the “Forrest Gump” audio-movie and its rich musical content. In order to facilitate replicative and derived works, only free and open-source software was utilized.


Background
Previously, we have released a large, high-resolution, 7 Tesla fMRI dataset on the processing of natural auditory stimuli -a two-hour audio movie 1 . Recently, we have extended this initial release with a detailed annotation of the emotional content of the stimulus 2 to broaden the range of research questions that could be addressed with these data. Here we further amend this dataset with additional high-resolution fMRI data from the same participants on the perception of musical genres. We employed a proven paradigm and stimuli that have been previously shown to enable investigation of distributed population codes of musical timbre in bilateral superior temporal cortices 3 .
The present data release enables comparative studies of the representation of musical genres (spectrum, timbre, vocal content) with ultra high-field, high resolution fMRI data from a larger sample of participants. In conjuction with the previous data releases, it will also further expand the continuum of research question that can be approached with the joint dataset. For example, the development of encoding models for cortical representations of music in complex auditory stimuli (the audio-movie contains several dozen musical excerpts from a broad range of genres). To this end, we include extracted audio features that represent the time-frequency information of each stimulus in four different views. The views are mapped to different perceptually-motivated scales (mel and decibel scales) and via a decorrelating linear transformation (DCT-II). It is hoped that providing these example features will catalyze discoveries of auditory stimulus codes in neural populations.
Lastly, these data can also serve as a public resource for benchmarking algorithms for functional alignment [e.g., 4], or other analyses, and thus, further the availability of resources for the investigation of real-life cognition 5 .

Materials and methods
Participants Acquisition of the data described herein was part of a previously published study 1 , and took place in close temporal proximity (no more than a few weeks apart). The participants in this data release are identical to those previously reported. They were fully instructed about the nature of the study and were paid a total of 100 EUR for their participation, which included the previously reported data acquisitions, as well as the one described herein. All data acquisitions were jointly approved by the ethics committee of the Otto-von-Guericke-University of Magdeburg, Germany (approval reference 37/13).

Stimulus
All stimuli employed in this study are identical to those used in a previous study [for details refer to 3]. They were five natural, stereo, high-quality music stimuli (6 s duration; 44.1 kHz sampling rate) for each of five different musical genres: 1) Ambient, 2) Roots Country 3) Heavy Metal, 4) 50s Rock'n'Roll, and 5) Symphonic (see Figure 1 for details).

Procedures and stimulation setup
The setup for audio-visual presentation was as previously reported 1 . Participants listened to the audio using custom-built in-ear headphones, Figure 1. Spectrograms for all 25 stimuli showing structural differences in the time-frequency characteristics of the five musical genres. Each stimulus was a six second excerpt from the middle of a distinct musical piece. Excerpts were normalized so that their root-mean-square power values were equal, and a 50 ms quarter-sine ramp was applied at the start and end of each excerpt to suppress transients. Most prominent are the differences between music clips with and without vocal components. and an LCD projector displayed visual instructions on a rearprojection screen that they saw via a mirror attached to the head coil.
At the start of each recording session, during the preparatory MR scans, participants listened to a series of longer excerpts of musical pieces and songs from the five different genres. During this phase participants were instructed to request adjustments of the stimulus volume in order to guarantee optimal perception of the stimuli against the noise pedestal emitted by the scanner. There was no overlap between the songs presented in this phase and those used as stimuli in the main experiment.
Eight scanning runs followed the initial sound calibration. Each run was started by the participant with a key-press ready signal. There were 25 trials, with five different stimuli ( Figure 1) for each of the five genres per run (see Figure 2 for details on the experiment design). At the end of each run participants were given the opportunity for a break of variable length until they indicated readiness for the next run. Most participants started the next run within a minute.
Stimulus presentation and response logging were implemented using PsychoPy 7 running on a computer with the (Neuro)Debian operating system 8 .

Functional MRI data acquisition
The acquisition protocol for functional MRI was largely identical with the one previously reported 1 , hence only differences and key facts are listed here. Importantly, the same landmark-based procedure for automatic slice positioning that was used to align the scanner field-of-view between acquisition sessions, was used again to align the field-of-view The start of each trial was synchronized with the MRI volume acquisition trigger. When the trigger was received the permanently displayed white fixation cross turned green, a 6 s music stimulus was presented, and, immediately afterwards, the fixation cross turned white again. Stimulation was followed by a variable delay (minimum delay 4 s). For the five trials of a genre, a 4 s and 8 s delay occurred once, while the remained three trials included a 6 s delay period. Thereby all trials had 4-8 s of uniform stimulation (no audio, white fixation cross) after each musical stimulus. The order of delays was randomized within a run. During trials with an 8 s second delay participants were presented with a yes/no question four seconds after the end of the music stimulus. The content of the question was randomized and asked for particular features of the stimulus that had just ended (e.g., "Was there a female singer?", "Did the song have a happy melody?"). Participants had to indicate their response by pressing one of two buttons with the index or middle finger of their right hand corresponding to the response alternative presented on the screen. "Yes" was always mapped to the left side (index finger), "No" always to the right side (middle finger). The question had the purpose of keeping the participants attentive to the stimuli and counteract the effect of increasing familiarity across multiple runs. (B) Run configuration. The 25 stimuli were identical across runs and presented exactly once per run. Order of stimulus genres within each run was counter-balanced using De Bruijn cycles 6 (alphabet size = 5, counter-balancing level = 2), hence each genre was followed by any other genre equally often and exactly once. Eight unique genre order sequences were generated and used for all participants, while randomizing the order of run sequences across participants. This was done in order to enable the application of the hyperalignment algorithm 4 . Data acquisition for two participants showed anomalies with respect to this procedure (see Table 2 for details).
for this acquisition with the one in the previous study 1 . As the exact same alignment target was used, this led to a very similar field-ofview configuration across acquisitions.
Each acquisition run consisted of 153 volumes (repetition time of 2.0 seconds with no inter-volume gaps).

Physiological recordings
The cardiac and respiratory traces were recorded for the full duration of all eight runs. The acquisition setup for physiological was identical with the one previously reported 1 .

Dataset content
The released data comprises raw and pre-processed fMRI data, physiological recordings, behavioral log files, and auditory stimuli (total ≈95 GB). Table 1 provides an overview of the location of individual data components. The following sections briefly describe important properties.

Behavioral log files
Log files are available as plain text files with comma-separated value markup. All enumerations are zero-based. Each lines represents a trial. Columns for the following information are present: order of run in sequence (run), ID of trial sequence for this run (run_id; see Figure 2B), fMRI volume corresponding to stimulation start (total: volume, in the current run: run_volume), stimulus file name (stim), music genre label (genre), inter-stimulus interval in seconds (delay), flag whether a control question with presented (catch), measured asynchrony between MRI trigger and sound onset in seconds (sound_soa), and time stamp of the corresponding MRI trigger with respect to the start of the experiment in seconds (trigger_ts).
Information on the stimulus timing is also available in per-subject, per-run, per-condition plain-text files in FSL's EV3 format: one line per stimulation event, three columns with stimulus onset and duration (both in seconds relative to the start of a scan), as well as a third column with an arbitrary intensity weight that is always set to 1.
fMRI data All functional MRI data were converted from the DICOM format into the NIfTI format for publication using the same procedure as in 1.
fMRI data are available in three different flavours, each stored in an individual 4D image for each run separately. Raw BOLD data are stored in bold.nii.gz. While raw BOLD data are suitable for further analysis, they suffer from severe geometric distortions. BOLD data that have been distortion-corrected 9 at the scanner console are provided in bold_dico.nii.gz. In addition, distortioncorrected data that have been anatomically aligned to a per-subject BOLD template image are available: bold_bold7Tp1_to_ subjbold7Tp1.nii.gz.

Participant motion estimates
Head movement correction was performed with respect to a dedicated reference scan at the start of the recording session within scanner online reconstruction as part of the distortion correction procedure. The associated motion estimates are provided in a whitespace-delimited 6-column text file (translation X, Y, Z in mm, rotation around X, Y, Z in deg) with one row per fMRI volume for each run separately.

Physiological recordings
Physiological data were truncated to start with the first MRI trigger pulse and to end one volume acquisition duration after the last trigger pulse. Data are provided in a four-column (MRI trigger, respiratory trace, cardiac trace and oxygen saturation), space-delimited text file for each run. A log file of the automated conversion procedure is provided in the same directory (conversion.log). Sampling rate for the majority of all participants is 200 Hz (see Table 2 for exceptions).

Audio features
Recent experiments have shown that audio features can be predicted via regression models from fMRI signals to test stimulus coding hypotheses 3,10 . To facilitate this activity with the current data we extracted four audio features from down-mixed mono stimuli. Feature extraction used a front-end windowed short-time Fourier transform, with window size 16384 samples (371.52 ms) and hop size 4410 samples (100 ms) yielding 63 overlapping feature vectors per stimulus file. Window parameters were chosen to trade temporal for spectral acuity, yielding frequency samples spaced linearly at 2.69 Hz intervals from 0-22.05 kHz. The four features extracted from this representation are described below.

Mel-Frequency Spectrum (mfs) -48 dimensions.
Motivated by human auditory perception the mel scale organizes frequency by equidistant pitch locations as determined by psychophysical experiments. We used the essentia open source audio processing library 11 to extract the mel-frequency spectrum, which yielded energy in mel bands by applying a frequency-domain filterbank 12 to the short-time Fourier spectrum. Frequency-domain filtering consisted of applying equal area overlapping triangular filters to the Fourier spectrum spaced according to the mel scale and normalized such that the sum of coefficients for every filter equals one.

Mel-Frequency Cepstral Coefficients (mfcc) -48 dimensions.
Cepstral features have been widely reported to perform well in speech recognition and music classification systems 13 , where the task is required to be sensitive to timbre. Typically, only the lower 10-20 cepstral coefficients (low quefrency) are retained; these encode the shape of the broad spectral envelope -an acoustic correlate of timbre. However, when sensitivity to timbre is not required, utilizing the upper coefficients (high quefrency), that encode fine spectral structure such as pitch, makes the feature robust to timbral changes 14 . We extracted the full set of 48 cepstral coefficients from the mel-frequency spectrum, by mapping the mel spectrum to a decibel amplitude scale and multiplying by the discrete cosine transform (DCT-II) matrix. It is expected that any application would first remove the constant first column and retain either the subsequent 13-20 coefficients or the remaining upper coefficients after those, depending on whether sensitivity or robustness to timbral difference is required. The remaining two features yield such a separation into low and high quefrency spectral components.  (lq_mfs, hq_mfs). Although proven to be useful in machine classification tasks, cepstral coefficients are in a different domain than the spectrum. The last two features map selected cepstral coefficients back to the spectrum domain by reconstructing the 48 mel-frequency spectrum bands using the low-quefrency and highquefrency mfcc coefficients respectively. In each case, the nonselected coefficients were zeroed and the resulting feature mapped back to the spectral domain using the inverse (transposed) DCT-II matrix and then inverting the decibel amplitude scale. These two sets of features represent broad-spectrum information (timbre) and fine-scale spectral structure (pitch) respectively. The product of these two spectra yields the mel-frequency spectrum.

Source code
The source code for descriptive statistics in Figure 1 and Figure 3, as well as the implementation for the analysis presented in Figure 4 is available in a Git repository at https://github.com/psychoinformatics-de/paper-f1000_pandora_data. Source code for the implementation of the stimulation paradigm and audio feature extraction are included in the data release. Additional scripts for data conversion and quality control are available at: https://github.com/hanke/ gumpdata.

Dataset validation
In order to assess data quality, we investigated whether different BOLD response patterns associated with the five musical genres could be discriminated, using either univariate statistical parametric mapping or multivariate pattern (MVP) classification accuracy (searchlight-based analysis, radius of two voxels, sparse spatial sampling with sphere-centers spaced by two voxels, leave-onerun-out cross-validated classification analysis with a support vector machine, accuracy mapped on a voxel reflects the average across all sphere-analysis a voxel participated in). Inspection of the participant motion estimates revealed a median translation of less than a voxel size, and a maximum rotation of about 1 deg (see Figure 3 for outliers).
Despite the variable magnitude of motion, no participant was excluded from the subsequent analysis.
The results of the univariate analysis ( Figure 4A) and the MVP analysis ( Figure 4B and Table 3) identify largely congruent areas. MVP analysis generally detects larger and more numerous areas, either due to higher sensitivity or a comparably more liberal statistical threshold. Noteably, clusters of above-chance classification accuracy not only contain auditory cortex and other cortical fields related to speech and music processing, but also the subcortical bilateral medial geniculate bodies, a neural relay station immediately prior to the primary auditory cortex in the auditory pathway 15 .
Given the confirmed wide-spread availability of genre-discriminating signal we conclude that these data are suitable for studying the representation of music and auditory features. Table 2 contains a list of all known data anomalies that may help potential data consumers to select appropriate subsets of this dataset.

Usage notes
These data are part of a larger public dataset available at http:// www.studyforrest.org. The website includes information on all available resources, data access options, publications that employ this dataset, as well as source code for data conversion and data processing.  Random-effects GLM group analysis (n=20) were computed using the FEAT component of FSL 16 . Individual contrasts were evaluated for each genre to identify voxels showing a BOLD response to this particular genre that is larger than the average response to all other genres. For all voxel clusters that show a significant difference at the group-level (cluster forming threshold Z=3.1, cluster probability threshold p<0.05) for any genre, the selectivity label was determined by the maximum Z statistic across all genres. No significant selective activation was found for the ambient genre. The majority of all voxels were labeled selective for one of the musical genres where stimuli contained vocals (country, rock'n'roll, heavy metal). Only a small cluster in BA44 R (Broca's area) was labeled selective for symphonic music, despite the lack of speech content in these stimuli.
(B) For comparison, the location of voxel clusters with above-chance classification accuracy for predicting the genre of a music stimulus (colors only indicate individual clusters, not association with particular genres). The associated areas are largely overlapping with the results of the GLM analysis. However, genre-discriminating signals were identified in a number of additional areas. For details on the MVP analysis and cluster statistics see Table 3. Unthresholded maps for GLM and MVP analyses are available at NeuroVault.org 17 collection 308.

Figure 3. Summary statistics for head movement estimates across runs and participants.
These estimates indicate relative motion with respect to a dedicated reference scan at the beginning of each scan session. The area shaded in light gray depicts the range across participants, while the medium gray area indicates the 50% percentile around the mean, and the dark gray area shows ± one standard error of the mean. The black line indicates the median estimate. Dashed vertical lines indicate run boundaries where participants had a brief break. The red lines indicate motion estimate time series of outlier participants. An outlier was defined as a participant whose motion estimate exceeded a distance of two standard deviations from the mean across participants for at least one fMRI volume in a run. For a breakdown of detected outliers see Table 2.
All data are made available under the terms of the Public Domain Dedication and License (PDDL; http://opendatacommons.org/ licenses/pddl/1.0/). All source code is released under the terms of the MIT license (http://www.opensource.org/licenses/MIT). In short, this means that anybody is free to download and use this dataset for any purpose as well as to produce and re-share derived data artifacts. While not legally required, we hope that all users of the data will acknowledge the original authors by citing this publication and follow good scientific practice as laid out in the ODC Attribution/Share-Alike Community Norms (http://opendatacommons.org/norms/odc-by-sa/).

Consent
Written informed consent for publication of acquired data in a de-identified form was obtained from all participants.
Author contributions MH conducted the study, implemented the stimulation paradigm, performed dataset validation analysis, and wrote the manuscript. RD implemented and performed the dataset validation analyses. CH converted the stimulation protocol into the OpenFMRI format. JSG contributed to the implementation of the stimulation paradigm, and contributed to the validation analysis. MC contributed the stimuli. FRK contributed to quality control analysis and to the manuscript. JS was the data acquisition lead. Table 3. Average group results of a searchlight-based (radius ≈2.5 mm) cross-validated within-subject musical genre classification analysis (n=20; SVM classifier; C parameter scaled according to the norm of the data). The table lists statistics (size, mean/max/std accuracy) as well as localization information (coordinates in mm MNI152) for clusters with above-chance classification performance in the group (cluster-level probability p<0.05; FWE-corrected). Clusters are depicted in Figure 4B. Statistical evaluation was implemented using a bootstrapped permutation analysis, as described by Stelzer and colleagues 18 and implemented in PyMVPA 19 , using 50 permutation searchlight accuracy maps per subject, 10000 bootstrap samples, voxel-wise cluster forming threshold of p<0.001). Apart from two large clusters covering the majority of bilateral area for auditory perception and speech processing, additional clusters with genre-discriminating signals were identified. These include the bilateral medial geniculate bodys, as well as smaller regions on the ventral visual pathway, frontal orbital cortex, and the cerebellum. For these regions the NeuroSynth database 20 reports high posterior probabilities for the topics: counting, motor, naming, phonology, prosody, visual, and vocal (as determined with the Neurosynth term atlas shipped with NeuroDebian 8 ). that will afterwards be available for future studies (the authors name this part 'quality control'). Secondarily, the authors provide a pre-processing of the stimuli set, too, with the aim to provide elementary descriptors which could be used to relate fMRI brain responses to delivered music snippets for encoding models estimation. Thirdly, the authors mention the goal to validate the encoding models generated by the previous dataset (studyforrest.org). Alongside the main goals, the authors mention the possibility to use the latest fMRI corpus as a resource for benchmarking algorithms for functional alignment. Could the authors make examples of such algorithms, ideally with references?
Firstly, in my opinion the stimuli and the experimental procedures are generally clearly described, but the pre-processing of fMRI analyses could be further commented in order to allow appropriate replication. In fact, some procedures are mentioned in other papers and that would require the knowledge of previous work done by the group. To ease the readers into the topic could the authors add one/two sentences for each procedure or method that is relevant for the points made in the present manuscript? For example at page 3 a landmark-based procedure is mentioned. Could the authors briefly mention what was used as a landmark and which software or metric allowed this? Additionally, one shortcoming of the dataset is the lack of a characterization of the audio presentation. The transforms were computed on the source waveforms. However, it is not so clear what waveform arrives at the ear and how the gain adjustments scaled them. This information would be important for the investigations of neural sound encoding with this dataset. In addition, this information would be of great use for users who intend to calculate other transforms on the source waveforms.
Secondarily, while the authors' premises are valid in only providing a quick and preliminary analysis of the data as a quality check, it would be helpful to briefly introduce the research context behind it. A couple of sentences in the introduction would be sufficient to explain the expected activated networks for those readers that are unfamiliar with neuroimaging studies on music.
Finally, it would be helpful to briefly outline which ones of the many tools are made available in the paper directly, and which ones are preparatory for future studies. For example the computation of cepstral coefficients is often used in speech recognition, to separate vocal tract from speech descriptors and to create features for recognition, but is not clear if and how they are used in the current manuscript. We are aware of the amount of work carried out by the authors, in documenting and making the material available, however the achieved goals should not be mixed with future ones. Please amend in order to outline the current state of the research framework to the general reader.

Detailed comments:
Abstract. What is the advantage of a slow event-related paradigm (long ISI) for this experiment? Please comment on it in the text.
p.4. Some more information about the acquisition procedures and the procedures applied to the preprocessed dataset should be provided, or at least stated where they can be found (e.g. scripts in the dataset). For example a short description of the FOV, (e.g. covering temporal and inferior frontal areas of the brain) may be added in the article itself. For example, a distortion correction technique is mentioned, along with a realignment of the fMRI volumes. A template image is also mentioned. How is the template generated? Is there a slice-time correction step? Is the distortion correction applied before, after or during the realignment?
p.5. In the description of the mel coefficients could the authors briefly explain the difference between DCT_II and the other flavors of the DCT algorithm (or cite a representative reference)? p.7. What is the contrast of the univariate analysis? I understand that this is a standard analysis, but in our opinion its description is missing some crucial information on how it was performed. For example: a) I'd like some basic information on how the validation analyses were done: is the source code for the classification provided too? b) Given this statement "Given the confirmed wide-spread availability of genre-discriminating signal we conclude that these data are suitable for studying the representation of music and auditory features." it would be good to have a brief overview of previous studies (with references) in this field and the brain areas they usually associate with music processing + processing of different genres to facilitate understanding for readers without a background in the neurobiology of music. p.8. Figure 4 is a bit confusing. As I understand it the color code on top only refers to Figure 4a. Maybe label a) statistical analysis and b) classification analysis and use different colors in b. The legend refers to a small symphonic music cluster in RIFG which I cannot identify on the actual brain images.

Minor comments
Abstract and p.2. The term 'quality control' might be slightly misleading. One could just state that basic GLM and MVPA analyses were conducted as a 'proof of principle' of effective classification on the data.
p.3 Correct the sentence 'continuum of research question' into '… questions' p.3 Participants: which of the functional scans was done first?
p.4 Were the sound adjustments done only once at the beginning of the experiment or for every functional run? Is the level of the sound adjustment documented in the material?
p.5 Correct the sentence 'flag whether a control question with presented' into '…was presented' p.5 "As the exact same alignment target was used, this led to a very similar field-of view configuration across acquisitions." Somewhat unclear. Does this mean that the subjects' head was in the same position in both experiments (i.e. voxels match functionally) or does it mean that the FOV was in the same scanner's coordinates? It is unlikely that both conditions are met simultaneously.
p.5 "While raw BOLD data are suitable for further analysis, they suffer from severe geometric p.5 "While raw BOLD data are suitable for further analysis, they suffer from severe geometric distortions. BOLD data that have been distortion-corrected9 at the scanner console are provided in bold_dico.nii.gz." Please provide more information about the correction (e.g. k-space or voxel space correction). Is it documented? Is all information that was used available in the dataset? Users may want to apply their own correction p.5 "A log file of the automated conversion procedure is provided in the same directory (conversion.log)". Conversion of what?
p.5 "Audio features" Are the sound files provided with the dataset? You might want to state that here. Otherwise state how the user could obtain them to apply their own transforms. It would be important to characterize the sound presentation system. The transforms were calculated on the sound files but it is not so clear what actually reached the ear and how the genres might have differed in energy etc. This is information could greatly extend the usefulness of this rich dataset.
p.5 "frequency domain filter bank" Are the parameters provided? If yes, mention that here.
p.5 The description of the audio features reports a highest signal frequency of ~22KHz but the signals in figure 1 show a clear decrease in gain for all stimuli after ~16KHz. Could the authors document what the sampling frequency of the audio waves was and if a low pass filter was applied?
No competing interests were disclosed.

Competing Interests:
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias You can publish traditional articles, null/negative results, case reports, data notes and more The peer review process is transparent and collaborative Your article is indexed in PubMed after passing peer review Dedicated customer support at every stage For pre-submission enquiries, contact research@f1000.com