A studyforrest extension, an annotation of spoken language in the German dubbed movie “Forrest Gump” and its audio-description

Here we present an annotation of speech in the audio-visual movie “Forrest Gump” and its audio-description for a visually impaired audience, as an addition to a large public functional brain imaging dataset ( studyforrest.org). The annotation provides information about the exact timing of each of the more than 2500 spoken sentences, 16,000 words (including 202 non-speech vocalizations), 66,000 phonemes, and their corresponding speaker. Additionally, for every word, we provide lemmatization, a simple part-of-speech-tagging (15 grammatical categories), a detailed part-of-speech tagging (43 grammatical categories), syntactic dependencies, and a semantic analysis based on word embedding which represents each word in a 300-dimensional semantic space. To validate the dataset’s quality, we build a model of hemodynamic brain activity based on information drawn from the annotation. Results suggest that the annotation’s content and quality enable independent researchers to create models of brain activity correlating with a variety of linguistic aspects under conditions of near-real-life complexity.


Introduction
Cognitive and psychiatric neuroimaging are moving towards studying brain functions under conditions of lifelike complexity 1,2 . Motion pictures 3 and continuous narratives 4,5 are increasingly utilized as so called "naturalistic stimuli". Naturalistic stimuli are usually designed for commercial purposes and to entertain their audiences. Thus, the temporal structure of their feature space is usually not explicitly known, leading to an "annotation bottleneck" 6 when used for neuroscientific research.
Data-driven methods like inter-subject correlation (ISC) 7 or independent component analysis (ICA) 8 are often used to analyze such fMRI data in order to circumvent this bottleneck. However, use of data-driven methods alone falls short of associating results with particular stimulus events 9 . Model-driven methods, like the general linear model (GLM), which are based on stimulus annotations can be useful to test hypotheses on specific brain functions under more ecologically valid conditions, to statistically control confounding stimulus features, and to explain not just "how" the brain is responding to a stimulus but also "why" 10 . Studies using GLMs based on annotations of a stimulus' temporal structure have elucidated, for example, how the brain responds to visual features of a movie 11 or speech-related features of a narrative 12 . Furthermore, stimulus annotations can inform data-driven methods about a stimulus' temporal dynamics, or model-driven and data-driven methods can be combined to improve the interpretability of results 13 .
Here we provide an annotation with exact onset and offset of each sentence, word and phoneme (see Table 1 for an overview) spoken in the audio-visual movie "Forrest Gump" 14 and its audio-description (i.e. the movie's soundtrack with an additional narrator) 15 . fMRI data of participants watching the audio-visual movie 16 and listening to the audiodescription 17 are the core data of the publicly available studyforrest dataset (studyforrest.org). The current publication enables researchers to model hemodynamic brain responses that correlate with a variety of aspects of spoken language ranging from a speaker's identity, to phonetics, grammar, syntax, and semantics. This publication extends already available annotations of portrayed emotions 18 , perceived emotions 19 , as well as cuts and locations depicted in the movie 20 . All annotations can be used in any study focusing on aspects of real-life cognition by serving as additional confound measures describing the temporal structure and feature space of the stimuli.

Materials and methods Stimulus
We annotated speech in the slightly shortened "research cut" 17 of the movie "Forrest Gump" and its temporally aligned audio-description 16 that was broadcast as an additional audio track for visually impaired listeners on Swiss public television 15 . The plot of the original movie is already carried by an off-screen voice of the main character Forrest Gump. In the audio-description, an additional male narrator describes essential aspects of the visual scenery when there is no off-screen voice, dialog, or other relevant auditory content.
Annotation procedure Preliminary, manual orthographic transcripts of dialogues, non-speech vocalizations (e.g. laughter or groaning) and the script for the audio-description's narrator were merged and converted to Praat's 21 TextGrid format. This merged transcript contained rough onset and offset timings for small groups of sentences, and was further edited in Praat for manual validation against the actual content of the audio material. The following steps were performed by a single person, already familiar with the stimulus, in several passes to iteratively improve the quality of the data: approximate temporal onsets and offsets were corrected; intervals containing several sentences were split into intervals containing only one sentence; when two or more persons were speaking simultaneously the less dominant voice was dropped; low volume Table 1. Overview of the annotation's content for the audio-description of "Forrest Gump" (i.e. the audioonly variant of the movie) that comprises the additional narrator. Counts are given for the whole stimulus (all) and its individual segments used during fMRI scanning. The category sentences comprises complete grammatical sentences which are additionally marked in the annotation with a full stop at the end ("my feet hurt."). It also comprises questions ("do you want a chocolate?"), exclamations ("run away!"), or non-speech vocalizations in quick succession ("ha, ha, ha"), or in isolation (e.g. "Forrest?", "Forrest!", "ha") at time points when speakers switch rapidly. The category words comprises each word or non-speech vocalization (N=202) in isolation. All  1  2  3  4  5  6  7  8   Sentences  2528  292  366  320  352  344  289  365  200   Words  16187  2089  2162  2115  2035  2217  2033  2322  1214   Phonemes  66611  8802  8727  8770  8557  9197  8353  9351  4854 non-speech vocalizations or low volume background speech (especially during music or continuous environmental noise) which were subjectively assessed to be incomprehensible for the audience were also dropped.

Category
We then used the Montreal Forced Aligner v1.0.1 22 to algorithmically identify the exact onset and offset of each word and phoneme. To enable the aligner to look up the phonemes embedded within each word, we chose the accompanying German pronunciation dictionary provided by Prosodylab 23 that uses the Prosodylab PhoneSet to describe the pronunciation of phonemes. To improve the detection rate of the automatic alignment, the dictionary was manually updated with German words that occur in the stimuli but were originally missing in the dictionary. The pronunciation of English words and phonemes occurring in the otherwise German audio track was taken from the accompanying English pronunciation dictionary (following the ARPAbet PhoneSet). The audio track of the audio-description was converted from FLAC to WAV via FFmpeg v4.1.4 24 to meet the aligner's input requirements. This WAV file, the merged transcription, and the updated dictionary were submitted to the aligner that first trained an acoustic model on the data and then performed the alignment.
The resulting timings of words and phonemes were corrected manually and iteratively in several passes using Praat v6.0.22 21 : in a first step, onsets and offsets on which the automatic alignment performed moderately were corrected. Some low volume sentences that are spoken in continuously noisy settings (e.g. during battle or hurricane) were removed due to poor overall alignment performance. In a second step, the complete sentences of the orthographic transcription were copied into the annotation created by the aligner. In a third step, a speaker's identity was added for each sentence (see Table 2 for the most often occurring speakers). During every step previous results were repeatedly checked for errors and further improvements.
We employed the Python package spaCy v2.2.1 25 and its accompanying German language model (de_core_ news_md) that was trained on the TIGER Treebank corpus 26 to automatically analyze linguistic features of each word in their corresponding sentence. Non-speech vocalizations were dropped from the sentences before analysis to improve results. We then performed analyses regarding part-of-speech (i.e. grammatical tagging or word-category disambiguation), syntactic dependencies, lemmatization, word embedding (i.e. a multi-dimensional meaning representation of a word), and if the word is one of the most common words of the German language (i.e. if the word is part of a stop list).

Data legend
The annotation is available in two different versions, both providing the same information: a) as a text-based Praat TextGrid file, and b) as a text-based, tab-separated value (TSV) formatted table. The following descriptions refer to the ten columns of the TSV file, namely onset, duration, person, text, pos, tag, dep, lemma, stop, vector.

Start (start)
The onset of the sentence, word or phoneme. Time stamps are provided in the format seconds.milliseconds from stimulus onset. Duration (duration) The duration of the sentence, word or phoneme provided in the format seconds.milliseconds.

Speaker identity (person)
Name of the person that speaks the sentence, word or phoneme. See Table 2 for the ten most often occurring speakers.
Text (text) The text of a spoken sentence or word, or the pronunciation of a phoneme. Phonemes of German words follow the Prosodylab PhoneSet, English words follow the ARPAbet PhoneSet.
Simple part-of-speech tag (pos) A simple part-of-speech tagging (grammatical tagging; word-category disambiguation) of words. The tag labels of this simple part-of-speech tagging follow the Universal Dependencies v2 POS tag set (universaldependencies.org). See Table 3 for a description of the labels and the respective counts of all 15 labels. Nouns that spaCy mistook for proper nouns or vice versa were corrected via script. Additionally in cells of this column, sentences are tagged as SENTENCE, and phonemes are tagged as PHONEME to facilitate filtering in potential further processing steps.
Detailed part-of-speech tag (tag) A detailed part-of-speech tagging of words following the TIGER Treebank annotation scheme 26 which is based on the Stuttgart-Tübingen-Tagset 27 . See Table 4 for a description of the labels and the respective counts of the 15 most often occurring labels (overall 43 labels). Nouns that spaCy mistook for proper nouns or vice versa were corrected via script.

Syntactic dependency (dep)
Information about a word's syntactic dependencies with other words within the same sentence. Information follows the TIGER Treebank annotation scheme 26 and is given in the format: "arc label;word's head;word's child1, word's child2, ...", where the "arc label" (see Table 5) describes the type of syntactic relation that connects a "child" (the current word) to its "head".

Lemmatization (lemma)
The base form (root) of a word.

Common Word (stop)
This column's cell provides information if the word is part of a stop list, hence one of the most common words in the German language or not (True vs. False).

Word embedding (vector)
A 300-dimensional word vector providing a multi-dimensional meaning representation of a word. Out-of-vocabulary words with a vector consisting of 300 dimensions of zeroes were set to # to save space.

Dataset content
The annotation comes in two different versions. First, as a text-based TextGrid file (annotation/ fg_rscut_ ad_ger_speech_tagged.TextGrid) to be conveniently edited using the software Praat 21 . Second, as a textbased, tab-separated-value (TSV) formatted table (annotation/fg_rscut_ ad_ger_speech_tagged.tsv) in accordance with the brain imaging data structure (BIDS) 28 . The dataset and validation data are available from Open Science Framework, DataLad and Zenodo (see Underlying data) 29,30,31 . The source code for all descriptive statistics included in this paper is available in code/descriptive-statistics.py (Python script).

Dataset validation
In order to assess the annotation's quality, we investigated if contrasting speech-related events to events without speech lead to increased activation in areas known to be involved in language processing 32 . Moreover, we tested if two similar linguistic concepts (proper nouns and nouns) providing high semantic information contrasted with a concept providing low semantic information (coordinate conjunctions) lead to increased activation in congruent brain areas.
We used a dataset providing blood oxygenation level-dependent (BOLD) functional magnetic resonance imaging (fMRI) data of 20 subjects (age 21-38 years, mean age 26.6 years, 12 male) listening to the 2 h audio-description (7 Tesla, 2 s repetition time, 3599 volumes, 36 axial slices, thickness 1.4 mm, 1.4 Â 1.4 mm in-plane resolution, 224 mm field-ofview) 17 . Data were already corrected for motion at the scanner computer. Further, individual BOLD time-series were already aligned by non-linear warping to a study-specific T2*-weighted echo planar imaging (EPI) group template (cf. 17 for exact details).
All further steps for the current analysis were carried out using FEAT v6.00 (FMRI Expert Analysis Tool) 33 as part of FSL v5.0.9 (FMRIB's Software Library) 34 . Data of one participant were dropped to due to invalid distortion correction during scanning. Data were temporally high-pass filtered (cut-off 150 s), spatially smoothed (Gaussian kernel; 4.0 mm FWHM), and the brain was extracted from surrounding tissue. A grand-mean intensity normalization of the entire 4D dataset was performed by a single multiplicative factor.
We implemented a standard three-level, voxel-wise general linear model (GLM) to average parameter estimates across the eight stimulus segments, and later across 19 subjects. At the first level analyzing each segment for each subject individually, we created 26 regressors (see Table 6) based on events drawn from the annotation. The 20 most often occurring detailed part-of-speech labels (nn with N=2620 to prf with N=157) were modeled as boxcar function from onset to offset of each word. The remaining other part-of-speech labels were pooled to a single new label (tag_other; N=1123) and modeled as a boxcar function from a word's onset to offset. The 80 most often occurring phonemes (n with N=6053 to IY1 with N=32) were pooled to phonemes (N=65251) and modeled as boxcar function from a phoneme's onset to offset. The end of each complete grammatical sentence was modeled as an impulse event (N=1651) to capture variance correlating with sentence comprehension. "No-speech" events (no-sp; N=264) serving as a control condition were created such that a sufficient number of events and a minimum separation of speech and non-speech events were achieved. Events were randomly positioned in intervals without audible speech that lasted at least 3.6 s. Each event of the no-speech condition had to have a minimum distance of 1.8 s to any onset or offset of a word, and to any onset of another no-speech event. A length of 70 ms was chosen for no-speech events matching the average length of phonemes. Lastly, we used continuous bins of information about low-level auditory features (left-right difference in volume and root mean square energy) that was averaged across the length of every movie frame (40 ms) to capture variance correlating with assumed low-level perceptual processes. Time series of events were convolved with FSL's "Double-Gamma HRF" as a with N=32). The label no-sp represents moments when no speech was audible. fg_ad_lrdiff (left-right volume difference) and fg_ad_rms (root mean square energy) were compute for and averaged across every movie frame (40 ms) via Python script. Events were convolved with FSL's "Double-Gamma HRF" to create the regressors. The correlation of these regressors over the time course of the whole stimulus can be seen in  Figure 1. Temporal derivatives were also included in the design matrix to compensate for regional differences between modeled and actual HRF. Finally, six motion parameters were used as additional nuisance regressors and the design was subjected to the same temporal filtering as the BOLD time series. The following three t-contrasts were defined: 1) words (all 21 tag-related regressors) > no-speech (no-sp), 2) proper nouns (ne) > coordinate conjunctions (kon), and 3) nouns (nn) > coordinate conjunctions (kon).
The second-level analysis that averaged contrast estimates across the eight stimulus segments per subject was carried out using a fixed effects model by forcing the random effects variance to zero in FLAME (FMRIB's Local Analysis of Mixed Effects) 35,36 . The third level analysis which averaged contrast estimates across subjects was carried out using a mixedeffects model (FLAME stage 1) with automatic outlier deweighting 36,37 . Z (Gaussianised T/F) statistic images were thresholded using clusters determined by Z>3.4 and a corrected cluster significance threshold of p<.05 37 . Brain regions associated with observed clusters were labeled using the Jülich Histological Atlas 38,39 and the Harvard-Oxford Cortical Atlas 40 provided by FSL. Figure 2 depicts the results of the three contrasts (z-threshold Z>3.4; p<.05 cluster-corrected). The contrast words > no-speech yielded four significant clusters (see Table 7): one left-lateralized cluster spanning from the angular gyrus and inferior posterior supramarginal gyrus across the superior and middle temporal gyrus, including parts of Heschl's gyrus and planum temporale. A second left cluster in (inferior) frontal regions, including precentral gyrus, pars opercularis (Brodmann Areal 44; BA44) and pars triangularis (BA45). Similarly in the right hemisphere, one cluster spanning from the angular gyrus across the superior and middle temporal gyrus but including frontal inferior regions (pars opercularis and pars triangularis). A fourth significant cluster is located in the left thalamus.
The contrast proper nouns > coordinate conjunctions yielded nine significant clusters (see Table 8): one left-lateralized cluster spanning from the angular gyrus across planum temporale and superior temporal gyrus, partially covering the Heschl's gyrus, into the anterior middle temporal gyrus. A largely congruent but smaller cluster in the right hemisphere. Two clusters in posterior cingulate cortex and precuneus of both hemispheres. Three small clusters in the right occipital pole, right Heschl's gyrus and left superior lateral occipital pole.   Table 8. Significant clusters (z-threshold Z>3.4; p<.05 cluster-corrected) for the contrast proper nouns (ne) > coordinate conjunctions (kon). Clusters sorted by voxel size. The first brain structure given contains the voxel with the maximum Z-Value, followed by brain structures from posterior to anterior, and partially covered areas (l. = left; r. = right; c. = cortex; g. = gyrus).

Max location (MNI)
Center of gravity (MNI) The contrast nouns > coordinate conjunctions yielded four significant clusters (see Table 9): two clusters that are slightly smaller than the lateral temporal clusters of contrast nouns > coordinate conjunction. In this case, spanning from angular gyrus in the left hemisphere and from planum temporale in the right hemisphere into the anterior part of superior temporal cortex. Finally, two small right-lateralized clusters in the right posterior cingulate gyrus and right precuneus.
For the contrast words > no-speech, results show increased hemodynamic activity in a bilateral cortical network including temporal, parietal and frontal regions related to processing spoken language 32,41,42 . These clusters resemble results of previous studies that implemented an ISC approach to analyze fMRI data of naturalistic auditory stimuli 5,43,44 . We do not find significantly increased activations in midline areas (like the posterior cingulate cortex and precuneus or anterior cingulate cortex and medial frontal cortex) which showed synchronized activity across subjects in previous studies. In this regard, our results are similar to 4 who implemented both an ISC and a GLM analysis. In this study, the ISC analysis showed synchronized activity in midline areas but the GLM analysis contrasting blocks of listening to narratives to blocks of a resting condition showed significantly decreased activity in these areas.
The two contrasts that contrasted nouns and proper nouns respectively to coordinate junctions yielded increased activation partially located in early sensory regions (Heschl's Gyrus; 45 ) and most prominently adjacent regions bilaterally (planum temporale; superior temporal gyrus; 46,47 ). We chose nouns and proper nouns for these two contrasts because they represent linguistically similar concepts but are uncorrelated in the German language and stimulus (cf. Figure 1). We contrasted nouns and proper nouns respectively to coordinate conjunctions because nouns and proper nouns are linguistically different to coordinate conjunctions as well as uncorrelated. Despite the fact that nouns and proper nouns are uncorrelated, both contrasts lead to largely spatially congruent clusters. Results suggest that models based on our annotation of similar linguistic concepts correlate with hemodynamic activity in spatially similar areas. We confirmed the validity of these interpretation by testing if the spatial congruency could be attributed to a negative correlation of coordinate conjunctions with the modeled time series which turned out not to be the case. In summary, results of our exploratory analyses suggest that the annotation of speech meets basic quality requirements to be a basis for model-based analyses that investigate language perception under more ecologically valid conditions.

Data availability
Underlying data Zenodo: A studyforrest extension, an annotation of spoken language in the German dubbed movie "Forrest Gump" and its audio-description (annotation Zenodo: A studyforrest extension, an annotation of spoken language in the German dubbed movie "Forrest Gump" and its audio-description (validation analysis). https://doi.org/10.5281/zenodo.4382188 30 .

Open Peer Review Introduction
In the introduction, the authors make a general point about model-driven GLM analyses providing a series of advantages over data-driven methods, especially in terms of interpretability. They write that "use of data-driven methods alone falls short of associating results with particular stimulus events", while GLMs "allow to test hypotheses on specific brain functions under more ecologically valid conditions, to statistically control for confounding stimulus features, and to explain not just "how" the brain is responding to a 1.
stimulus, but also why". A few objections can be moved to these claims (or to the way they are presented here). While GLMs are, on the surface, more easily interpretable, their use (especially in naturalistic contexts) poses a few challenges. For example, a) global characteristics of the stimulus make interpretation of baselines tricky (and their commensurability across datasets challenging), with consequences on the interpretation of the effects of specific features; b) the choice of covariates, while allowing to control for collinearity, can also alter the interpretation of the effects of individual features -in other words, interpretation is conditional on the model; c) causal claims (aka the why) are notoriously hard to make on the basis of linear models alone. While I am sympathetic to the idea that a lot can be gained by the use of these simple tools, authors could consider nuancing some or further specifying some of their claims.
My next (related) point concerns the link between the annotations presented by the authors and GLMs. In the introduction, the authors seem to present the usability of their annotations as almost constrained to analytic contexts where GLMs are used or as heuristics tools for data-drive methods. But nothing prevents these annotations from being used as either predictor or target features in other analysis frameworks (e.g., predictive models). The authors could consider being a bit more ambitious, and slightly reframing the first paragraphs along the lines of these considerations.

2.
In the introduction, the authors claim that "stimulus annotations can inform data-driven methods about a stimulus' temporal dynamics". I think the gist of it is clear, but maybe a more concrete example or wording would help.

3.
A minor point about wording, should "the current publication" be "the present publication" or "this publication"? Not a native speaker, so it may just be an issue with my English (in which case, please ignore this comment).

Materials and methods
It may be beneficial to mention early in the paper that the movie is presented in German. It is also a great feature of the dataset, being it one of the very few dataset -to my knowledge -with stimuli not in English, which is quite crucial for matters of cross-linguistic generalization.

1.
One aspect that is not entirely clear from the paper is which component of the studyforrest dataset the annotations refer to / can be used for. Can they only be used for data from subjects where the movie was only presented auditorily, or can the same annotations (filtering out "narrator" rows) also be used for those parts of the dataset where participants are presented with the movie both auditorily and visually? I think it will be highly beneficial to the manuscript to a) provide a little recap of what studyforrest is, its sub-components, and where to find them; b) make it more explicit for which batches of subjects/tasks these new annotations can be used.

2.
More of a clarification question than a suggestion: the onsets in the annotations tsv file refer to the full movie file. It should be possible to cross-reference these onsets with runand subject-specific events files from the BIDS dataset. If so, could more details on how to get from onsets in the annotation files to run-specific GLM-ready onsets be provided (at a high level, e.g., "onsets in the annotation files can be cross-referenced with time-stamps in 3. the event files x and x to retrieve subject-specific and run-specific onsets …")? Disclaimer: I am relatively new to BIDS, and this may be a trivial point (in which case ignore this comment).
Once again a minor point, but the fact that annotations are BIDS-compliant and shared in tsv could already be mentioned in the introduction (or even in the abstract).

4.
In "Annotation procedure", the authors introduce the fact that the stimulus is also annotated at the sentence-level. However, a definition of a sentence is only given in the description for Table 1. Could a quick reference of what counts as a sentence be added to the main text?

5.
The authors mention having dropped the "less dominant" voice when speakers overlap. Could a couple of words be added on how that was defined (volume? character prominence?)? And since some could argue these events are also potentially interesting bits of information, do instances of overlaps represent a significant portion of the stimulus, or does this occur only in few instances? 6.
On page 4, the authors talk about feature extraction in terms of analysis (e.g., "automatically analyze linguistic features of each word in their corresponding sentence" and performed "analyses" regarding part of speech). Probably a matter of taste, but would rather refer to these steps as "extraction" of linguistic features; to me, the use of "analysis" suggests that more has been performed on the features (e.g., visualization or some form of validation) than mere feature extraction.

7.
Which type of word embeddings were extracted should be specified. 8.
SpaCy has very good documentation explaining the interpretation of each feature and how pipelines for extraction work (e.g. https://spacy.io/usage/linguistic-features). References to that (e.g., simply links to the documentation) could be added, for example in the sentence "We then performed analyses regarding part-of-speech… " whenever each feature/pipeline is introduced. It would make it easier for non-linguists to get a feel for what features mean, and for experts to dig into the details of the specific pipelines used.

9.
The paragraph "Annotation procedure" may benefit from adding an example, e.g., a sample sentence from the transcript with the corresponding annotation. It would help readers (especially those who do not have deep expertise in linguistics) to visualize what features are about and visualize the different levels of annotation. I understand this may be tricky to do without disrupting the flow of the text though and will leave it to the authors to decide whether to implement it or not.

10.
In the caption for table 2, can the expression "sentences spoken by the ten most often occurring speakers" be streamlined?

11.
Can a sample of the resulting annotation dataset (whose components are described in "data legend") be displayed as exemplification, e.g., just the head of the table or something similar?

Dataset validation I confirm that I have read this submission and believe that I have an appropriate level of
I understand the point the authors are making. Namely, that it usually takes much more time to annotate the data than to collect them. However, I would like to suggest that the wording ("temporal structure", "feature space", "bottleneck"), especially in the first paragraph of the article, might be a bit too challenging to immediately understand. Maybe a gentler introduction to the topic would be helpful for some readers. Also, I think that the point the authors make is not 100% valid: The study by Hasson et al. (cited in [7]) actually circumvented the problem of annotating the whole stimulus, by using "reverse correlation": That is, only interpreting the time points where activity is highest. This allowed Hasson and colleagues to interpret effects on a substantive level, just by showing the frames of the movie for which brain activity was highest. I think this is also a useful and economical approach -but certainly more limited than a full annotation.
In the description of Table 1 (N=202)." I completely understand that the additional coding of whether a speech episode is a legal sentence in German would go beyond the scope of the present work. Also, users will be able to code this themselves, according to their needs, using the detailed tagging provided by the authors (and the dot at the end is also useful, though not a perfect marker (e.g. "lief und lief." or "Hörfilm e. V.")). However, I think that calling a variable or feature "sentences", when it really is not coding the presence of a sentence, could be a source of future errors. The same goes for "words". So maybe the authors would want to consider using different labels for those two features. ○ Related to the point above, I think that sometimes a single sentence has longer pauses and is then divided into multiple "sentences". Take the narration at the very end of the movie: "S ie wird wieder angehoben, ..." , "und schwebt auf die Kamera zu." is coded as two sentences. Researchers interested in syntax might benefit from being aware that they have some additional work to do (which is only fair!), in order to merge those speech episodes which are really a single sentence. Additionally, if they define the sentences differently, they will have to do a different tagging of the words, too. I do not suggest that the authors need to do any additional work, but I just want to point out that the label "sentences" could mislead some language researchers. Apart from sentences being divided by pauses, there is the question of punctuation. Given that the stimulus is audio only, my personal experience is that it is sometimes difficult to decide on a definitive punctuation (where commas should occur or if a comma or a full stop should be used). I think that in German commas play a bigger role in defining the syntactic structure of a sentence than they might do e.g. in English. Therefore, I would suggest that some brief discussion of the general limitations of deriving syntactic information from an audio stimulus might be a useful addition to the article (of course only if the authors agree with my premise). ○ "Data Legend" > "Common Word (stop)": The source or the definition of the stop list would be helpful for users who would want to cite it in their publications ○ "Data Legend" > "Word embedding": This is a very rich and useful resource and I was immediately able to do some interesting analyses using the 300-dimensional word vector. For example, correlating the values of all words with each other gave some very interesting and plausible results (e.g. color words clustering together etc.). I think this itself could merit a separate publication with a focus on semantics. As it currently stands, there is however no information about how these data were generated and how future users could cite them and find out more about them. Therefore, I think that a bit more information could significantly increase the impact of this resource. ○ Also, when playing around with it, I noticed that sometimes two different words have the exact same 300-dimensional vector. For example: strahlend-bewölkt überqueren -durchqueren braun-olivgrün gepackten-schlamm schütteln-wischen Is this correct behavior and why would two different words have the exact same values? Especially for a pair like "gepackten"-"Schlamm" that might be surprising. ○ "Dataset validation": maybe some brief examples for the "coordinating conjunction" and different "nouns" could be given for illustrative purposes ○ "Dataset validation": why were non-speech events not modeled with a boxcar, spanning their whole duration, when this was otherwise possible for single words. It would seem to me that the duration of individual words would make a boxcar function much more problematic than using a boxcar for speech-free periods (which are often short, but there are some really long ones in there as well, which might not be fully utilized when modeled with a default length of 70ms -which would amount to a stick function). ○ "Dataset validation": how do the non-speech events correlate with the other features (i.e. could they be included in the heatmap in Fig 1?) ○ Tables 7-9: could it be better to round the values to full mm? I would argue that even for 7T data (smoothed with 4mm FWHM), sub-mm precision for the MNI-coordinates might imply an accuracy that is not obtainable ○ Figure 2 legend (and pages 7, 12): I recently learned that "cf." is used to point to contrasting information (for example, if there is literature making an opposing point; https://blog.apastyle.org/apastyle/2010/05/its-all-latin-to-me.html).

Is the rationale for creating the dataset(s) clearly described? Yes
Are the protocols appropriate and is the work technically sound? Yes