Lies, irony, and contradiction — an annotation of semantic conflict in the movie "Forrest Gump"

Here we extend the information on the structure of the core stimulus of the studyforrest project (http://studyforrest.org) with a description of semantic conflict in the “Forrest Gump” movie. Three observers annotated the movie independently regarding episodes with portrayal of lies, irony or sarcasm. We present frequency statistics, and inter-observer reliability measures that qualify and quantify semantic conflict in the stimulus. While the number of identified events is limited, this annotation nevertheless enriches the knowledge about the complex high-level structure of this stimulus, and can help to evaluate its utility for future studies, and the usability of the existing brain imaging data regarding this aspect of cognition.

This article is included in the Neuroinformatics channel.
This article is included in the Real-life cognition channel. Detection of semantic conflict is an important cognitive skill for human social interaction. It is required to identify lies (false statements made with the intention to deceive) but also to correctly interpret stylistic devices -such as sarcasm and irony (statements with direct meaning that is the opposite 1 or contrary 2 to the implied semantic content). As the interpretation of such events is highly context dependent, it is difficult to study how the brain processes these in the context of real-life like interactions in complex natural environments.
In this study we explored occurrences of semantic conflict in the core stimulus of the studyforrest project (http://studyforrest.org)the motion picture "Forrest Gump" -in order to evaluate whether the available brain imaging data 3,4 can be readily used to study this aspect of cognition. We annotated the presence of contradictory statements, including lies and ironic statements, as well as the portrayal of cues, such as exaggeration or raised eyebrows, that are often associated with making ironic statements. Additionally, we recorded the context that allowed observers to classify an event as contradictory.
Depending on the exact criterion used for identifying events across observers, we found only between 64 and 36 occurrences of semantic conflict or portrayal of irony cues in the entire movie stimulus. These are likely insufficient numbers for an investigation based on these data alone. However, these new annotations nevertheless contribute to a more comprehensive description of this complex movie stimulus 5,6 and may be useful as confound variables in subsequent studies.

Stimulus
The annotated stimulus was a slightly shortened (≈2 h) version of the movie Forrest Gump (R. Zemeckis, Paramount Pictures, 1994), with a dubbed German soundtrack, and is identical to the audiovisual movie annotated in 5,6. Further details on this particular movie cut, and how to reproduce it from commercially available sources, are available in 4.

Observers
Three observers (all female, age 19-20) independently annotated the movie. They were also involved in the development of the concept for this annotation.

Procedure
Observers were instructed to watch the movie from beginning to end, replaying scenes as often as required, and to detect two types of events: 1) whenever a verbal statement is made that contradicts with either the immediate context or with the viewer's body-of-knowledge at this point in the movie, or 2) whenever one or more cues associated with irony (predefined list, see below) are portrayed. In either case, observers had to describe the event by specifying its properties via a number of variable settings in a spreadsheet. The software video player VLC (http:// www.videolan.org/vlc) was used to watch and navigate through the movie.

Data legend
For each annotated event, a total of 10 properties were recorded, each of which are described in the following sections.

Start and end
The duration of each event is recorded in start and end as the number of seconds from movie start (no subsecond precision, due to limitations of the video player time display). The time-points correspond to the onset and offset of the respective evidence. Both times can be identical in the case of events with less than one second duration. For contradictory statements, the duration covers the time from the onset of evidence of a contradiction until the end of the statement.

Sender and receiver
The identity of a character making a contradictory statement or portraying an irony cue is encoded in sender using character labels listed in 5. In the case that the respective statement is directed to another present movie character, its identity is encoded in receiver.

Evidence of a contradiction
The contradiction flag indicates the presence of a contradiction in an event (1: present, 0: absent). The variable proof qualifies if the current or previous events provide the viewer with information to allow the detection of this contradiction (see Table 1). If proof is empty, the movie itself does not contain such information (e.g. a common sense contradiction).

Irony cues
The variable cues contains a space-separated list of labels for all irony cues present in a particular events. See Table 1 for a description of all possible labels.

Event category
The category variable classifies events into lies, ironic statements, and other events (value empty).
Intention Two more variables encode whether a contradiction was used deliberately and whether this was noticed by the receiver. The variable intended encodes the presence of evidence for deliberate use (1: yes, 0: no). The variable is empty if there is no evidence for either case. The second variable intention_decoded encodes, in the same way, whether a potential receiver noticed a deliberate ironic statement or lie.
The source code for all descriptive statistics included in this paper is available in code/descriptive_stats.py (Python script).

Dataset validation
We used an automated procedure to check the annotation records of individual observers for errors or potential problems. Observers submitted their annotations in tabular form to a script that generated a list of error and warning messages. Using this feedback, observers double-checked their annotations as often as necessary until no objective errors were found and all warning messages were confirmed to be false positives. The tests included, for example, plausibility of timing information (no end time before the respective start time) or the presence of unknown condition labels.
In order to assess inter-observer agreement of annotations, we used a two-step approach. First, the temporal location of events depicting any relevant property were determined by comparing annotation timing across observers. The columns in Table 2 report agreement statistics for events defined by at least one, two, or all three observers recording an annotation for the same sender at the same time. In the case that individual observers reported events of different length, or with only partially overlapping duration, only the time-windows with the minimum number of observers reporting an event were considered.
In the second step, we computed Fleiss' Kappa 7 for each individual property of an annotation separately with respect to being consistently assigned or non-assigned to the identified events ( Table 2). We observe increasing inter-observer agreement of all annotated properties with increasing agreement of annotation timing, approaching "substantial" or "almost perfect" agreementaccording to the conventions put forth by 8. The Python script to compute all descriptive statistics presented in the paper from the released annotations is provided. In addition, released data, code, and manuscript sources are also available on Github (https://github.com/psychoinformatics-studyforrest-paper-ironyannotation).

Data and software availability
Author contributions MH contributed to the design of the annotation effort, performed the dataset validation, and wrote the paper; PI contributed to the design, coordinated the annotation effort, and wrote the paper. Both authors agreed to the final content of the paper. Table 2. Annotation inter-observer agreement statistics. Number of events and categorization agreement are presented for three levels of inter-observer agreement on the temporal location and the performing movie character. The number of events for any particular event property are determined by majority vote across observers, i.e. an event is counted when more observers indicate the presence of a property than its absence. Exhaustive technical detail on the statistical analysis can be found in the descriptive_stats.py Python script. In this data note, the authors analyzed and reported semantic conflicts in the movie "Forrest Gump". This data collection has been well conducted and is part of a larger project named the studyforrest project.
An explanation in few lines of what the goal of the studyforrest project is and of the rationale behind this data note would be welcomed. The authors mentioned that this data collection was conducted "in order to evaluate whether the available brain imaging data can be readily used to study this aspect of cognition" but it is quite hard to follow this sentence without information regarding the goal of studyforrest project and without reading the recent publications of Hanke and Ibe. As a more minor suggestion, we think it could be useful to add some references justifying the cues used in this dataset (see . et al.

I have only minor recommendations:
Perhaps it is beyond the scope of a data note, but it may help the reader if the authors could expatiate on the various annotation categories. Specifically, the introduction could be expanded to say a few words as to why semantic conflict is interesting, why these particular dimensions were the ones chosen and what exactly each means in lay terms.
Second, the authors suggest in their introduction that there are insufficient semantic conflict events in Forrest Gump to be truly useful. Although I appreciate the candor, it's my opinion that we should first see what creative uses people can make of this annotation and the associated imaging dataset before we get too sullen! I have read this submission. I believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.
No competing interests were disclosed. Competing Interests: