Affective rating of audio and video clips using the EmojiGrid

Background: In this study we measured the affective appraisal of sounds and video clips using a newly developed graphical self-report tool: the EmojiGrid. The EmojiGrid is a square grid, labeled with emoji that express different degrees of valence and arousal. Users rate the valence and arousal of a given stimulus by simply clicking on the grid. Methods: In Experiment I, observers (N=150, 74 males, mean age=25.2±3.5) used the EmojiGrid to rate their affective appraisal of 77 validated sound clips from nine different semantic categories, covering a large area of the affective space. In Experiment II, observers (N=60, 32 males, mean age=24.5±3.3) used the EmojiGrid to rate their affective appraisal of 50 validated film fragments varying in positive and negative affect (20 positive, 20 negative, 10 neutral). Results: The results of this study show that for both sound and video, the agreement between the mean ratings obtained with the EmojiGrid and those obtained with an alternative and validated affective rating tool in previous studies in the literature, is excellent for valence and good for arousal. Our results also show the typical universal U-shaped relation between mean valence and arousal that is commonly observed for affective sensory stimuli, both for sound and video. Conclusions: We conclude that the EmojiGrid can be used as an affective self-report tool for the assessment of sound and video-evoked emotions.


Introduction
In daily human life, visual and auditory input from our environment significantly determines our feelings, behavior and evaluations is an essential part of their design and evaluation and requires efficient methods to assess whether the desired experiences are indeed achieved. A wide range of physiological, behavioral and cognitive measures is currently available to measure the affective response to sensorial stimuli, each with their own advantages and disadvantages (for a review see: Kaneko et al., 2018a). The most practical and widely used instruments to measure affective responses are questionnaires and rating scales. However, their application is typically time-consuming and requires a significant amount of mental effort (people typically find it difficult to name their emotions, especially mixed or complex ones), which affects the experience itself and restricts repeated application. While verbal rating scales are typically more efficient than questionnaires, they also require mental effort since users are required to relate their affective state to verbal descriptions (labels). Graphical rating tools however allow users to intuitively project their feelings to figural elements that correspond to their current affective state.
Arousal and pleasantness (valence) are principal dimensions of affective responses to environmental stimuli (Mehrabian & Russell, 1974). A popular graphical affective self-report tool is the Self-Assessment Mannikin (SAM) (Bradley & Lang, 1994): a set of iconic humanoid figures representing different degrees of valence, arousal, and dominance. Users respond by selecting from each of the three scales the figure that best expresses their own feeling. The SAM has previously been used for the affective rating of video fragments (e.g., Bos  Emoji based rating tools are increasingly becoming popular tools as self-report instruments (Kaye et al., 2017) to measure for instance user and consumer experience (e.g. www.emojiscore. com). Since facial expressions can communicate a wide variety of both basic and complex emotions emoji-based self-report tools may also afford the measurement and expression of mixed (complex) emotions that are otherwise hard to verbalize (Elder, 2018). However, while facial images and emoji are processed in a largely equivalent manner, suggesting that some non-verbal aspects of emoji are processed automatically, further research is required to establish whether they are also emotionally appraised on an implicit level (Kaye et al., 2021).

Amendments from Version 1
We added a concise review of the literature about the emotional affordances of emoji to the Introduction section. In the Data Analysis section, we now explain how the EmojiGrid data were scaled. The graphs in the Results section now represent datapoints by the identifiers of the corresponding stimuli, to allow the visual assessment, comparison and verification of the emotions induced by the different affective stimuli. We also added correlation plots for the mean valence and arousal ratings obtained both with the SAM and EmojiGrid to enable a direct comparison within both of these affective dimensions. In addition, we uploaded a new set of Excel notebooks to the Open Science Framework that include all graphs, together with a brief description of the nature and content of all stimuli, their original affective classification, and their mean valence and arousal values (1) as provided by the authors of the (sound and video) databases and (2) as measured in this study. We extended the Discussion section with some limitations of this study, such as ways to measure mixed emotions, and the fact that the comparison of the SAM and EmojiGrid ratings were based on ratings from different populations.

REVISED
The EmojiGrid enables users to rate the valence and arousal of a given stimulus by simply clicking on the grid. It has been found that the use of emoji as scale anchors facilitates affective over cognitive responses ( This study evaluates the EmojiGrid as a self-report tool for the affective appraisal of auditory and visual events. In two experiments, participants were presented with different sound and video clips, covering both a large part of the valence scale and a wide range of semantic categories. The video clips were stripped of their sound channel (silent) to avoid interaction effects. After perceiving each stimulus, participants reported their affective appraisal (valence and arousal) using the EmojiGrid. The sound samples (Yang et al., 2018) and video clips (Aguado et al., 2018) had been validated in previous studies in the literature using 9-point SAM affective rating scales. This enables an evaluation of the EmojiGrid by directly comparing the mean affective ratings obtained with it to those that were obtained with the SAM.
In this study we also investigate how the mean valence and arousal ratings for the different stimuli are related. Although the relation between valence and arousal for affective stimuli varies between individuals and cultures (Kuppens et al., 2017), it typically shows a quadratic (U-shaped) form across participants (i.e., at the group level): stimuli that are on average rated either high or low on valence are typically also rated as more arousing than stimuli that are on average rated near neutral on valence (Kuppens et al., 2013;Mattek et al., 2017). For the valence and arousal ratings obtained with the EmojiGrid, we therefore also investigate to what extent a quadratic form describes their relation at the group level.

Participants
English speaking participants from the UK were recruited via the Prolific database (https://www.prolific.co/). Exclusion criteria were age (outside the range of 18-35 years old) and hearing or (color) vision deficiencies. No further attempts were made to eliminate any sampling bias.
We estimated the sample size required for this study with the "ICC.Sample.Size" R-package, assuming an ICC of 0.70 (generally considered as 'moderate': Landis & Koch, 1977), and determined that sample sizes of 57 (Experiment 1) and 23 (Experiment 2) would yield a 95% confidence interval of sufficient precision (±0.07; Landis & Koch, 1977). Because the current experiment was run online and not in a well-controlled laboratory environment, we aimed to recruit about 2-3 times the minimum required number of participants.
This study was approved by the by TNO Ethics Committee (Application nr: 2019-012), and was conducted in accordance with the Helsinki Declaration of 1975, as revised in 2013 (World Medical Association, 2013). Participants electronically signed an informed consent by clicking "I agree to participate in this study", affirming that they were at least 18 years old and voluntarily participated in the study. The participants received a small financial compensation for their participation.

Measures
Demographics. The participants in this study reported their nationality, gender and age.
Valence and arousal: the EmojiGrid. The EmojiGrid is a square grid (similar to the Affect Grid: Russell et al., 1989), labeled with emoji that express various degrees of valence and arousal ( Figure 1). Users rate their affective appraisal (i.e., the valence and arousal) of a given stimulus by pointing and clicking at the location on the grid that that best represents their impression. The EmojiGrid was originally developed and validated for the affective appraisal of food stimuli, since the SAM appeared to be frequently misunderstood in that context (Toet et al.,  The instructions asked the participants to perform the survey on a computer or tablet (but not on a device with a small screen such as a smartphone) and to activate the full-screen mode of their browser. This served to maximize the resolution of the questionnaire and to prevent distractions by other programs running in the background. In Experiment I (sounds) the participants were asked to turn off any potentially disturbing sound sources in their room. Then the participants were informed that they would be presented with a given number of different stimuli (sounds in Experiment I and video clips in Experiment II) during the experiment and they were asked to rate their affective appraisal of each stimulus. The instructions also mentioned that it was important to respond seriously, while there would be no correct or incorrect answers. Participants could electronically sign an informed consent. By clicking " I agree to participate in this study ", they confirmed that they were at least 18 years old and that their participation was voluntary. The survey then continued with an assessment of the demographic variables (nationality, gender, age).
Next, the participants were familiarized with the EmojiGrid. First, it was explained how the tool could be used to rate valence and arousal for each stimulus. The instructions were: "To respond, first place the cursor inside the grid on a position that best represents how you feel about the stimulus, and then click the mouse button." Note that the dimensions of valence and arousal were not mentioned here. Then the participants performed two practice trials. In Experiment I, these practice trials also allowed the repeated playing of the sound stimulus. This was done to allow the participants to adjust the sound level of their computer system. The actual experiment started immediately after the practice trials. The stimuli were presented in random order. The participants rated each stimulus by clicking at the appropriate location on the EmojiGrid. The next stimulus appeared immediately after clicking. There were no time restrictions. On average, each experiment lasted about 15 minutes.

Experiment I: Sounds
This experiment served to validate the EmojiGrid as a rating tool for the affective appraisal of sound-evoked emotions. Thereto, participants rated valence and arousal for a selection of sounds from a validated sound database using the EmojiGrid. The results are compared with the corresponding SAM ratings provided for each sound in the database. . The selection used in the current study was such that the mean affective (valence and arousal) ratings provided for stimuli in the same semantic category were maximally distributed over the two-dimensional affective space (ranging from very negative like a car horn, hurricane sounds or sounds of vomiting, via neutral like people walking up a stairs, to very positive music). As a result, the entire stimulus set is a representative cross-section of the IADS-E covering a large area of the affective space. All sound clips had a fixed duration of 6s. The exact composition of the stimulus set is provided in the Supplementary Material. Each participant rated all sound clips.

Participants.
A total of 150 participants (74 males, 76 females) participated in this experiment. All participants were UK nationals. Their mean age was 25.2 (SD= 3.5) years.

Participants.
A total of 60 participants (32 males, 28 females) participated in this experiment. All participants were UK . Their mean age was 24.5 (SD= 3.3) years.

Data analysis
The response data (i.e., the horizontal or valence and vertical or arousal coordinates of the check marks on the EmojiGrid) were quantified as integers between 0 and 550 (the size of the square EmojiGrid in pixels), and then scaled between 1 and 9 for comparison with the results of Yang et al.  1977). For all other analyses a probability level of p < 0.05 was considered to be statistically significant.
MATLAB 2020a was used to further investigate the data. The mean valence and arousal responses were computed across all participants and for each of the stimuli. MATLAB's Curve Fitting Toolbox (version 3.5.7) was used to compute leastsquares fits to the data points. Adjusted R-squared values were calculated to quantify the agreement between the data and the curve fits. Figure 2 shows the correlation plots between the mean valence and arousal ratings for the 77 affective IADS-E sounds used in the current study, obtained with the EmojiGrid (this study) and with a 9-point SAM scale (Yang et al. (2018). This figure illustrates the overall agreement between the affective ratings obtained with both self-assessment tools for affective sound stimuli.

Experiment I
The linear (two-tailed) Pearson correlation coefficients between the valence and arousal ratings obtained with the EmojiGrid (present study) and with the SAM (Yang et al., 2018) were, respectively, 0.881 and 0.760 (p<0.001). To further quantify the agreement between both rating tools we computed intraclass correlation coefficients (ICC) with their 95% confidence intervals for the mean valence and arousal ratings between both studies. The ICC value for valence is 0.936 [0.899-0.959] while the ICC for arousal is 0.793 [0.674-0.868], indicating both studies show an excellent agreement for valence and a good agreement for arousal (even though the current study was performed via the internet and therefore did not provide the amount of control over many experimental factors as one would have in a lab experiment).   methods yield a relation between mean valence and arousal ratings that can indeed be described by a quadratic (U-shaped) relation at the nomothetic (group) level. Figure 4 shows the correlation plots between the mean valence and arousal ratings for the 50 affective video clips used in the current study, obtained with the EmojiGrid (this study) and with a 9-point SAM scale (Aguado et al., 2018). This figure illustrates the overall agreement between the affective ratings obtained with both self-assessment tools for affective sound stimuli.

Experiment II
The linear (two-tailed) Pearson correlation coefficients between the valence and arousal ratings obtained with the EmojiGrid (present study) and with the SAM (Aguado et al., 2018) were respectively 0.963 and 0.624 (p<0.001). To further quantify the agreement between both rating tools we computed intraclass correlation coefficients (ICC) with their 95% confidence intervals for the mean valence and arousal ratings between both studies. The ICC value for valence is 0.981 [0.967 -0.989] while the ICC for arousal is 0.721 [0.509 -0.842], indicating both studies show an excellent agreement for valence and a good agreement for arousal. Figure 5 shows the relation between the mean valence and arousal ratings for the 50 video clips tested. The curves in this figure represent quadratic fits to the data points. The adjusted R-squared values are respectively 0.68 and 0.78. Hence, both methods yield a relation between mean valence and arousal  (Aguado et al., 2018). Labels correspond to the original identifiers of the stimuli (Yang et al., 2018). The line segments represent linear fits to the data points. ratings that can be described by a quadratic (U-shaped) relation at the nomothetic (group) level.
Raw data from each experiment are available as Underlying data (Toet, 2020).

Conclusion
In this study we evaluated the recently developed EmojiGrid self-report tool for the affective rating of sounds and video.
In two experiments, observers rated their affective appraisal of sound and video clips using the EmojiGrid. The results show a close correspondence between the mean ratings obtained with the EmojiGrid and those obtained with the validated SAM tool in previous validation studies in the literature: the agreement is excellent for valence and good for arousal, both for sound and video. Also, for both sound and video, the EmojiGrid yields the universal U-shaped (quadratic) relation between mean valence and arousal that is typically observed for affective sensory stimuli. We conclude that the EmojiGrid is an efficient affective self-report tool for the assessment of sound and video-evoked emotions.
A limitation of the EmojiGrid is the fact that it is based on the circumplex model of affect which posits that positive and negative feelings are mutually exclusive (Russell, 1980). Hence, in its present form, and similar to other affective self-report tools like the SAM or VAS scales, the EmojiGrid only allows the measurement of a single emotion at a time. However, emotions are not strictly bipolar and two or more same or opposite valenced emotions can co-occur together (Larsen & McGraw, 2014;Larsen et al., 2001). Mixed emotions consisting of opposite feelings can in principle be registered with the EmojiGrid by allowing participants to enter multiple responses.
Another limitation of this study is the fact that the comparison of the SAM and EmojiGrid ratings were based on ratings from different populations (akin to a comparison of two independent samples). Hence, our current regression estimates are optimized based on the particular samples that were used. Future studies should investigate a design in which the same participants use both self-report tools to rate the same set of stimuli. . Sensiks (www.sensiks.com) has adopted a simplified version of the EmojiGrid in its Sensory Reality Pod to enable the user to select and tune multisensory (visual, auditory, tactile and olfactory) affective experiences. Thank you for your critical remarks and valuable suggestions which definitely helped us to improve our initial draft paper. Also, we appreciate the fact that you spent your valuable time on this review.

Literature about the emotional affordances of emoji
Thank you for this suggestion. We agree that reviewing literature about the emotional affordances of emoji will be a valuable addition to the Introduction, helping the reader to better place the current findings in their context. We therefore added the following text to the Introduction: . Emoji based rating tools are increasingly becoming popular tools as self-report instruments (Kaye, Malone, & Wall, 2017) to measure for instance user and consumer experience (e.g. www.emojiscore.com). Since facial expressions can communicate a wide variety of both basic and complex emotions emoji-based selfreport tools may also afford the measurement and expression of mixed (complex) emotions that are otherwise hard to verbalize (Elder, 2018). However, while facial images and emoji are processed in a largely equivalent manner, suggesting that some non-verbal aspects of emoji are processed automatically, further research is required to establish whether they are also emotionally appraised on an implicit level (Kaye et al., 2021)."

How numerical values were determined
Thank you for pointing out this omission. We now include the following explanation of the scaling in the section on data analysis: "The response data (i.e., the horizontal or valence and vertical or arousal coordinates of the check marks on the EmojiGrid) were quantified as integers between 0 and 550 (the size of the square EmojiGrid in pixels), and then scaled between 1 and 9 for comparison with the results of Yang et al.

Contribution and limitations
We now address some limitations of the present study (e.g. related to the measurement of corresponds to the stimuli instead of dots for figures 1 and 2 when comparing the results from this study to previous work. Both are important because it allows the reader to visually assess the extent an expected emotion of stimuli (e.g., high arousal and positive valence) truly maps onto the mean scores and for the potential discrepancy between the two scale formats for the same stimuli to be obvious. This is important for replication but also because there is a greater dispersion when the SAM rating format is used.