Ensemble of multimodal deep learning autoencoder for infant cry and pain detection

Yosi Kristian; Natanael Simogiarto; Mahendra Tri Arif Sampurna; Elizeus Hanindito

doi:10.12688/f1000research.73108.1

Home Browse Ensemble of multimodal deep learning autoencoder for infant cry and...

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Research Article

Ensemble of multimodal deep learning autoencoder for infant cry and pain detection

[version 1; peer review: 2 approved with reservations]

Yosi Kristian¹, Natanael Simogiarto¹, Mahendra Tri Arif Sampurna ², Elizeus Hanindito³

PUBLISHED 28 Mar 2022

Author details Author details

¹ Department of Informatics, Institut Sains dan Teknologi Terpadu, Surabaya, Jawa Timur, 60284, Indonesia
² Department of Pediatrics, Faculty of Medicine, Universitas Airlangga, Surabaya, Jawa Timur, 60132, Indonesia
³ Department of Anesthesiology, Faculty of Medicine, Universitas Airlangga, Surabaya, Jawa Timur, 60132, Indonesia

Yosi Kristian
Roles: Conceptualization, Formal Analysis, Investigation, Methodology, Project Administration, Resources, Software, Supervision, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Natanael Simogiarto
Roles: Formal Analysis, Investigation, Project Administration, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Mahendra Tri Arif Sampurna
Roles: Conceptualization, Funding Acquisition, Investigation, Methodology, Project Administration, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Elizeus Hanindito
Roles: Investigation, Supervision

OPEN PEER REVIEW

REVIEWER STATUS

Abstract

Background: Babies cannot communicate their pain properly. Several pain scores are developed, but they are subjective and have high variability inter-observer agreement. The aim of this study was to construct models that use both facial expression and infant voice in classifying pain levels and cry detection.

Methods: The study included a total of 23 infants below 12-months who were treated at Dr Soetomo General Hospital. The the Face Leg Activity Cry and Consolability (FLACC) pain scale and recordings of the baby's cries were taken in the video format. A machine-learning-based system was created to detect infant cries and pain levels. Spectrograms with the Short-Time Fourier Transform were used to convert the audio data into a time-frequency representation. Facial features combined with voice features extracted by using the Deep Learning Autoencoders was used for the classification of infant pain levels. Two types of autoencoders: Convolutional Autoencoder and Variational Autoencoder were used for both faces and voices.

Result: The goal of the autoencoder was to produce a latent-vector with much smaller dimensions that was still able to recreate the data with minor losses. From the latent-vectors, a multimodal data representation for Convolutional Neural Network (CNN) was used for producing a relatively high F1 score, higher than single data modal such as the voice or facial expressions alone. Two major parts of the experiment were: 1. Building the three autoencoder models, which were autoencoder for the infant’s face, amplitude spectrogram, and dB-scaled spectrogram of infant’s voices. 2. Utilising the latent-vector result from the autoencoders to build the cry detection and pain classification models.

Conclusion: In this paper, four pain classifier models with a relatively good F1 score was developed. These models were combined by using ensemble methods to improve performance, which resulted in a better F1 score.

Keywords

Infant facial pain classification, infant cry detection, autoencoder, audio frequency features, deep learning.

Corresponding author: Mahendra Tri Arif Sampurna

Competing interests: No competing interests were disclosed.

Grant information: The author(s) declared that no grants were involved in supporting this work.

Copyright: © 2022 Kristian Y et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Kristian Y, Simogiarto N, Sampurna MTA and Hanindito E. Ensemble of multimodal deep learning autoencoder for infant cry and pain detection [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:359 (https://doi.org/10.12688/f1000research.73108.1) First published: 28 Mar 2022, 11:359 (https://doi.org/10.12688/f1000research.73108.1) Latest published: 30 Jan 2023, 11:359 (https://doi.org/10.12688/f1000research.73108.2)

List of abbreviations

ANN: Artificial Neural Networks

ASM: Active Shape Model

BCE: binary cross-entropy

CAE: convolutional autoencoder

FACS: Facial Action Coding System

FLACC: Face, Legs, Activity, Cry, Consolability

IFPaLVD: Infant FLACC Pain Level Video Dataset

LSTM: Long-Short Term Memory

MFCCs: Mel-Frequency Cepstrum Coefficients

MSE: Mean Squared Error

NIPS: Neonatal Infant Pain Scale

ROI: region of interest

SSIM: Structural Similarity

STFT: Short-Time Fourier Transform

VAE: Variational Autoencoder

Background

Babies cannot communicate their feelings properly, as such they cry to show hunger, discomfort (wet diaper and itchy skin), and pain. Methods such as Face, Legs, Activity, Cry, Consolability (FLACC),¹ Neonatal Infant Pain Scale (NIPS),² Wong-Baker FACES Rating Scale,³ and others,⁴ are used to detect pain experienced by a baby. Based on these methods, the main factors for detecting pain are the baby’s facial expression, sound of crying, and body movements. Most of the pain scores are subjective, and they have high variability inter-observer agreement. Therefore, there is a need for a pain measurement tool to objectively measure and classify the pain.

Hospitals mainly have more patients than medical staff, therefore, manually monitoring all their patients would be very time consuming and difficult. Normally, neonates are kept inside incubators, which can reduce the volume of their voice, resulting in difficulty for the medical staff to notice their discomfort. Therefore, an automation program with an artificial intelligence system that can identify the baby’s condition is needed. It can also be used to recognize and monitor the infant’s pain scale’s progress and increase or reduce pain medication. Automatization will allow a doctor to monitor infant conditions remotely with input from the nurses. This invention can significantly improve the health care services, especially in developing countries where doctors are not always on-site.

M. Petroni et al.⁵ used Artificial Neural Networks (ANN) to classify infant voices to assess whether they’re in pain, fear, or anger. J. O. Garcia and C. A. R. Garcia⁶ also used infant voices to classify a normal or a pathological cry. This research used Mel-Frequency Cepstrum Coefficients (MFCCs)⁷ to encode speech signals before the signals were forwarded into the neural network. One study⁸ also used infant voices with MFCC representation to predict respiratory diseases such as wheeze, asthma, and crackles. Besides pain classification, MFCC has also been used to detect cries using audio features.⁹ This research divided an audio sequence into segments with 10 seconds duration. Instead of using infant voices, some studies also used facial expression features. K. Sikka et al.¹⁰ used facial expression feature to classify pain by using a Facial Action Coding System (FACS). Y. Kristian et al.¹¹ tried hand-crafted extraction of facial expression features by using the Active Shape Model (ASM), based on the infant facial landmark detection and cry detection results from their previous research.¹² These studies were further developed by using the deep learning autoencoder method to build a cry detection and pain classifier system.¹³ The deep learning extracts facial expression features automatically with the help of facial landmark detection as demonstrated by Kristian et al.¹¹ J. Egede et al.¹³ utilized facial expression by trying to combine hand-crafted feature extraction and deep learning feature extraction. All these studies applied a facial expression as the only feature for classifying pain levels and crying detection. P. Werner et al.¹⁴ used a facial expression as a feature for one of their models. However, this research built other models by using the Biomedical Signals features while applying the Random Forest algorithm to combine these features.

The aim of this paper was to use both facial expression and infant voice as features for classifying pain levels and cry detection. As such deep learning autoencoders to automatically extract features for both facial and audio features were used. Additionally, the results for three features: only facial expression, only voice, and both facial expression and voice were compared in this study.

The key contributions of this paper are:

1) Automated three frame selection from videos dataset. Since several frames consist of an infant’s face, it was not possible to use all the frames. However, automated frame selection can also be a step to building an automated system.
2) Autoencoders were constructed to build a latent-vector with a smaller dimension than the original dimension but capable of representing the original data for both facial and audio data.
3) Introduction of the classifier models using the Convolutional Neural Network (CNN) for the classification of pain level and cry detection.
4) Combination of the best four pain classifier models that uses the ensemble methods to produce a higher F1 score.

Methods

In this paper, the facial expression and infant’s voice to build a cry detection and pain classifier system was used. From the video frames, face detection and face landmarks were applied to gain the face region. From the audio frames, the wave signal was transformed into amplitude spectrograms and dB-scaled spectrograms by using the Short-Time Fourier Transform (STFT). To gain the features from the infant’s face and voices spectrograms, autoencoders were used for the production of the latent-vector. These latent-vectors were used for CNN classifiers to ultimately predict crying condition and pain level. Four pain classifier models were used to form an Ensemble model. An overview of the system is shown in Figure 1.

Figure 1. Architecture system.

The face region will be extracted from the selected frames using face landmark. The amplitude and dB-scaled spectrograms are obtained from STFT process. These data are forwarded into autoencoders resulting five latent-vector which will be the input for the cry detection and pain classifier system.

Study setting and participants

This study was an observational study enrolled at Dr. Soetomo General Hospital (Surabaya City, East Java Province, Indonesia). The study included 23 infants below 12-months who were treated at Dr Soetomo General Hospital from November 2011 until December 2012. These infants were selected using a consecutive sampling. The FLACC pain scale and recordings of the baby's cries were taken in the video format and then analyzed using the software.

Ethics approval and consent

This study was approved by Clinical Research Unit at Dr. Soetomo General Hospital (248/Panke.KKE/III/2016). Due to the nature of the research, the visual data documenting facial expressions of observed infants will not be disclosed under the ethical considerations and regulations. Informed written consent was obtained from all patient’s parents, confirming that the results and any accompanying images including direct or indirect identifiers can be published.

Dataset

Hanindito¹⁵ has taken 46 videos from 23 infants. Those videos were obtained before and after surgery (mostly hernia surgery) for each infant. The videos were labeled using FLACC measurement methods which took place in Dr Soetomo General Hospital. From this scoring method, all the data were concluded into three classification labels, which were no pain, moderate pain, and severe pain (See Underlying data).¹⁶ In 2017, Kristian et al.¹¹ added ten more videos using the same FLACC ground truth labels and method.

For this research, these videos were divided into smaller segments with a 10 second duration to generate more data. At the same time, it provided the opportunity to avoid videos with high noise for both video frames and audio waves. Some videos could only be divided into one segment, while others into more than ten segments, depending on the number of frames that contain an extractable infant frontal face. After this process, 253 videos with an average of 10 second duration was obtained. Unfortunately, 64 videos were discarded as they did not contain any extractable frontal face frames. This dataset is called Infant FLACC Pain Level Video Dataset (IFPaLVD), which can be downloaded.¹⁷

The 189 videos were used to apply the data augmentation process to increase the number of data collection. The augmentation process included both image and audio augmentation. For image data, horizontal flip and image rotation was used. The angles used for image rotation were -10°, 10°, -20°, 20°, and 15°. The output example of this face augmentation can be seen in Figure 2.

Figure 2. Face augmentation from original image (a) using 15° rotation (b) and horizontal-flip (c).

There is the time-shifting process (0.25 seconds) during spectrogram conversion. Therefore, the duration of 9.75 seconds was used to obtain 10 seconds for audio data. Further audio augmentation process was tuning the volume 2×, 3×, and 4× higher or 2× and 3× lower. Therefore, each video was generated into 12 unique videos, resulting in the final data to consist of 2,668 videos. Visually, the augmentation result for the audio data did not look any different.

Frame selection and preprocessing

Every 10 second video had approximately 259 frames depending on the Frames Per Second (FPS) from each video. Therefore, in order to save resources, instead of using all the frames three selected frames were chosen. The first frame was at the time which the baby cried the loudest. This was made possible by selecting the highest value of audio frames along with mapping the index value from audio frames to match the video frames’ index value. Following this, two more frames were chosen from one second before and after the first selected frame. However, there was a possibility that the chosen frame might have not had an extractable frontal face. Therefore, with the selected frame as the starting point, each frame was evaluated until a frame with an extractable frontal face was reached.

Upon selecting three desired frames, the remaining frames which included the non-face region was discarded. Since grayscale images were used in this process it was important to convert the image color. The region of interest (ROI), which is the face region was selected in the grayscale images. This process used the face detection model from a previous study.¹⁸ In this model, a rectangle area that consisted of the baby’s face was selected, and facial mapping (landmark) with the use of DLIB library was applied. This technique resulted in 64 points for the face landmarks. These points were used to draw a filled convex polygon and eliminate all the values outside the polygon. Finally, the frame was resized into the shape of 200×200 to uniform all the data dimensions.

Audio preprocessing

Sounds are continuous analog signals, and their waveforms can take any value.¹⁹ Therefore, to digitalize and homogenize an audio signal, computers convert the value into specific sample rate. Similar to FPS for video, audio has a variety of sample rates. The sample rates needed to be uniform in the dataset. The sample rate used was 22,050 Hz, resulting in an array with the length of 21,500 for 9.75 seconds of audio, which represented the audio wave. However, it was not possible to extract any information from this audio wave. Therefore, the wave representation was changed into the time-frequency representation. The method to change the audio representation was by STFT¹⁸ (Equation 1). This equation contains two parameters, hop counter (h) and frequency index (f).

(1)

STFT (h, f) = \sum_{n = 0}^{N - 1} x_{n + h} w_{n} e^{- \frac{i 2 π}{N} fn}

In the STFT process a 2,048 length “Hann” window with 512 hops was used. The total of 215000 samples, produced 420 hop iterations, with each iteration having a 2,048 window size (N). The result was a two-dimensional array with the shape of 1025×420. The two-dimensional array still had a complex number (i), therefore an absolute function to remove that complex number was done (Amplitude spectrogram). This amplitude spectrogram has many near-zero values as babies cry for a certain frequency range. Therefore, row 50 until 350 for this spectrogram was used and the rest was discarded, resulting in the final shape of 300 × 420. The amplitude spectrogram in its linear scale was transformed into its power log scale or dB scale²⁰ (Equation 2), and the result was called dB-scaled spectrogram. The value of 1.0 was used for the power reference (P₀), and A represented the amplitude spectrogram.

(2)

dB = 10 {log}_{10} (\frac{A^{2}}{P_{0}})

Autoencoder

In 1986, D. E. Rumelhart et al.²¹ introduced a method called autoencoder. Autoencoder is a neural network consisting of two parts, encoder, and decoder. The encoder maps the input to the latent-vector with a smaller dimension, while the decoder maps the latent-vector back into the original input.²¹ An autoencoder for image representation with the use of deep networks was used by Vincent et al.²² for denoising images. That research inspired S. Gao et al.²³ to use a deep autoencoder for extracting features from human faces, which showed promising results.

Variational Autoencoder (VAE) is a new variation of an autoencoder that was proposed in 2013.²⁴ The VAE can control the latent-vector distribution, due to its structure that contains two latent-vectors (Figure 3). As shown in previous research,²⁴ VAE can generate new data from its latent-vector, and build a face generation model to create a completely new face image. As shown in Figure 3, the two latent-vectors are combined into a single latent-vector using a reparameterization trick (Equation 3). The combination of latent-vector is important as it enables the backpropagation process for VAE. In this equation, μ represents mean (first latent-vector), σ represents variance (second latent-vector), and ϵ represents random normal. Moreover, VAE also adds the reconstruction loss with its latent-loss, which is usually called KL Divergence Loss (Equation 4) with N representing the size of the latent-neuron.

(3)

z = μ + e^{\frac{σ}{2}} * ϵ

(4)

KL (x, y) = \frac{1}{2} \sum_{i}^{N} (e^{x_{i}} + {y_{i}}^{2} - 1 - x_{i})

Figure 3. Architecture of Variational Autoencoder (VAE) consist of encoder and decoder with two latent-vector, which usually called Mean and Standard Deviation.

These latent-vector were combined using reparameterization trick.

Apart from autoencoders for image data, a study in 2017²⁵ applied autoencoder for denoising a single-channel audio source. The autoencoder type used in this study was a convolutional autoencoder (CAE). They attempted to recreate the magnitude spectrograms. The spectrogram was a time-frequency representation of an audio signal. Figure 4 shows the example of dB-scaled spectrograms from this study data. This research indicates that using CAE from denoising audio produced promising results.

Figure 4. Spectrograms of the infant’s voice from crying baby in severe pain (a), and from not crying nor in pain baby (b).

Autoencoders are unsupervised learning neural network algorithm with a backpropagation process.²⁶ This algorithm is unsupervised learning because it will try to produce an output that has a close similarity to the input, which can lead to both the output and the input to have similar shapes. Therefore, eliminating the use of data labels.

After being forwarded into several hidden layers, autoencoders will reduce the data dimension. At a certain point, the layers will start to increase the data dimension. Finally, the last layer will have the same size as the input. The main idea of autoencoders is to reduce the original data dimension without losing important details so that it can be reconstructed back into the original data.²⁷

Based on this explanation, autoencoders consist of four main parts. The first part is the encoder, where the model will keep reducing the data dimension. The second part is called bottle-neck, or latent-vector. Latent-vector is the result of the encoder, which has the smallest size in the model. The third part is the decoder, which will try to recreate the original data from the latent-vector. The last part is the reconstruction loss, which is a function to measure the similarity between input and output.²⁸ From all these parts, the encoder and latent-vector is needed. The purpose of the decoder and the reconstruction are to make sure there is an excellent encoder and latent-vector. Finally, only the encoder is used to get the latent-vector, and this latent-vector will be forwarded into the classifier.

In this paper, the autoencoders were used as the feature learning process. As mentioned, this paper has used autoencoders for both face and audio data. Both CAE and VAE were used, and the results were compared. Additionally, the results were analyzed with and without using a dense latent-vector. Mostly, the autoencoders consist of three kinds of layers, which were convolutional, max pooling, and upsampling layer. For the models that used a dense latent-vector, a fully connected layer was added. Rectified Linear Unit (ReLU) was used²⁹ (Equation 5) as the activation function of the convolutional layers and the sigmoid function³⁰ (Equation 6) for the fully connected layers. Other activation functions did not increase the performances.

(5)

f (x) = max (0, x)

(6)

f (x) = \frac{1}{1 + e^{- x}}

Both amplitude and dB-scaled spectrograms used the same structure, but they were trained separately. Although layers with variant parameters were used, all convolutional layers only used one stride and the same padding so that the convolutional layers don’t reduce the dimension. The max-pooling layer for the dimensionality reduction task used 2×2 kernels to minimize the detail loss. The structure for the best model for CAE without dense latent-vector can be seen in Table 1 for face autoencoder, and Table 2 for spectrogram autoencoder. For the models that used dense latent-vector, a similar architecture was used with a different bottle-neck part (yellow color). The bottle-neck part was changed into a fully connected layer with 1,250 neurons, and the next layer with 8,250 neurons for the face autoencoder. For the spectrogram autoencoder layers were not changed, however, a convolutional layer with ten filters and a 3×3 kernel before the bottle-neck, and a fully connected layer with 7,875 neurons after the bottle-neck part was added. For the VAE models, the same architecture for both with and without dense latent-vector was used with the difference that the bottle-neck parts were duplicated, and an extra layer was added to implement the combination of the two latent-vector by using the reparameterization trick.

Table 1. Face CAE architecture.

Layer	Model	Layer	Model
1	Conv 3×3, 30	9	Conv 3×3, 10
2	Max Pool 2×2	10	UpSampling 2×2
3	Conv 3×3, 30	11	Conv 5×5, 20
4	Max Pool 2×2	12	UpSampling 2×2
5	Conv 3×3, 20	13	Conv 7×7, 20
6	Max Pool 2×2	14	UpSampling 2×2
7	Conv 3×3, 10	15	Conv 3×3, 30
8	Conv 3×3, 2	16	Conv 3×3, 1

Table 2. Spectrogram CAE architecture.

Layer	Model	Layer	Model
1	Conv 3×3, 50	7	Conv 3×3, 30
2	Max Pool 2×2	8	UpSampling 2×2
3	Conv 3×3, 40	9	Conv 3×3, 40
4	Max Pool 2×2	10	UpSampling 2×2
5	Conv 3×3, 30	11	Conv 3×3, 50
6	Conv 3×3, 1	12	Conv 3×3, 1

Based on the autoencoders structure without dense latent-vector, the latent features will have the shape of 25×25×2 for the face data and the shape of 75×105×1 for spectrograms data. This shape is the smallest that could be produced which still gave a good reconstruction result. In the next part, we’ll present that autoencoders without dense latent-vector provide a better reconstruction result.

Autoencoders need a loss function to measure their performance. The most common and simple function is the Mean Squared Error (MSE) function.³¹ However, using this function won’t produce the best result for this study therefore, Structural Similarity index (SSIM)³² was used as the loss function for the face autoencoder. In addition, binary cross-entropy (BCE) instead of MSE was used in the spectrogram autoencoder. The benefits of using SSIM and BCE compared to MSE has been shown in in Figures 5 and 6.

Figure 5. Spectrogram autoencoders output result.

(a) Original spectrogram, (b) BCE reconstruction, and (c) MSE reconstruction. MSE can’t reconstruct the original data.

Figure 6. Face autoencoders output result.

(a) Original image, (b) SSIM reconstruction, and (c) MSE reconstruction.

Equations 7 and 8 for SSIM and BCE loss functions were used with h representing model prediction, and y representing the desired output. For the SSIM loss function, 0.01 for k₁ and 0.03 for k₂, while L represents the maximum value was used (This value is one since the image has already been normalized).

(7)

SSIM (h, y) = \frac{(2 μ_{h} μ_{y} + {(k_{1} L)}^{2}) (2 σ_{hy} + {(k_{2} L)}^{2})}{({μ_{h}}^{2} + {μ_{y}}^{2} + {(k_{1} L)}^{2}) ({σ_{h}}^{2} + {σ_{y}}^{2} + {(k_{2} L)}^{2})}

(8)

BCE (h, y) = - \frac{1}{N} \sum_{i = 0}^{N} (y_{i} log h_{i} + (1 - y_{i}) log (1 - h_{i}))

Multimodal pain and cry detection

In 2018, T. Baltrušaitis et al.³³ summarized methods that have been used for multimodal data representation. At present, there are only two methods for co-learning multimodal data at the same time: joint and coordinated representations. The basic concept of the joint representation method is to fuse the data representation by concatenating its individual features. On the other hand, the coordinated representation let each representation to be learned separately while still in coordination. Joint representation was used in the present study as it is the most used method. However, limited labeled data has been the main obstacle faced in this case. The solution to this problem was to use a pre-train representation of unsupervised learning, such as autoencoders. The method of using autoencoder for unsupervised learning has been used previously by another study.³⁴

In 2011, M. A. Nicolaou et al.³⁵ used a multimodal data without deep learning methods. Since the features were hand-crafted, autoencoders were not used. The same joint representation method along with three fusion methods were used by that study. The methods were feature-level (used in this study), model-level, and output-level fusion. The feature-level fusion is combining the feature to be predicted using one model. While model-level and output-level fusions predict each representation individually, and combine them using another model.³⁶

In both studies by Y. Kristian et al.¹¹^,¹² a cry detection and pain classification system was created. This study used 46 videos taken by E. Hanindito et al.¹⁵ and added ten more videos with multiple duration. These studies introduced an ASM for landmarking infant’s face. The ASM was built using the frontal face on the videos as their data to extract hand-crafted geometric features. From these features, they’re able to achieve high accuracy for the infant facial cry detection and pain classifier.

In 2018, while deep learning had already been widely used as a feature learning method, Y. Kristian, et al.³⁷ removed all the hand-crafted feature extraction, and switched to deep learning by using autoencoders. This research still used the same ASM from their previous research¹² to crop the face region and discarded unimportant data. This paper transforms the ASM model into Kazemi et al. face landmark algorithm³⁸ (Figure 7). DLIB library supports this algorithm. The extracted features were forwarded into the Long-Short Term Memory (LSTM) model to classify pain and for cry detection from a sequence of video frames. This research reported a high accuracy for both cry detection and pain classification. The face landmark algorithm from this research was used in this paper by using the DLIB library.

Figure 7. Infant facial landmarking.

Apart from studies using facial expression, several studies tried to classify infant pain using acoustic features from the infant voices data. In 1995, Petroni et al.⁵ tried to classify the infant’s cry by using the Mel-cepstrum coefficients representation and filter-band energies with the help of ANN. This research used three classification targets: anger, fear, and pain. The results provided good accuracy for predicting anger and pain. However, this research had many prediction errors for the fear label.

In 2003, J. O. Garcia and C. A. R. Garcia⁶ set out to classify normal and pathological infant’s cries. This research provided a high accuracy by using 506 data to classify two target classes. Instead of using the audio wave signal, MFCC as the audio representation was used in this study, which is the same as the spectrogram but adapted to resemble the human ear. This research also used the help of a neural network to classify their final prediction. In addition, the study also included experiments with feature selection methods such as PCA.

In the present study classifiers for both cry detection and pain classifier were used. Throughout this study, from all the autoencoders tested, both CAE and VAE without dense latent-vector had better performances (Results shown in the next section). Therefore, the latent-vector from autoencoders without a dense-latent vector was used as the input for the classifiers. Following this, it was noticed that autoencoder models with similar performances may have completely different latent features from each other. Some latent features may be able to provide good classification results while this might not be possible for others. Due to this reason, the use of a neural network classifier consisting of just fully connected layers and dropout layers could not be used in this study. Therefore, at least one convolutional layer to boost the classifier performances was needed. In addition to the convolutional layer, dropout layers to prevent overfitting were used considering the small amount of data in this study.

In totals, five latent-vectors from the autoencoders were used. All these features were not used at once, however, models were built by using a single or combination of features. There were five feature parameters for the cry detection and pain classifier system: faces, dB-Scaled, amplitude, faces + dB-Scaled, and faces + amplitude. For a model that used more than one feature, the convolutional layer’s output was concatenated. Therefore, based on the multimodal models, there were four convolutional layers, three for face data and one for spectrogram, before concatenation process.

For the cry detection model, there were only two target classes. Therefore, one neuron output, and the sigmoid activation function was used. However, as there were three target classes for the pain classifier, three output neurons were needed. Additionally, the softmax activation function was used³⁹ (Equation 9) to make the neuron’s output as the prediction probability for each class. For the cost function, binary cross-entropy for the detection system, and for the pain classifier, a categorical cross-entropy (Equation 10) were used. In these equations, N represents the number of output neurons.

(9)

f (x_{i}) = \frac{e^{- x_{i}}}{\sum_{j}^{N} e^{- x_{j}}}

(10)

CCE (y, h) = - \sum_{i}^{N} (y_{i} log h_{i})

As this study is a classification problem, accuracy as the performance measurement was used. On the other hand, the most widely used metric to measure model performances for classification problem is the F1 score (Equation 11). F1 score is the harmonic mean of precision (Equation 12) and recall (Equation 13). The final measurement is the Macro F1 Score. This measurement is the average of F1 scores for each class.

(11)

F 1 = 2 \frac{precision . recall}{precision + recall}

(12)

precision = \frac{True Positive}{True Prositive + False Positive}

(13)

recall = \frac{True Positive}{True Prositive + False Negative}

Ensemble model

This study presents one model that was the best for the cry detection system. However, as a significant high F1 score for the pain classifier model could not be achieved, four models that provided a significant high F1 score were used instead. Each model had different parameters: face + dB-Scaled, face + amplitude, only dB-scaled, and only amplitude. From these results, it was decided to use all the models rather than choosing the best models. The prediction of the pain class was by using four models, and the majority votes method as the ensembling method was used. This method voted for all classifier results and used the majority class label as the final predicted class. In the case of tied votes, the class with higher pain labels were chosen. For example, if the severe pain and moderate pain classes had the same voting points, severe pain as the final prediction was selected.

Results

In this paper, two major parts of the experiment were presented. In the first part, three autoencoder models were built, which were autoencoder for the infant’s face, amplitude spectrogram, and dB-scaled spectrogram of infant’s voices. In the second part, the latent-vector result from the autoencoders were used to build the cry detection and pain classification models.

Face autoencoder

In total, 20% of the 16,575 face dataset was used as the testing data, which resulted in 3,093 data. The results for the facial autoencoder can be seen in Table 3, and the image output sample can be seen in Figure 8. These results indicate that both CAE and VAE can recreate the data with a small SSIM loss value. From the reconstruction loss and the sample output, it can be seen that the best model was the use of CAE without dense latent-vector.

Table 3. Face autoencoder on various type.

Type	E	Weight param	SSIM loss
Type	E	Weight param	Train	Valid
CAE	25	46,373	0,052239	0,050319
CAE-d	20	7,552,851	0,083503	0,090254
VAE	80	46,555	0,123958	0,118260
VAE-d	100	29,493,161	0,175157	0,177079

Figure 8. Sample of face autoencoder results using SSIM loss function with original image (a), output from CAE (b), output from CAE with dense latent-vector (c), output from VAE (d), and output from VAE with dense latent-vector (e).

A dense latent-vector had reduced the model performance for both CAE and VAE. Although, the performance was only slightly reduced, and there was sufficient reconstruction of the original face, the dense latent-vector used much more weight parameters because of its fully connected layers. Therefore, it can be concluded that autoencoders without dense latent-vector were the best option in this case.

Even though CAE and VAE cannot be compared as they are different methods, it can be seen in Figure 8 that CAE gave a much better reconstruction result. It was assumed that VAE tried to build a more general model, as a result the model was forced to miss the small details.

Spectrogram autoencoders

Two spectrogram autoencoder models were built, one for dB-scaled spectrogram and one for amplitude spectrogram and the same structure and loss function for both spectrograms were used. Moreover, the same data training and testing distribution was used. The testing data was 506 data as it represents 20% of the total data (2530 short videos). The use of the dense latent-vector for the spectrogram autoencoder produced the same results as the face antoencoder (Results not shown). The results of the spectrograms CAEs and VAEs can be seen in Table 4, and the output samples were shown in Figure 9 for the dB-scaled spectrogram and Figure 10 for the amplitude spectrogram.

Table 4. Spectrogram autoencoder on various type.

Type	E	Weight param	BCE loss
Type	E	Weight param	Train	Valid
*dB-scaled spectrogram*
CAE	10	59.282	0,089058	0,095361
VAE	20	59.553	0,114446	0,114446
*Amplitude spectrogram*
CAE	20	59.282	0,004835	0,005037
VAE	20	59.553	0,007848	0,0075736

Figure 9. Sample of dB-scaled spectrogram autoencoder results using BCE loss function with original spectrogram (a), output from CAE (b), and output from VAE (c).

Figure 10. Sample of amplitude spectrogram autoencoder results using BCE loss function with original spectrogram (a), output from CAE (b), and output from VAE (c).

From the output samples, it can be seen that VAEs’ reconstructions were not good for both spectrograms. Based on the results in Figure 10, it can be agreed that the amplitude spectrogram cannot be reconstructed by using VAE. For the dB-scaled spectrogram, the reconstruction may be similar to the original data. However, few spectrogram differences might give a huge difference in their audio signal. Therefore, the present study did not test with the classifier by using the VAE latent-vector for the spectrograms. It was also assumed that VAEs were unable to recreate the perfect original data due to their ability to generalize.

Data label distribution

Two data training and testing distribution was tested. In the first distribution, seven original videos, which were three pre-surgery, and four post-surgery videos were excluded to prevent overfitting of the system. This resulted in the use of 1,836 training data and 432 testing data (Table 5). For the second distribution, the data order was mixed, and 25% of the data for the purpose of testing was excluded. Therefore, 1,701 training data and 567 data for testing was obtained. Further details about this distribution can be seen in Table 6. This distribution was provided to compare the result with the first distribution.

Table 5. First dataset distribution.

Labels		Training	Testing
Pain	Severe	276	132
	Moderate	492	132
	No pain	1068	168
Cry	Yes	744	240
Cry	No	1092	192

Table 6. Second dataset distribution.

Labels		Training	Testing
Pain	Severe	301	107
	Moderate	469	155
	No pain	931	305
Cry	Yes	735	249
Cry	No	966	318

Cry detection

In this study several feature combinations by using both data distributions were tested (Table 7). The results show that the use of the second distribution gave a much higher results on all parameters. It is possible that using the second dataset distribution couldn’t detect an overfit model because of the high similarities between the training and testing data.

Table 7. Cry detection using various features.

	Input	Face autoencoder	Spectrogram autoencoder	Weight param	First distribution		Second distribution
	Input	Face autoencoder	Spectrogram autoencoder	Weight param	Acc	Macro F1	Acc	Macro F1
1	Multimodal	CAE	dB-Scaled CAE	439,680	0.875000	0.871008	0.934744	0.933486
2	Multimodal	VAE	dB-Scaled CAE	439,680	0.791666	0.782214	0.906525	0.903647
3	Multimodal	CAE	Amplitude CAE	439,680	0.835648	0.834004	0.943562	0.942811
4	Multimodal	VAE	Amplitude CAE	439,680	0.756944	0.754735	0.888888	0.885610
5	Face	CAE	-	283,834	0.444444	0.307692	0.945326	0.944099
6	Face	VAE	-	283,834	0.604166	0.598118	0.899470	0.898254
7	Audio	-	dB-ScaledCAE	237.787	0.872685	0.868943	0.980599	0.980245
8	Audio	-	Amplitude CAE	237.787	0.824075	0.820374	0.938271	0.937316

Based on the result of the first data distribution, the facial expression was not a relevant feature for cry prediction. However, this feature can give a much better result on the pain problem. This result indicated that more data is required to avoid overfitting of the system. Face features using CAE gave a higher result; this statement was also supported by the multimodal feature result, which produced a better result when using CAE facial feature extraction. On the contrary, using voice features, especially dB-scaled spectrogram, resulted in a much higher F1 score and accuracy than the facial features. Therefore, it can be concluded that using the voice feature was more relevant and in need of less data than the facial feature. The use of a dB-scaled spectrogram had a better outcome than an amplitude spectrogram, even though the amplitude spectrogram also had better result than the facial features. However, the best result was obtained by the multimodal feature, which used the dB-scaled spectrogram as the audio representation, and CAE as the facial feature extraction. The multimodal feature slightly increased the accuracy even though it used more resources.

According to these results, it can be said that with the existence of enough resources, it’s recommended to use multimodal features. However, in case of limited resources, it is best to use only the voice feature as it can still produce an acceptable result. The confusion matrix of using the multimodal and voice feature by using the dB-scaled spectrogram audio representation has been provided in Table 8.

Table 8. Confusion matrix of cry detection.

	Multimodal		dB-scaled
	Cry	No	Cry	No
Cry	151	41	159	33
No	13	227	39	202

Pain classification

The result for pain classification problem is shown in Table 9. It provided extremely high accuracy and F1 score for most cases, even when the first distribution gave a very low result. Therefore, the best method and feature from the first distribution was selected. The highest result came from the model that used the multimodal feature with CAE for face autoencoder, and the dB-scaled spectrogram as its audio representation. However, this parameter only produced a slightly better result than other parameters. This result was also not good enough to be titled as the best model for pain classification. Therefore, the top-four models were ranked based on their F1 scores. Ensemble of these four models were tested with the use of several methods, and the best result was produced using the majority votes technique and the tiebreaker of the sum of each model’s probability from the four models. This ensemble model had a better result based on the accuracy and F1 score. The confusion matrix for this ensemble model can be seen in Table 10. The input and output example in has also been provided in Figure 11.

Table 9. Pain classification using various features.

	Input	Face autoencoder	Spectrogram autoencoder	Weight param	First distribution		Second distribution
	Input	Face autoencoder	Spectrogram autoencoder	Weight param	Acc	Macro F1	Acc	Macro F1
1	Multimodal	CAE	dB-Scaled CAE	439,712	0.717593	0.699674	0.991182	0.990835
2	Multimodal	VAE	dB-Scaled CAE	439,712	0.61574	0.608616	0.961199	0.957011
3	Multimodal	CAE	Amplitude CAE	439,712	0.708333	0.678531	0.989418	0.98927
4	Multimodal	VAE	Amplitude CAE	439,712	0.518518	0.490641	0.955908	0.946924
5	Face	CAE	-	113,179	0.372685	0.359521	0.968254	0.959511
6	Face	VAE	-	113,179	0.446759	0.455669	0.77425	0.710336
7	Audio	-	dB-Scaled CAE	118,359	0.631944	0.618512	0.982363	0.984028
8	Audio	-	Amplitude CAE	118,359	0.655092	0.601390	0.974709	0.974656
Ensemble model 1, 3, 7, 8					0.756944	0.739956

Table 10. Confusion matrix of ensemble pain classification.

	No pain	Moderate	Severe
No pain	153	11	4
Moderate	7	104	21
Severe	5	57	70

Figure 11. The example of data input with their results.

Discussion

In this paper, several methods have been tested for building a deep learning model for pain and cry detection systems. It is evident that a voice feature was more relevant than a facial expression for pain detection. However, the combination of both features can achieve a higher score. Based on this finding, it can be suggested that multimodal representation such as combination with a physiological parameter of vital signs should be considered more in future work. Severe stimuli resulting vital sign parameters such as increase in heartbeat, changes in breathing patterns, oxygenation instead of a change of face and cry pattern should be explored.

The limited amount of data may introduce bias, however this paper was able to achieve a relatively good result from the use of the proposed method. To improve the performances even further, increasing the data amount would have a significant effect.

A patient monitoring system is very important, as most of the pain scores are subjective. This findings support the hospital monitoring system and increase awareness of health care providers for the infant's pain. If the pain persists, modifying the pain medication may be needed.

Conclusion

This study has introduced a method for cry and pain detection by using the infant’s facial expression and voice. In addition, an alternative method in cases with limited resources have been provided. It has been shown that the infant’s voice is a more relevant feature than the facial expression, but the two features combined can obtain a better result.

For feature extraction, it has been demonstrated that CAE performs better than VAE for facial expression and voice spectrograms. This is because VAE is used more for generating new data rather than extracting features. Moreover, it has been shown that using dense latent-vector in the autoencoder increased the reconstruction loss.

Data availability

Underlying data

Figshare: Ensemble of multimodal deep learning autoencoder for infant cry and pain detection. https://doi.org/10.6084/m9.figshare.16910299.v1.¹⁶

This project contains the following underlying data:

The data file contains the labels for video dataset for multi modal infant pain and cry classification.

Data are available under the terms of the Creative Commons Zero “No rights reserved” data waiver (CC0 4.0 Public domain dedication).

Author contributions

Conceptualization, Y.K. and M.T.A.S.; methodology, Y.K. and M.T.A.S.; software, Y.K. and N.S. validation, N.S., Y.K. and M.T.A.S.; formal analysis, N.S. and Y.K.; investigation, N.S., Y.K., M.T.A.S. and E.H.; resources, N.S. and E.H., and Y.K.; writing—original draft preparation, N.S., Y.K. and M.T.A.S.; writing—review and editing, N.S., Y.K. and M.T.A.S.; visualization, N.S. and Y.K.; supervision, Y.K., M.T.A.S., and E.H.; project administration, N.S., Y.K. and M.T.A.S.; funding acquisition, M.T.A.S. All authors have read and agreed to the published version of the manuscript.

References

1. Voepel-Lewis T, Shayevitz JR, Malviya S: for Scoring Postoperative Pain in Young Children. Pediatr. Nurs. 1997; 23(3).
2. Hudson-Barr D, Capper-Michel B, Lambert S, et al.: Validation of the pain assessment in neonates (PAIN) scale with the neonatal infant pain scale (NIPS). Neonatal Netw. 2002; 21(6): 15–21. PubMed Abstract | Publisher Full Text
3. Garra G, et al.: Validation of the Wong-Baker FACES Pain Rating Scale in Pediatric Emergency Department Patients. Acad. Emerg. Med. 2010; 17(1): 50–54. PubMed Abstract | Publisher Full Text
4. Karcioglu O, Topacoglu H, Dikme O, et al.: A systematic review of the pain scales in adults: Which to use?. Am. J. Emerg. Med. 2018; 36: 707–714. Publisher Full Text
5. Petroni M, Malowany AS, Johnston CC, et al.: Classification of infant cry vocalizations using artificial neural networks (ANNs). 1995 International Conference on Acoustics, Speech, and Signal Processing. 1995; vol. 5: pp. 3475–3478.
6. Garcia JO, Garcia CAR: Mel-frequency cepstrum coefficients extraction from infant cry for classification of normal and pathological cry with feed-forward neural networks. Proceedings of the International Joint Conference on Neural Networks, 2003. 2003; vol. 4: pp. 3140–3145.
7. Logan B; Others: Mel frequency cepstral coefficients for music modeling. Ismir. 2000; vol. 270: pp. 1–11.
8. Mayorga P, Druzgalski C, Morelos RL, et al.: Acoustics based assessment of respiratory diseases using GMM classification. 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology. 2010; pp. 6312–6316.
9. Cohen R, Lavner Y: Infant cry analysis and detection. 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel. 2012; pp. 1–5.
10. Sikka K, et al.: Automated assessment of children’s postoperative pain using computer vision. Pediatrics. 2015; 136(1): e124–e131. PubMed Abstract | Publisher Full Text
11. Kristian Y, et al.: A Novel Approach on Infant Facial Pain Classification using Multi Stage Classifier and Geometrical-Textural Features Combination. IAENG Int. J. Comput. Sci. 2017; 44(1).
12. Kristian Y, Hariadi M, Purnomo MH: Ideal Modified Adachi Chaotic Neural Networks and active shape model for infant facial cry detection on still image. 2014 International Joint Conference on Neural Networks (IJCNN). 2014; pp. 2783–2787.
13. Egede J, Valstar M, Martinez B: Fusing deep learned and hand-crafted features of appearance, shape, and dynamics for automatic pain estimation. 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017). 2017; pp. 689–696.
14. Werner P, Al-Hamadi A, Niese R, et al.: Automatic pain recognition from video and biomedical signals. 2014 22nd International Conference on Pattern Recognition. 2014; pp. 4582–4587.
15. Hanindito E: Dynamic Acoustic Pattern as Pain Indicator on Baby Cries Post Surgery Procedure PhD thesis.2013.
16. Kristian Y, Sampurna MTA: Infant Pain and Cry Data Labels.csv. Figshare. Dataset. Publisher Full Text
17. Kristian Y, Sampurna MTA: Video files for Infant FLACC Pain Level Video Dataset IFPaLVD.csv. Figshare. Dataset. Publisher Full Text
18. Viola P, Jones MJ: Robust real-time face detection. Int. J. Comput. Vis. 2004; 57(2): 137–154. Publisher Full Text
19. Gong Y, Poellabauer C: Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic Cues. 2018 27th International Conference on Computer Communication and Networks (ICCCN). 2018; pp. 1–9. Publisher Full Text
20. Davis D, Patronis E: Sound system engineering. Routledge; 2012.
21. Rumelhart DE, Hinton GE, Williams RJ: Learning internal representations by error propagation.1985.
22. Vincent P, Larochelle H, Lajoie I, et al.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010; 11(Dec): 3371–3408.
23. Gao S, Zhang Y, Jia K, et al.: Single sample face recognition via learning deep supervised autoencoders. IEEE Trans Inf Forensics Secur. 2015; 10(10): 2108–2118. Publisher Full Text
24. Kingma DP, Welling M: Auto-encoding variational bayes. arXiv Prepr arXiv13126114. 2013.
25. Grais EM, Plumbley MD: Single channel audio source separation using convolutional denoising autoencoders. 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). 2017; pp. 1265–1269.
26. Yu W, et al.: Learning Deep Network Representations with Adversarially Regularized Autoencoders. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018; pp. 2663–2671. Publisher Full Text
27. Bank D, Koenigstein N, Giryes R: Autoencoders. CoRR. 2020; vol. abs/2003.0 [Online]. Reference Source
28. Zhou Q, Zhou W, Yang B, et al.: Deep cycle autoencoder for unsupervised domain adaptation with generative adversarial networks. IET Comput. Vis. 2019; 13(7): 659–665. Publisher Full Text
29. Nair V, Hinton GE: Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010; pp. 807–814.
30. Nwankpa C, Ijomah W, Gachagan A, et al.: Activation functions: Comparison of trends in practice and research for deep learning. arXiv Prepr arXiv181103378. 2018.
31. Elkholy MM, Mostafa M, Ebied HM, et al.: Hyperspectral unmixing using deep convolutional autoencoder. Int. J. Remote Sens. 2020; 41(12): 4799–4819. Publisher Full Text
32. Wang Z, Bovik AC, Sheikh HR, et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004; 13(4): 600–612. PubMed Abstract | Publisher Full Text
33. Baltrušaitis T, Ahuja C, Morency L-P: Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018; 41(2): 423–443. PubMed Abstract | Publisher Full Text
34. Ngiam J, Khosla A, Kim M, et al.: Multimodal deep learning. Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011; pp. 689–696.
35. Nicolaou MA, Gunes H, Pantic M: Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput. 2011; 2(2): 92–105. Publisher Full Text
36. Yuan X, Huang B, Wang Y, et al.: Deep Learning-Based Feature Representation and Its Application for Soft Sensor Modeling With Variable-Wise Weighted SAE. IEEE Trans. Ind. Informatics. 2018; 14(7): 3235–3243. Publisher Full Text
37. Kristian Y, Purnama IKE, Sutanto EH, et al.: Klasifikasi Nyeri pada Video Ekspresi Wajah Bayi Menggunakan DCNN Autoencoder dan LSTM. J Nas Tek Elektro dan Teknol Inf. 2018; 7(3): 308–316. Publisher Full Text
38. Kazemi V, Sullivan J: One millisecond face alignment with an ensemble of regression trees. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014; pp. 1867–1874.
39. Gibbs JW: Elementary Principles in Statistical Mechanics: Developed with Especial Reference to the Rational Foundation of Thermodynamics. C. Scribner’s Sons; 1902.

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 28 Mar 2022

Author details Author details

Natanael Simogiarto
Roles: Formal Analysis, Investigation, Project Administration, Resources, Software, Validation, Visualization, Writing – Original Draft Preparation, Writing – Review & Editing

Elizeus Hanindito
Roles: Investigation, Supervision

Competing interests

No competing interests were disclosed.

Grant information

The author(s) declared that no grants were involved in supporting this work.

Article Versions (2)

version 2

Revised

Published: 30 Jan 2023, 11:359

https://doi.org/10.12688/f1000research.73108.2

version 1

Published: 28 Mar 2022, 11:359

https://doi.org/10.12688/f1000research.73108.1

© 2022 Kristian Y et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Kristian Y, Simogiarto N, Sampurna MTA and Hanindito E. Ensemble of multimodal deep learning autoencoder for infant cry and pain detection [version 1; peer review: 2 approved with reservations]. F1000Research 2022, 11:359 (https://doi.org/10.12688/f1000research.73108.1)

NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Version 1

VERSION 1

PUBLISHED 28 Mar 2022

Views

Reviewer Report 02 Aug 2022

Abdulaziz Saleh Ba Wazir, Faculty of Engineering, Multimedia University, Cyberjaya, Malaysia

Approved with Reservations

https://doi.org/10.5256/f1000research.76734.r144993

The paper presents a bright idea of deep learning usage for infants cry and pain detection. The paper used multimodal and utilized both visual and audio features of the dataset for training and testing purposes. The use of both features added more value in the field of infants cry researches using deep learning. Here is a few comments to boost the paper contribution and organization as well.

Many of the literature on autoencoders was highlighted in details in the methods section. However, it is proper to be discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in the methods section
Many of the literature on infants cry and pain detection as references 11, 13, and 15 was highlighted in details in the methods section. However, it is proper to be further discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in methods section.
It is highly recommended to separate methods and experimental settings into sections. evaluation and metrics can be part of the experimental settings.
Discussion either can be extended and detailed or added to results section and have one section (results and discussion) and detail the discussions within the results if combined.
Although several literature was cited on the same dataset, the paper lacks comparison with those researches. As parts of the results and discussion a comparison should be tabulated and discussed, even if the size and type of the dataset are different. The table should highlight the dataset, approach, deep learning models, and achieved metrics. At least few literature can be compared such as 11, 12, 13, 14, and 15.
For the comparison table as in comment number 5, the table should highlight at least one or two common metrics. If common metrics is missing, adding one or two metrics is still acceptable than not having a comparison at all.
The comparison table shall be discussed to highlight the novelty and contributions of this paper over other methods, such as usage of audio and images, or out performance of current work over others in terms of metrics, or weights. One or two difference is enough, but the more the better.
The novelty and contributions based on the comparison table can be used to further detail the contributions in background and conclusion sections.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Deep Learning, Image processing, Computer Vision, Acoustics using deep learning, Audio and Signal Processing

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 30 Jan 2023

Mahendra Tri Arif Sampurna, Department of Pediatrics, Faculty of Medicine, Universitas Airlangga, Surabaya, 60132, Indonesia

30 Jan 2023

Author Response
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made ... Continue reading
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 2

The paper presents a bright idea of deep learning usage for infants cry and pain detection. The paper used multimodal and utilized both visual and audio features of the dataset for training and testing purposes. The use of both features added more value in the field of infants cry researches using deep learning. Here is a few comments to boost the paper contribution and organization as well.

Many of the literature on autoencoders was highlighted in details in the methods section. However, it is proper to be discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in the methods section

Response: Thank you for your valuable comment. We agree that this can lead to more effective reading, therefore we have move part of method to introduction.

Many of the literature on infants cry and pain detection as references 11, 13, and 15 was highlighted in details in the methods section. However, it is proper to be further discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in methods section.

Response: Thank you for your suggestion. We also think it will be better if there is “Related Work” section before method. However, to conform with other paper published in this journal we maintain several previous work on the method section which have the basis of our method and move several literature to the discussion section.

It is highly recommended to separate methods and experimental settings into sections. evaluation and metrics can be part of the experimental settings.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.

Discussion either can be extended and detailed or added to results section and have one section (results and discussion) and detail the discussions within the results if combined.

Response: Thank you for you valuable suggestion. We have added several related in discussion and hope that there will provide more understanding for our study.

Although several literature was cited on the same dataset, the paper lacks comparison with those researches. As parts of the results and discussion a comparison should be tabulated and discussed, even if the size and type of the dataset are different. The table should highlight the dataset, approach, deep learning models, and achieved metrics. At least few literature can be compared such as 11, 12, 13, 14, and 15.

Response: Thank you for your valuable suggestion. We found that this is very brilliant and have added the table for comparison other works.

For the comparison table as in comment number 5, the table should highlight at least one or two common metrics. If common metrics is missing, adding one or two metrics is still acceptable than not having a comparison at all.

Response: Thank you for your valuable suggestion. We have added the different in one or two common metric as suggested

The comparison table shall be discussed to highlight the novelty and contributions of this paper over other methods, such as usage of audio and images, or out performance of current work over others in terms of metrics, or weights. One or two difference is enough, but the more the better.

Response: Thank you for your valuable suggestion. We have added the difference in one or two common metric as suggested

The novelty and contributions based on the comparison table can be used to further detail the contributions in background and conclusion sections.

Response: Thank you for your valuable suggestion. We have added the table comparison of other work to assist this findings.
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 2

The paper presents a bright idea of deep learning usage for infants cry and pain detection. The paper used multimodal and utilized both visual and audio features of the dataset for training and testing purposes. The use of both features added more value in the field of infants cry researches using deep learning. Here is a few comments to boost the paper contribution and organization as well.

Many of the literature on autoencoders was highlighted in details in the methods section. However, it is proper to be discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in the methods section

Response: Thank you for your valuable comment. We agree that this can lead to more effective reading, therefore we have move part of method to introduction.

Many of the literature on infants cry and pain detection as references 11, 13, and 15 was highlighted in details in the methods section. However, it is proper to be further discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in methods section.

Response: Thank you for your suggestion. We also think it will be better if there is “Related Work” section before method. However, to conform with other paper published in this journal we maintain several previous work on the method section which have the basis of our method and move several literature to the discussion section.

It is highly recommended to separate methods and experimental settings into sections. evaluation and metrics can be part of the experimental settings.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.

Discussion either can be extended and detailed or added to results section and have one section (results and discussion) and detail the discussions within the results if combined.

Response: Thank you for you valuable suggestion. We have added several related in discussion and hope that there will provide more understanding for our study.

Although several literature was cited on the same dataset, the paper lacks comparison with those researches. As parts of the results and discussion a comparison should be tabulated and discussed, even if the size and type of the dataset are different. The table should highlight the dataset, approach, deep learning models, and achieved metrics. At least few literature can be compared such as 11, 12, 13, 14, and 15.

Response: Thank you for your valuable suggestion. We found that this is very brilliant and have added the table for comparison other works.

For the comparison table as in comment number 5, the table should highlight at least one or two common metrics. If common metrics is missing, adding one or two metrics is still acceptable than not having a comparison at all.

Response: Thank you for your valuable suggestion. We have added the different in one or two common metric as suggested

The comparison table shall be discussed to highlight the novelty and contributions of this paper over other methods, such as usage of audio and images, or out performance of current work over others in terms of metrics, or weights. One or two difference is enough, but the more the better.

Response: Thank you for your valuable suggestion. We have added the difference in one or two common metric as suggested

The novelty and contributions based on the comparison table can be used to further detail the contributions in background and conclusion sections.

Response: Thank you for your valuable suggestion. We have added the table comparison of other work to assist this findings.
Competing Interests: We declare that we have no competing interest Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 30 Jan 2023

Mahendra Tri Arif Sampurna, Department of Pediatrics, Faculty of Medicine, Universitas Airlangga, Surabaya, 60132, Indonesia

30 Jan 2023

Author Response
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made ... Continue reading
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 2

The paper presents a bright idea of deep learning usage for infants cry and pain detection. The paper used multimodal and utilized both visual and audio features of the dataset for training and testing purposes. The use of both features added more value in the field of infants cry researches using deep learning. Here is a few comments to boost the paper contribution and organization as well.

Many of the literature on autoencoders was highlighted in details in the methods section. However, it is proper to be discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in the methods section

Response: Thank you for your valuable comment. We agree that this can lead to more effective reading, therefore we have move part of method to introduction.

Many of the literature on infants cry and pain detection as references 11, 13, and 15 was highlighted in details in the methods section. However, it is proper to be further discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in methods section.

Response: Thank you for your suggestion. We also think it will be better if there is “Related Work” section before method. However, to conform with other paper published in this journal we maintain several previous work on the method section which have the basis of our method and move several literature to the discussion section.

It is highly recommended to separate methods and experimental settings into sections. evaluation and metrics can be part of the experimental settings.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.

Discussion either can be extended and detailed or added to results section and have one section (results and discussion) and detail the discussions within the results if combined.

Response: Thank you for you valuable suggestion. We have added several related in discussion and hope that there will provide more understanding for our study.

Although several literature was cited on the same dataset, the paper lacks comparison with those researches. As parts of the results and discussion a comparison should be tabulated and discussed, even if the size and type of the dataset are different. The table should highlight the dataset, approach, deep learning models, and achieved metrics. At least few literature can be compared such as 11, 12, 13, 14, and 15.

Response: Thank you for your valuable suggestion. We found that this is very brilliant and have added the table for comparison other works.

For the comparison table as in comment number 5, the table should highlight at least one or two common metrics. If common metrics is missing, adding one or two metrics is still acceptable than not having a comparison at all.

Response: Thank you for your valuable suggestion. We have added the different in one or two common metric as suggested

The comparison table shall be discussed to highlight the novelty and contributions of this paper over other methods, such as usage of audio and images, or out performance of current work over others in terms of metrics, or weights. One or two difference is enough, but the more the better.

Response: Thank you for your valuable suggestion. We have added the difference in one or two common metric as suggested

The novelty and contributions based on the comparison table can be used to further detail the contributions in background and conclusion sections.

Response: Thank you for your valuable suggestion. We have added the table comparison of other work to assist this findings.
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 2

The paper presents a bright idea of deep learning usage for infants cry and pain detection. The paper used multimodal and utilized both visual and audio features of the dataset for training and testing purposes. The use of both features added more value in the field of infants cry researches using deep learning. Here is a few comments to boost the paper contribution and organization as well.

Many of the literature on autoencoders was highlighted in details in the methods section. However, it is proper to be discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in the methods section

Response: Thank you for your valuable comment. We agree that this can lead to more effective reading, therefore we have move part of method to introduction.

Many of the literature on infants cry and pain detection as references 11, 13, and 15 was highlighted in details in the methods section. However, it is proper to be further discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in methods section.

Response: Thank you for your suggestion. We also think it will be better if there is “Related Work” section before method. However, to conform with other paper published in this journal we maintain several previous work on the method section which have the basis of our method and move several literature to the discussion section.

It is highly recommended to separate methods and experimental settings into sections. evaluation and metrics can be part of the experimental settings.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.

Discussion either can be extended and detailed or added to results section and have one section (results and discussion) and detail the discussions within the results if combined.

Response: Thank you for you valuable suggestion. We have added several related in discussion and hope that there will provide more understanding for our study.

Although several literature was cited on the same dataset, the paper lacks comparison with those researches. As parts of the results and discussion a comparison should be tabulated and discussed, even if the size and type of the dataset are different. The table should highlight the dataset, approach, deep learning models, and achieved metrics. At least few literature can be compared such as 11, 12, 13, 14, and 15.

Response: Thank you for your valuable suggestion. We found that this is very brilliant and have added the table for comparison other works.

For the comparison table as in comment number 5, the table should highlight at least one or two common metrics. If common metrics is missing, adding one or two metrics is still acceptable than not having a comparison at all.

Response: Thank you for your valuable suggestion. We have added the different in one or two common metric as suggested

The comparison table shall be discussed to highlight the novelty and contributions of this paper over other methods, such as usage of audio and images, or out performance of current work over others in terms of metrics, or weights. One or two difference is enough, but the more the better.

Response: Thank you for your valuable suggestion. We have added the difference in one or two common metric as suggested

The novelty and contributions based on the comparison table can be used to further detail the contributions in background and conclusion sections.

Response: Thank you for your valuable suggestion. We have added the table comparison of other work to assist this findings.
Competing Interests: We declare that we have no competing interest Close
Report a concern

Views

Reviewer Report 20 Jun 2022

Ghazal Bargshady, Faculty of Science and Technology, University of Canberra, Bruce, Australia

Approved with Reservations

https://doi.org/10.5256/f1000research.76734.r138998

The paper is good, the model and novelty is explained well, however there are some feedbacks which the authors need to address before indexing.

Discussion should discuss more details, explain well, and talk more about future

The paper is good, the model and novelty is explained well, however there are some feedbacks which the authors need to address before indexing.

Discussion should discuss more details, explain well, and talk more about future works, please discuss the achievement and compare with other work. This section need more work.
The comparison with other work is not discussed well. It is better the authors use table for comparison with other works.
In method section, it is recommended to separate experimental setup such as database and experimental details in a separate section.
It is better to highlight more contributions in Introduction and discussion.
In the references list, I can see the author have not mentioned the other related works. There are so many important research related to deep learning and pain detection that have been done before by others and published in high quality journals which is required to bring in the reference list and compare the results with them.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Affective computing, deep learning, computer vision

CITE

Report a concern

Author Response 30 Jan 2023

Mahendra Tri Arif Sampurna, Department of Pediatrics, Faculty of Medicine, Universitas Airlangga, Surabaya, 60132, Indonesia

30 Jan 2023

Author Response
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made ... Continue reading
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 1

The paper is good, the model and novelty is explained well, however there are some feedbacks which the authors need to address before indexing.

Discussion should discuss more details, explain well, and talk more about future works, please discuss the achievement and compare with other work. This section need more work.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.

The comparison with other work is not discussed well. It is better the authors use table for comparison with other works.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.

In method section, it is recommended to separate experimental setup such as database and experimental details in a separate section.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.

It is better to highlight more contributions in Introduction and discussion.

Response: Thank you for your valuable suggestion. We agree with you comment and now have highlighted the contributions in several point on the last part of introduction and conclusion.

In the references list, I can see the author have not mentioned the other related works. There are so many important research related to deep learning and pain detection that have been done before by others and published in high quality journals which is required to bring in the reference list and compare the results with them.

Response: Thank you for your valuable suggestion. We have added other high quality journal to prove more understanding on our study.
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 1

The paper is good, the model and novelty is explained well, however there are some feedbacks which the authors need to address before indexing.

Discussion should discuss more details, explain well, and talk more about future works, please discuss the achievement and compare with other work. This section need more work.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.

The comparison with other work is not discussed well. It is better the authors use table for comparison with other works.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.

In method section, it is recommended to separate experimental setup such as database and experimental details in a separate section.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.

It is better to highlight more contributions in Introduction and discussion.

Response: Thank you for your valuable suggestion. We agree with you comment and now have highlighted the contributions in several point on the last part of introduction and conclusion.

In the references list, I can see the author have not mentioned the other related works. There are so many important research related to deep learning and pain detection that have been done before by others and published in high quality journals which is required to bring in the reference list and compare the results with them.

Response: Thank you for your valuable suggestion. We have added other high quality journal to prove more understanding on our study.
Competing Interests: We declare that we have no competing interest Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 30 Jan 2023

Mahendra Tri Arif Sampurna, Department of Pediatrics, Faculty of Medicine, Universitas Airlangga, Surabaya, 60132, Indonesia

30 Jan 2023

Author Response
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made ... Continue reading
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 1

The paper is good, the model and novelty is explained well, however there are some feedbacks which the authors need to address before indexing.

Discussion should discuss more details, explain well, and talk more about future works, please discuss the achievement and compare with other work. This section need more work.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.

The comparison with other work is not discussed well. It is better the authors use table for comparison with other works.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.

In method section, it is recommended to separate experimental setup such as database and experimental details in a separate section.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.

It is better to highlight more contributions in Introduction and discussion.

Response: Thank you for your valuable suggestion. We agree with you comment and now have highlighted the contributions in several point on the last part of introduction and conclusion.

In the references list, I can see the author have not mentioned the other related works. There are so many important research related to deep learning and pain detection that have been done before by others and published in high quality journals which is required to bring in the reference list and compare the results with them.

Response: Thank you for your valuable suggestion. We have added other high quality journal to prove more understanding on our study.
We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 1

The paper is good, the model and novelty is explained well, however there are some feedbacks which the authors need to address before indexing.

Discussion should discuss more details, explain well, and talk more about future works, please discuss the achievement and compare with other work. This section need more work.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.

The comparison with other work is not discussed well. It is better the authors use table for comparison with other works.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.

In method section, it is recommended to separate experimental setup such as database and experimental details in a separate section.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.

It is better to highlight more contributions in Introduction and discussion.

Response: Thank you for your valuable suggestion. We agree with you comment and now have highlighted the contributions in several point on the last part of introduction and conclusion.

In the references list, I can see the author have not mentioned the other related works. There are so many important research related to deep learning and pain detection that have been done before by others and published in high quality journals which is required to bring in the reference list and compare the results with them.

Response: Thank you for your valuable suggestion. We have added other high quality journal to prove more understanding on our study.
Competing Interests: We declare that we have no competing interest Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 28 Mar 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3	4	5
Version 2 (revision) 30 Jan 23			read	read	read
Version 1 28 Mar 22	read	read

Ghazal Bargshady, University of Canberra, Bruce, Australia
Abdulaziz Saleh Ba Wazir, Multimedia University, Cyberjaya, Malaysia
Rohan Borgalli, Shah and Anchor Kutchhi Engineering College, Mumbai, India
Lucas Pereira Carlini, Centro Universitário FEI, Campus São Bernardo do Campo, Brazil
Alwin Poulose, Indian Institute of Science Education & Research, Thiruvananthapuram, India

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

7 Views

05 Sep 2024 | for Version 2

Alwin Poulose, Indian Institute of Science Education & Research, Thiruvananthapuram, India

7 Views Cite this report Responses(0)

Approved With Reservations

Please use the introduction section.
Please add paper organization details.
The background section is a combination of the introduction and related work section.
Please improve the related work writing. Please consider state-of-the-art facial emotion recognition systems for your study. Please find the following references for your research. Ref 1, Ref 2.
The confusion matrix can be shown in a figure format rather than a table format.
The authors defined the precision and recall in the discussion and didn't show these results in the tables.

Is the work clearly and accurately presented and does it cite the current literature?

No
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Yes
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

1. Kim J, Poulose A, Han D: CVGG-19: Customized Visual Geometry Group Deep Learning Architecture for Facial Emotion Recognition. IEEE Access. 2024; 12: 41557-41578 Publisher Full Text
2. Kim JH, Poulose A, Han DS: The Extensive Usage of the Facial Image Threshing Machine for Facial Emotion Recognition Performance.Sensors (Basel). 2021; 21 (6). PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Indoor Localization, Human Activity Recognition, Emotion Recognition, Intelligent Systems

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

10 Views

31 Jan 2024 | for Version 2

Lucas Pereira Carlini, Department of Electrical Engineering, Centro Universitário FEI, Campus São Bernardo do Campo, Brazil

10 Views Cite this report Responses(0)

Not Approved

Dear Authors, thank you for letting me read your work.

The paper presented a framework to construct models that use both facial expression and infant voice in classifying pain levels and cry detection. In the end, four pain classifier models with a relatively good F1 score were developed. These models were combined by using ensemble methods to improve performance, which resulted in a better F1 score.

My comments and improvement suggestions are divided following paper’s sections.

——————————————————————
Abstract:

Some typos were found. Authors wrote “ The the..” in the second sentence in Methods, and used the wrong verbal conjugation in the end of the first sentence in Conclusions, to fix “was” to “were”.

I highly recommend adding quantitative data in the abstract to avoid terms such as “relatively high F1 score”, “relatively good F1 score”, and “better F1 score”. These terms are misleading since they do not provide the actual value.

——————————————————————
Background:

In the first paragraph, the authors claim that “Most of the pain scores are subjective”. Yet, since this is not general knowledge, the authors should provide a relevant reference to it. Please, find attached some references to back it up.

In the second paragraph, no references were provided to back claims. Authors may find relevant information in Grunau’s work.

In the third to fifth paragraphs, the authors comment on related work, yet they do not provide the limitations and actual performance. These are mandatory information required to assess the solution being proposed here.

The contributions being proposed are not a novelty to the academic community. Dr Zamzmi already approached multimodality in neonatal pain assessment, developing several solutions during the past years. Please, see attached references. I suggest the authors to further review recent literature, and list the relevant and state-of-the-art solutions to better benchmark their work.

Furthermore, it is not clear why an Autoencoder is required to obtain feature vectors. The authors must better justify the reasoning behind it. Using transfer learning of state-of-the-art CNN models (ResNet or VGG-Face) can yield better results.

——————————————————————
Methods:

Study setting and participants:
The authors should further detail the demographics' information regarding the infants that enrolled in the study.

Dataset:
Please, provide the amount of sample videos to each label.
Also, the training protocol should be described. What is the share of train/test data?
Finally, does the author prevent data leakage from training to test set? Since the dataset presents limited number of unique neonates, the leave-some-subjects-out training protocol should be applied.

Frame selection and preprocessing:
Only extracting the frames corresponding to the instant of maximum cry is wrong, since it is known to the medical community that 50% of the neonates do not cry when experiencing pain. Please, see attached references.

Multimodal pain and cry detection:
It is not clear to me whether both audio and facial features were used as input to the classifiers or not. The authors should better clarify it.

——————————————————————
Results:

Reading the text, it appears to me that training/test sets are “static”, meaning that there is no cross-validation to ensure that the generalisation capabilities of the model are being assessed. This is mandatory for limited-sample-size problems. The lack of a training protocol is also problematic. These facts hinder reproducibility of the work.

The results provided in tables 7 and 9 suggest model instability, due to the quite different performances between each data distribution. These results seem to confirm the problematic lack of cross-validation.

Finally, providing three distinct frames as input to the model as carried out here is problematic, since the collection of these frames can highly vary between samples, as stated by the authors: “However, there was a possibility that the chosen frame might have not had an extractable frontal face. Therefore, with the selected frame as the starting point, each frame was evaluated until a frame with an extractable frontal face was reached.” Pain is an experience that varies in a given timeframe, and it is expressed and felt in a particular way by each neonate.

——————————————————————
Discussion:

Again, the authors use terms like “excellent accuracy” to comment on quantitative results. These terms are misleading since they do not provide the actual value.

The comparison with other works are unreliable, since each framework presented their own dataset and training protocols. The authors should better describe this information.

——————————————————————
Conclusion:

The conclusion is lacking a paragraph to describe future works and feasible approaches to tackle current limitations.

Is the work clearly and accurately presented and does it cite the current literature?

No
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Yes

References

1. Barros MCM, Thomaz CE, da Silva GVT, do Carmo Azevedo Soares J, et al.: Identification of pain in neonates: the adults' visual perception of neonatal facial features.J Perinatol. 2021; 41 (9): 2304-2308 PubMed Abstract | Publisher Full Text
2. G, Zamzmi CY, Pai D, Goldgof R, et al.: An approach for automated multimodal analysis of infants dust. http://dblp.uni-trier.de/db/conf/icpr/icpr2016.html#ZamzamiPGKAS16. 2016. Publisher Full Text
3. Salekin MS, Zamzmi G, Goldgof D, Kasturi R, et al.: Multimodal spatio-temporal deep learning approach for neonatal postoperative pain assessment.Comput Biol Med. 2021; 129: 104150 PubMed Abstract | Publisher Full Text
4. Salekin MS, Zamzmi G, Goldgof D, Mouton PR, et al.: Attentional Generative Multimodal Network for Neonatal Postoperative Pain Estimation.Med Image Comput Comput Assist Interv. 2022; 13433: 749-759 PubMed Abstract | Publisher Full Text
5. Grunau RE: Personal perspectives: Infant pain-A multidisciplinary journey.Paediatr Neonatal Pain. 2020; 2 (2): 50-57 PubMed Abstract | Publisher Full Text
6. Grunau RVE, Johnston CC, Craig KD: Neonatal facial and cry responses to invasive and non-invasive procedures.Pain. 1990; 42 (3): 295-305 PubMed Abstract | Publisher Full Text
7. A.S, Coutrin C, Lucas F, Leonardo H, et al.: Convolutional Neural Networks for Newborn Pain Assessment Using Face Images: A Quantitative and Qualitative Comparison. https://link.springer.com/chapter/10.1007/978-981-16-6775-6_41. 2023. Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Machine learning, computer vision, Explainable AI, Neonatal Pain Assessment.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

12 Views

02 Jan 2024 | for Version 2

Rohan Borgalli, Shah and Anchor Kutchhi Engineering College, Mumbai, Maharashtra, India

12 Views Cite this report Responses(0)

Not Approved

The manuscript proposed a Ensemble of multimodal deep learning autoencoder for infant cry and pain detection.
Below are my general comments:

1. Please highlight recent related work based on the infant cry and pain detection.. The related works can be enhanced with disadvantages considered in the manuscript.

2. Add a reason for why you only selected Deep Learning Autoencoders for the classification of infant pain levels based algorithm for your work.

3. The article must demonstrate novelty and technical contribution with recent studies comparisons. Also, you need to look for current studies and remove those which are more than five years old unless they are essential. You must look for further studies in high-impact factor journals and limit to a large extent those from conferences.

4. The dataset used by you is very limited videos (frames) to train deep autoencoder architecture proposed in paper.

5. Deep autoencoder architecture based model for the proposed method should include its detailed architecture and discuss how this model is better than other models used for deep learning-inspired infant cry and pain detection techniques.

6. In the Brief introduction of several datasets section, the authors listed a few infant cry and pain datasets. The authors have not introduced the discussion about each dataset, such as the advantages or drawbacks of each one.

7. In Citation [41] overall accuracy of 0.95 is mentioned but in this paper its value is different.

8.. Details of amplitude spectrogram, and dB-scaled spectrogram of infant’s voices is missing in paper.
I suggest authors to look into above suggestions and do revision of the paper.

Is the work clearly and accurately presented and does it cite the current literature?

Partly
Is the study design appropriate and is the work technically sound?

No
Are sufficient details of methods and analysis provided to allow replication by others?

No
If applicable, is the statistical analysis and its interpretation appropriate?

No
Are all the source data underlying the results available to ensure full reproducibility?

No
Are the conclusions drawn adequately supported by the results?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Facial Expression Recognition using Machine Learning

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

17 Views

02 Aug 2022 | for Version 1

Abdulaziz Saleh Ba Wazir, Faculty of Engineering, Multimedia University, Cyberjaya, Malaysia

17 Views Cite this report Responses(1)

Approved With Reservations

Many of the literature on autoencoders was highlighted in details in the methods section. However, it is proper to be discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in the methods section
Many of the literature on infants cry and pain detection as references 11, 13, and 15 was highlighted in details in the methods section. However, it is proper to be further discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in methods section.
It is highly recommended to separate methods and experimental settings into sections. evaluation and metrics can be part of the experimental settings.
Discussion either can be extended and detailed or added to results section and have one section (results and discussion) and detail the discussions within the results if combined.
Although several literature was cited on the same dataset, the paper lacks comparison with those researches. As parts of the results and discussion a comparison should be tabulated and discussed, even if the size and type of the dataset are different. The table should highlight the dataset, approach, deep learning models, and achieved metrics. At least few literature can be compared such as 11, 12, 13, 14, and 15.
For the comparison table as in comment number 5, the table should highlight at least one or two common metrics. If common metrics is missing, adding one or two metrics is still acceptable than not having a comparison at all.
The comparison table shall be discussed to highlight the novelty and contributions of this paper over other methods, such as usage of audio and images, or out performance of current work over others in terms of metrics, or weights. One or two difference is enough, but the more the better.
The novelty and contributions based on the comparison table can be used to further detail the contributions in background and conclusion sections.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Partly
Are the conclusions drawn adequately supported by the results?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Deep Learning, Image processing, Computer Vision, Acoustics using deep learning, Audio and Signal Processing

Respond to this report

Responses (1)

Author Response

30 Jan 2023

Mahendra Tri Arif Sampurna, Department of Pediatrics, Faculty of Medicine, Universitas Airlangga, Surabaya, 60132, Indonesia

We would like to thank the reviewer for reading and commenting on our submission. We will attempt to answer each question and suggestion as clearly as possible. Hopefully, revisions made on the manuscript would allow further considerations on indexing the article.

Thank you for reviewing our manuscript. We hereby list the responses and revisions made on the original manuscript:

Reviewer 2

The paper presents a bright idea of deep learning usage for infants cry and pain detection. The paper used multimodal and utilized both visual and audio features of the dataset for training and testing purposes. The use of both features added more value in the field of infants cry researches using deep learning. Here is a few comments to boost the paper contribution and organization as well.

Many of the literature on autoencoders was highlighted in details in the methods section. However, it is proper to be discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in the methods section

Response: Thank you for your valuable comment. We agree that this can lead to more effective reading, therefore we have move part of method to introduction.
Many of the literature on infants cry and pain detection as references 11, 13, and 15 was highlighted in details in the methods section. However, it is proper to be further discussed as part of the literature in introduction section or as a separate literature section. And only refer to it in methods section.

Response: Thank you for your suggestion. We also think it will be better if there is “Related Work” section before method. However, to conform with other paper published in this journal we maintain several previous work on the method section which have the basis of our method and move several literature to the discussion section.
It is highly recommended to separate methods and experimental settings into sections. evaluation and metrics can be part of the experimental settings.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.
Discussion either can be extended and detailed or added to results section and have one section (results and discussion) and detail the discussions within the results if combined.

Response: Thank you for you valuable suggestion. We have added several related in discussion and hope that there will provide more understanding for our study.
Although several literature was cited on the same dataset, the paper lacks comparison with those researches. As parts of the results and discussion a comparison should be tabulated and discussed, even if the size and type of the dataset are different. The table should highlight the dataset, approach, deep learning models, and achieved metrics. At least few literature can be compared such as 11, 12, 13, 14, and 15.

Response: Thank you for your valuable suggestion. We found that this is very brilliant and have added the table for comparison other works.
For the comparison table as in comment number 5, the table should highlight at least one or two common metrics. If common metrics is missing, adding one or two metrics is still acceptable than not having a comparison at all.

Response: Thank you for your valuable suggestion. We have added the different in one or two common metric as suggested
The comparison table shall be discussed to highlight the novelty and contributions of this paper over other methods, such as usage of audio and images, or out performance of current work over others in terms of metrics, or weights. One or two difference is enough, but the more the better.

Response: Thank you for your valuable suggestion. We have added the difference in one or two common metric as suggested
The novelty and contributions based on the comparison table can be used to further detail the contributions in background and conclusion sections.

Response: Thank you for your valuable suggestion. We have added the table comparison of other work to assist this findings.

View more View less

Competing Interests

We declare that we have no competing interest

Back to all reports

Reviewer Report

29 Views

20 Jun 2022 | for Version 1

Ghazal Bargshady, Faculty of Science and Technology, University of Canberra, Bruce, Australia

29 Views Cite this report Responses(1)

Approved With Reservations

The paper is good, the model and novelty is explained well, however there are some feedbacks which the authors need to address before indexing.

Discussion should discuss more details, explain well, and talk more about future works, please discuss the achievement and compare with other work. This section need more work.
The comparison with other work is not discussed well. It is better the authors use table for comparison with other works.
In method section, it is recommended to separate experimental setup such as database and experimental details in a separate section.
It is better to highlight more contributions in Introduction and discussion.
In the references list, I can see the author have not mentioned the other related works. There are so many important research related to deep learning and pain detection that have been done before by others and published in high quality journals which is required to bring in the reference list and compare the results with them.

Is the work clearly and accurately presented and does it cite the current literature?

Yes
Is the study design appropriate and is the work technically sound?

Yes
Are sufficient details of methods and analysis provided to allow replication by others?

Partly
If applicable, is the statistical analysis and its interpretation appropriate?

Yes
Are all the source data underlying the results available to ensure full reproducibility?

Yes
Are the conclusions drawn adequately supported by the results?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Affective computing, deep learning, computer vision

Respond to this report

Responses (1)

Author Response

30 Jan 2023

Mahendra Tri Arif Sampurna, Department of Pediatrics, Faculty of Medicine, Universitas Airlangga, Surabaya, 60132, Indonesia

Discussion should discuss more details, explain well, and talk more about future works, please discuss the achievement and compare with other work. This section need more work.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.
The comparison with other work is not discussed well. It is better the authors use table for comparison with other works.

Response: Thank you for your valuable suggestion. We acknowledge our paper lack on this part and now have added the table for comparison other works.
In method section, it is recommended to separate experimental setup such as database and experimental details in a separate section.

Response: Thank you for your suggestion. We now have separate the method. The first three subsection (study setting, ethics approval, and dataset) is related to study settings and data collection and continue with part of experimental settings.
It is better to highlight more contributions in Introduction and discussion.

Response: Thank you for your valuable suggestion. We agree with you comment and now have highlighted the contributions in several point on the last part of introduction and conclusion.
In the references list, I can see the author have not mentioned the other related works. There are so many important research related to deep learning and pain detection that have been done before by others and published in high quality journals which is required to bring in the reference list and compare the results with them.

Response: Thank you for your valuable suggestion. We have added other high quality journal to prove more understanding on our study.

View more View less

Competing Interests

We declare that we have no competing interest

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Voepel-Lewis T, Shayevitz JR, Malviya S: for Scoring Postoperative Pain in Young Children. Pediatr. Nurs. 1997; 23(3).

[2] 2. Hudson-Barr D, Capper-Michel B, Lambert S, et al.: Validation of the pain assessment in neonates (PAIN) scale with the neonatal infant pain scale (NIPS). Neonatal Netw. 2002; 21(6): 15–21. PubMed Abstract | Publisher Full Text

[3] 3. Garra G, et al.: Validation of the Wong-Baker FACES Pain Rating Scale in Pediatric Emergency Department Patients. Acad. Emerg. Med. 2010; 17(1): 50–54. PubMed Abstract | Publisher Full Text

[4] 4. Karcioglu O, Topacoglu H, Dikme O, et al.: A systematic review of the pain scales in adults: Which to use?. Am. J. Emerg. Med. 2018; 36: 707–714. Publisher Full Text

[5] 5. Petroni M, Malowany AS, Johnston CC, et al.: Classification of infant cry vocalizations using artificial neural networks (ANNs). 1995 International Conference on Acoustics, Speech, and Signal Processing. 1995; vol. 5: pp. 3475–3478.

[6] 6. Garcia JO, Garcia CAR: Mel-frequency cepstrum coefficients extraction from infant cry for classification of normal and pathological cry with feed-forward neural networks. Proceedings of the International Joint Conference on Neural Networks, 2003. 2003; vol. 4: pp. 3140–3145.

[7] 7. Logan B; Others: Mel frequency cepstral coefficients for music modeling. Ismir. 2000; vol. 270: pp. 1–11.

[8] 8. Mayorga P, Druzgalski C, Morelos RL, et al.: Acoustics based assessment of respiratory diseases using GMM classification. 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology. 2010; pp. 6312–6316.

[9] 9. Cohen R, Lavner Y: Infant cry analysis and detection. 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel. 2012; pp. 1–5.

[10] 10. Sikka K, et al.: Automated assessment of children’s postoperative pain using computer vision. Pediatrics. 2015; 136(1): e124–e131. PubMed Abstract | Publisher Full Text

[11] 11. Kristian Y, et al.: A Novel Approach on Infant Facial Pain Classification using Multi Stage Classifier and Geometrical-Textural Features Combination. IAENG Int. J. Comput. Sci. 2017; 44(1).

[12] 12. Kristian Y, Hariadi M, Purnomo MH: Ideal Modified Adachi Chaotic Neural Networks and active shape model for infant facial cry detection on still image. 2014 International Joint Conference on Neural Networks (IJCNN). 2014; pp. 2783–2787.

[13] 13. Egede J, Valstar M, Martinez B: Fusing deep learned and hand-crafted features of appearance, shape, and dynamics for automatic pain estimation. 2017 12th IEEE international conference on automatic face & gesture recognition (FG 2017). 2017; pp. 689–696.

[14] 14. Werner P, Al-Hamadi A, Niese R, et al.: Automatic pain recognition from video and biomedical signals. 2014 22nd International Conference on Pattern Recognition. 2014; pp. 4582–4587.

[15] 15. Hanindito E: Dynamic Acoustic Pattern as Pain Indicator on Baby Cries Post Surgery Procedure PhD thesis.2013.

[16] 16. Kristian Y, Sampurna MTA: Infant Pain and Cry Data Labels.csv. Figshare. Dataset. Publisher Full Text

[17] 17. Kristian Y, Sampurna MTA: Video files for Infant FLACC Pain Level Video Dataset IFPaLVD.csv. Figshare. Dataset. Publisher Full Text

[18] 18. Viola P, Jones MJ: Robust real-time face detection. Int. J. Comput. Vis. 2004; 57(2): 137–154. Publisher Full Text

[19] 19. Gong Y, Poellabauer C: Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic Cues. 2018 27th International Conference on Computer Communication and Networks (ICCCN). 2018; pp. 1–9. Publisher Full Text

[20] 20. Davis D, Patronis E: Sound system engineering. Routledge; 2012.

[21] 21. Rumelhart DE, Hinton GE, Williams RJ: Learning internal representations by error propagation.1985.

[22] 22. Vincent P, Larochelle H, Lajoie I, et al.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 2010; 11(Dec): 3371–3408.

[23] 23. Gao S, Zhang Y, Jia K, et al.: Single sample face recognition via learning deep supervised autoencoders. IEEE Trans Inf Forensics Secur. 2015; 10(10): 2108–2118. Publisher Full Text

[24] 24. Kingma DP, Welling M: Auto-encoding variational bayes. arXiv Prepr arXiv13126114. 2013.

[25] 25. Grais EM, Plumbley MD: Single channel audio source separation using convolutional denoising autoencoders. 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP). 2017; pp. 1265–1269.

[26] 26. Yu W, et al.: Learning Deep Network Representations with Adversarially Regularized Autoencoders. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018; pp. 2663–2671. Publisher Full Text

[27] 27. Bank D, Koenigstein N, Giryes R: Autoencoders. CoRR. 2020; vol. abs/2003.0 [Online]. Reference Source

[28] 28. Zhou Q, Zhou W, Yang B, et al.: Deep cycle autoencoder for unsupervised domain adaptation with generative adversarial networks. IET Comput. Vis. 2019; 13(7): 659–665. Publisher Full Text

[29] 29. Nair V, Hinton GE: Rectified linear units improve restricted boltzmann machines. Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010; pp. 807–814.

[30] 30. Nwankpa C, Ijomah W, Gachagan A, et al.: Activation functions: Comparison of trends in practice and research for deep learning. arXiv Prepr arXiv181103378. 2018.

[31] 31. Elkholy MM, Mostafa M, Ebied HM, et al.: Hyperspectral unmixing using deep convolutional autoencoder. Int. J. Remote Sens. 2020; 41(12): 4799–4819. Publisher Full Text

[32] 32. Wang Z, Bovik AC, Sheikh HR, et al.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 2004; 13(4): 600–612. PubMed Abstract | Publisher Full Text

[33] 33. Baltrušaitis T, Ahuja C, Morency L-P: Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018; 41(2): 423–443. PubMed Abstract | Publisher Full Text

[34] 34. Ngiam J, Khosla A, Kim M, et al.: Multimodal deep learning. Proceedings of the 28th International Conference on Machine Learning (ICML-11). 2011; pp. 689–696.

[35] 35. Nicolaou MA, Gunes H, Pantic M: Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput. 2011; 2(2): 92–105. Publisher Full Text

[36] 36. Yuan X, Huang B, Wang Y, et al.: Deep Learning-Based Feature Representation and Its Application for Soft Sensor Modeling With Variable-Wise Weighted SAE. IEEE Trans. Ind. Informatics. 2018; 14(7): 3235–3243. Publisher Full Text

[37] 37. Kristian Y, Purnama IKE, Sutanto EH, et al.: Klasifikasi Nyeri pada Video Ekspresi Wajah Bayi Menggunakan DCNN Autoencoder dan LSTM. J Nas Tek Elektro dan Teknol Inf. 2018; 7(3): 308–316. Publisher Full Text

[38] 38. Kazemi V, Sullivan J: One millisecond face alignment with an ensemble of regression trees. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014; pp. 1867–1874.

[39] 39. Gibbs JW: Elementary Principles in Statistical Mechanics: Developed with Especial Reference to the Rational Foundation of Thermodynamics. C. Scribner’s Sons; 1902.

Ensemble of multimodal deep learning autoencoder for infant cry and pain detection

Abstract

Keywords

List of abbreviations

Background

Methods

Figure 1. Architecture system.

Study setting and participants

Ethics approval and consent

Dataset

Figure 2. Face augmentation from original image (a) using 15° rotation (b) and horizontal-flip (c).

Frame selection and preprocessing

Audio preprocessing

(1)

(2)

Autoencoder

(3)

(4)

Figure 3. Architecture of Variational Autoencoder (VAE) consist of encoder and decoder with two latent-vector, which usually called Mean and Standard Deviation.

Figure 4. Spectrograms of the infant’s voice from crying baby in severe pain (a), and from not crying nor in pain baby (b).

(5)

(6)

Table 1. Face CAE architecture.

Table 2. Spectrogram CAE architecture.

Figure 5. Spectrogram autoencoders output result.

Figure 6. Face autoencoders output result.

(7)

(8)

Multimodal pain and cry detection

Figure 7. Infant facial landmarking.

(9)

(10)

(11)

(12)

(13)

Ensemble model

Results

Face autoencoder

Table 3. Face autoencoder on various type.

Figure 8. Sample of face autoencoder results using SSIM loss function with original image (a), output from CAE (b), output from CAE with dense latent-vector (c), output from VAE (d), and output from VAE with dense latent-vector (e).

Spectrogram autoencoders

Table 4. Spectrogram autoencoder on various type.

Figure 9. Sample of dB-scaled spectrogram autoencoder results using BCE loss function with original spectrogram (a), output from CAE (b), and output from VAE (c).

Figure 10. Sample of amplitude spectrogram autoencoder results using BCE loss function with original spectrogram (a), output from CAE (b), and output from VAE (c).

Data label distribution

Table 5. First dataset distribution.

Table 6. Second dataset distribution.

Cry detection

Table 7. Cry detection using various features.

Table 8. Confusion matrix of cry detection.

Pain classification

Table 9. Pain classification using various features.

Table 10. Confusion matrix of ensemble pain classification.

Figure 11. The example of data input with their results.

Discussion

Conclusion

Data availability

Underlying data

Author contributions

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated