Ensemble of multimodal deep learning autoencoder for infant cry and pain detection [version 2; peer review: 2 approved with reservations]

Background: Babies cannot communicate their pain properly. Several pain scores are developed, but they are subjective and have high variability inter-observer agreement. The aim of this study was to construct models that use both facial expression and infant voice in classifying pain levels and cry detection. Methods: The study included a total of 23 infants below 12-months who were treated at Dr Soetomo General Hospital. The the Face Leg Activity Cry and Consolability (FLACC) pain scale and recordings of the baby's cries were taken in the video format. A machine-learning-based system was created to detect infant cries and pain levels. Spectrograms with the Short-Time Fourier Transform were used to convert the audio data into a time-frequency representation. Facial features combined with voice features extracted by using the Deep Learning Autoencoders was used for the classification of infant pain levels. Two types of autoencoders: Convolutional Autoencoder and Variational


Background
Babies cannot communicate their feelings properly, as such they cry to show hunger, discomfort (wet diaper and itchy skin), and pain.Methods such as Face, Legs, Activity, Cry, Consolability (FLACC), 1 Neonatal Infant Pain Scale (NIPS), 2 Wong-Baker FACES Rating Scale, 3 and others, 4 are used to detect pain experienced by a baby.Based on these methods, the main factors for detecting pain are the baby's facial expression, sound of crying, and body movements.Most of the pain scores are subjective, and they have high variability inter-observer agreement.Therefore, there is a need for a pain measurement tool to objectively measure and classify the pain.Hospitals mainly have more patients than medical staff, therefore, manually monitoring all their patients would be very time consuming and difficult.Normally, neonates are kept inside incubators, which can reduce the volume of their voice, resulting in difficulty for the medical staff to notice their discomfort.Therefore, an automation program with an artificial intelligence system that can identify the baby's condition is needed.It can also be used to recognize and monitor the infant's pain scale's progress and increase or reduce pain medication.Automatization will allow a doctor to monitor infant conditions remotely with input from the nurses.This invention can significantly improve the health care services, especially in developing countries where doctors are not always on-site.M. Petroni et al. 5 used Artificial Neural Networks (ANN) to classify infant voices to assess whether they're in pain, fear, or anger.J. O. Garcia and C. A. R. Garcia 6 also used infant voices to classify a normal or a pathological cry.This research used Mel-Frequency Cepstrum Coefficients (MFCCs) 7 to encode speech signals before the signals were forwarded into the neural network.One study 8 also used infant voices with MFCC representation to predict respiratory diseases such as wheeze, asthma, and crackles.Besides pain classification, MFCC has also been used to detect cries using audio features. 9his research divided an audio sequence into segments with 10 seconds duration.Instead of using infant voices, some studies also used facial expression features.K. Sikka et al. 10 used facial expression feature to classify pain by using a Facial Action Coding System (FACS).Y. Kristian et al. 11 tried hand-crafted extraction of facial expression features by using the Active Shape Model (ASM), based on the infant facial landmark detection and cry detection results from their previous research. 12ese studies were further developed by using the deep learning autoencoder method to build a cry detection and pain classifier system. 13In 1986, D. E. Rumelhart et al. 14 introduced a method called autoencoder.Autoencoder is a neural network consisting of two parts, encoder, and decoder.The encoder maps the input to the latent-vector with a smaller dimension, while the decoder maps the latent-vector back into the original input. 14An autoencoder for image representation with the use of deep networks was used by Vincent et al. 15 for denoising images.That research inspired S. Gao et al. 16 to use a deep autoencoder for extracting features from human faces, which showed promising results.

REVISED Amendments from Version 1
We have rearranged our manuscript structure to boost the paper contribution and organization.We have added the new table entitled "comparison with other related works".We highlight more the novelty and contributions of this paper.
Any further responses from the reviewers can be found at the end of the article The deep learning extracts facial expression features automatically with the help of facial landmark detection as demonstrated by Kristian et al. 11 J. Egede et al. 13 utilized facial expression by trying to combine hand-crafted feature extraction and deep learning feature extraction.All these studies applied a facial expression as the only feature for classifying pain levels and crying detection.P. Werner et al. 17 used a facial expression as a feature for one of their models.However, this research built other models by using the Biomedical Signals features while applying the Random Forest algorithm to combine these features.
The aim of this paper was to use both facial expression and infant voice as features for classifying pain levels and cry detection.As such deep learning autoencoders to automatically extract features for both facial and audio features were used.Additionally, the results for three features: only facial expression, only voice, and both facial expression and voice were compared in this study.
The key contributions of this paper are: 1) Automated three frame selection from videos dataset.Since several frames consist of an infant's face, it was not possible to use all the frames.However, automated frame selection can also be a step to building an automated system.
2) Autoencoders were constructed to build a latent-vector with a smaller dimension than the original dimension but capable of representing the original data for both facial and audio data.
3) Introduction of the classifier models using the Convolutional Neural Network (CNN) for the classification of pain level and cry detection.
4) Combination of the best four pain classifier models that uses the ensemble methods to produce a higher F1 score.

Study setting and participants
This study was an observational study enrolled at Dr. Soetomo General Hospital (Surabaya City, East Java Province, Indonesia).The study included 23 infants below 12-months who were treated for Hernia at Dr Soetomo General Hospital from November 2011 until December 2012.These infants were selected using a consecutive sampling.The FLACC pain scale and audio visual recordings of the baby's cries were taken in the video format and then analyzed using the software.

Ethics approval and consent
This study was approved by Clinical Research Unit at Dr. Soetomo General Hospital (248/Panke.KKE/III/2016).Due to the nature of the research, the visual data documenting facial expressions of observed infants will not be disclosed under the ethical considerations and regulations.Informed written consent was obtained from all patient's parents, confirming that the results and any accompanying images including direct or indirect identifiers can be published.

Dataset
Hanindito 18 has taken 46 videos from 23 infants.Those videos were obtained before and after surgery (mostly hernia surgery) for each infant.The videos were labeled using FLACC measurement methods which took place in Dr Soetomo General Hospital.From this scoring method, all the data were concluded into three classification labels, which were no pain, moderate pain, and severe pain (See Underlying data). 19In 2017, Kristian et al. 11 added ten more videos using the same FLACC ground truth labels and method.
For this research, these videos were divided into smaller segments with a 10 second duration to generate more data.At the same time, it provided the opportunity to avoid videos with high noise for both video frames and audio waves.Some videos could only be divided into one segment, while others into more than ten segments, depending on the number of frames that contain an extractable infant frontal face.After this process, 253 videos with an average of 10 second duration was obtained.Unfortunately, 64 videos were discarded as they did not contain any extractable frontal face frames.This dataset is called Infant FLACC Pain Level Video Dataset (IFPaLVD), which can be downloaded. 20e 189 videos were used to apply the data augmentation process to increase the number of data collection.
The augmentation process included both image and audio augmentation.For image data, horizontal flip and image rotation was used.The angles used for image rotation were -10°, 10°, -20°, 20°, and 15°.The output example of this face augmentation can be seen in Figure 1.
There is the time-shifting process (0.25 seconds) during spectrogram conversion.Therefore, the duration of 9.75 seconds was used to obtain 10 seconds for audio data.Further audio augmentation process was tuning the volume 2Â, 3Â, and 4Â higher or 2Â and 3Â lower.Therefore, each video was generated into 12 unique videos, resulting in the final data to consist of 2,668 videos.Visually, the augmentation result for the audio data did not look any different.

Experimental setup
In this paper, the facial expression and infant's voice to build a cry detection and pain classifier system was used.From the video frames, face detection and face landmarks were applied to gain the face region.From the audio frames, the wave signal was transformed into amplitude spectrograms and dB-scaled spectrograms by using the Short-Time Fourier Transform (STFT).To gain the features from the infant's face and voices spectrograms, autoencoders were used for the production of the latent-vector.These latent-vectors were used for CNN classifiers to ultimately predict crying condition and pain level.Four pain classifier models were used to form an Ensemble model.An overview of the system is shown in Figure 2.
Frame selection and preprocessing Every 10 second video had approximately 259 frames depending on the Frames Per Second (FPS) from each video.Therefore, in order to save resources, instead of using all the frames three selected frames were chosen.The first frame was at the time which the baby cried the loudest.This was made possible by selecting the highest value of audio frames along with mapping the index value from audio frames to match the video frames' index value.Following this, two more frames were chosen from one second before and after the first selected frame.However, there was a possibility that the chosen frame might have not had an extractable frontal face.Therefore, with the selected frame as the starting point, each frame was evaluated until a frame with an extractable frontal face was reached.The face region will be extracted from the selected frames using face landmark.The amplitude and dB-scaled spectrograms are obtained from STFT process.These data are forwarded into autoencoders resulting five latent-vector which will be the input for the cry detection and pain classifier system.Upon selecting three desired frames, the remaining frames which included the non-face region was discarded.Since grayscale images were used in this process it was important to convert the image color.The region of interest (ROI), which is the face region was selected in the grayscale images.This process used the face detection model from a previous study. 21In this model, a rectangle area that consisted of the baby's face was selected, and facial mapping (landmark) with the use of DLIB library was applied.This technique resulted in 64 points for the face landmarks.These points were used to draw a filled convex polygon and eliminate all the values outside the polygon.Finally, the frame was resized into the shape of 200Â200 to uniform all the data dimensions.

Audio preprocessing
Sounds are continuous analog signals, and their waveforms can take any value. 22Therefore, to digitalize and homogenize an audio signal, computers convert the value into specific sample rate.Similar to FPS for video, audio has a variety of sample rates.The sample rates needed to be uniform in the dataset.The sample rate used was 22,050 Hz, resulting in an array with the length of 21,500 for 9.75 seconds of audio, which represented the audio wave.However, it was not possible to extract any information from this audio wave.Therefore, the wave representation was changed into the time-frequency representation.The method to change the audio representation was by STFT 21 (Equation 1).This equation contains two parameters, hop counter (h) and frequency index (f ).
In the STFT process a 2,048 length "Hann" window with 512 hops was used.The total of 215000 samples, produced 420 hop iterations, with each iteration having a 2,048 window size (N).The result was a two-dimensional array with the shape of 1025Â420.The two-dimensional array still had a complex number (i), therefore an absolute function to remove that complex number was done (Amplitude spectrogram).This amplitude spectrogram has many near-zero values as babies cry for a certain frequency range.Therefore, row 50 until 350 for this spectrogram was used and the rest was discarded, resulting in the final shape of 300 Â 420.The amplitude spectrogram in its linear scale was transformed into its power log scale or dB scale 23 (Equation 2), and the result was called dB-scaled spectrogram.The value of 1.0 was used for the power reference (P 0 ), and A represented the amplitude spectrogram.

Autoencoder
Variational Autoencoder (VAE) is a new variation of an autoencoder that was proposed in 2013. 24The VAE can control the latent-vector distribution, due to its structure that contains two latent-vectors (Figure 3).As shown in previous research, 24 VAE can generate new data from its latent-vector, and build a face generation model to create a completely new face image.As shown in Figure 3, the two latent-vectors are combined into a single latent-vector using a reparameterization trick (Equation 3).The combination of latent-vector is important as it enables the backpropagation process for VAE.In this equation, μ represents mean (first latent-vector), σ represents variance (second latent-vector), and ϵ represents random normal.Moreover, VAE also adds the reconstruction loss with its latent-loss, which is usually called KL Divergence Loss (Equation 4) with N representing the size of the latent-neuron.
Apart from autoencoders for image data, a study in 2017 25 applied autoencoder for denoising a single-channel audio source.The autoencoder type used in this study was a convolutional autoencoder (CAE).They attempted to recreate the magnitude spectrograms.The spectrogram was a time-frequency representation of an audio signal.Figure 4 shows the example of dB-scaled spectrograms from this study data.This research indicates that using CAE from denoising audio produced promising results.
Autoencoders are unsupervised learning neural network algorithm with a backpropagation process. 26This algorithm is unsupervised learning because it will try to produce an output that has a close similarity to the input, which can lead to both the output and the input to have similar shapes.Therefore, eliminating the use of data labels.
After being forwarded into several hidden layers, autoencoders will reduce the data dimension.At a certain point, the layers will start to increase the data dimension.Finally, the last layer will have the same size as the input.The main idea of autoencoders is to reduce the original data dimension without losing important details so that it can be reconstructed back into the original data. 27sed on this explanation, autoencoders consist of four main parts.The first part is the encoder, where the model will keep reducing the data dimension.The second part is called bottle-neck, or latent-vector.Latent-vector is the result of the encoder, which has the smallest size in the model.The third part is the decoder, which will try to recreate the original data from the latent-vector.The last part is the reconstruction loss, which is a function to measure the similarity between input and output. 28From all these parts, the encoder and latent-vector is needed.The purpose of the decoder and the reconstruction are to make sure there is an excellent encoder and latent-vector.Finally, only the encoder is used to get the latent-vector, and this latent-vector will be forwarded into the classifier.
In this paper, the autoencoders were used as the feature learning process.As mentioned, this paper has used autoencoders for both face and audio data.Both CAE and VAE were used, and the results were compared.Additionally, the results were analyzed with and without using a dense latent-vector.Mostly, the autoencoders consist of three kinds of layers, which were convolutional, max pooling, and upsampling layer.For the models that used a dense latent-vector, a fully connected layer was added.Rectified Linear Unit (ReLU) was used 29 (Equation 5) as the activation function of the convolutional layers and the sigmoid function 30 (Equation 6) for the fully connected layers.Other activation functions did not increase the performances.
Both amplitude and dB-scaled spectrograms used the same structure, but they were trained separately.Although layers with variant parameters were used, all convolutional layers only used one stride and the same padding so that the convolutional layers don't reduce the dimension.The max-pooling layer for the dimensionality reduction task used 2Â2 kernels to minimize the detail loss.The structure for the best model for CAE without dense latent-vector can be seen in Table 1 for face autoencoder, and Table 2 for spectrogram autoencoder.For the models that used dense latent-vector, a similar architecture was used with a different bottle-neck part (yellow color).The bottle-neck part was changed into a fully connected layer with 1,250 neurons, and the next layer with 8,250 neurons for the face autoencoder.For the spectrogram autoencoder layers were not changed, however, a convolutional layer with ten filters and a 3Â3 kernel before the bottle-neck, and a fully connected layer with 7,875 neurons after the bottle-neck part was added.For the VAE models, the same architecture for both with and without dense latent-vector was used with the difference that the bottle-neck parts were duplicated, and an extra layer was added to implement the combination of the two latent-vector by using the reparameterization trick.
Based on the autoencoders structure without dense latent-vector, the latent features will have the shape of 25Â25Â2 for the face data and the shape of 75Â105Â1 for spectrograms data.This shape is the smallest that could be produced which still gave a good reconstruction result.In the next part, we'll present that autoencoders without dense latent-vector provide a better reconstruction result.
Autoencoders need a loss function to measure their performance.The most common and simple function is the Mean Squared Error (MSE) function. 31However, using this function won't produce the best result for this study therefore, Structural Similarity index (SSIM) 32 was used as the loss function for the face autoencoder.In addition, binary crossentropy (BCE) instead of MSE was used in the spectrogram autoencoder.The benefits of using SSIM and BCE compared to MSE has been shown in in Figures 5 and 6.
Equations 7 and 8 for SSIM and BCE loss functions were used with h representing model prediction, and y representing the desired output.For the SSIM loss function, 0.01 for k 1 and 0.03 for k 2 , while L represents the maximum value was used (This value is one since the image has already been normalized).

Multimodal pain and cry detection
In the present study, we used joint representation to co-learning multimodal data together at the same time.We also use a pre-train representation of unsupervised learning, such as autoencoders to overcome the limited labeled data in this study.This research used the ASM model from previous research 12 to crop the face region and discarded unimportant data.
We transforms the ASM model into Kazemi et al. face landmark algorithm 33 (Figure 7).DLIB library supports this algorithm.The extracted features were forwarded into the Long-Short Term Memory (LSTM) model to classify pain and for cry detection from a sequence of video frames.
In the present study classifiers for both cry detection and pain classifier were used.Throughout this study, from all the autoencoders tested, both CAE and VAE without dense latent-vector had better performances (Results shown in the next section).Therefore, the latent-vector from autoencoders without a dense-latent vector was used as the input for the classifiers.Following this, it was noticed that autoencoder models with similar performances may have completely different latent features from each other.Some latent features may be able to provide good classification results while this might not be possible for others.Due to this reason, the use of a neural network classifier consisting of just fully connected layers and dropout layers could not be used in this study.Therefore, at least one convolutional layer to boost the classifier performances was needed.In addition to the convolutional layer, dropout layers to prevent overfitting were used considering the small amount of data in this study.
In totals, five latent-vectors from the autoencoders were used.All these features were not used at once, however, models were built by using a single or combination of features.There were five feature parameters for the cry detection and pain classifier system: faces, dB-Scaled, amplitude, faces + dB-Scaled, and faces + amplitude.For a model that used more than one feature, the convolutional layer's output was concatenated.Therefore, based on the multimodal models, there were four convolutional layers, three for face data and one for spectrogram, before concatenation process.
For the cry detection model, there were only two target classes.Therefore, one neuron output, and the sigmoid activation function was used.However, as there were three target classes for the pain classifier, three output neurons were needed.Additionally, the softmax activation function was used 34 (Equation 9) to make the neuron's output as the prediction probability for each class.For the cost function, binary cross-entropy for the detection system, and for the pain classifier, a categorical cross-entropy (Equation 10) were used.In these equations, N represents the number of output neurons.
As this study is a classification problem, accuracy as the performance measurement was used.On the other hand, the most widely used metric to measure model performances for classification problem is the F1 score (Equation 11).F1 score is the harmonic mean of precision (Equation 12) and recall (Equation 13).The final measurement is the Macro F1 Score.This measurement is the average of F1 scores for each class.

Ensemble model
This study presents one model that was the best for the cry detection system.However, as a significant high F1 score for the pain classifier model could not be achieved, four models that provided a significant high F1 score were used instead.Each model had different parameters: face + dB-Scaled, face + amplitude, only dB-scaled, and only amplitude.From these results, it was decided to use all the models rather than choosing the best models.The prediction of the pain class was by using four models, and the majority votes method as the ensembling method was used.This method voted for all classifier results and used the majority class label as the final predicted class.In the case of tied votes, the class with higher pain labels were chosen.For example, if the severe pain and moderate pain classes had the same voting points, severe pain as the final prediction was selected.

Results
In this paper, two major parts of the experiment were presented.In the first part, three autoencoder models were built, which were autoencoder for the infant's face, amplitude spectrogram, and dB-scaled spectrogram of infant's voices.In the second part, the latent-vector result from the autoencoders were used to build the cry detection and pain classification models.

Face autoencoder
In total, 20% of the 16,575 face dataset was used as the testing data, which resulted in 3,093 data.The results for the facial autoencoder can be seen in Table 3, and the image output sample can be seen in Figure 8.These results indicate that both CAE and VAE can recreate the data with a small SSIM loss value.From the reconstruction loss and the sample output, it can be seen that the best model was the use of CAE without dense latent-vector.
A dense latent-vector had reduced the model performance for both CAE and VAE.Although, the performance was only slightly reduced, and there was sufficient reconstruction of the original face, the dense latent-vector used much more weight parameters because of its fully connected layers.Therefore, it can be concluded that autoencoders without dense latent-vector were the best option in this case.
Even though CAE and VAE cannot be compared as they are different methods, it can be seen in Figure 8 that CAE gave a much better reconstruction result.It was assumed that VAE tried to build a more general model, as a result the model was forced to miss the small details.

Spectrogram autoencoders
Two spectrogram autoencoder models were built, one for dB-scaled spectrogram and one for amplitude spectrogram and the same structure and loss function for both spectrograms were used.Moreover, the same data training and testing distribution was used.The testing data was 506 data as it represents 20% of the total data (2530 short videos).The use of the dense latent-vector for the spectrogram autoencoder produced the same results as the face antoencoder (Results not shown).The results of the spectrograms CAEs and VAEs can be seen in Table 4, and the output samples were shown in Figure 9 for the dB-scaled spectrogram and Figure 10 for the amplitude spectrogram.
From the output samples, it can be seen that VAEs' reconstructions were not good for both spectrograms.Based on the results in Figure 10, it can be agreed that the amplitude spectrogram cannot be reconstructed by using VAE.For the dB-scaled spectrogram, the reconstruction may be similar to the original data.However, few spectrogram differences   might give a huge difference in their audio signal.Therefore, the present study did not test with the classifier by using the VAE latent-vector for the spectrograms.It was also assumed that VAEs were unable to recreate the perfect original data due to their ability to generalize.

Data label distribution
Two data training and testing distribution was tested.In the first distribution, seven original videos, which were three presurgery, and four post-surgery videos were excluded to prevent overfitting of the system.This resulted in the use of 1,836 training data and 432 testing data (Table 5).For the second distribution, the data order was mixed, and 25% of the data for the purpose of testing was excluded.Therefore, 1,701 training data and 567 data for testing was obtained.Further details about this distribution can be seen in Table 6.This distribution was provided to compare the result with the first distribution.

Cry detection
In this study several feature combinations by using both data distributions were tested (Table 7).The results show that the use of the second distribution gave a much higher results on all parameters.It is possible that using the second dataset distribution couldn't detect an overfit model because of the high similarities between the training and testing data.
Based on the result of the first data distribution, the facial expression was not a relevant feature for cry prediction.However, this feature can give a much better result on the pain problem.This result indicated that more data is required to avoid overfitting of the system.Face features using CAE gave a higher result; this statement was also supported by the multimodal feature result, which produced a better result when using CAE facial feature extraction.On the contrary, using voice features, especially dB-scaled spectrogram, resulted in a much higher F1 score and accuracy than the facial  features.Therefore, it can be concluded that using the voice feature was more relevant and in need of less data than the facial feature.The use of a dB-scaled spectrogram had a better outcome than an amplitude spectrogram, even though the amplitude spectrogram also had better result than the facial features.However, the best result was obtained by the multimodal feature, which used the dB-scaled spectrogram as the audio representation, and CAE as the facial feature extraction.The multimodal feature slightly increased the accuracy even though it used more resources.
According to these results, it can be said that with the existence of enough resources, it's recommended to use multimodal features.However, in case of limited resources, it is best to use only the voice feature as it can still produce an acceptable result.The confusion matrix of using the multimodal and voice feature by using the dB-scaled spectrogram audio representation has been provided in Table 8.

Pain classification
The result for pain classification problem is shown in Table 9.It provided extremely high accuracy and F1 score for most cases, even when the first distribution gave a very low result.Therefore, the best method and feature from the first distribution was selected.The highest result came from the model that used the multimodal feature with CAE for face autoencoder, and the dB-scaled spectrogram as its audio representation.However, this parameter only produced a slightly better result than other parameters.This result was also not good enough to be titled as the best model for pain classification.Therefore, the top-four models were ranked based on their F1 scores.Ensemble of these four models were tested with the use of several methods, and the best result was produced using the majority votes technique and the tiebreaker of the sum of each model's probability from the four models.This ensemble model had a better result based on the accuracy and F1 score.The confusion matrix for this ensemble model can be seen in Table 10.The input and output example in has also been provided in Figure 11.

Discussion
In this paper, several methods have been tested for building a deep learning model for pain and cry detection systems.It is evident that a voice feature was more relevant than a facial expression for pain detection.However, the combination of both features can achieve a higher score.The multimodal deep learning autoencoder can assist the detection of infant cry and pain.
Table 11 shows comparison of this work with others related works.Since 2010, researchers have tried to use image as a modality to classify the pain in infants.They found an excellent accuracy of using images as modality of pain detection.
Our research which proposed a multimodal deep learning autoencoder for infant cry and pain detection can provide a greater excellent accuracy to the F1 score 0.991.
Apart from studies using facial expression, several studies tried to classify infant pain using acoustic features from the infant voices data.In 1995, Petroni et al. 5 tried to classify the infant's cry by using the Mel-cepstrum coefficients representation and filter-band energies with the help of ANN.This research used three classification targets: anger, fear, and pain.The results provided good accuracy for predicting anger and pain.However, this research had many prediction errors for the fear label.
In 2003, J. O. Garcia and C. A. R. Garcia 6 set out to classify normal and pathological infant's cries.This research provided a high accuracy by using 506 data to classify two target classes.Instead of using the audio wave signal, MFCC as the audio representation was used in this study, which is the same as the spectrogram but adapted to resemble the human ear.This research also used the help of a neural network to classify their final prediction.
In 2011, M. A. Nicolaou et al. 35 used a multimodal data without deep learning methods.Since the features were handcrafted, autoencoders were not used.The same joint representation method along with three fusion methods were used by that study.The methods were feature-level (used in this study), model-level, and output-level fusion.The feature-level fusion is combining the feature to be predicted using one model.While model-level and output-level fusions predict each representation individually, and combine them using another model. 36 2018, T. Baltrušaitis et al. 37 summarized methods that have been used for multimodal data representation.At present, there are only two methods for co-learning multimodal data at the same time: joint and coordinated representations.The basic concept of the joint representation method is to fuse the data representation by concatenating its individual features.On the other hand, the coordinated representation let each representation to be learned separately while still in coordination.Joint representation was used in the present study as it is the most used method.However, limited labeled data has been the main obstacle faced in this case.The solution to this problem was to use a pre-train representation of unsupervised learning, such as autoencoders.The method of using autoencoder for unsupervised learning has been used previously by another study.In our previous study 11,12 a cry detection and pain classification system was created.This study used 46 videos taken by E. Hanindito et al. 18 and added ten more videos with multiple duration.These studies introduced an ASM for landmarking infant's face.The ASM was built using the frontal face on the videos as their data to extract handcrafted geometric features.From these features, they're able to achieve high accuracy for the infant facial cry detection and pain classifier.
In 2018, while deep learning had already been widely used as a feature learning method, Y. Kristian, et al. 39 removed all the hand-crafted feature extraction, and switched to deep learning by using autoencoders.This research still used the same ASM from their previous research 12 to crop the face region and discarded unimportant data.This paper transforms the ASM model into Kazemi et al. face landmark algorithm. 33DLIB library supports this algorithm.The extracted features were forwarded into the Long-Short Term Memory (LSTM) model to classify pain and for cry detection from a sequence of video frames.The face landmark algorithm from this research was used in this paper by using the DLIB library.
This research reported a high accuracy for both cry detection and pain classification.Based on this finding, it can be suggested that multimodal representation such as combination with a physiological parameter of vital signs should be considered more in future work.Severe stimuli resulting vital sign parameters such as increase in heartbeat, changes in breathing patterns, oxygenation instead of a change of face and cry pattern should be explored.
However, several limitations exist in our study.The limited amount of data may introduce bias, however this paper was able to achieve a relatively good result from the use of the proposed method.To improve the performances even further, increasing the data amount would have a significant effect.
A patient monitoring system is very important, as most of the pain scores are subjective.These findings support the hospital monitoring system and increase awareness of health care providers for the infant's pain.If the pain persists, modifying the pain medication may be needed.

Conclusion
This study has introduced a method for cry and pain detection by using the infant's facial expression and voice.In addition, an alternative method in cases with limited resources have been provided.It has been shown that the infant's voice is a more relevant feature than the facial expression, but the two features combined can obtain a better result.
For feature extraction, it has been demonstrated that CAE performs better than VAE for facial expression and voice spectrograms.This is because VAE is used more for generating new data rather than extracting features.Moreover, it has been shown that using dense latent-vector in the autoencoder increased the reconstruction loss.

Abdulaziz Saleh Ba Wazir
Faculty of Engineering, Multimedia University, Cyberjaya, Malaysia The paper presents a bright idea of deep learning usage for infants cry and pain detection.The paper used multimodal and utilized both visual and audio features of the dataset for training and testing purposes.The use of both features added more value in the field of infants cry researches using deep learning.Here is a few comments to boost the paper contribution and organization as well.
Many of the literature on autoencoders was highlighted in details in the methods section.However, it is proper to be discussed as part of the literature in introduction section or as a separate literature section.And only refer to it in the methods section 1.
Many of the literature on infants cry and pain detection as references 11, 13, and 15 was highlighted in details in the methods section.However, it is proper to be further discussed as part of the literature in introduction section or as a separate literature section.And only refer to it in methods section.

2.
It is highly recommended to separate methods and experimental settings into sections.evaluation and metrics can be part of the experimental settings.

3.
Discussion either can be extended and detailed or added to results section and have one section (results and discussion) and detail the discussions within the results if combined.

4.
Although several literature was cited on the same dataset, the paper lacks comparison with those researches.As parts of the results and discussion a comparison should be tabulated and discussed, even if the size and type of the dataset are different.The table should highlight the dataset, approach, deep learning models, and achieved metrics.At least few literature can be compared such as 11, 12, 13, 14, and 15.

5.
For the comparison table as in comment number 5, the table should highlight at least one or two common metrics.If common metrics is missing, adding one or two metrics is still acceptable than not having a comparison at all.

6.
The comparison table shall be discussed to highlight the novelty and contributions of this paper over other methods, such as usage of audio and images, or out performance of current work over others in terms of metrics, or weights.One or two difference is enough, but the more the better.

7.
The novelty and contributions based on the comparison table can be used to further detail the contributions in background and conclusion sections.Reviewer Expertise: Deep Learning, Image processing, Computer Vision, Acoustics using deep learning, Audio and Signal Processing I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Ghazal Bargshady
Faculty of Science and Technology, University of Canberra, Bruce, Australia The paper is good, the model and novelty is explained well, however there are some feedbacks which the authors need to address before indexing.Discussion should discuss more details, explain well, and talk more about future works, please discuss the achievement and compare with other work.This section need more work. 1.
The comparison with other work is not discussed well.It is better the authors use table for 2.
comparison with other works.
In method section, it is recommended to separate experimental setup such as database and experimental details in a separate section.

3.
It is better to highlight more contributions in Introduction and discussion.4.
In the references list, I can see the author have not mentioned the other related works.There are so many important research related to deep learning and pain detection that have been done before by others and published in high quality journals which is required to bring in the reference list and compare the results with them.

Are sufficient details of methods and analysis provided to allow replication by others? Partly
If applicable, is the statistical analysis and its interpretation appropriate?Yes Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Affective computing, deep learning, computer vision I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 2 .
Figure2.Architecture system.The face region will be extracted from the selected frames using face landmark.The amplitude and dB-scaled spectrograms are obtained from STFT process.These data are forwarded into autoencoders resulting five latent-vector which will be the input for the cry detection and pain classifier system.

Figure 3 .
Figure 3. Architecture of Variational Autoencoder (VAE) consist of encoder and decoder with two latentvector, which usually called Mean and Standard Deviation.These latent-vector were combined using reparameterization trick.

Figure 4 .
Figure 4. Spectrograms of the infant's voice from crying baby in severe pain (a), and from not crying nor in pain baby (b).

Figure 8 .
Figure 8. Sample of face autoencoder results using SSIM loss function with original image (a), output from CAE (b), output from CAE with dense latent-vector (c), output from VAE (d), and output from VAE with dense latentvector (e).

Figure 9 .
Figure 9. Sample of dB-scaled spectrogram autoencoder results using BCE loss function with original spectrogram (a), output from CAE (b), and output from VAE (c).

Figure 10 .
Figure 10.Sample of amplitude spectrogram autoencoder results using BCE loss function with original spectrogram (a), output from CAE (b), and output from VAE (c).

Figure 11 .
Figure 11.The example of data input with their results.

8 .
Is the work clearly and accurately presented and does it cite the current literature?YesIs the study design appropriate and is the work technically sound?YesAre sufficient details of methods and analysis provided to allow replication by others?PartlyIf applicable, is the statistical analysis and its interpretation appropriate?YesAre all the source data underlying the results available to ensure full reproducibility?PartlyAre the conclusions drawn adequately supported by the results?YesCompeting Interests: No competing interests were disclosed.

Table 3 .
Face autoencoder on various type.

Table 7 .
Cry detection using various features.

Table 8 .
Confusion matrix of cry detection.

Table 9 .
Pain classification using various features.

Table 10 .
Confusion matrix of ensemble pain classification. 38

Table 11 .
Comparison with other related works.