Motion and Geometric Feature Analysis for Real-time Automatic Micro-expression Recognition Systems [version 1; peer review: awaiting peer review]

The trend of real-time micro-expression recognition systems has increased with recent advancements in human-computer interaction (HCI) in security and healthcare. Several studies in this field contributed towards recognition accuracy, while few studies look into addressing the computation costs. In this paper, two approaches for micro-expression feature extraction are analyzed for real-time automatic micro-expression recognition. Firstly, motion-based approach, which calculates motion of subtle changes from an image sequence and present as features. Then, secondly, a low computational geometric-based feature extraction technique, a very popular method for facial expression recognition in real-time. These approaches were integrated in a developed system together with a facial landmark detection algorithm and a classifier for real-time analysis. Moreover, the recognition performance were evaluated using SMIC, CASME, CAS(ME)2 and SAMM datasets. The results suggest that the optimized Bi-WOOF (leveraging on motion-based features) yields the highest accuracy of 68.5%, while the full-face graph (leveraging on geometric-based features) yields 75.53% on the SAMM dataset. On the other hand, the optimized Bi-WOOF processes sample at 0.36 seconds and full-face graph processes sample at 0.10 seconds with a 640x480 image size. All experiments were performed on an Intel i5-3470 machine.


Introduction
A micro-expression is a brief, spontaneous facial expression that occurs on a human face in response to the emotions they are experiencing. The micro-expression contains a significant amount of information, and has attracted the interest of computer vision researchers because of its potential uses in security, interrogation, and healthcare. [1][2][3] However, due to the speed of facial muscle movement, it is difficult to extract this information, and the features must be more detailed. A typical period of micro-expression is 200 milliseconds or less. 4 For real-time micro-expression analysis and emotion recognition, the process involves pre-processing, feature extraction, and recognition. This paper examines two popular feature extraction approaches: motion-based features and geometric-based features. These approaches are reported to have reliable details from uncontrolled image data, which makes it feasible for real-time analysis.
A motion-based feature is constructed based on non-rigid motion changes of subtle expressions where motion changes are extracted for spotting purposes. Facial motion analysis was first presented in 5 using optical flow to spot microexpressions. Since then, several studies in this field have explored this approach for facial landmark detection and microexpression recognition. In, 6 authors proposed an optical flow features from Apex-frame network (OFF-ApexNet), which combines optical flow guided context with the convolutional neural network (CNN) to compute features. Then, authors in 7 presented a novel algorithm that combines a deep multi-task convolutional network for detecting facial landmarks and a fused deep convolutional network for micro-expression features. In another study, 8 authors suggested the Riesz pyramid and a multi-scale steerable Hilbert transform. While, Merghani and Yap 9 proposed a new region-based method with an adaptive mask. However, the motion-based features reported recognition accuracy from the current studies peaks at 74.06% over the CASMEII using leave-one-subject-out cross validation (LOSOCV). 6 On the other hand, geometric facial analysis deals with the locations and shapes of facial components. As highlighted in Liu et al., 10 performance of landmark detection algorithms is limited, where only a few studies utilized landmarks in early facial graph representations. However, with recent advancements in face analysis study, improved facial landmark detection algorithms are presented in several studies. [11][12][13][14] Towards facial landmark graph features, Lei et al., 15 presented a method that only employed 28 brow and lip landmarks, which contributed significantly to micro-expressions. While, other studies [16][17][18][19] presented graph-based methods using AU to define landmarks of interest. The reported recognition accuracies from these methods proved that micro-expression features can be extracted using facial graph approaches. However, the general problem with graph-based micro-expression recognition is the lack of large-scale in-the-wild datasets. To date, the recognition accuracy peaks at 87.33% over the SAMM dataset with LOSOCV as reported in Buhari et al. 18 Methods An automatic micro-expression recognition system is implemented for real-time facial analysis by integrating face landmark detection, feature extraction, and then classification. In the developed system, a trained model is generated using publicly available spontaneous micro-expression datasets. For micro-expression feature analysis, two different methods were implemented. Firstly, a Bi-Weighted Oriented Optical Flow (Bi-WOOF) feature descriptor by Liong et al., 20 is used. This method is a motion-based approach that uses optical flow to compute features, and it requires an apex-frame spotting before the feature computation. The BiWOOF is considered in this study as the results of its performance improvement over the textural feature extraction methods such as local binary patterns on three orthogonal planes (LBP-TOP) as reported in Liong et al. 20 However, the computational cost poses challenges for real-time recognition as it require apex-frame spotting. The second feature descriptor considered is a full-face graph by Buhari et al. 18 This geometric-based feature computation method requires only facial landmarks to compute features. The fullface graph is considered in this study because the computational time is significantly low in comparison with motionbased. However, it is reported that the earlier geometric-based methods could not detect hidden changes in facial components due to its subtleness and briefness. Figure 1 illustrates the implemented real-time micro-expression recognition system developed using Bi-WOOF. This feature extractor requires at least two frames (i.e, neutral frame and apex-frame) to compute. Firstly, the system captures images of faces using dlib-19.4. 21 Next, apex-frame spotting is applied using an automatic apex frame spotting method by Liong et al., 22 to identify the frame with the highest facial expression within the captured image sequences (i.e., processing sample). As reported by the authors in Liong et al., 22 the performance of their method improved over the annotated apex frames provided in micro-expression databases. Here, image sequences from spontaneous microexpression datasets were utilized. Upon identifying the onset and apex frames, optical flow vectors are computed to define the face motion patterns: (i) magnitude: pixel movement intensity; (ii) orientation: flow motion direction; and (iii) optical strain: modest deformation intensity. Then, using the computed optical flow vectors (i.e., the magnitude, orientation, and the optical strain), Bi-WOOF features are formed.

Motion-based framework
Step-by-step details of this method can be found in Liong et al. 20 Figure 1 shows a framework of the real-time micro-expression recognition system using apex-frame spotting and BiWOOF feature extraction method. Figure 2 illustrates the implemented real-time micro-expression recognition system developed using full-face graph. At first, facial landmark detection is applied to detect coordinates of facial components using dlib-19.4. Then, segments of lines are generated using the detected coordinates of facial components by connecting each landmark point (denoted as p) with subsequent landmark points (denoted as q), for p ∈ f1, 2,…, Ng and q ∈ f1, 2,…, pg, where N ¼ 68. This concept is described as a full-facial graph using landmark points, and segments are generated as follows:

Geometric-based framework
The indexes (i.e., p,q) of every landmark with subsequent landmark points are determined and stored as segments in ℑ for feature computations. After the graph is generated, features are computed by calculating the Euclidean distance and gradient of every segment, an idea presented in Buhari et al. 18 The total number of features computed using this technique is N Â ðN À 1Þ ! K, which translates to 4,556 features at N ¼ 68.
To further analyse the potential performance improvement of the geometric-based features, Eulerian motion magnification (EMM) is applied to the images to amplify the micro-expressions prior to the landmark detection process. Eulerianinspired approaches 23,24 do not require explicit motion vectors but emulate motion magnification by magnifying property changes, such as amplitude (denoted as A-EMM) or phase (denoted as P-EMM). According to Le et al. 24 A-EMM outperformed P-EMM in terms of recognition rates over a broad range of magnification factors. Thus, this paper considered the A-EMM to the images before the feature computations. Details of the methods for A-EMM are detailed in Le et al. 24 Figure 3 illustrates the principle of the integrated a magnification sub-process to the implemented single-frame sample with a geometric-based features system.

Experiment settings
The experiments were performed using four spontaneous datasets: (i) spontaneous micro-expression dataset (SMIC) dataset, 25 (ii) Chinese Academy of Sciences Micro-expression (CASMEII) dataset, 26 (iii) spontaneous macro-expressions and micro-expressions (CAS (ME) 2 ) dataset, 27 and (iv) spontaneous actions and micro-movements (SAMM) dataset. 28 These are spontaneous micro-expression datasets which were used in this study with full details of these datasets in Li et al., 25 Yan et al., 26 Qu et al., 27 and Davison et al. 28 Details to acquire these datasets used in this study are available at www.oulu.fi/cmvs/node/41319 for SMIC, 25 fu.psych.ac.cn/CASME/casme2-en.php for CAS-MEII, 26 fu.psych.ac.cn/CASME/cas (me)2-en.php for CAS (ME) 2 , 27 and personalpages.manchester.ac.uk/staff/adrian.davison/SAMM.html for SAMM. 28 Moreover, to evaluate the performance using a larger dataset, this paper merged the four datasets to form a COMBINED dataset. The COMBINED dataset is created from the raw images of all the four datasets. The steps for generating the COMBINED datasets includes; face detection, face cropping, colour-space conversion to grayscale, and image re-scaling to 140 Â 170. From these steps, colour-space conversion is applied to adopt the SAMM dataset samples as provided in grayscale format, where the sample re-scaling to 140 Â 170 adopts the SMIC dataset cropped image sizes (i.e., the smallest cropped image size considered to provide reliable features description, and achieve high speed performance for real-time micro-expression recognition). The image re-scaling utilises a down-sampling technique by Buhari et al., 29 in order to re-produce a high quality downscaled samples. In addition, the COMBINED dataset adopted the SMIC dataset sample labelling by re-grouping the seven classes of emotions (i.e., happiness, sadness, anger, surprise, fear, contempt, and disgust) to three classes (i.e., positive, negative, and surprise). Here, positive ∈fhappinessg, negative ∈fsadness; anger; fear; contempt; disgustg, and surprise ∈fsurpriseg. Figure 4 illustrates the COMBINED dataset formation from the four spontaneous datasets. Note that the participant images utilised in Figure 4 are the publishable images with the consent of participants, as stated in the documentation from each study. While, Table 1 summarises the selected spontaneous micro-expression datasets used in this study. In this table, the COMBINED dataset is denoted as δ. Table 2 records the recognition accuracy of the baseline Bi-WOOF, 20 (denoted as BBW), optimized Bi-WOOF, (denoted as OBW), full-face graph, (denoted as FFG) and full-face graph with A-EMM, (denoted as FFG+M). The baseline Bi-WOOF is referred to the original method by Liong et al., 20 which was implemented using MATLAB, then the optimized Bi-WOOF refers to the implemented C++ version that accelerates the computation performance for real-time analysis. All the four experimental setups utilized the Support Vector Machines (SVM) classifier with a Radial basis function (RBF) kernel. The SVM hyper-parameter selection is based on the recommendation in Bergstra and Bengio. 30 This is described as an optimised hyper-parameter selection technique for best classification performance in comparison to the sequential tuning in a context of a model with many hyper-parameters. In addition, all measured accuracies are based on LOSOCV. Similarly, the COMBINED dataset is denoted as δ in Table 2. Table 2 presents the recognition accuracies of the baseline Bi-WOOF (denoted as BBW), the optimized Bi-WOOF (denoted as OBW), full-face graph (denoted as FFG) and the full-face graph with A-EMM (denoted as FFG+M). Here, the BBW and OBW yield the highest recognition accuracies of 66.01% and 69.15%, respectively, over the COMBINED dataset. Similarly, FFG and FFG+M yield the highest recognition accuracy of 77.05% and 77.85%, respectively, over the COMBINED dataset. From these results, it is observed that the OBW improved the performance of the BBW by up to 3.28% over the SAMM dataset. On the other hand, the implemented full-face graph with A-EMM improved the performance by up to 1.20% over the SAMM dataset. Then, in comparison with optimized Bi-WOOF and full-graph with A-EMM improved the performance by 9.72%, 11.23%, 18.32%, 7.03% and 8.7% over SMIC, CASMEII, CAS (ME) 2 , SAMM and COMBINED datasets, respectively. Moreover, Table 3 compares the accuracies of the optimized BiWOOF with other motion-based methods, where Table 4 compares the accuracies of the full-face graph with other geometric-based methods. Discussion Table 3 lists the performance of benchmark motion-based methods, 6-9,20,31,32 with the optimized Bi-WOOF. The best reported accuracy is 74.06% over the CASMEII dataset. 6 Looking at Bi-WOOF+Phase, 32 the highest performance reported is 68.29% over the SMIC dataset, which outperformed the baseline and the optimized Bi-WOOFs. However, the optimized Bi-WOOF outperformed the reported accuracies in other studies, 7-9,31 as shown in Table 3.

Results
On the other hand, Table 4 lists the benchmark geometric-based methods [15][16][17][18][19] with the full-face graph and full-face graph + A-EMM, which are denoted as experiment I and experiment II, respectively. From these results, Buhari et al., 18 reported the highest accuracies of 76.67%, 75.04%, 81.41%, and 87.33% over the SMIC, CASMEII, CAS (ME) 2 , and SAMM datasets, respectively. In Buhari et al., 18 the full-face graph utilized 68 landmarks from the raw images (denoted as ℝ). While, the full-face graph in experiment I and experiment II yield 74.62%, 74.41%, 75.11%, 74.33% and 75.01%, 74.55%, 76.21%, 75.53% over the SMIC, CASMEII, CAS (ME) 2 and SAMM datasets, respectively. From these results, the full-face graph in experiment II outperformed the full-face graph presented in 18 with 8.11%, 1.1%, and 3.38% over the SMIC, CASMEII, and CAS (ME) 2 datasets, respectively. While, on the other hand, Buhari et al., 18 outperformed the results in experiment II with 4.7% over the SAMM dataset. While, in comparison with reported accuracies presented in Table 3, the full-face graph in experiment II achieved the highest performance.
Looking at the performance presented in Tables 3 and 4, the performance reported in Buhari et al., 18 registered the highest accuracy of 87.33% over SAMM datasets, respectively. While, the full-face with A-EMM outperformed the full-face graph performance presented in Buhari et al. 18 From these results, the conclusion can be drawn that the geometric-based features are competing closely with the motion-based features. In terms of the computational time, the optimized Bi-WOOF running time is 0.36 seconds per sample (i.e., 2.7fps), while the full-face graph running time is 0.10 seconds per sample (i.e., 10fps), with a 640Â480 image resolution, on an Intel i5-3470 machine. The reported running times include facial landmark detection and classification.

Conclusions
This paper analyzed the performance of motion-based features (i.e., Bi-WOOF) and geometric-based features (i.e., Fullface graph) for real-time micro-expression recognition systems. The results indicate that the optimized Bi-WOOF improved recognition accuracy of the baseline Bi-WOOF by up to 3.28% over the SAMM dataset. While, on the other hand, the full-face graph performance is improved by up to 1.20% with A-EMM over the SAMM dataset. Moreover, the full-face graph and full-face graph with A-EMM exhibit significant performance improvement over the baseline and the optimized Bi-WOOF by up to 18.32%. Though the full-face graph improved performance of recognition accuracy, the processing time could limit the readiness of the full-face graph features for real-time systems using high-speed cameras.

Data availability Underlying data
The experiments were performed using four spontaneous datasets: (i) spontaneous micro-expression dataset (SMIC) dataset, 25 (ii) Chinese Academy of Sciences Micro-expression (CASMEII) dataset, 26 (iii) spontaneous macro-expressions and micro-expressions (CAS (ME) 2 ) dataset, 27 and (iv) spontaneous actions and micro-movements (SAMM) dataset. 28 These are spontaneous micro-expression datasets which were used in this study with full details of these datasets in Li et al., 25 Yan et al., 26 Qu et al., 27 and Davison et al. 28 Details to acquire these datasets used in this study are available at  27 and personalpages.manchester.ac.uk/staff/adrian.davison/SAMM.html for SAMM. 28 Moreover, to evaluate the performance using a larger dataset, this paper merged the four datasets to form a COMBINED dataset. The COMBINED dataset is created from the raw images of all the four datasets with the source code available under extended data.

Extended data
Zenodo: Implementation of COMBINED micro-expression dataset and Setup files for real-time micro-expression recognition using motion and geometric features. https://doi.org/10.5281/zenodo.5524141. 33 The project contains the following extended data: • Real-time micro-expression recognition using biwoof features (executable setup for micro-expression recognition using motion-based features).
• Real-time micro-expression recognition using full-face graph features (executable setup for micro-expression recognition using geometric-based features).
• Image re-scaler for COMBINED micro-expression dataset formation (Visual Studio 2010 source code written in C++).
Data are available under the terms of the Creative Commons Zero (CC0 v1.0 Universal).
Zenodo: Performance analysis of micro-expression recognition over different sample image sizes. https://doi. org/10.5281/zenodo.5379773. 34 This project contains the following extended data: • Performance improvement over 140Â170 sample size.
• Performance improvement over 240Â340 sample size.
• Performance improvement over 560Â680 sample size.
• Performance improvement over 1120Â1360 sample size.
Data are available under the terms of the Creative Commons Zero (CC0 v1.0 Universal).

The benefits of publishing with F1000Research:
Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com