Automatic Classification and Shift Detection of Facial Expressions in Event-Aware Smart Environments

Affective application developers often face a challenge in integrating the output of facial expression recognition (FER) software in interactive systems: although many algorithms have been proposed for FER, integrating the results of these algorithms into applications remains difficult. Due to inter-and within-subject variations further post-processing is needed. Our work addresses this problem by introducing and comparing three post-processing classification algorithms for FER output applied to an event-based interaction scheme to pinpoint the affective context within a time window. Our comparison is based on earlier published experiments with an interactive cycling simulation in which participants were provoked with game elements and their facial expression responses were analysed by all three algorithms with a human observer as reference. The three post-processing algorithms we investigate are mean fixed-window, matched filter, and Bayesian changepoint detection. In addition, we introduce a novel method for detecting fast transition of facial expressions, which we call emotional shift. The proposed detection pattern is suitable for affective applications especially in smart environments, wherever users' reactions can be tied to events.


INTRODUCTION
In recent years, a↵ective computing has developed into a vibrant, multi-disciplinary field of research with exciting opportunities for new applications. Its foundations are in emotion models, sensing technologies, a↵ective and social signal processing, a↵ective data sets and reference applications. Despite much progress over two decades many challenges to build working systems remain [28,26,6]. During our research into a↵ective solutions for exergames, we encountered numerous di culties building generalised emotion-enriched applications due to the complex nature of emotions, rational and contextual processing, which occupies a significant portion of the human brain.
The common approach of mimicking human processes by collecting vast amounts of emotion data for situational and cultural contexts, experimental settings and subject groups, parametrised with plausibility rules and tuning parameters, has resulted in substantial challenges for AI research.
Over the course of our research comparing facial expression and emotion recognition systems, we identified the issue of application-specific mapping and its automatic interpretation. While facial-expression-derived emotions are a valuable source of information for event and reaction detection in a↵ective-aware applications, they are also di cult and com-  Figure 1: Our pipeline for an event-aware smart a↵ective system. The user's face is recorded by a camera and afterwards analysed with a FER algorithm to extract facial expressions. The algorithms output is post-processed with filtering, classification and interpretation based on external or internal events. This post-processing step is the focus of this publication.
plex to interpret and correlate with user actions, profiles and other data. As a result, FER analysis is quite complex and often not reliable due to response variations (see Fig. 2). For our work we found it useful to generalise both user actions and internal plus external signals to the application as events. Utilizing FER with these events provided us with additional context. The concept of this study was to extend single frame and average signal approaches to more closely analyse timing and response characteristics. Our general aim is to build robust real functional interactive applications with a variety of users and individual response dynamics, where a one-to-one mapping from expression to reaction is not fixed and post-processing is needed to provide a working system. Our goal is to determine whether the user perceives an event -based on the reaction and emotional expressions. These events may be triggered by system internal provocations, such as audio-visual-haptic stimuli or caused by system external triggers, such as an event from the real world. In the latter case, the system must detect the external event in order to determine the timeframe for the user's reaction as depicted in Fig. 1.
Thus far, we have relied on semi-automatic methods for detecting user reactions [24], as automatic solutions for direct emotion and expression mapping did not work for us.
In this paper we present a comparison of approaches for automatically recognising responses from the output of FER algorithms and provide a benchmark for this post-processing step. We demonstrate our post-processing methods with the state-of-the-art FER system Emotient.
Although this work is based on an emotion provoking exergame, our findings can be applied to any a↵ective scenario in which an application setting provides internal or external events to fix search windows occurring in smart and assistive Environments.

Previous Work
In order to understand this work in its context, we find it beneficial to be aware of our previous work, which describes the EmotionBike system [24], presents the experimental setup and showcases the provocation of human reactions with game events analysed post experiment by the facial expression system CERT [17]. The EmotionBike provides an exergame scenario enabling users to cycle through an interactive game on a stationary bike trainer with steering capabilities. It is a variant of a cockpit scenario also suitable for research on the topic of games for physical therapy and orthopaedic rehabilitation. Our follow-up study [23] enhanced the experimental setup with the event-based analysis of galvanic skin response combined with facial expressions. For practical guidance, we presented a benchmark [3] of four automatic facial expression analysis systems with three emotion-labelled reference databases and a systematic method for performance analysis and improvement that allows to tailor for specific application needs.
It is to note that all related work including our own: 1. observe only single frame or short sequence input and 2. use exclusively semi-automatic or manual emotion classification in practical applications that produce high interand within-subject variations. One standing research challenge is the fully automatic classification of user reactions, for which we present three alternative algorithms as feasible solutions. Our classification methods can be e↵ectively combined with the application-specific clustering approach [3] to increase its robustness for a wide spectrum of user reactions.
An interesting observation in our previous studies is the way inter-and within subject-responses vary as positive/negative inverse reaction based on predisposition.

Applications: Smart and Assistive Environments
Smart environments often provide a reasonable application context for recognising emotions and expressions. In this section we describe environments that could benefit from our methodology. As an example, the STHENOS project [18] already focussed on the development of a methodology and an a↵ective computing system for the recognition of physiological states and biological activities in assistive environments. Kanjo et al. [12] provide a good introduction to, and review of, the di↵erent approaches and modalities for emotion recognition in pervasive environments.
Another scenario in the area of cyber physical systems [15] that is event aware and presupposes a user's reaction is a cardriver assistance system: After a potential harmful external event occurs, the choice between waiting for an appropriate reaction from the driver or initiating an automatic response is crucial. A shift in the driver's facial expressions is one indication to wait in the first case.
Cockpit-based scenarios like the NAVIEYES system provide a lightweight architecture for a driver assistance system [22], that could benefit from facial-expression-shift detection as an additional input source to improve detection of driver's intentions. Another example is McCall et al.'s "Driver Behavior and Situation Aware Brake Assistance for Intelligent Vehicles" which adapted the system's reaction based on situational severity and driver attentiveness and intent by using a camera pointed at the driver's head [19]. Doshi et al. provided an overview on systems for driver behavior prediction and intent inference [8].
Korn et al. [14] published their work reagarding gamification in work environments, which applies facial expression analysis with the FER algorithm SHORE from Fraunhofer IIS and a semi-automatic (Wizard of Oz) approach. This is a similar method as in our previous setup.

Emotional Models and Expressions
Calvo et al. [7] list six main perspectives for understanding emotions: emotions as expressions, emotions as embodiments, cognitive approaches to emotions, emotions as social constructs, neuroscience approaches with core a↵ect and psychological constructions of emotion.
In this study we focus on the theory of emotions as expressions, which is primarily based on the theory of six basic emotions [9], although the number of expressions detected varies between algorithms.
A common approach for detecting emotional expressions involves generating a feature set of facial landmarks or muscle activity [34]. One approach for discrete quantification makes use of action units (AUs), which are part of the facial action coding system (FACS) by Ekman and Friesen [9] and describe a set of activities based on facial muscles. Coding facial expressions of emotions based on the presence of AUs have often been used in FER algorithms [20,2,34].

Facial Expressions
Facial expressions consist of three di↵erent phases: onset, apex and o↵set (see Fig. 3). All three phases have di↵erent durations: while on-and o↵set are typically short, apex is typically the longest phase. Spontaneous expressions often mix these phases resulting in multiple apexes [13]. Facial expressions can be divided in normal and micro-expressions, the latter sometimes called leaking expressions [9,32]. Although discussion continues about duration as a criterion of di↵erentiation [32], micro-expressions appear to last less than 0.5 s, while normal expressions typically last longer (often exceeding 1 s) [32].

FER Algorithms
An overview on the many approaches for detecting facial expressions through image and video processing can be found in the surveys of Zeng et al. [34] and Sariyanidi et al. [27]. State-of-the-art FER algorithms use a pipeline beginning with the crucial first step of finding the face, followed by reducing the data size with filtering [34]. Features are then extracted from the reduced data and machine learning or statistical classification generates the result. Our previous work [3] contains further insights into the nature of FER algorithms.
Algorithms may be trained to detect AUs [20] or facial landmarks [21] as an intermediate step or they may be directly trained for facial expression detection on raw input [29]. Their output is typically an independent probability value between 0.0 and 1.0 for every possible expression (see [3] for details).
Some research has been conducted on the (spatio-) temporal modelling of low-level AUs to exploit their chronological sequence [13] while others [16] have focused on the dynamics of higher level expressions, namely the six basic emotions and a neutral emotion.
While much work has been provided on converting video data to FER-output, we found very few studies [1,5] explaining the output's automatic classification, although it is a necessary step for application integration. We have found no general solution for this post-processing step other than semi-automatic or manual processing.

CLASSIFICATION ALGORITHMS
We developed three di↵erent algorithms for classifying the FER output data, starting with a primarily mean-based algorithm using a fixed window size. The second algorithm is based on a common approach in signal processing, which uses a matched filter with a fixed scan size and correlation to the data points. The third approach uses Bayesian changepoint detection. As our goal was to automate the analysis of our data for the event-based setup, all three approaches were developed for an analysis window around the event. The classification of our data is binary-based and coded with two symbols (e.g. "01", see Table 1). The data was subdivided by window size (see Section 4.1) and classified using four di↵erent methods: three algorithm-based approaches and manual classification by a single expert (data analyst) acting as a human observer for comparison.

Categorising of Data
The observer had the same options for classification as those used by the algorithms. The two symbol binary result was used for detecting the emotional shift in the facial expression channels (see Fig 4). Table 1: Types of data classification: "00" denotes that no expression was found, "01" that a rising edge was found, "10" that a falling edge was found, "11" that a stable signal near to 1.0 was found and "??" that the signal was inconclusive. ? /¯\/---/

Peak Detection
All three algorithms use the same method for peak detection that was originally developed by Eli Billauer [4]. We used his standard peak detect version with a look-ahead value of 1 and a delta value of 0.25. After peak detection with a delta of 0.25, we used a value of 0.5 for thresholds between the observed minima and maxima to verify the detection and improve the overall detection rate compared to increasing the peak detection threshold itself. In our previous work [24] we applied the peakutils algorithm by Bergman [25], but preliminary testing revealed that the Billauer algorithm provided slightly better results with this data set.

Edge-Detection-Based Algorithms
Our proposed edge-detection-based algorithms (CP, PMP) use a common design with a multi-step approach, presented in Figure 5. The main di↵erences between both approaches are the method for processing the data, the threshold for detecting the peaks and edges. These di↵erences are marked in green in Fig. 5. Both methods are explained below.

Smoothing of Data
For the smoothing of the data, we used a modified singlepass moving average filter for each block of (n=4) data points: If the mean is above a threshold of t=0.5, the block value is the maximum value in the block; if the mean is below or equal to the threshold, the block value is set to the minimum instead. This maximises the spread within the block data to improve change detection by the algorithms.

Processing Data with Changepoint Peak (CP)
Our changepoint-based design uses Bayesian changepoint detection for identifying the positions of rising and falling edges. We used the changepoint detection method described by Xuan et al. [31], which is based on the work on Bayesian inference for multiple changepoint problems by Fearnhead [10]. We used a constant prior of 1/len(data) and a truncate value of -20 as it produced the best results in preliminary testing on our data set.

Processing Data with Pattern Matching Peak (PMP)
Using a simple threshold binary filter is insu cient to process the data, as it still produces a signal requiring additional pattern filtering. We therefore used a reversed approach with pattern filtering and a binary threshold instead.
This algorithm utilises a matched filter [11] based on 1D cross-correlation. We initially used a filter length of l = 16 resulting in a complete length of cl = 2 ⇤ l = 32. This initial filter length was chosen because with a frame-rate of 30 fps it is close to the common minimal length of normal facial expressions (1 s) [32]. We then compared it with smaller (l=8) and bigger lengths (l=24) denominating the algorithm's variants: PMP8, PMP16 and PMP24.

Edge Detection
The peak detection process uses a delta value (threshold) that is lower than the actual threshold as described in section 3.2. The edge detection for PMP relies on cross-correlation, which generates separate curves for rising and falling edges.
For CP detection, falling and rising are distinguished by a rating of the two data points before and after the actual edge.

Edge Classification
For edge-based methods, edge classification is calculated using Table 2. This table also includes the case of smaller (double) falling or rising edges, if they meet the corresponding condition.

Fixed-Window Mean Bisection (FWMB)
Our binary search based algorithm halves the window around the event position. If no results are found in this iteration, three further subdivision steps are performed. In order to identify possible edges at each depth level, the means of both sides of the window are compared. If the di↵erence between the mean of the left and the right sections is greater than a threshold of 0.5, a direct classification is returned. On each subdivision window, a single peak detection is applied. Fig. 6 depicts output for all three algorithms as an example. All three algorithms classify this data as "01". Fig. 6 also depicts the main di↵erence between the algorithms: the dependency on window size for detecting edges. FWMB always uses the middle of the window, while PMP and CP are more flexible, as they rely on edge detection rather than a fixed window size.

EVALUATION OF ALGORITHMS
In order to evaluate the three algorithms, we subdivided the data of 2, 4 and 8 s length for a specific facial expression at the event-position: half the window size before and after the event. Every slice of the data was then classified using one of the three developed algorithms and the classification was compared to that of the human observer.

Experimental Data for Evaluation
Data was collected during experiments in di↵erent game levels of the EmotionBike project [24]. In this work, we focussed on two game events: the falling event (fa) from the challenge level (see Fig. 7) and the jump-scare event (js) in the night level (see Fig. 8) resulting in a data set of 92 events and 3,312 subdivided sequences (see Table 3). In general, our exergame involved three types of events: 1. Sudden events: Users are given no warning when this event will occur, resulting in a small window for detecting facial reactions. The jump-scare event is an example of this type.
2. Fuzzy events: Users can estimate the occurrence of this predictable event, making the actual window size larger. The falling event is in this category.

Continuous events:
As the event is constantly present in time, no event time can be calculated. We therefore ignored this type in this study.   We used the Emotient FER algorithm provided as part of the iMotions platform 1 for classifying the video data recorded at 30fps as the algorithm generated good results in our previous benchmarking [3]. We used all 12 provided expression channels, seven basic emotions (joy, anger, surprise, fear, disgust, sadness and contempt) and additionally included: confusion, frustration, neutral emotion, positive emotion and negative emotion. Table   Table 4 presents the results of the classification for every event, window size and algorithm. Each result is also compared with the corresponding observer classification and has an associated success rate. With the exception of inconclusive data ("??") a mean is calculated over the other four classifications. An overall mean (all3) is then calculated from the mean values for every window size indicating the best overall success rate for the algorithm and event.

Classification Results
The CP-based algorithm produced the best overall results especially in cases where the total processing frame was 4 s or less. In longer windows, the classification of "11" often failed when there were small negative spikes often ignored by the human observer.

Reliability
Krippendor↵'s alpha is a common method for testing the inter-rater reliability [33]. Normally, Krippendor↵'s alpha is used to estimate the reliability of a complete group of observers, but it can also be used to compare subgroups [33]. For our purpose, we compared the output of each algorithm with the observer's result as depicted in Fig. 9.
In all cases, the CP-based algorithms had the highest values. We used Krippendor↵'s alpha as an additional criterion for assessing the agreement with the human observer. Results for participant videos were calculated separately for each event (jump-scare=js and falling=fa) and window (2s, 4s, 8s) resulting in an overall mean and standard deviation (SD). The CPbased algorithm produced the best results in case of appropriate windows sizes (2, 4s). Too large window sizes (8 s) degraded the performance. Table 5 displays examples of confusion matrices for good (blue), best (green) and worse(red) case classification results of the CP-based algorithm.  The overall outcome of the shift detection is contained in Table 6. In nearly 92% of cases, the CP-based algorithms matched with the observer classification, although the CPbased algorithm results were lower than PMP-results in one scenario.

Shift Detection Results
We also calculated the Krippendor↵'s alpha values for comparison between the observer and the di↵erent algorithms (see Fig. 10) which further supported the conclusion, that the CP-based algorithms generated the best overall results. Figure 10: Results of Krippendor↵ 's alpha for all shift detection by algorithms compared to a human observer. No SD was calculated due to the limited number of data points (11 for jump-scare events).

Performance and Limitations
All three algorithms were capable of processing event windows in soft real time (processing time < 1s), although the CP-based method is normally used for o✏ine detection and had the longest processing time of all three algorithms. For maximum performance, the complete window of data needs to be present.
All three approaches are based on the assumption that changes in facial expressions occur rapidly; and were therefore unable to detect gentle transitions. This was no problem in our case, since our externally provoked events generally occur rapidly.
The integrated CP algorithm has a complexity of O(n 2 ), which must be considered when increasing the window size or frame-rate of data.

CONCLUSION
In this paper, we provide automatic solutions for the classification of facial expression recognition outputs for practical applications. We developed post-processing methods that observe both single-channel and multi-channel shifts as candidate indicators, which can be utilized as a more robust event-response detection. This work addresses one standing research challenge for a fully automatic and unsupervised classification of facial expression reactions tailored for specific applications. While our automatic classification is an important step, we still find it challenging to handle interand within-subject variations of responses in a generic way.
Of our three approaches, the changepoint-based classification performed best and it was closest to our human observer results and to human perception of the curves, as demonstrated by the best overall classification performance and the highest values for Krippendor↵'s alpha. Hereby it is important to avoid too wide analysis windows by pretesting, as these degrade performance.
The window size e↵ect is illustrated in Table 4. The large windows-size e↵ect may be intuitively explained with our study, where the classification of windows with an overall length of 8 s was challenging to score, even for the human observer. In this case, the number of inconclusive categorizations increased significantly (especially for the sudden event; see Table 5), suggesting that this window is too wide. Table 7 summarizes our overall findings for the three different algorithms in terms of complexity, real-time capability and accuracy. Our results further suggest that an automatic processing of shift events shows considerable promise as an additional tool to cope with subject variations. Especially the changepoint-based algorithm produced the best results for the detection of emotional shift with a 92% compliance compared to the human observer-based classification.
Our application of CP and PMP classification provides a starting point for further investigations in short microexpressions and event-based segmentation techniques without fixed window sizes.