Analysis and Optimization of an Unsupervised Learning Approach for Emotion Recognition

This project is already completed.

Motivation and Goals

Emotions play an important role in our everyday lives. There are various studies that show that there is a relationship between emotions and cognition [1]. It appears therefore obvious that driving, as a complex process, is also affected by emotions. In fact, a lot of studies provide evidence for this assumption: a high stress level was shown to lead to narrowed attention and less concentration [2], anger caused more driving errors and degenerated situation awareness [3, 4], and adapting emotions in the car voice to the driver’s emotions resulted in better driving performance [5].
This evidence leads to the conclusion that if a system could be developed to identify the driver’s emotions and take an appropriate action, it would not only increase the user experience but also the safety of the passengers.
Since this service is fundamentally enabled and predicated on the ability of the system to recognize emotions, the focus of this master’s thesis will be on the emotion recognition part.
There are many approaches for emotion recognition, mostly based on supervised processes in the field of Machine Learning. However, supervised approaches are problematic since they depend on large data infrastructure to bring unlabeled data to human experts, a common agreement as to the number and criteria of emotions themselves, and the assumption that symptoms of different emotions are common enough between individuals that supervised training based on data from User A would transfer directly to recognition in User B without additional labeled-User B-data. To ameliorate these concerns with supervised approaches, the goal of this master’s thesis is to approach this problem without a prior definition of emotions thus defining emotional recognition as an unsupervised task. This will require the analysis and optimization of such an unsupervised emotion recognition approach in comparison to state-of-the-art supervised approach to define a more objective emotion classification.

Emotion Classification

Since emotions are not directly measurable, and a proper categorization of emotions is somewhat controversial, there exists much pre-existing literation about adequate categorizations such as: Ekman’s basic emotions [6], Russell’s circumplex model [7], and newer ideas like the complex model “Hourglass of Emotions” [8]. This disagreement and the difficulty to prove the theories makes it difficult for the implementation of concrete applications as well as for answering the question of the induction of specific emotions.

Modalities for Emotion Recognition

Probably the most common modality used for emotion recognition is the recognition of facial expressions, for which researchers often use Ekman’s FACS (Facial Action Coding System) [9] for the encoding or as features. Studies even call the visual modality as superior to audio and other modalities [10]. Other modalities, especially in the driving context, include voice tone and text analysis as well as the use of physiological data [11]. It is widely accepted that multimodal systems outperform unimodal systems significantly [11, 12].

Current Emotion Recognition Methods

Most of the current approaches are based on the idea of supervised learning methods and thus require labeled data. Examples of supervised algorithms used for this task include Deep Convolutional Neural Networks (DCNN) [13], DCNNs in combination with Recurrent Neural Networks (RNN) [14] and also methods like support vector machines (SVM) [15]. However, there also is pre-existing literature showing that unsupervised learning methods have the ability to learn about emotions latently, e.g. with a Deep Convolutional Generative Adversarial Network (DCGAN) [16].

Planned Methods

We propose to evaluate an unsupervised approach for emotion recognition which would allow us to avoid the decision for one specific emotion classification theory.

To that purpose, we will evaluate the performance of several unsupervised algorithms on the datasets, establishing less complex methods as baselines such as: KPCA (Kernel Principal Component Analysis) [17] as a matrix decomposition algorithm.
If the baseline results guide us toward more expressive latent variable model methods, we will continue with models based on the GPLVM (Gaussian Process Latent Variable Model) or an improved version of a DCGAN (Deep Convolutional Generative Adversarial Network) like Wasserstein-GAN (WGAN) [18]. GPLVMs were already used as models for emotion recognition [19] and their strength in uncertainty quantification could be an advantage. With the first DCGAN it could be shown that with simple linear algebra it is possible to manipulate facial expressions in the latent space of the model which implies that some kind of latent space representing emotions was learned [16].

Since the modality of facial expressions provides the largest pre-existing datasets, and is also the easiest to obtain, we will focus on this modality.

To evaluate the proposed method, the results will be compared with the results of a widely accepted supervised learning method like a DCNN to identify strengths and weaknesses of the algorithm.

Work Plan

Activity	April	May	June	July	August	September	October	November	December	Januar
Concept	x	x
Implementation	x	x	x	x	x	x	x
Project report			x	x	x	x
Evaluation				x	x	x	x	x	x
Network improvement				x	x	x	x	x	x
Master Thesis					x	x	x	x	x	x

References

Lisetti, C. L., Nasoz, F. (2004) Using Noninvasive Wearable Computers to Recognize Human Emotions from Physiological Signals. EURASIP Journal on Applied Signal Processing. 2004. 1672-1687. 10.1155/S1110865704406192.
Gao, Hua & Yuce, Anil & Thiran, Jean-Philippe. (2014). Detecting emotional stress from facial expressions for driving safety. 2014 IEEE International Conference on Image Processing, ICIP 2014. 10.1109/ICIP.2014.7026203.
Jeon, M., Walker, B., Gable, T. (2014) Anger Effects on Driver Situation Awareness and Driving Performance. Presence, vol. 23, no. 1, pp. 71-89, Feb. 1 2014. doi: 10.1162/PRES_a_00169
Deffenbacher, J. L., Deffenbacher, D. M., Lynch, R. S., Richards, T. L. (2003) Anger, aggression, and risky behavior: a comparison of high and low anger drivers. Behaviour research and therapy. 41. 701-18. 10.1016/S0005-7967(02)00046-3.
Nass, Clifford & Jonsson, Ing-Marie & Harris, Helen & Reaves, Ben & Endo, Jack & Brave, Scott & Takayama, Leila. (2005) Improving automotive safety by pairing driver emotion and car voice emotion. Proc. CHI. 10.1145/1056808.1057070.
Ekman, P. (1984) Expression and the Nature of Emotion. In Scherer, K. & Ekman, P. (Eds.), Approaches to Emotion (pp. 319-343). Hillsdale, NJ: Lawrence Erlbaum.
Russell J.A. (1980) A circumplex model of affect. Journal of Personality and Social Psychology. 39:1161–1178.
Cambria E., Livingstone A., Hussain A. (2012) The Hourglass of Emotions. In: Esposito A., Esposito A.M., Vinciarelli A., Hoffmann R., Müller V.C. (eds) Cognitive Behavioural Systems. Lecture Notes in Computer Science, vol 7403. Springer, Berlin, Heidelberg
Ekman, P & V. Friesen, W. (2002). Investigator's Guide to the Facial Action Coding System (FACS).
Poria, Soujanya & Cambria, Erik & Bajpai, Rajiv & Hussain, Amir. (2017). A Review of Affective Computing: From Unimodal Analysis to Multimodal Fusion. Information Fusion. 37. 10.1016/j.inffus.2017.02.003.
Jeon, Myounghoon. (2017). Emotions in Driving. 437-474. 10.1016/B978-0-12-801851-4.00017-3.
Ringeval, Fabien & Schuller, Björn & Valstar, Michel & Jaiswal, Shashank & Marchi, Erik & Lalanne, Denis & Cowie, Roddy & Pantic, Maja. (2015). AV+ EC 2015–the first affect recognition challenge bridging across audio, video, and physiological data.
Zheng, W & S. Yu, J & Zou, Yuexian. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. 827-831. 10.1109/ACII.2015.7344669.
Khorrami, Pooya & Paine, Tom & Brady, Kevin & Dagli, Charlie & S. Huang, Thomas. (2016). How Deep Neural Networks Can Improve Emotion Recognition on Video Data. arXiv:1602.07377v5
Sravan Kumar, S., RangaBabu, T. (2015) Emotion and Gender Recognition of Speech Signals Using SVM. International Journal of Engineering Science and Innovative Technology, vol. 4, no. 3, pp. 128-137, May 2015.
Radford, A., Metz, L., Chintala, S. (2016) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434v2
Schölkopf, A. Smola and K. R. Müller (1998). “Nonlinear Component Analysis as a Kernel Eigenvalue Problem,” in Neural Computation, vol. 10, no. 5, pp. 1299-1319, July 1 1998. doi: 10.1162/089976698300017467
Arjovsky, M., Chintala, S., Bottou, L. (2017) Wasserstein GAN. arXiv:1701.07875v318.
García H.F., Álvarez M.A., Orozco Á. (2014) Gaussian Process Dynamical Models for Emotion Recognition. Bebis G. et al. (eds) Advances in Visual Computing. ISVC 2014. Lecture Notes in Computer Science, vol 8888. Springer, Cham

Contact Persons at the University Würzburg

Prof. Dr. Marc Erich Latoschik (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg
marc.latoschik@uni-wuerzburg.de