Human-Computer Interaction

Gesture Recognition and Feature Selection


This project is already completed.

Motivation

In the field of human-computer-interaction (HCI) one of the core aspects is the design of the interface between the user and the system. Often, the whole system is judged by the design of the front end interface or its interaction paradigm. Using mouse and keyboard in combination with WIMP (windows, icons, menus, pointer) interfaces is suitable for certain tasks, e.g. in an open space office it is more convenient to silently typewrite text than to utilize speech recognition. In contrast, there are situations in which these conventional approaches can only be applied in a limited way or not at all. For example in virtual, mixed and augmented reality (VR, MR, AR), the user is fully or partially immersed into a virtual environment, wearing a headmounted display or 3D glasses while often standing up or even moving around. It is apparent that traditional input devices e.g. mouse and keyboard are not suitable for interacting with these systems. In such situations a more natural way of communication with less extra tools is desirable. As such, multimodal interaction (MMI) poses a promising approach (Latoschik 2005).

MMI contains forms of interaction like speech, haptic feedback, gaze or gestures. Especially “‘open-hand’ gestures, that is gestures of the upper limbs without tool usage” (Latoschik 1998) are a means of intercultural communication. In this context, “gesture recognition has been attracting more and more attention from academia and industry” (Sohn 2014). Especially camera-based approaches provide a convenient way of communication for the users as their hands can be used freely, not restricting them to the input capabilities of an additional device.

Tracking systems usually extract the position of body joints and pass them on to a final decision unit which classifies the data as different gestures. These units can be manually specified templates. For a swiping movement thresholds for values like speed and direction of the hand must be exeeded in order to recognize the gestures. Problems arise when users execute the gesture differently than intended or have a divergent bodyshape. To deal with a more heterogeneous group of users machine learning poses a remedy. With an adequate database to learn from modern machine learning methods are able to recognise variations from the original data set and still derive the correct semantic output. For a similar purpose, machine learning is commonly used in the related field of computer vision, e.g. google’s cloud vision API. Other use-cases of machine learning are data security, marketing personalization, natural language processing (NLP), database mining and picture recognition, like face, text and object recognition (Ng). Within the field of machine learning, a popular technique for automatic gesture recognition is called Artificial Neural Networks (Trigueiros 2012, Maung 2009, Braffort 1996).

For illustration the following example is provided: data like the current position of various joints of the body are detected through tracking systems. More complex features, such as speed and acceleration, can be obtained from the position of one body joint or by combining several different inputs, like angles. By relaying the raw or processed data to the artificial network, the latter then learns predefined gestures from training samples. Problems that often emerge throughout this process are the selection of meaningful input features and, depending on them, the configuration of the network. A smaller set of carefully selected input features improves performance of most machine learning classifiers (Sohn 2012, Trigueiros 2012). Relevant features have a higher or lower likelihood to appear when the gesture or pattern is present while irrelevant features have a random appearance in both cases (Helmbold 2012). Features can either be selected from all possible combinations of the raw data or found by thorough thought about the problem, preferably by an expert in the field.

In the past, automatic feature selection was mainly used in areas of application with an enormous amount of available input features like text processing, gene expression array analysis or combinatorial chemistry (Guyon 2003). Provided that no overfitting due to dimensionality incurs, the application of feature selection methods improves predictor performance (Guyon 2003). More recently, Microsoft presented the Visual Gesture Builder (VGB) in 2014. The VGB allows the construction and usage of own gestures by autonomously extracting crucial variables in combination with a machine learning predictor which learns gestures from previously annotated videos. As the performance of a machine learning classifier depends on the given features, it is interesting to learn if feature selection algorithms are appropriate for machine learning in the context of Human Computer Interaction (HCI).

Research Question

The question that shall be answered in the scope of this thesis is to which extent the automatic selection of input features for artificial neural networks can improve their performance in gesture recognition. In this context, multiple aspects shall be considered:

A tool which allows to compare the different approaches needs to be developed to examine the extent of the benefits gesture recognition gains from feature selection algorithms. The next step is to investigate if the data used to train the VGB can be employed to liken the VGBs results to these findings.

Furthermore, it would be interesting to know if the algorithms that are regarded in this thesis can as well be applied to further techniques in the area of multimodal interaction.

Approach

For a start, literature regarding feature selection for machine learning and the usage of machine learning for gesture recognition has to be assorted. The feature selection algorithm best suited for the task has to be detected and implemented. If multiple algorithms are suitable, the most frequently used one will be selected. For comparison, an adequate use-case has to be found and a set of gestures that should be learned has to be chosen. Sufficient training data has to be aquired as well as experts who are able to pick adequat features for the chosen gesture. After carefully preparing and implementing the experiment, the performance of the feature selection method has to be compared to the all-input and expert settings (vs. VGB). Eventually, a simple example application has to be developed, embedding the most effective algorithm into SimulatorX (Latoschik 2011).

Gantt chart showing the project’s schedule

Bibliography

Braffort, A. (1996). A gesture recognition architecture for sign language. In Procee- dings of the second annual ACM conference on Assistive technologies (S. 102– 109). ACM.

Guyon, I. & Elisseeff, A. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157–1182.

Latoschik, M. E. (2005). A user interface framework for multimodal VR interactions. In Proceedings of the 7th international conference on Multimodal interfaces (S. 76–83). ACM.

Latoschik, M. E. & Tramberend, H. (2011). Simulator X: A scalable and concurrent architecture for intelligent realtime interactive systems. In Virtual Reality Conference (VR), 2011 IEEE (S. 171–174). IEEE.

Maung, T. H. H. (2009). Real-time hand tracking and gesture recognition system using neural networks. World Academy of Science, Engineering and Techno- logy, 50, 466–470.

Sohn, M.-K., Lee, S.-H., Kim, D.-J., Kim, B. & Kim, H. (2012). A comparison of 3D hand gesture recognition using dynamic time warping. In Proceedings of the 27th Conference on Image and Vision Computing New Zealand (S. 418–422). ACM.

Trigueiros, P., Ribeiro, F. & Reis, L. P. (2012). A comparison of machine learning algorithms applied to hand gesture recognition. In 7th Iberian Conference on Information Systems and Technologies (CISTI 2012) (S. 1–6). IEEE.

Ng, A. (n.d.). Coursera - Machine Learning - Welcome [Video File]. Retrieved from https://www.coursera.org/learn/machine-learning/home/week/1

Latoschik, M. E. & Tramberend, H. (2011). Simulator X: A scalable and concurrent architecture for intelligent realtime interactive systems. In Virtual Reality Conference (VR), 2011 IEEE (S. 171–174). IEEE.


Contact Persons at the University Würzburg

Chris Zimmerer (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
chris.zimmerer@uni-wuerzburg.de

Prof. Dr. Marc Erich Latoschik
Mensch-Computer-Interaktion, Universität Würzburg
marc.latoschik@uni-wuerzburg.de

Legal Information