Developing Custom Gesture Recognition for HoloLens 2

This project is already assigned.

Motivation and Goals

Imagine standing in an operating room at a hospital. You are a medic student who wants to prepare a patient for the next surgery steps. Next to you, multiple acute care devices to track the patient’s vitals are waiting for you to be used. You take a closer look at these devices: You see multiple screens, buttons and other displays that all serve different functionalities. Is there no easier way of interaction? In summary, all of the listed interactions include single input which directly implies inadequate efficiency. But in a medical context, some more problems arise with these machines. During surgery, hygiene and safety have the highest priority. Turning your back on the patients to make an input using your hands could imperil all of the listed criteria. Furthermore, local anesthetists at university clinic Würzburg and observers have classified these devices as counter-intuitive. This can lead to not using the very expensive and crucial machines. So, what can we do to support this human-machine-interaction? A current project at University Würzburg deals with these exact issues during an acute care simulation. In this context, the specific simulation environment consists of a real room which is equipped for medical training. This HCI project serves the purpose to contribute to the goals of this specific project - to enable pervasive interaction during the simulation. This goal can be reached by using gestural interaction. But why gestures in particular? After Schlosser elaborated the positive aspects of Head Mounted Displays (HMDs) in operating rooms in 2018, medics now wear an augmented reality headset during the simulation in this project (Schlosser et al. 2018). HMDs bring multimodal input opportunities, but in a medical context, gestures are the interactions with the least detected problems. For instance, speech interaction comes with detection problems and button interaction imperils hygiene standards. Therefore, I decided to use gestural interaction using Microsoft HoloLens 2 in an acute care simulation room. However, HoloLens 2 only provides a very small set of gestures that cannot be customized with the available tools. So, individual mechanics must be implemented to provide the functionality of custom gesture recognition. In this project, two suitable approaches are selected, implemented and tested.

Hand Gesture Recognition in a Medical Context

First of all, a study regarding gesture recognition in an anesthesia context has to be mentioned (Jurewicz et al. 2018). The main objective of this paper was to compare gesture-function mappings for experts and novices using a 3D, vision-based, gestural input system when exposed to the same context of anesthesia tasks in an operating room. Participants of the study chose ten anesthetic functions to determine intuitive gesture-function mappings. The results showed two gesture-function sets with both similarities and differences.

Hand Gesture Recognition Approaches

Hand Gesture Recognition can be done using multiple techniques and types of data. In this paragraph, skeleton-based approaches will be focused. So, the next step is to elaborate on studies regarding different gesture customization approaches. Here, the study “Skeleton-based Dynamic hand gesture recognition” is of importance (Smedt et al.2016). As the title shows, Quentin De Smeth and his colleagues present a new skeleton-based approach proposed for 3D hand gesture recognition using a linear Support Vector Machine. Moreover, experimental results showed a consistently superior performance of the skeleton-based approach over a depth-based approach. Secondly, Ionescu, Coquin, Lambert and Buzuloiu also discussed dynamic hand gesture recognition using a hand skeleton (Ionescu et al. 2005). Their approach was based on a 2D skeleton representation of the hand and provides a single image with a dynamic signature of the gesture for each posture. The classification process was implemented using a conventional template matching approach. Finally, recognition was done using Daddeley’s distance as a measure for dissimilarities between model parameters. Another paper used a similar approach by calculating the hand skeleton using distance transformation techniques (Reddy et al. 2011). The skeleton is computed for each and every hand posture in the entire hand motion and superimposed on a single image. The recognition was done by using Image Euclidian distance measures.

Gesture Customization for Microsoft HoloLens

Almost no research was done concerning gesture customization for HoloLens. However, one article addresses this topic using a deep learning artificial intelligence (Reply2018). The article “Extending HoloLens Gestures with Deep Learning AI” by Valorem-Reply elaborates on a new gesture system that runs an image on Azure through a custom Convolutional Neural Network (CNN) that was trained by sending images from the application along with their correctly identified gestures. Even though it does not represent the attempt reported in this project, it demonstrates the possibility of gesture customization for HoloLens in general.

Concept

Skeletal Hand Gesture Set

Among other things, a hand gesture set has to be generated. The set will be created by considering pre-existing gesture sets like the one resulting from Jurewicz. Another used gesture set is the “EgoGesture” data set by Zhang et. al (Zhang et al. 2018). Due to performance loss with depth data, skeletal joints will be used for the data set (see cover picture).

Unity-HoloLens Interface

To record joints and implement the live pose prediction, an interface between HoloLens and Unity has to be created. This can be done using the Mixed Reality Toolkit (MRTK). With this Toolkit it is possible to create a hand tracking profile to generate joint prefabs. Therefore, relevant joints for gesture detection can be selected and used for further operations. The finished Unity application can be deployed to the HoloLens and executed. Then, the selected joints for gesture recognition will be recorded via speech input and saved into files with its assigned gesture afterwards.

Gesture Recognition

Literature research in the early state of the project has revealed two gesture recognition approaches. Besides other deep learning mechanisms, one machine learning approach and one template-based approach was selected. The two approaches will both be implemented in C# to guarantee the compatibility with Unity and therefore HoloLens 2. Objective criteria will test and evaluate the two approaches afterwards and identify their adequacy.

Machine Learning Approach

The selected machine learning approach is a Support Vector Machine (SVM). A SVM is a linear model for linear or non-linear classification and regression problems. Compared to other machine learning methods, SVM is very powerful at recognizing patterns and complex data sets. Therefore, Support Vector Machines have already been used for skeletal hand gesture detection. Other application areas of SVMs are, among other things, recognizing fraudulent credit cards, identifying speakers, as well as detecting faces. For the implementation in C#, an existing library called “libSVMsharp” will be used to guarantee simple usage in connection to HoloLens and Unity Engine.

Template-Based Approach

Next, we stumbled across a geometrical template matcher called “Jackknife” which is a general purpose-gesture recognizer that is designed to work with a variety of input devices including Kinect or Leap Motion. It is also designed for gesture customization and implemented in multiple programming languages, C# among others. Jackknife also provides the advantage of only needing one or two samples of a gesture pattern to work instead of an entire data set. Due to the given facts, a compatibility with Unity Engine is already given and will be adjusted into the HoloLens context.

Methodology

The main focus of this project is to learn and perform mid-air gestures from an ego-perspective using HoloLens 2. Before the actual training process, a hand gesture data set has to be prepared. The gesture set will be recorded for both approaches and prepared for the recognition process. The recording process will be conducted using the MRTK Toolkit and further C# scripts. Next, the template matcher will be implemented. Since the training process should work almost out of the box with some adjustments, a new live pose prediction for HoloLens has to be established. After that, the machine learning approach will be implemented. An independent C# program will be generated using a SVM library. The outcoming model will then be saved and integrated into Unity for live pose prediction. During the implementation process, some objective measurements, like accuracy, precision, recall or confusion matrices, will be used for the evaluation of the methods. At the end of the project, not only these criteria will be used for discussing the appropriateness of the approaches, but also their individual advantages and disadvantages. However, the goal is not to exclude one specific process.

References

Ionescu, B., Coquin, D., Lambert, P., & Buzuloiu, V. (2005). Dynamic hand gesture recog-nition using the skeleton of the hand.EURASIP Journal on Advances in Signal Pro-cessing,2005(13), 1–9.

Jurewicz, K. A., Neyens, D. M., Catchpole, K., & Reeves, S. T. (2018). Developing a 3dgestural interface for anesthesia-related human-computer interaction tasks usingboth experts and novices.

Reddy, K. S., Latha, P. S., & Babu, M. R. (2011). Hand gesture recognition using skeletonof hand and distance based metric.International Conference on Advances in Com-puting and Information Technology, 346–354.

Reply, V. (2018).Extending hololens gestures with deep learning ai. https://www.valoremreply.com/post/extendinghololensgestures/ (accessed: 25.02.2021)

Schlosser, P., Grundgeiger, T., & Happel, O. (2018). Multiple patient monitoring in theoperating room using a head-mounted display.Extended Abstracts of the 2018 CHIConference on Human Factors in Computing Systems, 1–6.

Smedt, Q. D., Wannous, H., & Vandeborre, J.-P. (2016). Skeleton-based dynamic handgesture recognition.

Zhang, Y., Cao, C., Cheng, J., & Lu, H. (2018). Egogesture: A new dataset and benchmarkfor egocentric hand gesture recognition.IEEE Transactions on Multimedia,20(5),1038–1050.

Contact Persons at the University Würzburg

Dr. Florian Niebling (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg
florian.niebling@uni-wuerzburg.de