Show Me Your Moves

This project is already assigned.

Background

The analysis of human performance in virtual- or augmented environments is a fundamental process for virtual- or augmented reality (VR/AR) applications. On the one hand, it makes them possible in the first place - for example, live tracking of the user’s head for rendering purposes or live recognition of performed gestures for interaction. On the other hand, it allows to gain insights into human physiology by means of offline analysis - how do environment and interaction as well as their technical implementation affect users and what are the inter personal differences of this effect. In addition, the latter category of analyses can lead to the development and validation of models, which in turn form the basis for novel live analysis techniques that enhance the basic functionality of VR and AR applications mentioned initially. For instance, predicting 3D postures on RGB-Images for enhancing markerless tracking and recognizing non-verbal human object references for enhancing Multimodal Interfaces (MMIs).

For all these analysis scenarios, it must be possible to record and play back data. This data can be properties of a human movement, biometric measurements, information about the context of the action - i.e. the (virtual) environment - or any other properties that the respective VR or AR application manages in its application state.

This project thus targets a general solution for recording and playback of human performance and its virtual environment context with a special focus on temporal synchronization, customizability, adaptability, and interoperability. The quality of the temporal synchronism of the recorded values is a decisive factor for the achievable quality of the analyses carried out on them and is therefore of relevance to the project. The latter software quality aspects are important for the intended use of the project results within the chair and beyond. In this context, it will be important to take standards and established formats into account, e.g. with regard to the compatibility of existing tools.

A prototypical use of the recording and playback solution for conducting analysis in the two following application fields is to be aimed at, if the project progress allows it:

Prediction of 3D postures on RGB-Images - The targeted data recording and playback solution enables the generation of large synthetic datasets which are a fundamental requirement to train convolutional neural networks (CNNs) to predict human pose. It is possible for CNNs to predict the 3D posture of a person on RGB-Images if it has been trained with a sufficiently large amount of data [6]. Recent research dealt with the generation of synthetic datasets as manual labeling of 3D poses on images is almost impossible [7,8,9,10]. A vision based tracking by CNNs manages without user instrumentation, such as marker suits. Thus, it creates the most natural interaction scenario possible [11].

Recognition of non-verbal human object references - MMIs implement human-computer interaction paradigms that center around users’ natural behavior and communication capabilities [1]. In order to build natural MMIs, a multimodal system has to be capable of tracking and interpreting non-verbal human actions [2]. For instance, humans perform gestures with coordinated finger, hand, arm, torso, and head movements. These gestures together with gaze are vital modes of communication that elaborate upon and enhances orally expressed information [3]. Further, emotions influence our decision making, perception, and overall rational thinking and as such also play a critical role in human communication. Gestures and emotions can be detected by analyzing human body movement [4,5]. Therefor, the recording and playback of movement is a fundamental requirement for multimodal systems to analysis human behavior and to build natural MMIs.

Tasks

Develop a general solution for recording entity property changes within the Unity engine
- Optimize the temporal synchronization of the recorded data
- Consider existing standards and file formats for the recorded data types
- Coordination with the projects ongoing at the chair representing either potential data sources or potential users (via primary contact person)
Apply your solution to record the entire motion of an avatar (if the project progress allows it)
- Use a setup similar to one of our research projects
- Integrate your approach in one of our research projects
- Provide a playback to allow generation of training data for image-based recognition approaches
- Potential poster submission of results
Apply your solution to first record user motion data and then train a machine learning-based recognizer that resolves object references in a virtual environment (if the project progress allows it)
- Use the Valve Index and Vive Eye to track and record head, eye, hand, finger, and torso movements
- Build upon the outcome of a finished HCI project that already implemented a machine learning pipeline for Unity
- Use the implementation to train an object references resolver
- Validate the precision of your implementation
- Participate in a potential scientific paper or poster submission

Contact Persons at the University Würzburg

Marvin Thäns (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg
marvin.thaens@uni-wuerzburg.de

Chris Zimmerer (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg
chris.zimmerer@uni-wuerzburg.de

Dr. Martin Fischbach
Mensch-Computer-Interaktion, Universität Würzburg
martin.fischbach@uni-wuerzburg.de

References

[1] Oviatt, S.; Cohen, P. Perceptual User Interfaces: Multimodal Interfaces That Process What Comes Naturally. Commun. ACM 2000, 43, 45–53.

[2] Oviatt, S., Schuller, B., Cohen, P., Sonntag, D., & Potamianos, G. (2017). The Handbook Of Multimodal-Multisensor Interfaces, Volume 1: Foundations, User Modeling, and Common Modality Combinations. Morgan & Claypool.

[3] Hostetter, A. B. (2011). When do gestures communicate? A meta-analysis. Psychological bulletin, 137(2), 297.

[4] Rautaray, S. S., & Agrawal, A. (2015). Vision based hand gesture recognition for human computer interaction: a survey. Artificial intelligence review, 43(1), 1-54.

[5] Zacharatos, H., Gatzoulis, C., & Chrysanthou, Y. L. (2014). Automatic emotion recognition based on body movement analysis: a survey. IEEE computer graphics and applications, 34(6), 35-45.

[6] Mehta, D. [Dushyant], Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W. & Theobalt, C. (2017). Monocular 3d human pose estimation in the wild using improved cnn supervision. In 2017 International Conference on 3D Vision (3DV) (S. 506–516). IEEE.

[7] Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I. & Schmid, C. (2017). Learning from synthetic humans. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017) (S. 4627–4635). IEEE.

[8] Chen, W., Wang, H., Li, Y., Su, H., Wang, Z., Tu, C., Chen, B. (2016). Synthesizing training images for boosting human 3d pose estimation. In 3D Vision (3DV), 2016 Fourth International Conference on (S. 479–488). IEEE.

[9] Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T. & Schiele, B. (2011). Learning people detection models from few training samples. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (S. 1473–1480). IEEE.

[10] Lassner, C., Pons-Moll, G. & Gehler, P. V. (2017). A generative model of people in clothing. In Proceedings of the IEEE International Conference on Computer Vision (Bd. 6).

[11] Wigdor, D. & Wixon, D. (2011). Brave NUI world: designing natural user interfaces for touch and gesture. Elsevier.