Recognition of VR players through motion data

This project is already completed.

Introduction

In recent years, the communication and interaction of people via the internet have turned more and more towards virtual reality (VR). More and more meetings are held via virtual spaces. In such virtual experiences, where participants only see the virtual avatars and interact with them, it is necessary to be able to prove identities. This kind of virtual experience is also called social VR.

Until now, authentication has always been done by a password. However, this is hardly suitable for a VR application in which there is no easy and intuitive way to enter text. A much more practical possibility is the recognition of participants using their movements. This would not create any obstacles in the VR experience and users could start directly in their virtual space without any hindrance.

The recognition of VR users based on their movements per VR single session has already been dealt with extensively. However, a realistic application for recognition purposes would not work with data from a single session. This is because applications for recognition purposes are usually used to recognize people at several points in the future. It must therefore also be trained to recognize people over several sessions.

For this reason, it is necessary to address the issue of recognizing people over several sessions. In order to determine whether people can be reliably recognized over several sessions, a multisession data set consisting of movement data from test persons must be created and a neural network trained on this basis.

Table 1: Summary of relevant related work targeting identification or authentication based on machine learning of movement data; $N$ describes the number of individual users for which the respective datasets contain movement sequences.

As can be seen in Table 1, there is already a huge amount of work regarding the recognition of people based on their movement data. In general, a distinction is made between the identification and authentication of people. Identification describes the ability to recognize people, while authentication is the ability to verify that a user is who they say they are. For example, Rogers et al. ¹, Pfeufer et al. ² and M. Miller et al. ³ focus on identifying their users, while Li et al. ⁴, Mustafa et al. ⁵ and all remaining authors (⁶, ⁷, ⁸, ⁹, ¹⁰, ¹¹, ¹²) from Table 1 place their focus on authenticating people. As this work builds on the paper of Schell et al. ¹³, ours, like his, is dedicated to user identification.

Regardless of whether the nature of the task is authentication or identification, there is an overarching problem in creating the data sets. Most of the data sets mentioned above consist only of data collected per person in one session. If this is the case, a neural network can adhere to small session-dependent movement patterns to recognize persons. This could be, for example, a person’s lack of energy, or a person’s day-dependent limp. This is impractical for practical applications where users need to be recognized over a varying number of sessions. Only a few, such as Liebers et al. ¹², have worked with a dataset that spans multiple sessions per subject.

The dataset of Liebers et al. ¹² is also one of the few that have been published and are publicly accessible. However, this data set, like all others in Table 1, consists of certain predefined movements. For a person to be recognized in a virtual space by their motion, they would have to perform a predefined movement. This is more suitable than entering a password, but still not perfect.

The data set Schell et al. ¹³ is working with consists of one-sessional movement data. People were asked to hold a conversation. During this conversation, the subjects’ movements were recorded. This data set Schell et al. ¹³ is working with is already somewhat better suited for a real-world application, as these movements are performed randomly. The subjects don’t have to be told to do them. For this reason, this type of movement could also be used to recognize users in virtual space, as the VR experience is not interrupted for identification.

As this paper builds on the work of Schell et al. and his already promising results in recognizing people from gestures, his methodology will be tested on a new data set consisting of the movement of subjects that get no instructions on how they should move.

Motivation

All of the previously mentioned publications work with data sets that have recorded personal data within one session. This may be mainly due to the increasing complexity of finding participants for studies as the number of sessions increases. In general, most of the data sets containing movement data were recorded via one session and clearly defined movements, such as throwing a ball ⁸ or shooting with a bow ¹². In our case, this is different.

By asking the subjects to play Half Life Alyx, a VR role-playing game, a variety of different movements and actions are possible, which in extreme cases can be disjoint between the subjects.

One-session recordings of subjects who are asked to do a recurring task provide opportunities for neural networks to remember subtleties. Perhaps a sensor has been mounted in such a way that its orientation differs from all the other sensors that have been mounted so far, or someone had a slight shiver one day because they were cold. In any case, the same movements are compared to each other. These things make it easier for the neural network to recognize people. However, this does not correspond to the real world. If an application requires it to recognize people based on their movements, it should be able to do so across a variety of sessions, a variety of user moods, a variety of movements, and also despite the possibility of slightly different positions of sensors. And that is exactly where my paper comes in.

Research Issue

This research addresses the question of whether it is possible to recognize people from motion data collected over multiple sessions rather than a single session. To address this question, the machine learning approach of Schell et al. ¹³, which was originally intended for a single-session data set, is applied to a two-session data set. Possible differences in performance are assessed and discussed.

Methodology

Collecting of Data

The first part of the task to be done is to create a data set. Participants will be asked to play the VR game Half Life Alyx for 30 minutes for each of the two sessions. To create equal conditions for all subjects, the same two game scenes are selected for all subjects, each of which is played in one of the sessions. The scenes are selected according to their suitability for the collection of movement data. They should not be too difficult and allow each subject to move forward in the game without too many obstacles. Before the first session, subjects will be allowed to familiarise themselves with the game mechanics by absolving a tutorial.

The data set itself will consist of position and rotation data of the hands and head, which will be recorded at constant intervals. Data about the movement of the eyes is also collected. In addition, information about the person is recorded and integrated into the data set. Age, gender, height, and previous experiences in VR of the participants are collected through a questionnaire. To further expand the future use of the dataset, the user’s field of view is also recorded while playing and made available.

Since I want to train neural networks with the data afterward, a minimum number of test persons is necessary. I aim for a minimum of 30 subjects, intending to integrate as many people as possible into the data set. If subjects are unable to complete both of the sessions, their data will be stored separately and marked as incomplete as our focus is a multisession dataset. No guarantee can be given for an equal distribution of genders in the data set since the priority is on the number of test subjects rather than on equal distribution.

Application of Machine Learning

After the data set is created, a Gated Recurrent Unit (GRU) ¹⁴ will be trained to recognize the test subjects based on their movement. I limit myself to the GRU, as Schell et al. showed that this kind of neural net performs best with the given type of data [@Chris].

Two GRU-Networks are implemented in python using the package PyTorch. The first receives information about the relative positions and orientations of the hands in dependence of the head per time interval, while the second GRU instead receives the speed of the movements.

To find the best parameters for the two GRUs, a two-stage hyperparameter search is performed. The first stage is dedicated to the size of the GRUs, the number of levels, and the dropout rate, while the second stage is dedicated to the number of time steps given into the GRUs at once. Both GRUs are compared against each other afterwards, with our focus on the accuracy of subject recognition.

Work Scheduling

Rough time estimation for each of the tasks

Record Data Set Recording of movement data of the controllers and the head-mounted display. In addition, the movement of the eyes is recorded.
Implement GRU Implementation of a network consisting of GRUs. The implementation will be done in the Python programming language and will use the Pycharm package.
Train/Evaluate GRU Training and evaluation of the training results. This also contains a hyperparameter search and adjustments to the neural net to increase the performance.
Write Thesis Writing of the thesis as soon as the first results of the neural network are available.d

Cynthia E. Rogers, Alexander W. Witt, Alexander D. Solomon, and Krishna K. Venkatasubramanian, (2015) ↩
Ken Pfeuffer, Matthias J. Geiger, Sarah Prange, Lukas Mecke, Daniel Buschek, and Florian Alt, (2019) ↩
Jun H Miller M.R., Herrera F, (2020) ↩
Sugang Li, Ashwin Ashok, Yanyong Zhang, Chenren Xu, Janne Lindqvist, and Macro Gruteser, (2016) ↩
Tahrima Mustafa, Richard Matovu, Abdul Serwadda, and Nicholas Muirhead, (2018) ↩
Alexander Kupin, Benjamin Moeller, Yijun Jiang, Natasha Kholgade Banerjee, and Sean Banerjee, (2019) ↩
Yiran Shen, Hongkai Wen, Chengwen Luo, Weitao Xu, Tao Zhang, Wen Hu, and Daniela Rus , (2019) ↩
A. Ajit, N. Banerjee, and S. Banerjee, (2019) ↩ ↩²
Florian Mathis, Hassan Ismail Fawaz, and Mohamed Khamis, (2020) ↩
Robert Miller, Natasha Kholgade Banerjee, and Sean Banerjee, (2020) ↩
Ilesanmi Olade, Charles Fleming, and Hai-Ning Liang, (2020) ↩
Jonathan Liebers, Mark Abdelaziz, Lukas Mecke, Alia Saad, Jonas Auda, Uwe Gruenefeld, Florian Alt, and Stefan Schneegass, (2021) ↩ ↩² ↩³ ↩⁴
Christian Schell, (2021) ↩ ↩² ↩³ ↩⁴
Rahul Dey and Fathi M. Salem, (2017) ↩

Contact Persons at the University Würzburg

Christian Schell (Primary Contact Person)
Universität Würzburg
christian.schell@uni-wuerzburg.de