How precise can it get - Detection of drawing gestures in virtual reality using a machine learning approach

This project is already completed.

To successfully create immersive virtual realities, users need to be able to intuitively and naturally interact with their virtual environment. When it comes to writing or sketching (which I will both refer to as “drawing”), the most intuitive and natural interaction is to manually do so with pen and paper since this is how we learned it from an early age. The contexts of Virtual Classrooms and Virtual Creative Collaboration are just two out of many use cases in which manual drawing could have an immense impact on immersion. But how can we bring manual drawing into Virtual Reality?

The most common idea to create a drawing interaction in Virtual Reality (VR) is through an additional pen-like hardware device (so-called Stylus). Recently scientific pilot versions of styluses have been announced (e.g., Romat et al. (2021, March); Jackson et al. (2020); Elsayed et al. (2020, November)) and there are also first commercial styluses on the market (e.g., Logitech VR Ink or FlyingShapes). Since these styluses have been specifically developed for drawing interactions, their hardware is designed to draw as precisely as possible, e.g., through a pressure sensor built into the tip of the stylus. However, they also have the disadvantage that they are still rare and expensive or require additional tracking systems.

Yet, there is also another approach to drawing in VR that aims at not requiring the extra hardware of a stylus and to thus be more accessible, cheaper and not to require an additional tracking system: Facebook (Zuckerberg & King, 2021) as well as Kern et al. (2021) developed a method to draw manually in VR by using a VR controller held upside-down as a pen, enabling the user to draw on physical surfaces such as tables or walls. This technique is realized by mapping a virtual surface on a physical surface, like a table. This allows the indirect tracking of the physical surface by tracking the virtual surface aligned to it. Then both the generated virtual surface and the bottom side of the controller are tracked. As soon as the controller bottom collides with the virtual surface, in other words, touches the physical table, a writing event is triggered. This surface tracking method overcomes the aforementioned disadvantages of existing styluses, by using the VR controller already at hand as a stylus itself. However, as the VR controllers were not specifically designed for drawing, there are disadvantages in terms of precision and usability that need to be overcome:

The method needs calibration before any drawing is possible. In Facebook’s approach, this is done by moving the controller in a circular motion on the physical surface. Kern et al. (2021) on the other hand align the virtual to the haptic surface by drawing a vertical and horizontal line on the haptic one. In both approaches, this additional step before drawing disrupts the natural and intuitive flow of interaction.
For this method to work, precise mapping of the virtual to the physical surface is required but cannot be assumed in real-world conditions. Variations in the tracking system or irregularities in the table can cause the virtual surface to shift from the physical one. This then leads to two possible scenarios. First, a writing event occurs although it shouldn’t because the virtual surface is above the physical surface. Or second, a writing event is intended but doesn’t happen because the virtual surface is below the physical surface. Both scenarios limit the reliability of the method.

These problems need to be overcome for this approach to be an alternative to common stylus pens. Machine learning could, e.g., bypass these problems by differentiating gestures that are meant as drawing from other gestures, based on given controller data. This would render the creation of a virtual surface obsolete and could potentially increase the precision with which virtual ink is created when intended. In the next section, I will show that in machine learning research so far, there have only been a few attempts to distinguish drawing from non-drawing gestures, all of which deal with midair drawing outside of VR and are not transferable to the specific requirements of a VR setup where controller tracking data is less accurate and writing is supposed to happen on top of a physical surface.

For example, Amma et al. (2014) tried to identify handwriting segments out of a data stream of hand movements. To do this, the handwriting segments had to be distinguished from segments in which daily tasks such as cooking or cleaning were followed. Participants were told to pursue their basic daily activities as well as writing simple sentences in capital letters in mid-air. They gathered their data from subjects wearing gloves with motion sensors, that had accelerometers and gyroscopes attached. They managed to classify almost all segments containing handwriting as such. However, 65 percent of the segments that did not contain handwriting were incorrectly classified as handwriting segments as well. Also, they only distinguished between segments of handwriting and segments of daily activities, so that a whole sentence was counted as one handwriting segment regardless of the fact that the pen was lifted regularly during the writing process. To enable unrestricted handwriting in VR, handwriting segments must be distinguished on a far more detailed level and much more precisely.

Another work from Bohari and Chen (2018, March) investigated the distinction between hover and strokes when writing or painting in mid-air using a pen. Similar to the approach of Amma et al. (2014) they took into account acceleration data of position and rotation. The data was gathered using the GeoMagic Touch 3D, a motorized stylus-like input device which, instead of tracking, gets its position and rotation data through its mechanical construction and is thus very precise. They argued that they chose this device to ensure accurate data, as gathering data through tracking lacked robustness. They were able to differentiate between hover and strokes in 82 percent of the cases. However, it is likely that people make more expansive movements when writing in mid-air compared to writing on physical surfaces, as the former isn’t a common task. Also, in a VR setup, it cannot be assumed that tracking is as precise as with a mechanical input device. Therefore, their work can’t be transferred to the use case of writing on planar surfaces in a VR setting.

By looking at that existing machine learning research, it becomes apparent that research so far fails to answer whether it is possible for a machine learning approach to distinguish drawing from non-drawing gestures on flat surfaces based on tracking data of the VR controller. If this were possible, such an approach would have the potential to provide better accuracy and usability compared to the currently most promising surface tracking approach of Kern et al. (2021) and Facebook.

My thesis will address the question of whether the recognition of drawing gestures on flat surfaces in VR is possible using machine learning. Because I assume that due to the inaccuracy of VR tracking, the size of the drawing gesture affects the accuracy of recognition, I will use an exploratory process to investigate the accuracy of recognition for different gesture sizes. In addition, I will further evaluate whether the machine learning approach is a better alternative to the current surface tracking approach of Kern et al. (2021) by comparing the two approaches in terms of their accuracy for different gesture sizes.

Goal

Research Question 1: At which gesture size can a machine learning algorithm detect if a drawing gesture is being performed in Virtual Reality when using the controller as a pen and only considering data from the controller tracking (position and rotation data)?

1a) In theory all data points labeled as drawing events lie on one plane. Can a simple linear regression procedure fit a matching plane trough the datapoints and thereby distinguish drawing from non-drawing events?

1b) How does a recurrent neural network approach compare to a logistic regression approach on different gesture size levels and how does its latency compare?

Research Question 2: How does the machine learning approach compare to the surface tracking approach developed by Kern et al. (2021) on different gesture size levels?

Methodology

Prerequisites:

A prerequisite for answering the posed research questions is a dataset consisting of labelled drawing and non-drawing data of different gesture sizes. Consequently, the first part of my master thesis consists of the creation of a suitable dataset.
Since the pressure sensor in the tip of the Logitech VR Ink is particularly well suited for creating precisely labelled drawing and non-drawing data, this tool will be used to generate the needed dataset. Given that the position and rotation data captured by the Logitech VR Ink can also be provided by a controller, the results will be transferable to the application with a controller. The dataset will contain drawing data from horizontal, vertical, and oblique planar surfaces.

Research Question 1a:

To answer Research Question 1a, I will apply a rudimentary linear regression approach to recognize drawing gestures. For this rudimentary approach, no training or feature engineering is needed. Instead, I will repeatedly fit a plane through a defined number of y-position data points, in such a way that the sum of all distances between the data points and the plane (the error) is as small as possible. Based on this error, I will then make a prediction about a drawing event. The focus will lie on the investigation whether such a basic approach can at all distinguish drawing from non-drawing gestures and if so, how the accuracy of this approach relates to its latency. Therefore, I will test the linear regression on different numbers of y-position data points.

Research Question 1b:

The main part of my master thesis lies in answering Research Question 1b. Therefore, I will put my focus on the development of a Recurrent Neural Network (RNN) since these networks are best suited to address time series data. Madsen (2019, March) showed that Gated Recurrent Units (GRUs), a specific form of a RNNs, are better and more efficient in handling long term memory than Long-Short-Term-Memories (LSTMs), another form of RNNs. For this reason, I will only work with GRUs.
In line with the approaches developed by Bohari and Chen (2018, March) and Amma et al. (2014) I will work with relative data points only and focus on velocity and acceleration data. For the comparison of the logistic regression approach and the GRU I will take their latency and accuracy into account on different gesture size levels.

Research Question 2:

The last part of my thesis consists of answering Research Question 2. To do so, I will compare the developed deep learning method with the surface tracking method of Kern et al. (2021). The comparison will take place on different levels of drawing gesture sizes. On each level the accuracy and latency of both Kern et al.’s (2021) and my own approach will be measured and compared to the ground truth provided by the Logitech VR Ink.

Challenges:

Previous examination shows that both the tracking of the Logitech VR Ink and the tracking of the controller are comparably inaccurate. This consequently leads to inaccurate position and rotation data. If this noise exceeds the actual signal, i.e., the inaccuracy is bigger than the difference between the two classes (drawing and non-drawing), even a machine learning approach will not be able to distinguish drawing from non-drawing events.

Another obstacle arises from the Inertial Measurement Units (IMUs) built into the controller. It would theoretically be easy to distinguish between drawing and non-drawing gestures if the acceleration value of the controller was 0 on collision. However, this is not the case because the IMUs built into the controllers react sluggishly to the abrupt change in acceleration. This makes the problem much greater in practice than it is assumed to be in theory.

Figure 1 shows how these challenges affect the accuracy of the tracking system. With the naked eye, no distinction between drawing and non-drawing events can be made by looking at the y-position data resulting from writing on a horizontal surface.

Milestones

Data Gathering
- Generating training and test data consistent of different sized drawing gestures on horizontal, vertical, and oblique planar surfaces.
Development of Machine Learning Approach (Research Question 1)

Application of a linear regression approach. The focus will lie on investigating the consequences, for accuracy and latency, of different amounts of datapoints considered.
Searching for suitable features for the detection of drawing gestures. Thereby the focus lies on relative data such as velocity and acceleration data
Training of a GRU using the generated features.
Hyperparameter search: Iterative adaption of data preprocessing parameters, model parameters and gesture sizes to investigate how precise the detection gets depending on the level of gesture size.
1. Evaluation (Research Question 2)
Comparison of the Deep Learning approach with the surface tracking approach developed by Kern et al. (2021) with respect to latency and accuracy on different gesture size levels. Thereby the values of the pressure sensor in the tip of the Logitech VR Ink are taken as ground truth.

Literature

Amma, C., Georgi, M., & Schultz, T. (2014). Airwriting: a wearable handwriting recognition system. Personal and ubiquitous computing, 18(1), 191-203.

Bohari, U., & Chen, T. J. (2018, March). To draw or not to draw: recognizing stroke-hover intent in non-instrumented gesture-free mid-air sketching. In 23rd International Conference on Intelligent User Interfaces (pp. 177-188).

Elsayed, H., Barrera Machuca, M. D., Schaarschmidt, C., Marky, K., Müller, F., Riemann, J., … & Mühlhäuser, M. (2020, November). VRSketchPen: Unconstrained Haptic Assistance for Sketching in Virtual 3D Environments. In 26th ACM Symposium on Virtual Reality Software and Technology (pp. 1-11).

Kern, F., Kullmann, P., Ganal, E., Korwisi, K., Stingl, R., Niebling, F., & Latoschik, M. E. (2021). Off-The-Shelf Stylus: Using XR Devices for Handwriting and Sketching on Physically Aligned Virtual Surfaces. Frontiers in Virtual Reality, 2, 69. Madsen, A. (2019). Visualizing memorization in RNNs. Distill, 4(3), e16.

Jackson, B. (2020, March). OVR Stylus: Designing Pen-Based 3D Input Devices for Virtual Reality. In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW) (pp. 13-18). IEEE.

Romat, H., Fender, A., Meier, M., & Holz, C. (2021, March). Flashpen: A High-Fidelity and High-Precision Multi-Surface Pen for Virtual Reality. In 2021 IEEE Virtual Reality and 3D User Interfaces (VR) (pp. 306-315). IEEE.

Zuckerberg, M., & King, G. (2021). Facebook launches” Horizon Workrooms.” Here’s how it works.

Contact Persons at the University Würzburg

Christian Schell (Primary Contact Person)
Universität Würzburg
christian.schell@uni-wuerzburg.de