Do you mean me?

This project is already assigned.

Introduction

Multimodal Interfaces (MMIs) are a prevalent candidate when choosing a suitable way to interact with the Virtual Environment (VE). They orchestrate multiple input sources while considering diverse kinds of natural human behavior, which enable the user to utilize different modalities when interacting with the interface [4,11]. Their increase of input channels makes them more robust, since they can differentiate and complement given information sources, even if they are erroneous or fragmentary, which helps to clarify occurring ambiguities and eliminates uncertainty [11,13]. This increase of input channels also grants users flexibility and accessibility, since physically or cognitive handicapped people have novel interaction possibilities [4,11]. Users often perceive these interfaces as more natural due to their focus on natural human behaviors, since users can make use of their senses (e.g. using speech & gestures) instead of being forced to use mice or keyboards to achieve their goal as in traditional WIMP interfaces [4,13]. MMIs also have the potential to increase the feeling of flow, since their absence of additional menus prevent the user from shifting his attention which therefore might reduce mental workload and may higher the usability [16].

When it comes to interaction with virtual objects that are out of arms reach, MMIs often combine speech and deictic gestures. Deictic gestures (e.g. pointing) are meant to complement the speech by illustrating the spoken words and are thus strongly connected to them [5]. Pointing as a deictic gesture is very suitable because it is a habitual interaction that can already be observed in infants; here, both the index finger and the corresponding arm are extended to point to the object of interest [1].

MMIs often choose ray-casting [10] as the choice of implementing the pointing gesture [16,18]. While ray-casting is simple to implement and has virtually no learning curve [3], it has other weaknesses that affect interaction with the system. The user has to hit the target exactly and keep this intersection stable [3], until the target is determined. This contradicts the actual behavior during pointing, since users have a natural offset and do not hit accurately when pointing at a target, as shown by Mayer et al. [8]. The interaction might be suitable for large targets [15], but is impractical for small, distant [3,12,15], or occluded targets [3,12], which need higher accuracy [9]. This demanded accuracy is negatively influenced by hand- and tracker-jitter or fatigue [3,6,9,15], which distort the ray’s direction preventing a precise selection. The additional effort to compensate for these drawbacks could lead to users’ frustration [3]. Advanced selection techniques like SQUAD [7], Expand, or Zoom [2] use multiple aids to lower the need for accuracy in the selection and were made for selection in occluded environments. However, the used aids often consist of view manipulation or additional menus which require multiple execution steps. These menus and multiple execution steps might reduce immersion and the manipulation of the user’s vision might cause unpleasantness as pointed out by Mendez et al. [9]. Furthermore, with respect to the aforementioned advantages of MMIs, these additional steps might be counter-intuitive and could contradict the benefits of the MMIs naturalness, since the user has to learn how to interact with them, distract from the user’s actual goals, and take the user more time to reach his goals [16]. It should also be noted that the success of selection with ray-casting is dichotomous in nature: either the object is selected, or it is not hit. Thus, the information content is very low. However, the quality of the information also determines the success of the user’s input discrimination [11], which makes ray-casting unsuitable in terms of information content. Since ray-casting requires a high degree of accuracy, the used ray always has to be visualized in the VEs. Otherwise, the user couldn’t determine start and end of the cast ray and couldn’t guarantee the required accuracy. However, this visualization might be considered unnatural, since it doesn’t correspond to reality.

In a previous project [14] I tried to tackle the aforementioned disadvantages of plain ray-casting. The idea was to allow users to be free in expressing their dectic gestures without limiting them to rigid pointing with a ray. Instead, the users were free to use multiple body parts like their head, eyes, hands, fingers and torso in a familiar way during the execution of deictic gestures. The resulting information is used to predict a probability of how likely a target object would be selected (later called the object’s awareness). This awareness brings the following advantages: 1) The selection itself does not have to be exact, since even if the object is not hit, unlike the conventional method, there is a residual probability that can be used to determine the target object. 2) The user does not have to adapt to the selection method, but the selection method adapts to the user. In other words, the probability values that an object is referenced depend on the information obtained from previous referencing of objects by the user. This property also takes into account the previously mentioned offset that occurs when pointing to objects.

For this purpose, I collected and annotated data of users while referencing given target objects and trained a neural network with this data. The neural network then was used as the object’s awareness which provides the desired likelihood.

Goal

While the first impressions of the developed approach were promising, I don’t have any further information about the performance and user’s acceptance of my approach compared to the traditional ray-casting. Furthermore, I don’t have any insights of the performance of my approach when integrated in an MMI. The goal of this master thesis is to compare the previous developed awareness approach to traditional ray-casting. Therefore, both approaches are going to be integrated in Concurrent Augmented Transition Networks (cATNs) [17] and will be compared in terms of subjective as well as objective quality measures.

Study

The study will compare both implemented multimodal interfaces. For this purpose, a within-design study is to be conducted in which the users use both interfaces. Here, the users are to select objects that are located in the scene. Both the distance and the density of the objects are changed in the individual conditions.

Time Schedule

References

[1] Butterworth, G. (2003). Pointing is the royal road to language. Pointing: Where language, culture, and cognition meet (pp. 9–33).

[2] Cashion, J., Wingrave, C., & LaViola, J. J. (2012). Dense and Dynamic 3D Selection for Game-Based Virtual Environments. IEEE Transactions on Visualization and Computer Graphics, 18(4), 634–642. https://doi.org/10.1109/TVCG.2012.40

[3] De Haan, G., Koutek, M., & Post, F. H. (n.d.). IntenSelect: Using Dynamic Object Rating for Assisting 3D Object Selection. 9.

[4] Dumas, B., Lalanne, D., & Oviatt, S. (2009). Multimodal Interfaces: A Survey of Principles, Models and Frameworks. In D. Lalanne & J. Kohlas (Eds.), Human Machine Interaction (Vol. 5440, pp. 3–26). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-00437-7_1

[5] Ekman, P., & Friesen, W. V. (1969). The Repertoire of Nonverbal Behavior: Categories, Origins, Usage, and Coding. Semiotica, 1(1), 49–98. https://doi.org/10.1515/semi.1969.1.1.49

[6] Forsberg, A., Herndon, K., & Zeleznik, R. (1996). Aperture based selection for immersive virtual environments. Proceedings of the 9th Annual ACM Symposium on User Interface Software and Technology - UIST ’96, 95–96. https://doi.org/10.1145/237091.237105

[7] Kopper, R., Bacim, F., & Bowman, D. A. (2011). Rapid and accurate 3D selection by progressive refinement. 2011 IEEE Symposium on 3D User Interfaces (3DUI), 67–74. https://doi.org/10.1109/3DUI.2011.5759219

[8] Mayer, S., Schwind, V., Schweigert, R., & Henze, N. (2018). The Effect of Offset Correction and Cursor on Mid-Air Pointing in Real and Virtual Environments. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–13. https://doi.org/10.1145/3173574.3174227

[9] Mendes, D., Medeiros, D., Sousa, M., Cordeiro, E., Ferreira, A., & Jorge, J. A. (2017). Design and evaluation of a novel out-of-reach selection technique for VR using iterative refinement. Computers & Graphics, 67, 95–102. https://doi.org/10.1016/j.cag.2017.06.003

[10] Mine, M. R. (1995). Virtual Environment Interaction Techniques. 18.

[11] Oviatt, S. (2003). Advances in robust multimodal interface design. IEEE Computer Graphics and Applications, 23(5), 62–68. https://doi.org/10.1109/MCG.2003.1231179

[12] Schmidt, G., Baillot, Y., Brown, D. G., Tomlin, E. B., & Swan, J. E. (2006, March). Toward disambiguating multiple selections for frustum-based pointing. In 3D User Interfaces (3DUI’06) (pp. 87-94). IEEE.

[13] Sharma, R., Pavlovic, V. I., & Huang, T. S. (1998). Toward multimodal human-computer interface. Proceedings of the IEEE, 86(5), 853–869. https://doi.org/10.1109/5.664275

[14] Stingl, R. (2021). Natural Pointing [unpublished].

[15] Tse, E., Hancock, M., & Greenberg, S. (2007). Speech-filtered bubble ray: Improving target acquisition on display walls. Proceedings of the Ninth International Conference on Multimodal Interfaces - ICMI ’07, 307. https://doi.org/10.1145/1322192.1322245

[16] Wolf, E., Klüber, S., Zimmerer, C., Lugrin, J.-L., & Latoschik, M. E. (2019). ”Paint that object yellow”: Multimodal Interaction to Enhance Creativity During Design Tasks in VR. 2019 International Conference on Multimodal Interaction, 195–204. https://doi.org/10.1145/3340555.3353724

[17] Zimmerer, C., Fischbach, M., & Latoschik, M. (2018). Semantic Fusion for Natural Multimodal Interfaces using Concurrent Augmented Transition Networks. Multimodal Technologies and Interaction, 2(4), 81. https://doi.org/10.3390/mti2040081

[18] Zudilova, E. V., Sloot, P. M. A., & Belleman, R. G. (2002). A multi-modal interface for an interactive simulated vascular reconstruction system. Proceedings. Fourth IEEE International Conference on Multimodal Interfaces, 313–318. https://doi.org/10.1109/ICMI.2002.1167013

Contact Persons at the University Würzburg

Chris Zimmerer (Primary Contact Person)
Mensch-Computer-Interaktion, Universität Würzburg
chris.zimmerer@uni-wuerzburg.de

Martin Fischbach
Mensch-Computer-Interaktion, Universität Würzburg
martin.fischbach@uni-wuerzburg.de