Human-Computer Interaction

Data Collection and Preparation for Multimodal Fusion in VR Interaction


This project is already assigned.

Motivation

Motivation

With Extended Reality (XR) systems becoming increasingly affordable and widespread, there is a growing demand for intuitive and effective interaction techniques. Multimodal interaction in Virtual Reality (VR), particularly the combination of speech and gestures, is considered a natural and promising interaction paradigm, allowing users to communicate more intuitively by combining complementary modalities (Oviatt & Cohen, 2000; McNeill & Duncan, 2000).

Despite its potential, the integration of these modalities, i.e. the multimodal fusion, remains a central challenge in HCI research (Baltrušaitis et al., 2019; Heinrich et al., 2025). User input is often context-dependent, dynamic and asynchronous, making the alignment and interpretation of modalities difficult.

At the same time, studying multimodal fusion is limited due to the lack of datasets that capture multimodal interaction in VR settings with sufficient structure and annotation detail (Müller et al., 2025). For example, recent large-scale datasets such as Ego4D provide rich multimodal recordings and annotated interactions, but primarily focus on perception tasks and lack structured representations of multimodal interaction, such as aligned speech–gesture intents (Grauman et al., 2022). Another challenge lies in the design of existing multimodal systems and studies. Many approaches focus on narrowly defined classification tasks, such as sentiment or emotion recognition (e.g. CMU-MOSEI), which lack structured representations of interaction and are therefore limited in their applicability to more complex multimodal settings (Zadeh et al., 2018; Baltrušaitis et al., 2019).

To address these challenges, this project aims to design and conduct a multimodal VR study using a Wizard-of-Oz setup and to record and construct a structured dataset capturing natural interaction behavior. This dataset will provide a foundation for future research on multimodal fusion and evaluation.


HCI Project: Data Collection and Preparation

This work is guided by the following research question:

How can multimodal interaction data in VR be captured, structured and annotated to support the systematic study and evaluation of multimodal fusion methods?

To answer this research question, a multimodal VR study using a Wizard-of-Oz setup is conducted, focusing on how multimodal interaction data can be captured, structured and represented in immersive environments.

Current State


Planned Work

1. System Integration and Study Design


2. Pilot testing and refinement of study procedure


3. Data Collection


4. Data preprocessing, synchronization and cleaning


5. Annotation and structuring of the dataset

6. Benchmark definition (optional)

Outcome

Schedule

References

Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.

Grauman, K., et al. (2022). Ego4D: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18995–19012.

Heinrich, R., Zimmerer, C., Fischbach, M., & Latoschik, M. E. (2025). A systematic review of fusion methods for the user-centered design of multimodal interfaces. Proceedings of the 27th International Conference on Multimodal Interaction, 485–495.

McNeill, D., & Duncan, S. D. (2000). Growth points in thinking-for-speaking. In D. McNeill (Ed.), Language and gesture (pp. 141–161). Cambridge University Press.

Müller, F., et al. (2025). AMIS: An audiovisual dataset for multimodal XR research. Bauhaus-Universität Weimar.

Oviatt, S., & Cohen, P. (2000). Perceptual user interfaces: Multimodal interfaces that process what comes naturally. Communications of the ACM, 43(3), 45–53.

Zadeh, A., Liang, P. P., Poria, S., Cambria, E., & Morency, L.-P. (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 2236–2246).


Contact Persons at the University Würzburg

Ronja Heinrich (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
ronja.heinrich@uni-wuerzburg.de

Dr. Martin Fischbach
Human-Computer Interaction, Universität Würzburg
martin.fischbach@uni-wuerzburg.de

Prof. Dr. Marc Erich Latoschik
Human-Computer Interaction, Universität Würzburg
marc.latoschik@uni-wuerzburg.de

Legal Information