Data Collection and Preparation for Multimodal Fusion in VR Interaction

This project is already assigned.

Motivation

With Extended Reality (XR) systems becoming increasingly affordable and widespread, there is a growing demand for intuitive and effective interaction techniques. Multimodal interaction in Virtual Reality (VR), particularly the combination of speech and gestures, is considered a natural and promising interaction paradigm, allowing users to communicate more intuitively by combining complementary modalities (Oviatt & Cohen, 2000; McNeill & Duncan, 2000).

Despite its potential, the integration of these modalities, i.e. the multimodal fusion, remains a central challenge in HCI research (Baltrušaitis et al., 2019; Heinrich et al., 2025). User input is often context-dependent, dynamic and asynchronous, making the alignment and interpretation of modalities difficult.

At the same time, studying multimodal fusion is limited due to the lack of datasets that capture multimodal interaction in VR settings with sufficient structure and annotation detail (Müller et al., 2025). For example, recent large-scale datasets such as Ego4D provide rich multimodal recordings and annotated interactions, but primarily focus on perception tasks and lack structured representations of multimodal interaction, such as aligned speech–gesture intents (Grauman et al., 2022). Another challenge lies in the design of existing multimodal systems and studies. Many approaches focus on narrowly defined classification tasks, such as sentiment or emotion recognition (e.g. CMU-MOSEI), which lack structured representations of interaction and are therefore limited in their applicability to more complex multimodal settings (Zadeh et al., 2018; Baltrušaitis et al., 2019).

To address these challenges, this project aims to design and conduct a multimodal VR study using a Wizard-of-Oz setup and to record and construct a structured dataset capturing natural interaction behavior. This dataset will provide a foundation for future research on multimodal fusion and evaluation.

HCI Project: Data Collection and Preparation

This work is guided by the following research question:

How can multimodal interaction data in VR be captured, structured and annotated to support the systematic study and evaluation of multimodal fusion methods?

To answer this research question, a multimodal VR study using a Wizard-of-Oz setup is conducted, focusing on how multimodal interaction data can be captured, structured and represented in immersive environments.

Current State

Recording environment for multimodal VR interaction capturing Unity scene state, user HMD view and third person view including audio
Wizard-of-Oz Unity environment with minigames for controlled interaction scenarios
Initial integration of the recording pipeline into the system is completed

Planned Work

1. System Integration and Study Design

Finalize integration of the recording system into the Wizard-of-Oz setup
Ensure reliable capture of multimodal data (speech, gestures, scene state)
Validate synchronization and completeness of recorded data
Design experimental protocol (instructions, flow, conditions)
Prepare questionnaires (e.g. LimeSurvey)

Conduct pilot session(s) to evaluate the usability and stability of the system
Identify technical issues in recording, synchronization and interaction handling
Assess clarity and effectiveness of task instructions and study flow
Adjust experimental setup and protocol to ensure smooth study execution
Validate overall data quality before starting the main study

3. Data Collection

Conduct user study in VR using the Wizard-of-Oz setup
Record multimodal interaction sessions
Document study execution and ensure data quality and consistency

4. Data preprocessing, synchronization and cleaning

Preprocess raw data to remove artifacts and inconsistencies
Align and synchronize multimodal data streams, including speech, gestures and system events
Standardize data formats and structures for further processing
Handle missing or corrupted data segments where necessary
Prepare cleaned datasets for annotation

5. Annotation and structuring of the dataset

Develop a consistent annotation and labeling scheme for multimodal interaction data
Define annotation guidelines to ensure reproducibility and clarity
Structure the dataset into a standardized and reusable format
Apply annotations to the collected dataset according to the defined scheme

6. Benchmark definition (optional)

Define evaluation tasks based on the collected multimodal data
Specify input and output formats for benchmarking multimodal fusion approaches
Construct a structured benchmark dataset based on annotated interaction data
Document the benchmark setup to ensure reproducibility

Outcome

Structured and synchronized multimodal dataset capturing speech, gestures and system interactions in realistic task scenarios
Validated data collection and recording pipeline
Annotation and labeling scheme for multimodal interaction data
Foundation for future research on multimodal fusion and evaluation

Schedule

References

Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.

Grauman, K., et al. (2022). Ego4D: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18995–19012.

Heinrich, R., Zimmerer, C., Fischbach, M., & Latoschik, M. E. (2025). A systematic review of fusion methods for the user-centered design of multimodal interfaces. Proceedings of the 27th International Conference on Multimodal Interaction, 485–495.

McNeill, D., & Duncan, S. D. (2000). Growth points in thinking-for-speaking. In D. McNeill (Ed.), Language and gesture (pp. 141–161). Cambridge University Press.

Müller, F., et al. (2025). AMIS: An audiovisual dataset for multimodal XR research. Bauhaus-Universität Weimar.

Oviatt, S., & Cohen, P. (2000). Perceptual user interfaces: Multimodal interfaces that process what comes naturally. Communications of the ACM, 43(3), 45–53.

Zadeh, A., Liang, P. P., Poria, S., Cambria, E., & Morency, L.-P. (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 2236–2246).

Contact Persons at the University Würzburg

Ronja Heinrich (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
ronja.heinrich@uni-wuerzburg.de

Dr. Martin Fischbach
Human-Computer Interaction, Universität Würzburg
martin.fischbach@uni-wuerzburg.de

Prof. Dr. Marc Erich Latoschik
Human-Computer Interaction, Universität Würzburg
marc.latoschik@uni-wuerzburg.de

Data Collection and Preparation for Multimodal Fusion in VR Interaction

Motivation

Motivation

HCI Project: Data Collection and Preparation

Current State

Planned Work

1. System Integration and Study Design

2. Pilot testing and refinement of study procedure

3. Data Collection

4. Data preprocessing, synchronization and cleaning

5. Annotation and structuring of the dataset

Apply annotations to the collected dataset according to the defined scheme

6. Benchmark definition (optional)

Outcome

Schedule

References

Contact Persons at the University Würzburg