Data Collection and Preparation for Multimodal Fusion in VR Interaction
This project is already assigned.
Motivation
Motivation
With Extended Reality (XR) systems becoming increasingly affordable and widespread, there is a growing demand for intuitive and effective interaction techniques. Multimodal interaction in Virtual Reality (VR), particularly the combination of speech and gestures, is considered a natural and promising interaction paradigm, allowing users to communicate more intuitively by combining complementary modalities (Oviatt & Cohen, 2000; McNeill & Duncan, 2000).
Despite its potential, the integration of these modalities, i.e. the multimodal fusion, remains a central challenge in HCI research (Baltrušaitis et al., 2019; Heinrich et al., 2025). User input is often context-dependent, dynamic and asynchronous, making the alignment and interpretation of modalities difficult.
At the same time, studying multimodal fusion is limited due to the lack of datasets that capture multimodal interaction in VR settings with sufficient structure and annotation detail (Müller et al., 2025). For example, recent large-scale datasets such as Ego4D provide rich multimodal recordings and annotated interactions, but primarily focus on perception tasks and lack structured representations of multimodal interaction, such as aligned speech–gesture intents (Grauman et al., 2022). Another challenge lies in the design of existing multimodal systems and studies. Many approaches focus on narrowly defined classification tasks, such as sentiment or emotion recognition (e.g. CMU-MOSEI), which lack structured representations of interaction and are therefore limited in their applicability to more complex multimodal settings (Zadeh et al., 2018; Baltrušaitis et al., 2019).
To address these challenges, this project aims to design and conduct a multimodal VR study using a Wizard-of-Oz setup and to record and construct a structured dataset capturing natural interaction behavior. This dataset will provide a foundation for future research on multimodal fusion and evaluation.
HCI Project: Data Collection and Preparation
This work is guided by the following research question:
How can multimodal interaction data in VR be captured, structured and annotated to support the systematic study and evaluation of multimodal fusion methods?
To answer this research question, a multimodal VR study using a Wizard-of-Oz setup is conducted, focusing on how multimodal interaction data can be captured, structured and represented in immersive environments.
Current State
- Recording environment for multimodal VR interaction capturing Unity scene state, user HMD view and third person view including audio
- Wizard-of-Oz Unity environment with minigames for controlled interaction scenarios
- Initial integration of the recording pipeline into the system is completed
Planned Work
1. System Integration and Study Design
- Finalize integration of the recording system into the Wizard-of-Oz setup
- Ensure reliable capture of multimodal data (speech, gestures, scene state)
- Validate synchronization and completeness of recorded data
- Design experimental protocol (instructions, flow, conditions)
- Prepare questionnaires (e.g. LimeSurvey)
2. Pilot testing and refinement of study procedure
- Conduct pilot session(s) to evaluate the usability and stability of the system
- Identify technical issues in recording, synchronization and interaction handling
- Assess clarity and effectiveness of task instructions and study flow
- Adjust experimental setup and protocol to ensure smooth study execution
- Validate overall data quality before starting the main study
3. Data Collection
- Conduct user study in VR using the Wizard-of-Oz setup
- Record multimodal interaction sessions
- Document study execution and ensure data quality and consistency
4. Data preprocessing, synchronization and cleaning
- Preprocess raw data to remove artifacts and inconsistencies
- Align and synchronize multimodal data streams, including speech, gestures and system events
- Standardize data formats and structures for further processing
- Handle missing or corrupted data segments where necessary
- Prepare cleaned datasets for annotation
5. Annotation and structuring of the dataset
- Develop a consistent annotation and labeling scheme for multimodal interaction data
- Define annotation guidelines to ensure reproducibility and clarity
- Structure the dataset into a standardized and reusable format
-
Apply annotations to the collected dataset according to the defined scheme
6. Benchmark definition (optional)
- Define evaluation tasks based on the collected multimodal data
- Specify input and output formats for benchmarking multimodal fusion approaches
- Construct a structured benchmark dataset based on annotated interaction data
- Document the benchmark setup to ensure reproducibility
Outcome
- Structured and synchronized multimodal dataset capturing speech, gestures and system interactions in realistic task scenarios
- Validated data collection and recording pipeline
- Annotation and labeling scheme for multimodal interaction data
- Foundation for future research on multimodal fusion and evaluation
Schedule
References
Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.
Grauman, K., et al. (2022). Ego4D: Around the world in 3,000 hours of egocentric video. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18995–19012.
Heinrich, R., Zimmerer, C., Fischbach, M., & Latoschik, M. E. (2025). A systematic review of fusion methods for the user-centered design of multimodal interfaces. Proceedings of the 27th International Conference on Multimodal Interaction, 485–495.
McNeill, D., & Duncan, S. D. (2000). Growth points in thinking-for-speaking. In D. McNeill (Ed.), Language and gesture (pp. 141–161). Cambridge University Press.
Müller, F., et al. (2025). AMIS: An audiovisual dataset for multimodal XR research. Bauhaus-Universität Weimar.
Oviatt, S., & Cohen, P. (2000). Perceptual user interfaces: Multimodal interfaces that process what comes naturally. Communications of the ACM, 43(3), 45–53.
Zadeh, A., Liang, P. P., Poria, S., Cambria, E., & Morency, L.-P. (2018). Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 2236–2246).
Contact Persons at the University Würzburg
Ronja Heinrich (Primary Contact Person)Human-Computer Interaction, Universität Würzburg
ronja.heinrich@uni-wuerzburg.de
Dr. Martin Fischbach
Human-Computer Interaction, Universität Würzburg
martin.fischbach@uni-wuerzburg.de
Prof. Dr. Marc Erich Latoschik
Human-Computer Interaction, Universität Würzburg
marc.latoschik@uni-wuerzburg.de