Human-Computer Interaction

Development of an Evaluation Pipeline for Multimodal Fusion


This project is already assigned.

Motivation

Multimodal fusion has been widely studied in the field of Human Computer Interaction (HCI), where the integration of heterogeneous data sources such as language, vision and gesture presents significant challenges (Baltrušaitis et al., 2019). While a variety of fusion methods have been proposed, ranging from declarative fusion approaches to more recent deep learning-based techniques, systematic and comparable evaluation remains difficult. This is partly due to the lack of unified evaluation standards and consistent experimental settings across models, which limits reproducibility and comparability of results (Xue et al., 2026). In contrast, in other domains such as natural language processing, standardized benchmarks and evaluation frameworks enable consistent comparison across models and have boosted research progress (Wang et al., 2018).

Motivated by these limitations, this project proposes a model-agnostic evaluation pipeline that enables structured representation and metric-based comparison of multimodal model outputs, such as predicted intents, referenced objects and associated parameters, providing a foundation for systematic evaluation of multimodal fusion methods.

The pipeline enables systematic analysis of multimodal outputs by defining clear representations, metrics and evaluation procedures. The goal is to establish a structured and reusable framework for assessing multimodal model predictions, independent of specific model architectures. Developed in parallel with a comprehensive dataset and benchmarking suite, the combined system forms the foundation for the implementation and evaluation of different multimodal fusion approaches. By developing the evaluation framework at an early stage, the project aims to ensure an unbiased and methodologically sound evaluation process, reducing the risk of tailoring the setup to specific approaches.

This project is therefore based on the following research question:

How can multimodal model outputs be evaluated in a structured and comparable way across different fusion approaches?

Planned Work

1. Framework Design

2. Representation and Structuring

3. Metrics Implementation

4. Evaluation Logic

5. Baseline implementation and Validation

6. Refinement and Demonstration

Outcome

Schedule

References

Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of EMNLP.

Xue, L., Zhang, C., Xue, K., Liu, X., Wang, G., & Han, Z. (2026). MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains. Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 27450-27458.


Contact Persons at the University Würzburg

Ronja Heinrich (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
ronja.heinrich@uni-wuerzburg.de

Dr. Martin Fischbach
Human-Computer Interaction, Universität Würzburg
martin.fischbach@uni-wuerzburg.de

Prof. Dr. Marc Erich Latoschik
Human-Computer Interaction, Universität Würzburg
marc.latoschik@uni-wuerzburg.de

Legal Information