Development of an Evaluation Pipeline for Multimodal Fusion
This project is already assigned.
Motivation
Multimodal fusion has been widely studied in the field of Human Computer Interaction (HCI), where the integration of heterogeneous data sources such as language, vision and gesture presents significant challenges (Baltrušaitis et al., 2019). While a variety of fusion methods have been proposed, ranging from declarative fusion approaches to more recent deep learning-based techniques, systematic and comparable evaluation remains difficult. This is partly due to the lack of unified evaluation standards and consistent experimental settings across models, which limits reproducibility and comparability of results (Xue et al., 2026). In contrast, in other domains such as natural language processing, standardized benchmarks and evaluation frameworks enable consistent comparison across models and have boosted research progress (Wang et al., 2018).
Motivated by these limitations, this project proposes a model-agnostic evaluation pipeline that enables structured representation and metric-based comparison of multimodal model outputs, such as predicted intents, referenced objects and associated parameters, providing a foundation for systematic evaluation of multimodal fusion methods.
The pipeline enables systematic analysis of multimodal outputs by defining clear representations, metrics and evaluation procedures. The goal is to establish a structured and reusable framework for assessing multimodal model predictions, independent of specific model architectures. Developed in parallel with a comprehensive dataset and benchmarking suite, the combined system forms the foundation for the implementation and evaluation of different multimodal fusion approaches. By developing the evaluation framework at an early stage, the project aims to ensure an unbiased and methodologically sound evaluation process, reducing the risk of tailoring the setup to specific approaches.
This project is therefore based on the following research question:
How can multimodal model outputs be evaluated in a structured and comparable way across different fusion approaches?
Planned Work
1. Framework Design
- Define the overall structure of the evaluation pipeline
- Define the representation format for model predictions and ground truth
- Design and implement a clear API for interacting with the evaluation pipeline
2. Representation and Structuring
- Develop parsing mechanisms to extract structured information from model outputs
- Define schemas for representing predicted intents, objects and parameters
- Handle variability and ambiguity in generated outputs
3. Metrics Implementation
- Research evaluation metrics (in MMI domain as well as broader, e.g. machine learning)
- Define matching strategies for comparing predictions and ground truth (e.g. exact match, soft / fuzzy match)
- Define evaluation targets at the component level (e.g. intent, object, parameter)
- Implement core evaluation metrics (e.g. accuracy, precision, recall)
- Measure processing and generation time as additional performance indicators
4. Evaluation Logic
- Define and specify an error taxonomy for multimodal predictions
- Implement matching strategies between predictions and ground truth
- Classify and analyze different types of prediction errors (e.g. intent, object and parameter mismatches)
- Investigate the behavior and sensitivity of evaluation metrics
- Identify limitations of the evaluation setup and derive improvement strategies
5. Baseline implementation and Validation
- Implement a simple baseline or dummy predictor (fusion method) compatible with the defined API
- Use the baseline to validate the evaluation pipeline and workflow
6. Refinement and Demonstration
- Refine the system based on validation results
- Demonstrate the functionality and interpretability of the evaluation process
Outcome
- Functional and extensible evaluation pipeline for multimodal fusion tasks
- Reusable API for integration into other systems and teaching contexts
- Set of core evaluation metrics for systematic analysis
- Initial insights into model behavior and evaluation challenges
- Baseline implementation for validation and comparison
- Comparability across different multimodal fusion paradigms through a unified evaluation framework
Schedule
References
Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of EMNLP.
Xue, L., Zhang, C., Xue, K., Liu, X., Wang, G., & Han, Z. (2026). MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains. Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 27450-27458.
Contact Persons at the University Würzburg
Ronja Heinrich (Primary Contact Person)Human-Computer Interaction, Universität Würzburg
ronja.heinrich@uni-wuerzburg.de
Dr. Martin Fischbach
Human-Computer Interaction, Universität Würzburg
martin.fischbach@uni-wuerzburg.de
Prof. Dr. Marc Erich Latoschik
Human-Computer Interaction, Universität Würzburg
marc.latoschik@uni-wuerzburg.de