Development of an Evaluation Pipeline for Multimodal Fusion

This project is already assigned.

Motivation

Multimodal fusion has been widely studied in the field of Human Computer Interaction (HCI), where the integration of heterogeneous data sources such as language, vision and gesture presents significant challenges (Baltrušaitis et al., 2019). While a variety of fusion methods have been proposed, ranging from declarative fusion approaches to more recent deep learning-based techniques, systematic and comparable evaluation remains difficult. This is partly due to the lack of unified evaluation standards and consistent experimental settings across models, which limits reproducibility and comparability of results (Xue et al., 2026). In contrast, in other domains such as natural language processing, standardized benchmarks and evaluation frameworks enable consistent comparison across models and have boosted research progress (Wang et al., 2018).

Motivated by these limitations, this project proposes a model-agnostic evaluation pipeline that enables structured representation and metric-based comparison of multimodal model outputs, such as predicted intents, referenced objects and associated parameters, providing a foundation for systematic evaluation of multimodal fusion methods.

The pipeline enables systematic analysis of multimodal outputs by defining clear representations, metrics and evaluation procedures. The goal is to establish a structured and reusable framework for assessing multimodal model predictions, independent of specific model architectures. Developed in parallel with a comprehensive dataset and benchmarking suite, the combined system forms the foundation for the implementation and evaluation of different multimodal fusion approaches. By developing the evaluation framework at an early stage, the project aims to ensure an unbiased and methodologically sound evaluation process, reducing the risk of tailoring the setup to specific approaches.

This project is therefore based on the following research question:

How can multimodal model outputs be evaluated in a structured and comparable way across different fusion approaches?

Planned Work

1. Framework Design

Define the overall structure of the evaluation pipeline
Define the representation format for model predictions and ground truth
Design and implement a clear API for interacting with the evaluation pipeline

2. Representation and Structuring

Develop parsing mechanisms to extract structured information from model outputs
Define schemas for representing predicted intents, objects and parameters
Handle variability and ambiguity in generated outputs

3. Metrics Implementation

Research evaluation metrics (in MMI domain as well as broader, e.g. machine learning)
Define matching strategies for comparing predictions and ground truth (e.g. exact match, soft / fuzzy match)
Define evaluation targets at the component level (e.g. intent, object, parameter)
Implement core evaluation metrics (e.g. accuracy, precision, recall)
Measure processing and generation time as additional performance indicators

4. Evaluation Logic

Define and specify an error taxonomy for multimodal predictions
Implement matching strategies between predictions and ground truth
Classify and analyze different types of prediction errors (e.g. intent, object and parameter mismatches)
Investigate the behavior and sensitivity of evaluation metrics
Identify limitations of the evaluation setup and derive improvement strategies

5. Baseline implementation and Validation

Implement a simple baseline or dummy predictor (fusion method) compatible with the defined API
Use the baseline to validate the evaluation pipeline and workflow

Refine the system based on validation results
Demonstrate the functionality and interpretability of the evaluation process

Outcome

Functional and extensible evaluation pipeline for multimodal fusion tasks
Reusable API for integration into other systems and teaching contexts
Set of core evaluation metrics for systematic analysis
Initial insights into model behavior and evaluation challenges
Baseline implementation for validation and comparison
Comparability across different multimodal fusion paradigms through a unified evaluation framework

Schedule

References

Baltrušaitis, T., Ahuja, C., & Morency, L.-P. (2019). Multimodal Machine Learning: A Survey and Taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423–443.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. Proceedings of EMNLP.

Xue, L., Zhang, C., Xue, K., Liu, X., Wang, G., & Han, Z. (2026). MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains. Proceedings of the AAAI Conference on Artificial Intelligence, 40(32), 27450-27458.

Contact Persons at the University Würzburg

Ronja Heinrich (Primary Contact Person)
Human-Computer Interaction, Universität Würzburg
ronja.heinrich@uni-wuerzburg.de

Dr. Martin Fischbach
Human-Computer Interaction, Universität Würzburg
martin.fischbach@uni-wuerzburg.de

Prof. Dr. Marc Erich Latoschik
Human-Computer Interaction, Universität Würzburg
marc.latoschik@uni-wuerzburg.de