|
--- |
|
library_name: transformers |
|
pipeline_tag: image-text-to-text |
|
license: apache-2.0 |
|
--- |
|
# Model Card: Reflective LLaVA (ReflectiVA) |
|
|
|
Multimodal LLMs (MLLMs) are the natural extension of large language models to handle multimodal inputs, combining text and image data. |
|
They have recently garnered attention due to their capability to address complex tasks involving both modalities. |
|
However, their effectiveness is limited to the knowledge acquired during training, which restricts their practical utility. |
|
In this work, we introduce a novel method to enhance the adaptability of MLLMs by integrating external knowledge sources. |
|
Our proposed model, Reflective LLaVA (```ReflectiVA```), utilizes reflective tokens to dynamically determine the need for external knowledge |
|
and predict the relevance of information retrieved from an external database. |
|
Tokens are trained following a two-stage two-model training recipe. This ultimately enables the MLLM to manage external knowledge |
|
while preserving fluency and performance on tasks where external knowledge is not needed. |
|
|
|
The efficacy of ```ReflectiVA``` for knowledge-based visual question answering, highlighting its |
|
superior performance compared to existing methods. |
|
|
|
|
|
In this model space, you will find the Overall Model (stage two) weights of ```ReflectiVA```. |
|
|
|
For more information, visit our [ReflectiVA repository](https://github.com/aimagelab/ReflectiVA). |
|
|
|
## Citation |
|
If you make use of our work, please cite our repo: |
|
|
|
```bibtex |
|
@article{cocchi2024augmenting, |
|
title={{Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering}}, |
|
author={Cocchi, Federico and Moratelli, Nicholas and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita}, |
|
journal={arXiv}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## Paper page |
|
|
|
Paper can be found at https://huggingface.co/papers/2411.16863. |