|
---
|
|
license: cc-by-4.0
|
|
---
|
|
|
|
# Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions
|
|
|
|
π [**Paper**](https://arxiv.org/abs/2406.10638) | π [**Code**](https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark) | π [**Data**](https://huggingface.co/datasets/BAAI/Multimodal-Robustness-Benchmark)
|
|
|
|
|
|
## Overview
|
|
|
|
MMR provides a comprehensive suite to evaluate the understanding capabilities of Multimodal Large Language Models (MLLMs) and their robustness when handling negative questions after correctly interpreting visual content. The MMR benchmark includes:
|
|
|
|
1. **Multimodal Robustness (MMR) Benchmark and Targeted Evaluation Metrics:**
|
|
- Comprising 12 categories of paired positive and negative questions.
|
|
- Each question is meticulously annotated by experts to ensure scientific validity and accuracy.
|
|
|
|
2. **Specially Designed Training Set:**
|
|
- Contains paired positive and negative visual question-answer samples to enhance robustness.
|
|
|
|
3. **Combined Dataset and Models:**
|
|
- The new dataset merges the proposed dataset with existing ones.
|
|
- Trained models include [Bunny-MMR-3B](https://huggingface.co/AI4VR/Bunny-MMR-3B), [Bunny-MMR-4B](https://huggingface.co/AI4VR/Bunny-MMR-4B), and [Bunny-MMR-8B](https://huggingface.co/AI4VR/Bunny-MMR-8B).
|
|
|
|
In this repository, we provide Bunny-MMR-3B, which is built upon [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) and [Phi-2](https://huggingface.co/microsoft/phi-2). More details about this model can be found in [GitHub](https://github.com/BAAI-DCAI/Multimodal-Robustness-Benchmark).
|
|
|
|
|
|
## Key Features
|
|
|
|
- **Rigorous Testing:**
|
|
- Extensive testing on leading MLLMs shows that while these models can correctly interpret visual content, they exhibit significant vulnerabilities when faced with leading questions.
|
|
|
|
- **Enhanced Robustness:**
|
|
- The targeted training significantly improves the MLLMs' ability to handle negative questions effectively.
|
|
|
|
|
|
# Quickstart
|
|
|
|
Here we show a code snippet to show you how to use the model with transformers.
|
|
|
|
Before running the snippet, you need to install the following dependencies:
|
|
|
|
```shell
|
|
pip install torch transformers accelerate pillow
|
|
```
|
|
|
|
```python
|
|
import torch
|
|
import transformers
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
from PIL import Image
|
|
import warnings
|
|
|
|
# disable some warnings
|
|
transformers.logging.set_verbosity_error()
|
|
transformers.logging.disable_progress_bar()
|
|
warnings.filterwarnings('ignore')
|
|
|
|
# set device
|
|
torch.set_default_device('cpu') # or 'cuda'
|
|
|
|
# create model
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|
'AI4VR/Bunny-MMR-3B',
|
|
torch_dtype=torch.float16,
|
|
device_map='auto',
|
|
trust_remote_code=True)
|
|
tokenizer = AutoTokenizer.from_pretrained(
|
|
'AI4VR/Bunny-MMR-3B',
|
|
trust_remote_code=True)
|
|
|
|
# text prompt
|
|
prompt = 'text prompt'
|
|
text = f"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\n{prompt} ASSISTANT:"
|
|
text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
|
|
input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][0:], dtype=torch.long).unsqueeze(0)
|
|
|
|
# image, sample images can be found in images folder
|
|
image = Image.open('path/to/image')
|
|
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype)
|
|
|
|
# generate
|
|
output_ids = model.generate(
|
|
input_ids,
|
|
images=image_tensor,
|
|
max_new_tokens=100,
|
|
use_cache=True)[0]
|
|
|
|
print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
|
|
```
|
|
|
|
## Citation
|
|
If you find this repository helpful, please cite the paper below.
|
|
|
|
```bibtex
|
|
@misc{liu2024seeing,
|
|
title={Seeing Clearly, Answering Incorrectly: A Multimodal Robustness Benchmark for Evaluating MLLMs on Leading Questions},
|
|
author={Yexin Liu and Zhengyang Liang and Yueze Wang and Muyang He and Jian Li and Bo Zhao},
|
|
year={2024},
|
|
eprint={2406.10638},
|
|
archivePrefix={arXiv},
|
|
}
|
|
```
|
|
|
|
## License
|
|
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
|
|
The content of this project itself is licensed under the [cc-by-4.0](./LICENSE). |