File size: 6,277 Bytes

6c54243
 
 
 
f45688c
34da571
5989901
35fbf56
5989901
 
 
34da571
5989901
 
e92526c
5989901
da73565
5989901
bf17083
 
4c20326
4e511ec
 
cb03807
4c20326
1e5c5ed
044c71d
acccec1
5989901
044c71d
acccec1
5989901
044c71d
acccec1
5989901
67fccd0
06cdb86
4e511ec
bf17083
858c699
bf17083
3bf3a0a
ef9be67
 
 
 
 
 
 
858c699
ef9be67
d167b68
858c699
bf17083
3bf3a0a
bf17083
 
 
 
 
 
d167b68

---
language:
- en
---
# Adapting Multimodal Large Language Models to Domains via Post-Training

This project adapts general Multimodal Large Language Models (MLLMs) to specific domains like science and industry to improve their real-world use. It focuses on three main areas:

### 1. Data Synthesis
- We create a **generate-then-filter pipeline** using open-source models to make diverse visual tasks from domain-specific image-caption pairs.
- This data works better than data made by hand or closed-source models (e.g., GPT-4V).

### 2. Training Pipeline
- Instead of the usual two-step training (image-caption pairs first, then visual tasks), we use a **single-step training** to handle more tasks for specific domains.

### 3. Task Evaluation
- We test our method in important fields like **biomedicine, food, and remote sensing**.
- We train and evaluate MLLMs on domain-specific tasks to show how well they perform.


## Resources
**🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗**

| Model                                                                       | Repo ID in HF 🤗                           | Domain       | Base Model              | Training Data                                                                                  | Evaluation Benchmark |
|:----------------------------------------------------------------------------|:--------------------------------------------|:--------------|:-------------------------|:------------------------------------------------------------------------------------------------|-----------------------|
| [Visual Instruction Synthesizer](https://huggingface.co/AdaptLLM/visual-instruction-synthesizer) | AdaptLLM/visual-instruction-synthesizer     | -  | open-llava-next-llama3-8b    | VisionFLAN and ALLaVA | -                   |
| [AdaMLLM-med-2B](https://huggingface.co/AdaptLLM/biomed-Qwen2-VL-2B-Instruct) | AdaptLLM/biomed-Qwen2-VL-2B-Instruct     | Biomedicine  | Qwen2-VL-2B-Instruct    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
| [AdaMLLM-food-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/food-Qwen2-VL-2B-Instruct     | Food  | Qwen2-VL-2B-Instruct    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
| [AdaMLLM-remote-sensing-2B](https://huggingface.co/AdaptLLM/food-Qwen2-VL-2B-Instruct) | AdaptLLM/remote-sensing-Qwen2-VL-2B-Instruct     | Remote Sensing  | Qwen2-VL-2B-Instruct    | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) | [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
| [AdaMLLM-med-8B](https://huggingface.co/AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B) | AdaptLLM/biomed-LLaVA-NeXT-Llama3-8B     | Biomedicine  | open-llava-next-llama3-8b    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
| [AdaMLLM-food-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/food-LLaVA-NeXT-Llama3-8B     | Food  | open-llava-next-llama3-8b    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
| [AdaMLLM-remote-sensing-8B](https://huggingface.co/AdaptLLM/food-LLaVA-NeXT-Llama3-8B) |AdaptLLM/remote-sensing-LLaVA-NeXT-Llama3-8B     | Remote Sensing  | open-llava-next-llama3-8b    | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
| [AdaMLLM-med-11B](https://huggingface.co/AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/biomed-Llama-3.2-11B-Vision-Instruct     | Biomedicine  | Llama-3.2-11B-Vision-Instruct    | [biomed-visual-instructions](https://huggingface.co/datasets/AdaptLLM/biomed-visual-instructions) | [biomed-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/biomed-VQA-benchmark)                   |
| [AdaMLLM-food-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/food-Llama-3.2-11B-Vision-Instruct     | Food | Llama-3.2-11B-Vision-Instruct    | [food-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [food-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |
| [AdaMLLM-remote-sensing-11B](https://huggingface.co/AdaptLLM/food-Llama-3.2-11B-Vision-Instruct) | AdaptLLM/remote-sensing-Llama-3.2-11B-Vision-Instruct     | Remote Sensing | Llama-3.2-11B-Vision-Instruct    | [remote-sensing-visual-instructions](https://huggingface.co/datasets/AdaptLLM/food-visual-instructions) |  [remote-sensing-VQA-benchmark](https://huggingface.co/datasets/AdaptLLM/food-VQA-benchmark)                   |

**Code**: [https://github.com/bigai-ai/QA-Synthesizer](https://github.com/bigai-ai/QA-Synthesizer)

## Citation
If you find our work helpful, please cite us.

[Adapt MLLM to Domains](https://huggingface.co/papers/2411.19930)
```bibtex
@article{adamllm,
  title={On Domain-Specific Post-Training for Multimodal Large Language Models},
  author={Cheng, Daixuan and Huang, Shaohan and Zhu, Ziyu and Zhang, Xintong and Zhao, Wayne Xin and Luan, Zhongzhi and Dai, Bo and Zhang, Zhenliang},
  journal={arXiv preprint arXiv:2411.19930},
  year={2024}
}
```

[Adapt LLM to Domains](https://huggingface.co/papers/2309.09530) (ICLR 2024)
```bibtex
@inproceedings{
adaptllm,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
```