--- license: apache-2.0 language: - en --- # SPHINX-V Model Card ## Model type: **SPHINX-V** is a new multimodal large language model designed for visual prompting, equipped with a novel visual prompt encoder and a two-stage training strategy. SPHINX-V supports multiple visual prompts simultaneously across various types, significantly enhancing user flexibility and achieve a fine-grained and open-world understanding of visual prompts. ## Paper or resources for more information: Project Page: [Draw-and-Understand](https://draw-and-understand.github.io/) \ Paper: [https://arxiv.org/abs/2403.20271](https://arxiv.org/abs/2403.20271) \ Code: [https://github.com/AFeng-x/Draw-and-Understand](https://github.com/AFeng-x/Draw-and-Understand) \ Dataset: [MDVP-Data & MDVP-Bench](https://huggingface.co/datasets/Afeng-x/Draw-and-Understand) ## Intended use **Primary intended uses:** The principal application of SPHINX-V is centered around conducting research in the realm of visual prompting large multimodal models and chatbots. **Primary intended users:** The model is primarily designed for use by researchers and enthusiasts specializing in fields such as computer vision, natural language processing, and interactive artificial intelligence. ## License Llama 2 is licensed under the LLAMA 2 Community License, Copyright (c) Meta Platforms, Inc. All Rights Reserved. ## Citations ``` @misc{lin2024drawandunderstand, title={Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want}, author={Weifeng Lin and Xinyu Wei and Ruichuan An and Peng Gao and Bocheng Zou and Yulin Luo and Siyuan Huang and Shanghang Zhang and Hongsheng Li}, year={2024}, eprint={2403.20271}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```