FG-CLIP: Fine-Grained Visual and Textual Alignment
Abstract
FG-CLIP enhances fine-grained understanding in multimodal tasks by leveraging large multimodal models, a high-quality dataset with detailed captions, and hard fine-grained negative samples.
Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The related data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.
Community
FG-CLIP is a new generation of cross-modal model, capabile of fine-grained discrimination for text-image alignment and retrieval
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs (2025)
- GOAL: Global-local Object Alignment Learning (2025)
- Decoupled Global-Local Alignment for Improving Compositional Understanding (2025)
- Compositional Image-Text Matching and Retrieval by Grounding Entities (2025)
- Refining CLIP's Spatial Awareness: A Visual-Centric Perspective (2025)
- VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models (2025)
- LRSCLIP: A Vision-Language Foundation Model for Aligning Remote Sensing Image with Longer Text (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend