File size: 9,938 Bytes
5e37d23 6b6d58f 5e37d23 726efdd 5e37d23 726efdd 5e37d23 5bce40f 5e37d23 6b6d58f 5e37d23 ce6621d 5e37d23 5bce40f ce6621d 5bce40f 5e37d23 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 |
---
base_model: mistral-community/pixtral-12b
library_name: peft
license: cc-by-4.0
datasets:
- daniel3303/GroundCap
language:
- en
metrics:
- bleu
- meteor
- cider
- spice
- f1
- recall
- precision
- gmeteor
- rouge
model-index:
- name: PixtralGroundCap
results:
- task:
type: image-captioning
subtype: grounded-image-captioning
dataset:
name: daniel3303/GroundCap
type: grounded-image-captioning
split: test
metrics:
- name: Precision
type: grounding-precision
value: 0.58
- name: Recall
type: grounding-recall
value: 0.96
- name: F1
type: grounding-f1
value: 0.69
- name: BLEU-4
type: bleu-4
value: 0.19
- name: METEOR
type: meteor
value: 0.23
- name: CIDEr
type: cider
value: 0.51
- name: SPICE
type: spice
value: 0.30
- name: gMETEOR
type: gmeteor
value: 0.35
---
# Model Card for PixtralGroundCap
This model is a fine-tuned version of Pixtral-12B on the GroundCap dataset for grounded image captioning. It generates detailed image descriptions with explicit grounding tags that link textual descriptions to specific visual elements in the image. The model was trained on the GroundCap dataset and uses a novel tag system to ground objects (`<gdo>`), actions (`<gda>`), and locations (`<gdl>`) to specific regions in images.
## Model Details
### Model Description
- **Developed by:** Daniel A. P. Oliveira, Lourenço Teodoro, and David Martins de Matos (INESC-ID Lisboa and Instituto Superior Técnico, Universidade de Lisboa)
- **Model type:** Fine-tuned Pixtral-12B model for grounded image captioning
- **Language(s):** English
- **License:** Creative Commons Attribution 4.0
- **Finetuned from model:** mistral-community/pixtral-12b
### Model Sources
- **Paper:** https://arxiv.org/abs/2502.13898
- **Dataset:** https://huggingface.co/datasets/daniel3303/GroundCap
## Uses
### Direct Use
The model is designed for generating grounded image captions that explicitly link textual descriptions to visual elements using three types of grounding tags:
- `<gdo>` for objects
- `<gda>` for actions
- `<gdl>` for locations
Each tag maintains object identity through unique IDs, enabling consistent reference tracking throughout the caption.
### Downstream Use
The model can be integrated into:
- Accessibility applications requiring detailed image descriptions
- Content management systems needing verifiable image captions
- Visual question answering systems
- Image retrieval systems
### Out-of-Scope Use
The model is not designed for:
- General image classification
- Object detection (requires separate object detection pipeline)
- Video captioning
- Non-English language captioning
## How to Get Started with the Model
### Input Format
The model expects input in the following format:
```
You are an AI assistant that can see and understand images. I will provide you with an image and the detected objects in it along with their positions and dimensions in the format [id, x,y,width,height].
[DETECTIONS]
[sky-0: 0.41,0.00,0.20,0.15]
[sky-1: 0.62,0.00,0.26,0.10]
[wall-0: 0.01,0.02,0.35,0.86]
[person-0: 0.38,0.35,0.12,0.40]
[person-1: 0.45,0.35,0.08,0.39]
[wall-1: 0.39,0.10,0.35,0.48]
[person-2: 0.71,0.29,0.20,0.51]
[wall-2: 0.75,0.03,0.24,0.88]
[person-3: 0.00,0.57,0.22,0.42]
[handbag-0: 0.21,0.75,0.11,0.23]
[person-4: 0.26,0.48,0.20,0.52]
[floor-wood-0: 0.40,0.59,0.60,0.41]
[/DETECTIONS]
[IMG]
```
### Example Output
The model will generate a grounded caption using three types of tags:
- ```<gdo>``` for objects
- ```<gda>``` for actions
- ```<gdl>``` for locations
Example output:
```
In this scene, a group of individuals is gathered in what appears to be a <gdl class="wall" wall-0 wall-1 wall-2>rugged, makeshift shelter</gdl>. The <gdl class="wall" wall-0 wall-1 wall-2>walls</gdl> are constructed from <gdl class="wall" wall-0 wall-1 wall-2>rough materials</gdl>, giving the space a temporary and utilitarian feel. The <gdl class="sky" sky-0 sky-1>sky</gdl> is visible in the background, suggesting that the shelter is partially open or lacks a complete roof.
Several <gdo class="person" person-0 person-1 person-2 person-3 person-4>people</gdo> are present in the scene, each engaged in their own activities. <gdo class="person" person-0>One individual</gdo> <gda class="sit" person-0>sits</gda> on the <gdl class="floor-wood" floor-wood-0>ground</gdl>, while <gdo class="person" person-1>another person</gdo> <gda class="sit" person-1>is seated</gda> nearby. <gdo class="person" person-2>Another person</gdo> <gda class="sit" person-2>is also sitting</gda> on the <gdl class="floor-wood" floor-wood-0>ground</gdl>, and <gdo class="person" person-3>a fourth individual</gdo> <gda class="sit" person-3>is seated</gda> as well. <gdo class="person" person-4>An additional person</gdo> <gda class="sit" person-4>is sitting</gda> close by.
The <gdo class="handbag" handbag-0>handbag</gdo> is placed on the <gdl class="floor-wood" floor-wood-0>ground</gdl> near one of the individuals, suggesting they might have brought some personal belongings with them. The overall atmosphere of the scene is one of simplicity and resilience, with the individuals making the best of their surroundings in this temporary shelter.
```
## Bias, Risks, and Limitations
- The model was trained on movie scenes from MovieNet, which may introduce biases in terms of scene composition, lighting, and camera angles
- Performance may vary for real-world images that differ significantly from movie scenes
- The model relies on pre-detected objects and their bounding boxes, Mask2Former was used for object detection in the original paper
### Recommendations
- Use in conjunction with a robust object detection system
- Verify grounding accuracy for critical applications
- Consider the movie-centric nature of the training data when applying to other domains
## Training Details
### Training Data
The model was trained on the GroundCap dataset, which contains:
- 52,016 images from 77 movies
- 344 human-annotated captions
- 52,016 automatically generated captions
-
### Training Procedure
The training followed a two-stage approach:
#### Stage 1:
- Training on 52,016 automatically generated captions
- Learning rate: 2×10^-4
- Epochs: 2
- Batch size: 64 (with gradient accumulation)
#### Stage 2:
- Fine-tuning on 344 human-refined captions
- Learning rate: 2×10^-6
- Epochs: 2
- Batch size: 32 (with gradient accumulation)
#### Training Hyperparameters
- **LoRA Configuration:**
- Rank: 16
- Alpha: 32
- Targeted layers: Self-attention (query, key, value, output) and MLP (gate, up, down)
- **Optimizer:** AdamW
- **Weight decay:** 0.01
- **Precision:** bfloat16
- **Hardware:** 2x NVIDIA A100 (80GB)
- **Training time:** 1 day
## Evaluation
### Testing Data, Factors & Metrics
The model was evaluated on:
- 10,000 test images from GroundCap from which 70 are human-annotated test cases
### Metrics
- **Grounding metrics:**
- Precision (P): Correctly grounded objects / Total objects mentioned in caption
- Recall (R): Correctly grounded objects / Total detected objects
- F1 score: Harmonic mean of precision and recall
- **Caption quality metrics:**
- BLEU-4: N-gram overlap with reference captions
- METEOR: Semantic similarity with reference captions
- CIDEr: Consensus-based image description evaluation
- SPICE: Semantic propositional image caption evaluation
- ROUGE-L: Longest common subsequence based evaluation
- **Combined metric:**
- gMETEOR: Harmonic mean of METEOR and grounding F1 score, combining language quality with grounding accuracy
- **Human evaluation:** (5-point Likert scale)
- Object precision: Accuracy of object grounding and tag classification
- Grounding recall: Coverage of detected objects in captions
- Description accuracy: Correctness of described actions and relationships
- Language quality: Grammar, readability, and coherence
- Overall quality: Assessment of caption effectiveness
- **ChatGPT-4o evaluation:** (5-point Likert scale)
- Uses same criteria as human evaluation
- Correlations with human judgments:
- Object Precision: 0.81 (Pearson), 0.73 (Spearman)
- Grounding Recall: 0.76 (Pearson), 0.67 (Spearman)
- Description Accuracy: 0.79 (Pearson), 0.77 (Spearman)
- Language Quality: 0.59 (Pearson), 0.44 (Spearman)
- Overall Quality: 0.78 (Pearson), 0.68 (Spearman)
### Results
Automatic metrics on test set for PixtralGroundCap:
- Precision: 0.58
- Recall: 0.96
- F1 Score: 0.69
- BLEU-4: 0.19
- METEOR: 0.23
- CIDEr: 0.51
- SPICE: 0.30
- ROUGE-L: 0.37
- gMETEOR: 0.35
Human evaluation results (scale 1-5):
- Object Precision: 4.22
- Grounding Recall: 4.19
- Description Accuracy: 4.08
- Language Quality: 4.91
- Overall Quality: 4.22
ChatGPT-4o evaluation results (scale 1-5):
- Object Precision: 4.21
- Grounding Recall: 4.13
- Description Accuracy: 4.01
- Language Quality: 4.90
- Overall Quality: 4.19
## Environmental Impact
- **Hardware Type:** 2x NVIDIA A100 GPUs
- **Hours used:** 24 hours
- **Cloud Provider:** INESC-ID
- **Compute Region:** Lisbon, Portugal
## Paper
[ArXiv link](https://arxiv.org/abs/2502.13898).
## Citation
**BibTeX:**
```bash
@article{Oliveira2025GroundCapAV,
title={GroundCap: A Visually Grounded Image Captioning Dataset},
author={Daniel A. P. Oliveira and Louren{ç}o Teodoro and David Martins de Matos},
year={2025},
url={https://api.semanticscholar.org/CorpusID:276450057}
}
```
## Model Card Authors
Daniel A. P. Oliveira, Lourenço Teodoro, and David Martins de Matos
## Model Card Contact
[email protected]
### Framework versions
- PEFT 0.13.2 |