File size: 9,938 Bytes
5e37d23
 
 
6b6d58f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5e37d23
 
726efdd
5e37d23
726efdd
5e37d23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5bce40f
5e37d23
 
 
 
 
 
 
 
 
6b6d58f
 
 
5e37d23
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce6621d
 
 
5e37d23
 
 
 
5bce40f
ce6621d
 
 
 
 
 
5bce40f
5e37d23
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
---
base_model: mistral-community/pixtral-12b
library_name: peft
license: cc-by-4.0
datasets:
- daniel3303/GroundCap
language:
- en
metrics:
- bleu
- meteor
- cider
- spice
- f1
- recall
- precision
- gmeteor
- rouge

model-index:
  - name: PixtralGroundCap
    results:
      - task:
          type: image-captioning
          subtype: grounded-image-captioning
        dataset:
          name: daniel3303/GroundCap
          type: grounded-image-captioning
          split: test
        metrics:
          - name: Precision
            type: grounding-precision
            value: 0.58
          - name: Recall
            type: grounding-recall
            value: 0.96
          - name: F1
            type: grounding-f1
            value: 0.69
          - name: BLEU-4
            type: bleu-4
            value: 0.19
          - name: METEOR
            type: meteor
            value: 0.23
          - name: CIDEr
            type: cider
            value: 0.51
          - name: SPICE
            type: spice
            value: 0.30
          - name: gMETEOR
            type: gmeteor
            value: 0.35
       

---

# Model Card for PixtralGroundCap

This model is a fine-tuned version of Pixtral-12B on the GroundCap dataset for grounded image captioning. It generates detailed image descriptions with explicit grounding tags that link textual descriptions to specific visual elements in the image. The model was trained on the GroundCap dataset and uses a novel tag system to ground objects (`<gdo>`), actions (`<gda>`), and locations (`<gdl>`) to specific regions in images.

## Model Details

### Model Description

- **Developed by:** Daniel A. P. Oliveira, Lourenço Teodoro, and David Martins de Matos (INESC-ID Lisboa and Instituto Superior Técnico, Universidade de Lisboa)
- **Model type:** Fine-tuned Pixtral-12B model for grounded image captioning
- **Language(s):** English
- **License:** Creative Commons Attribution 4.0
- **Finetuned from model:** mistral-community/pixtral-12b




### Model Sources

- **Paper:** https://arxiv.org/abs/2502.13898
- **Dataset:** https://huggingface.co/datasets/daniel3303/GroundCap



## Uses

### Direct Use

The model is designed for generating grounded image captions that explicitly link textual descriptions to visual elements using three types of grounding tags:
- `<gdo>` for objects
- `<gda>` for actions
- `<gdl>` for locations

Each tag maintains object identity through unique IDs, enabling consistent reference tracking throughout the caption.

### Downstream Use

The model can be integrated into:
- Accessibility applications requiring detailed image descriptions
- Content management systems needing verifiable image captions
- Visual question answering systems
- Image retrieval systems

### Out-of-Scope Use

The model is not designed for:
- General image classification
- Object detection (requires separate object detection pipeline)
- Video captioning
- Non-English language captioning

## How to Get Started with the Model

### Input Format

The model expects input in the following format:

```
You are an AI assistant that can see and understand images. I will provide you with an image and the detected objects in it along with their positions and dimensions in the format [id, x,y,width,height].
[DETECTIONS]
[sky-0: 0.41,0.00,0.20,0.15]
[sky-1: 0.62,0.00,0.26,0.10]
[wall-0: 0.01,0.02,0.35,0.86]
[person-0: 0.38,0.35,0.12,0.40]
[person-1: 0.45,0.35,0.08,0.39]
[wall-1: 0.39,0.10,0.35,0.48]
[person-2: 0.71,0.29,0.20,0.51]
[wall-2: 0.75,0.03,0.24,0.88]
[person-3: 0.00,0.57,0.22,0.42]
[handbag-0: 0.21,0.75,0.11,0.23]
[person-4: 0.26,0.48,0.20,0.52]
[floor-wood-0: 0.40,0.59,0.60,0.41]
[/DETECTIONS]
[IMG]
```

### Example Output

The model will generate a grounded caption using three types of tags:
- ```<gdo>``` for objects
- ```<gda>``` for actions
- ```<gdl>``` for locations

Example output:

```
In this scene, a group of individuals is gathered in what appears to be a <gdl class="wall" wall-0 wall-1 wall-2>rugged, makeshift shelter</gdl>. The <gdl class="wall" wall-0 wall-1 wall-2>walls</gdl> are constructed from <gdl class="wall" wall-0 wall-1 wall-2>rough materials</gdl>, giving the space a temporary and utilitarian feel. The <gdl class="sky" sky-0 sky-1>sky</gdl> is visible in the background, suggesting that the shelter is partially open or lacks a complete roof.
Several <gdo class="person" person-0 person-1 person-2 person-3 person-4>people</gdo> are present in the scene, each engaged in their own activities. <gdo class="person" person-0>One individual</gdo> <gda class="sit" person-0>sits</gda> on the <gdl class="floor-wood" floor-wood-0>ground</gdl>, while <gdo class="person" person-1>another person</gdo> <gda class="sit" person-1>is seated</gda> nearby. <gdo class="person" person-2>Another person</gdo> <gda class="sit" person-2>is also sitting</gda> on the <gdl class="floor-wood" floor-wood-0>ground</gdl>, and <gdo class="person" person-3>a fourth individual</gdo> <gda class="sit" person-3>is seated</gda> as well. <gdo class="person" person-4>An additional person</gdo> <gda class="sit" person-4>is sitting</gda> close by.
The <gdo class="handbag" handbag-0>handbag</gdo> is placed on the <gdl class="floor-wood" floor-wood-0>ground</gdl> near one of the individuals, suggesting they might have brought some personal belongings with them. The overall atmosphere of the scene is one of simplicity and resilience, with the individuals making the best of their surroundings in this temporary shelter.
```


## Bias, Risks, and Limitations

- The model was trained on movie scenes from MovieNet, which may introduce biases in terms of scene composition, lighting, and camera angles
- Performance may vary for real-world images that differ significantly from movie scenes
- The model relies on pre-detected objects and their bounding boxes, Mask2Former was used for object detection in the original paper


### Recommendations

- Use in conjunction with a robust object detection system
- Verify grounding accuracy for critical applications
- Consider the movie-centric nature of the training data when applying to other domains

## Training Details

### Training Data

The model was trained on the GroundCap dataset, which contains:
- 52,016 images from 77 movies
- 344 human-annotated captions
- 52,016 automatically generated captions
- 
### Training Procedure

The training followed a two-stage approach:

#### Stage 1:
- Training on 52,016 automatically generated captions
- Learning rate: 2×10^-4
- Epochs: 2
- Batch size: 64 (with gradient accumulation)

#### Stage 2:
- Fine-tuning on 344 human-refined captions
- Learning rate: 2×10^-6
- Epochs: 2
- Batch size: 32 (with gradient accumulation)

#### Training Hyperparameters

- **LoRA Configuration:**
  - Rank: 16
  - Alpha: 32
  - Targeted layers: Self-attention (query, key, value, output) and MLP (gate, up, down)
- **Optimizer:** AdamW
- **Weight decay:** 0.01
- **Precision:** bfloat16
- **Hardware:** 2x NVIDIA A100 (80GB)
- **Training time:** 1 day

## Evaluation

### Testing Data, Factors & Metrics

The model was evaluated on:
- 10,000 test images from GroundCap from which 70 are human-annotated test cases

### Metrics

- **Grounding metrics:** 
  - Precision (P): Correctly grounded objects / Total objects mentioned in caption
  - Recall (R): Correctly grounded objects / Total detected objects
  - F1 score: Harmonic mean of precision and recall

- **Caption quality metrics:** 
  - BLEU-4: N-gram overlap with reference captions
  - METEOR: Semantic similarity with reference captions
  - CIDEr: Consensus-based image description evaluation
  - SPICE: Semantic propositional image caption evaluation
  - ROUGE-L: Longest common subsequence based evaluation

- **Combined metric:** 
  - gMETEOR: Harmonic mean of METEOR and grounding F1 score, combining language quality with grounding accuracy

- **Human evaluation:** (5-point Likert scale)
  - Object precision: Accuracy of object grounding and tag classification
  - Grounding recall: Coverage of detected objects in captions
  - Description accuracy: Correctness of described actions and relationships
  - Language quality: Grammar, readability, and coherence
  - Overall quality: Assessment of caption effectiveness

- **ChatGPT-4o evaluation:** (5-point Likert scale)
  - Uses same criteria as human evaluation
  - Correlations with human judgments:
    - Object Precision: 0.81 (Pearson), 0.73 (Spearman)
    - Grounding Recall: 0.76 (Pearson), 0.67 (Spearman)
    - Description Accuracy: 0.79 (Pearson), 0.77 (Spearman)
    - Language Quality: 0.59 (Pearson), 0.44 (Spearman)
    - Overall Quality: 0.78 (Pearson), 0.68 (Spearman)

### Results

Automatic metrics on test set for PixtralGroundCap:
- Precision: 0.58
- Recall: 0.96
- F1 Score: 0.69
- BLEU-4: 0.19
- METEOR: 0.23
- CIDEr: 0.51
- SPICE: 0.30
- ROUGE-L: 0.37
- gMETEOR: 0.35

Human evaluation results (scale 1-5):
- Object Precision: 4.22
- Grounding Recall: 4.19
- Description Accuracy: 4.08
- Language Quality: 4.91
- Overall Quality: 4.22

ChatGPT-4o evaluation results (scale 1-5):
- Object Precision: 4.21
- Grounding Recall: 4.13
- Description Accuracy: 4.01
- Language Quality: 4.90
- Overall Quality: 4.19


## Environmental Impact

- **Hardware Type:** 2x NVIDIA A100 GPUs
- **Hours used:** 24 hours
- **Cloud Provider:** INESC-ID
- **Compute Region:** Lisbon, Portugal

## Paper

[ArXiv link](https://arxiv.org/abs/2502.13898).

## Citation

**BibTeX:**
```bash
@article{Oliveira2025GroundCapAV,
  title={GroundCap: A Visually Grounded Image Captioning Dataset},
  author={Daniel A. P. Oliveira and Louren{ç}o Teodoro and David Martins de Matos},
  year={2025},
  url={https://api.semanticscholar.org/CorpusID:276450057}
}
```

## Model Card Authors

Daniel A. P. Oliveira, Lourenço Teodoro, and David Martins de Matos

## Model Card Contact

[email protected]


### Framework versions

- PEFT 0.13.2