File size: 4,338 Bytes
ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 c2f9d4b ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c d051991 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c 9576a99 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c ed6e5c1 7497a4c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 |
---
library_name: transformers
tags:
- vision
license: apache-2.0
pipeline_tag: zero-shot-object-detection
---
# LLMDet (tiny variant)
[LLMDet](https://arxiv.org/abs/2501.18954) model was proposed in [LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
](https://arxiv.org/abs/2501.18954) by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng.
LLMDet improves upon the [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino) and [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by co-training the model with a large language model.
You can find all the LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/rziga/llmdet-68398b294d9866c16046dcdd) collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino).
## Intended uses
You can use the raw model for zero-shot object detection.
Here's how to use the model for zero-shot object detection:
```py
import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers.image_utils import load_image
# Prepare processor and model
model_id = "iSEE-Laboratory/llmdet_tiny"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)
# Prepare inputs
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
text_labels = [["a cat", "a remote control"]]
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)
# Run inference
with torch.no_grad():
outputs = model(**inputs)
# Postprocess outputs
results = processor.post_process_grounded_object_detection(
outputs,
threshold=0.4,
target_sizes=[(image.height, image.width)]
)
# Retrieve the first image result
result = results[0]
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
box = [round(x, 2) for x in box.tolist()]
print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")
```
## Training Data
This model was trained on:
- [Objects365v1](https://www.objects365.org/overview.html)
- [GOLD-G](https://arxiv.org/abs/2104.12763)
- [GRIT](https://huggingface.co/datasets/zzliang/GRIT)
- [V3Det](https://github.com/V3Det/V3Det)
- [GroundingCap-1M](https://arxiv.org/abs/2501.18954)
## Evaluation results
- Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)):
| Model | Pre-Train Data | MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP | Val1.0 APr | Val1.0 APc | Val1.0 APf | Val1.0 AP |
| --------------------------------------------------------- | -------------------------------------------- | ------------ | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- |
| [llmdet_tiny](https://huggingface.co/rziga/llmdet_tiny) | (O365,GoldG,GRIT,V3Det) + GroundingCap-1M | 44.7 | 37.3 | 39.5 | 50.7 | 34.9 | 26.0 | 30.1 | 44.3 |
| [llmdet_base](https://huggingface.co/rziga/llmdet_base) | (O365,GoldG,V3Det) + GroundingCap-1M | 48.3 | 40.8 | 43.1 | 54.3 | 38.5 | 28.2 | 34.3 | 47.8 |
| [llmdet_large](https://huggingface.co/rziga/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1 | 45.1 | 46.1 | 56.6 | 42.0 | 31.6 | 38.8 | 50.2 |
## BibTeX entry and citation info
```bib
@article{fu2025llmdet,
title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
journal={arXiv preprint arXiv:2501.18954},
year={2025}
}
```
|