File size: 4,338 Bytes

ed6e5c1
 
7497a4c
 
 
 
ed6e5c1
 
 
7497a4c
ed6e5c1
7497a4c
 
ed6e5c1
c2f9d4b
ed6e5c1
7497a4c
ed6e5c1
 
7497a4c
ed6e5c1
7497a4c
ed6e5c1
7497a4c
ed6e5c1
7497a4c
 
 
 
ed6e5c1
 
7497a4c
d051991
7497a4c
 
 
ed6e5c1
7497a4c
 
 
 
 
ed6e5c1
7497a4c
 
 
ed6e5c1
7497a4c
 
 
 
 
 
ed6e5c1
7497a4c
 
 
 
 
 
ed6e5c1
7497a4c
ed6e5c1
7497a4c
9576a99
 
 
 
7497a4c
ed6e5c1
 
7497a4c
ed6e5c1
7497a4c
ed6e5c1
7497a4c
 
 
 
 
ed6e5c1
 
 
7497a4c
ed6e5c1
7497a4c

---
library_name: transformers
tags:
- vision
license: apache-2.0
pipeline_tag: zero-shot-object-detection
---


# LLMDet (tiny variant)

[LLMDet](https://arxiv.org/abs/2501.18954) model was proposed in [LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models
](https://arxiv.org/abs/2501.18954) by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng.

LLMDet improves upon the [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino) and [Grounding DINO](https://huggingface.co/docs/transformers/model_doc/grounding-dino) by co-training the model with a large language model.

You can find all the LLMDet checkpoints under the [LLMDet](https://huggingface.co/collections/rziga/llmdet-68398b294d9866c16046dcdd) collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of [MM Grounding DINO](https://huggingface.co/docs/transformers/model_doc/mm-grounding-dino).


## Intended uses

You can use the raw model for zero-shot object detection.

Here's how to use the model for zero-shot object detection:

```py
import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers.image_utils import load_image


# Prepare processor and model
model_id = "iSEE-Laboratory/llmdet_tiny"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

# Prepare inputs
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
text_labels = [["a cat", "a remote control"]]
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)

# Postprocess outputs
results = processor.post_process_grounded_object_detection(
    outputs,
    threshold=0.4,
    target_sizes=[(image.height, image.width)]
)

# Retrieve the first image result
result = results[0]
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
    box = [round(x, 2) for x in box.tolist()]
    print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")
```

## Training Data

This model was trained on:
 - [Objects365v1](https://www.objects365.org/overview.html)
 - [GOLD-G](https://arxiv.org/abs/2104.12763)
 - [GRIT](https://huggingface.co/datasets/zzliang/GRIT)
 - [V3Det](https://github.com/V3Det/V3Det)
 - [GroundingCap-1M](https://arxiv.org/abs/2501.18954)


## Evaluation results

- Here's a table of LLMDet models and their performance on LVIS (results from [official repo](https://github.com/iSEE-Laboratory/LLMDet)):

    |                             Model                         | Pre-Train Data            |  MiniVal APr | MiniVal APc | MiniVal APf | MiniVal AP  | Val1.0 APr | Val1.0 APc | Val1.0 APf |  Val1.0 AP  |
    | --------------------------------------------------------- | -------------------------------------------- | ------------ | ----------- | ----------- | ----------- | ---------- | ---------- | ---------- | ----------- |
    | [llmdet_tiny](https://huggingface.co/rziga/llmdet_tiny)   | (O365,GoldG,GRIT,V3Det) + GroundingCap-1M    | 44.7         | 37.3        | 39.5        | 50.7        | 34.9       | 26.0       | 30.1       | 44.3        |
    | [llmdet_base](https://huggingface.co/rziga/llmdet_base)   | (O365,GoldG,V3Det) + GroundingCap-1M         | 48.3         | 40.8        | 43.1        | 54.3        | 38.5       | 28.2       | 34.3       | 47.8        |
    | [llmdet_large](https://huggingface.co/rziga/llmdet_large) | (O365V2,OpenImageV6,GoldG) + GroundingCap-1M | 51.1         | 45.1        | 46.1        | 56.6        | 42.0       | 31.6       | 38.8       | 50.2        |



## BibTeX entry and citation info

```bib
@article{fu2025llmdet,
  title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
  author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2501.18954},
  year={2025}
}
```