|
|
--- |
|
|
base_model: microsoft/conditional-detr-resnet-50 |
|
|
datasets: |
|
|
- Voxel51/fisheye8k |
|
|
library_name: transformers |
|
|
license: mit |
|
|
tags: |
|
|
- generated_from_trainer |
|
|
pipeline_tag: object-detection |
|
|
model-index: |
|
|
- name: fisheye8k_microsoft_conditional-detr-resnet-50 |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# fisheye8k_microsoft_conditional-detr-resnet-50 |
|
|
|
|
|
This model is a fine-tuned version of [microsoft/conditional-detr-resnet-50](https://huggingface.co/microsoft/conditional-detr-resnet-50) on the [Voxel51/fisheye8k](https://huggingface.co/datasets/Voxel51/fisheye8k) dataset. It is a key artifact of the **Mcity Data Engine** project. |
|
|
|
|
|
* **Paper**: [Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection](https://huggingface.co/papers/2504.21614) |
|
|
* **Project Page**: [Mcity Data Engine Docs](https://mcity.github.io/mcity_data_engine/) |
|
|
* **Code**: [GitHub Repository for Mcity Data Engine](https://github.com/mcity/mcity_data_engine) |
|
|
|
|
|
It achieves the following results on the evaluation set: |
|
|
- Loss: 1.4466 |
|
|
|
|
|
## Model description |
|
|
|
|
|
This model is an object detection model fine-tuned on the `microsoft/conditional-detr-resnet-50` architecture using the Fisheye8K dataset. It is developed as part of the **Mcity Data Engine**, an open-source framework designed to facilitate iterative model improvement through open-vocabulary data selection. The Mcity Data Engine addresses the challenge of selecting and labeling appropriate samples for machine learning models, particularly for detecting long-tail and rare classes of interest in large amounts of unlabeled data within Intelligent Transportation Systems (ITS). This fine-tuned model demonstrates the practical application of the Data Engine's capabilities in enhancing roadside perception systems for autonomous driving and smart city applications. |
|
|
|
|
|
The model is trained to detect specific categories, as defined in its configuration: `Bus`, `Bike`, `Car`, `Pedestrian`, and `Truck`. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
|
|
This model is intended for object detection tasks within Intelligent Transportation Systems, specifically for identifying vehicles and vulnerable road users in visual data, such as that collected from fisheye cameras. Its primary use case is within the Mcity Data Engine framework for research and development related to improving perception models with rare and novel data. |
|
|
|
|
|
**Limitations**: |
|
|
* The model's performance may vary on data significantly different from the Fisheye8K dataset (e.g., different camera types, environments, or lighting conditions). |
|
|
* Like all deep learning models, it may exhibit biases present in the training data and may not generalize perfectly to all real-world scenarios. |
|
|
* Further evaluation on diverse real-world ITS data is recommended for specific deployment scenarios. |
|
|
|
|
|
## Usage |
|
|
|
|
|
You can use this model directly with the Hugging Face `transformers` library for object detection tasks. |
|
|
|
|
|
```python |
|
|
from transformers import AutoImageProcessor, AutoModelForObjectDetection |
|
|
import torch |
|
|
from PIL import Image |
|
|
import requests |
|
|
|
|
|
# Load image processor and model |
|
|
model_name = "mcity-data-engine/fisheye8k_microsoft_conditional-detr-resnet-50" |
|
|
processor = AutoImageProcessor.from_pretrained(model_name) |
|
|
model = AutoModelForObjectDetection.from_pretrained(model_name) |
|
|
|
|
|
# Example image (replace with your own image URL or local path) |
|
|
url = "http://images.cocodataset.org/val2017/000000039769.jpg" # A standard COCO image |
|
|
image = Image.open(requests.get(url, stream=True).raw).convert("RGB") # Ensure RGB format |
|
|
|
|
|
# Perform inference |
|
|
inputs = processor(images=image, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# Post-process and print results |
|
|
target_sizes = torch.tensor([image.size[::-1]]) # (height, width) |
|
|
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes, threshold=0.9)[0] # Apply a confidence threshold |
|
|
|
|
|
print("Detected objects:") |
|
|
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): |
|
|
box = [round(i, 2) for i in box.tolist()] |
|
|
print( |
|
|
f" - {model.config.id2label[label.item()]} with confidence " |
|
|
f"{round(score.item(), 3)} at location {box}" |
|
|
) |
|
|
``` |
|
|
|
|
|
## Training and evaluation data |
|
|
|
|
|
This model was fine-tuned on the [Voxel51/fisheye8k](https://huggingface.co/datasets/Voxel51/fisheye8k) dataset. The Fisheye8K dataset contains images captured from fisheye cameras, primarily focusing on intelligent transportation system scenarios. |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Training hyperparameters |
|
|
|
|
|
The following hyperparameters were used during training: |
|
|
- learning_rate: 5e-05 |
|
|
- train_batch_size: 1 |
|
|
- eval_batch_size: 8 |
|
|
- seed: 0 |
|
|
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments |
|
|
- lr_scheduler_type: cosine |
|
|
- num_epochs: 36 |
|
|
- mixed_precision_training: Native AMP |
|
|
|
|
|
### Training results |
|
|
|
|
|
| Training Loss | Epoch | Step | Validation Loss | |
|
|
|:-------------:|:-----:|:-----:|:---------------:| |
|
|
| 1.0211 | 1.0 | 5288 | 1.5012 | |
|
|
| 0.9117 | 2.0 | 10576 | 1.4713 | |
|
|
| 0.8595 | 3.0 | 15864 | 1.4364 | |
|
|
| 0.7922 | 4.0 | 21152 | 1.5227 | |
|
|
| 0.7764 | 5.0 | 26440 | 1.6631 | |
|
|
| 0.7419 | 6.0 | 31728 | 1.4320 | |
|
|
| 0.7132 | 7.0 | 37016 | 1.4661 | |
|
|
| 0.6991 | 8.0 | 42304 | 1.4318 | |
|
|
| 0.6585 | 9.0 | 47592 | 1.4069 | |
|
|
| 0.6527 | 10.0 | 52880 | 1.4213 | |
|
|
| 0.6191 | 11.0 | 58168 | 1.4144 | |
|
|
| 0.6248 | 12.0 | 63456 | 1.3887 | |
|
|
| 0.6085 | 13.0 | 68744 | 1.4053 | |
|
|
| 0.582 | 14.0 | 74032 | 1.4418 | |
|
|
| 0.5592 | 15.0 | 79320 | 1.5815 | |
|
|
| 0.552 | 16.0 | 84608 | 1.4832 | |
|
|
| 0.5233 | 17.0 | 89896 | 1.4466 | |
|
|
|
|
|
### Framework versions |
|
|
|
|
|
- Transformers 4.48.3 |
|
|
- Pytorch 2.5.1+cu124 |
|
|
- Datasets 3.2.0 |
|
|
- Tokenizers 0.21.0 |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
Mcity would like to thank Amazon Web Services (AWS) for their pivotal role in providing the cloud infrastructure on which the Data Engine depends. We couldn’t have done it without their tremendous support! |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use the Mcity Data Engine in your research, feel free to cite the project: |
|
|
|
|
|
```bibtex |
|
|
@article{bogdoll2025mcitydataengine, |
|
|
title={Mcity Data Engine}, |
|
|
author={Bogdoll, Daniel and Anata, Rajanikant Patnaik and Stevens, Gregory}, |
|
|
journal={GitHub. Note: https://github.com/mcity/mcity_data_engine}, |
|
|
year={2025} |
|
|
} |
|
|
``` |