|
--- |
|
library_name: transformers |
|
license: other |
|
base_model: nvidia/mit-b0 |
|
tags: |
|
- generated_from_trainer |
|
datasets: |
|
- scene_parse_150 |
|
model-index: |
|
- name: segformer-b0-scene-parse-150 |
|
results: [] |
|
metrics: |
|
- mean_iou |
|
pipeline_tag: image-segmentation |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Segformer-b0-scene-parse-150 |
|
|
|
This model is a fine-tuned version of the [nvidia/mit-b0](https://huggingface.co/nvidia/mit-b0) model, specifically trained on the `scene_parse_150` dataset. The goal of this model is to perform semantic segmentation for various scene parsing tasks. |
|
|
|
### Evaluation Results: |
|
The model achieved the following results on the evaluation dataset: |
|
|
|
- **Loss**: 1.8435 |
|
- **Mean IoU**: 0.0881 |
|
- **Mean Accuracy**: 0.1619 |
|
- **Overall Accuracy**: 0.6663 |
|
|
|
**Per-Category IoU** and **Per-Category Accuracy** values are available but sparse, indicating performance variability across different categories. |
|
|
|
## Model Description |
|
|
|
Segformer-b0 is based on a modified version of the Vision Transformer (ViT) architecture, adapted for efficient segmentation tasks. It incorporates hierarchical features to generate high-quality segmentation maps. |
|
|
|
More detailed model descriptions, including architectural adjustments or preprocessing requirements, are needed. |
|
|
|
## Intended Uses & Limitations |
|
|
|
- **Use Cases**: Suitable for scene parsing and segmentation tasks in environments with diverse visual categories. |
|
- **Limitations**: Performance varies significantly between categories, as seen from sparse accuracy and IoU metrics. The model may struggle with underrepresented classes or categories with fewer visual distinctions. |
|
- Further details on intended domains and limitations are needed. |
|
|
|
## Training and Evaluation Data |
|
|
|
The model was trained on the `scene_parse_150` dataset, which consists of diverse visual scenes with 150 unique semantic categories. Further information on dataset specifics and any preprocessing steps is needed. |
|
|
|
## Training Procedure |
|
|
|
### Hyperparameters: |
|
- **Learning Rate**: 6e-05 |
|
- **Training Batch Size**: 2 |
|
- **Evaluation Batch Size**: 2 |
|
- **Seed**: 42 |
|
- **Optimizer**: Adam (betas=(0.9, 0.999), epsilon=1e-08) |
|
- **Learning Rate Scheduler**: Linear |
|
- **Number of Epochs**: 50 |
|
|
|
### Training Results: |
|
The model was trained over 50 epochs, but further details regarding its convergence behavior, training duration, and hardware environment could provide additional insights. |
|
|
|
## Framework Versions: |
|
- Transformers 4.44.2 |
|
- PyTorch 2.4.0+cu121 |
|
- Datasets 2.21.0 |
|
- Tokenizers 0.19.1 |