File size: 8,946 Bytes
3998add b22979e 3998add 39f4015 d0c9024 3998add 39f4015 3998add 39f4015 3998add 39f4015 3998add 39f4015 3998add 39f4015 35bbb6e 3998add 39f4015 3998add 39f4015 35bbb6e 39f4015 3998add 39f4015 3998add 39f4015 3998add 39f4015 3998add 39f4015 3998add 39f4015 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
language:
- multilingual
tags:
- spec-vision
- vision-language-model
- transformers
license: mit
pipeline_tag: image-text-to-text
---
# Model Summary
Spec-Vision-V1 is a lightweight, state-of-the-art open multimodal model built on datasets that include synthetic data and filtered publicly available sources, with a focus on high-quality, reasoning-dense data in both text and vision. The model belongs to the SpecVision family and supports a 128K context length (in tokens). It has undergone a rigorous enhancement process, incorporating supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.
# π Model Overview
**Spec-Vision-V1** is built for **deep integration of visual and textual data**, enabling it to understand and process images in combination with natural language. The model has been trained on a diverse dataset containing images with associated captions, descriptions, and contextual information.
### β¨ Key Features
- **πΌοΈ Multimodal Processing**: Seamlessly combines image and text inputs.
- **β‘ Transformer-Based Architecture**: High efficiency in vision-language understanding.
- **π Optimized for VQA & Captioning**: Excels in answering visual questions and generating descriptions.
- **π₯ Pre-trained Model**: Available for inference and fine-tuning.
---
## π Installation
To use Spec-Vision-V1, install the required dependencies:
```bash
pip install transformers torch torchvision pillow
```
---
## π₯ Usage
### π₯ Load the Model
```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
# Load the model and processor
model_name = "Spec-Vision-V1"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
# Load an example image
image = Image.open("example.jpg")
# Input text prompt
text = "Describe the image in detail."
# Process inputs
inputs = processor(images=image, text=text, return_tensors="pt")
# Generate output
with torch.no_grad():
outputs = model(**inputs)
# Print the generated text
print(outputs)
```
---
## π Model Specifications
| Attribute | Description |
|-----------------|----------------------------------------------|
| **Model Name** | Spec-Vision-V1 |
| **Architecture** | Transformer-based Vision-Language Model |
| **Pretrained** | β
Yes |
| **Dataset** | Trained on diverse image-text pairs |
| **Framework** | PyTorch & Hugging Face Transformers |
---
## π― Applications
| Task | Description |
|--------------------------|--------------------------------------------------------------|
| **πΌοΈ Image Captioning** | Generates detailed descriptions for input images. |
| **π§ Visual Question Answering** | Answers questions about images. |
| **π Image-Text Matching** | Determines the relevance of an image to a given text. |
| **π Scene Understanding** | Extracts insights from complex visual data. |
---
## BLINK Benchmark
A benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.
| Benchmark | Spec-Vision-V1 | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
|--------------------------|--------------|--------------------------|---------------|---------------|------------------|-------------|-------------------|----------------|--------|
| Art Style | 87.2 | 62.4 | 55.6 | 52.1 | 64.1 | 70.1 | 59.8 | 70.9 | 73.3 |
| Counting | 54.2 | 56.7 | 54.2 | 66.7 | 51.7 | 55.0 | 59.2 | 65.0 | 65.0 |
| Forensic Detection | 92.4 | 31.1 | 40.9 | 34.1 | 54.5 | 38.6 | 67.4 | 60.6 | 75.8 |
| Functional Correspondence | 29.2 | 34.6 | 24.6 | 24.6 | 33.1 | 26.9 | 33.8 | 31.5 | 43.8 |
| IQ Test | 25.3 | 26.7 | 26.0 | 30.7 | 25.3 | 29.3 | 26.0 | 34.0 | 19.3 |
| Jigsaw | 68.0 | 86.0 | 55.3 | 52.7 | 71.3 | 72.7 | 57.3 | 68.0 | 67.3 |
| Multi-View Reasoning | 54.1 | 44.4 | 48.9 | 42.9 | 48.9 | 48.1 | 55.6 | 49.6 | 46.6 |
| Object Localization | 49.2 | 54.9 | 53.3 | 54.1 | 44.3 | 57.4 | 62.3 | 65.6 | 68.0 |
| Relative Depth | 69.4 | 77.4 | 63.7 | 67.7 | 57.3 | 58.1 | 71.8 | 76.6 | 71.0 |
| Relative Reflectance | 37.3 | 34.3 | 32.8 | 38.8 | 32.8 | 27.6 | 36.6 | 38.8 | 40.3 |
| Semantic Correspondence | 36.7 | 31.7 | 31.7 | 22.3 | 32.4 | 31.7 | 45.3 | 48.9 | 54.0 |
| Spatial Relation | 65.7 | 75.5 | 78.3 | 78.3 | 55.9 | 81.1 | 60.1 | 79.0 | 84.6 |
| Visual Correspondence | 53.5 | 40.7 | 34.9 | 33.1 | 29.7 | 52.9 | 72.1 | 81.4 | 86.0 |
| Visual Similarity | 83.0 | 91.9 | 48.1 | 45.2 | 47.4 | 77.8 | 84.4 | 81.5 | 88.1 |
| **Overall** | **57.0** | **53.1** | **45.9** | **45.4** | **45.8** | **51.9** | **56.5** | **61.0** | **63.2** |
---
## Video-MME Benchmark
A benchmark that comprehensively assesses the capabilities of multimodal LLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.
| Benchmark | Spec-Vision-V1 | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
|-------------------------|--------------|--------------------------|---------------|---------------|------------------|-------------|-------------------|----------------|--------|
| Short (<2min) | 60.8 | 62.3 | 60.7 | 61.7 | 72.2 | 70.1 | 66.3 | 73.3 | 77.7 |
| Medium (4-15min) | 47.7 | 47.1 | 46.4 | 49.6 | 62.7 | 59.6 | 54.7 | 61.2 | 68.0 |
| Long (30-60min) | 43.8 | 41.2 | 42.6 | 46.6 | 52.1 | 53.9 | 46.6 | 53.2 | 59.6 |
| **Overall** | **50.8** | **50.2** | **49.9** | **52.6** | **62.3** | **61.2** | **55.9** | **62.6** | **68.4** |
---
## ποΈ Model Training Details
| Parameter | Value |
|----------------------|--------------------------------|
| **Batch Size** | 16 |
| **Optimizer** | AdamW |
| **Learning Rate** | 5e-5 |
| **Training Steps** | 100k |
| **Loss Function** | CrossEntropyLoss |
| **Framework** | PyTorch & Transformers |
---
## π License
**Spec-Vision-V1** is released under the **MIT**.
---
## π Citation
If you use **Spec-Vision-V1** in your research or application, please cite:
```bibtex
@article{SpecVision2025,
title={Spec-Vision-V1: A Vision-Language Transformer Model},
author={SVECTOR},
year={2025},
journal={SVECTOR Research}
}
```
---
## π¬ Contact
For support or inquiries, reach out to **SVECTOR**:
- **π Website**: [svector.co.in](https://www.svector.co.in)
- **π§ Email**: [[email protected]]([email protected])
- **β¨ GitHub**: [SVECTOR GitHub](https://github.com/SVECTOR-CORPORATION) |