File size: 8,946 Bytes

---
language:
- multilingual
tags:
- spec-vision
- vision-language-model
- transformers
license: mit
pipeline_tag: image-text-to-text
---

# Model Summary

Spec-Vision-V1 is a lightweight, state-of-the-art open multimodal model built on datasets that include synthetic data and filtered publicly available sources, with a focus on high-quality, reasoning-dense data in both text and vision. The model belongs to the SpecVision family and supports a 128K context length (in tokens). It has undergone a rigorous enhancement process, incorporating supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

# 🚀 Model Overview  

**Spec-Vision-V1** is built for **deep integration of visual and textual data**, enabling it to understand and process images in combination with natural language. The model has been trained on a diverse dataset containing images with associated captions, descriptions, and contextual information.  

### ✨ Key Features

- **🖼️ Multimodal Processing**: Seamlessly combines image and text inputs.
- **⚡ Transformer-Based Architecture**: High efficiency in vision-language understanding.
- **📝 Optimized for VQA & Captioning**: Excels in answering visual questions and generating descriptions.
- **📥 Pre-trained Model**: Available for inference and fine-tuning.

---

## 📌 Installation

To use Spec-Vision-V1, install the required dependencies:

```bash
pip install transformers torch torchvision pillow
```

---

## 🔥 Usage

### 📥 Load the Model

```python
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

# Load the model and processor
model_name = "Spec-Vision-V1"
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Load an example image
image = Image.open("example.jpg")

# Input text prompt
text = "Describe the image in detail."

# Process inputs
inputs = processor(images=image, text=text, return_tensors="pt")

# Generate output
with torch.no_grad():
    outputs = model(**inputs)

# Print the generated text
print(outputs)
```

---

## 📊 Model Specifications

| Attribute        | Description                                  |
|-----------------|----------------------------------------------|
| **Model Name**  | Spec-Vision-V1                               |
| **Architecture** | Transformer-based Vision-Language Model    |
| **Pretrained**  | ✅ Yes                                      |
| **Dataset**     | Trained on diverse image-text pairs        |
| **Framework**   | PyTorch & Hugging Face Transformers        |

---

## 🎯 Applications

| Task                     | Description                                                   |
|--------------------------|--------------------------------------------------------------|
| **🖼️ Image Captioning**    | Generates detailed descriptions for input images.         |
| **🧐 Visual Question Answering** | Answers questions about images.                  |
| **🔎 Image-Text Matching**  | Determines the relevance of an image to a given text.  |
| **🌍 Scene Understanding**  | Extracts insights from complex visual data.              |

---

## BLINK Benchmark

A benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.

| Benchmark                | Spec-Vision-V1 | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
|--------------------------|--------------|--------------------------|---------------|---------------|------------------|-------------|-------------------|----------------|--------|
| Art Style               | 87.2         | 62.4                     | 55.6          | 52.1          | 64.1             | 70.1        | 59.8              | 70.9           | 73.3   |
| Counting                | 54.2         | 56.7                     | 54.2          | 66.7          | 51.7             | 55.0        | 59.2              | 65.0           | 65.0   |
| Forensic Detection      | 92.4         | 31.1                     | 40.9          | 34.1          | 54.5             | 38.6        | 67.4              | 60.6           | 75.8   |
| Functional Correspondence | 29.2       | 34.6                     | 24.6          | 24.6          | 33.1             | 26.9        | 33.8              | 31.5           | 43.8   |
| IQ Test                 | 25.3         | 26.7                     | 26.0          | 30.7          | 25.3             | 29.3        | 26.0              | 34.0           | 19.3   |
| Jigsaw                  | 68.0         | 86.0                     | 55.3          | 52.7          | 71.3             | 72.7        | 57.3              | 68.0           | 67.3   |
| Multi-View Reasoning    | 54.1         | 44.4                     | 48.9          | 42.9          | 48.9             | 48.1        | 55.6              | 49.6           | 46.6   |
| Object Localization     | 49.2         | 54.9                     | 53.3          | 54.1          | 44.3             | 57.4        | 62.3              | 65.6           | 68.0   |
| Relative Depth          | 69.4         | 77.4                     | 63.7          | 67.7          | 57.3             | 58.1        | 71.8              | 76.6           | 71.0   |
| Relative Reflectance    | 37.3         | 34.3                     | 32.8          | 38.8          | 32.8             | 27.6        | 36.6              | 38.8           | 40.3   |
| Semantic Correspondence | 36.7         | 31.7                     | 31.7          | 22.3          | 32.4             | 31.7        | 45.3              | 48.9           | 54.0   |
| Spatial Relation       | 65.7         | 75.5                     | 78.3          | 78.3          | 55.9             | 81.1        | 60.1              | 79.0           | 84.6   |
| Visual Correspondence  | 53.5         | 40.7                     | 34.9          | 33.1          | 29.7             | 52.9        | 72.1              | 81.4           | 86.0   |
| Visual Similarity      | 83.0         | 91.9                     | 48.1          | 45.2          | 47.4             | 77.8        | 84.4              | 81.5           | 88.1   |
| **Overall**            | **57.0**     | **53.1**                 | **45.9**      | **45.4**      | **45.8**         | **51.9**    | **56.5**          | **61.0**       | **63.2** |

---

## Video-MME Benchmark

A benchmark that comprehensively assesses the capabilities of multimodal LLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.

| Benchmark               | Spec-Vision-V1 | LlaVA-Interleave-Qwen-7B | InternVL-2-4B | InternVL-2-8B | Gemini-1.5-Flash | GPT-4o-mini | Claude-3.5-Sonnet | Gemini-1.5-Pro | GPT-4o |
|-------------------------|--------------|--------------------------|---------------|---------------|------------------|-------------|-------------------|----------------|--------|
| Short (<2min)          | 60.8         | 62.3                     | 60.7          | 61.7          | 72.2             | 70.1        | 66.3              | 73.3           | 77.7   |
| Medium (4-15min)       | 47.7         | 47.1                     | 46.4          | 49.6          | 62.7             | 59.6        | 54.7              | 61.2           | 68.0   |
| Long (30-60min)        | 43.8         | 41.2                     | 42.6          | 46.6          | 52.1             | 53.9        | 46.6              | 53.2           | 59.6   |
| **Overall**            | **50.8**     | **50.2**                 | **49.9**      | **52.6**      | **62.3**         | **61.2**    | **55.9**          | **62.6**       | **68.4** |


---

## 🏗️ Model Training Details

| Parameter             | Value                          |
|----------------------|--------------------------------|
| **Batch Size**      | 16                             |
| **Optimizer**       | AdamW                          |
| **Learning Rate**   | 5e-5                           |
| **Training Steps**  | 100k                           |
| **Loss Function**   | CrossEntropyLoss               |
| **Framework**       | PyTorch & Transformers         |

---

## 📜 License

**Spec-Vision-V1** is released under the **MIT**.

---

## 📖 Citation

If you use **Spec-Vision-V1** in your research or application, please cite:

```bibtex
@article{SpecVision2025,
  title={Spec-Vision-V1: A Vision-Language Transformer Model},
  author={SVECTOR},
  year={2025},
  journal={SVECTOR Research}
}
```

---

## 📬 Contact

For support or inquiries, reach out to **SVECTOR**:

- **🌐 Website**: [svector.co.in](https://www.svector.co.in)
- **📧 Email**: [[email protected]]([email protected])
- **✨ GitHub**: [SVECTOR GitHub](https://github.com/SVECTOR-CORPORATION)