Spec-Vision-V1 / README.md

Update README.md

35bbb6e verified 28 days ago

8.9 kB

	---
	language: en
	tags:
	- spec-vision
	- vision-language-model
	- transformers
	license: mit
	---

	# Model Summary

	Spec-Vision-V1 is a lightweight, state-of-the-art open multimodal model built on datasets that include synthetic data and filtered publicly available sources, with a focus on high-quality, reasoning-dense data in both text and vision. The model belongs to the SpecVision family and supports a 128K context length (in tokens). It has undergone a rigorous enhancement process, incorporating supervised fine-tuning and direct preference optimization to ensure precise instruction adherence and robust safety measures.

	# 🚀 Model Overview

	Spec-Vision-V1 is built for deep integration of visual and textual data, enabling it to understand and process images in combination with natural language. The model has been trained on a diverse dataset containing images with associated captions, descriptions, and contextual information.

	### ✨ Key Features

	- 🖼️ Multimodal Processing: Seamlessly combines image and text inputs.
	- ⚡ Transformer-Based Architecture: High efficiency in vision-language understanding.
	- 📝 Optimized for VQA & Captioning: Excels in answering visual questions and generating descriptions.
	- 📥 Pre-trained Model: Available for inference and fine-tuning.

	---

	## 📌 Installation

	To use Spec-Vision-V1, install the required dependencies:

	```bash
	pip install transformers torch torchvision pillow
	```

	---

	## 🔥 Usage

	### 📥 Load the Model

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor
	from PIL import Image
	import torch

	# Load the model and processor
	model_name = "Spec-Vision-V1"
	model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)
	processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

	# Load an example image
	image = Image.open("example.jpg")

	# Input text prompt
	text = "Describe the image in detail."

	# Process inputs
	inputs = processor(images=image, text=text, return_tensors="pt")

	# Generate output
	with torch.no_grad():
	outputs = model(**inputs)

	# Print the generated text
	print(outputs)
	```

	---

	## 📊 Model Specifications

	\| Attribute \| Description \|
	\|-----------------\|----------------------------------------------\|
	\| Model Name \| Spec-Vision-V1 \|
	\| Architecture \| Transformer-based Vision-Language Model \|
	\| Pretrained \| ✅ Yes \|
	\| Dataset \| Trained on diverse image-text pairs \|
	\| Framework \| PyTorch & Hugging Face Transformers \|

	---

	## 🎯 Applications

	\| Task \| Description \|
	\|--------------------------\|--------------------------------------------------------------\|
	\| 🖼️ Image Captioning \| Generates detailed descriptions for input images. \|
	\| 🧐 Visual Question Answering \| Answers questions about images. \|
	\| 🔎 Image-Text Matching \| Determines the relevance of an image to a given text. \|
	\| 🌍 Scene Understanding \| Extracts insights from complex visual data. \|

	---

	## BLINK Benchmark

	A benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.

	\| Benchmark \| Spec-Vision-V1 \| LlaVA-Interleave-Qwen-7B \| InternVL-2-4B \| InternVL-2-8B \| Gemini-1.5-Flash \| GPT-4o-mini \| Claude-3.5-Sonnet \| Gemini-1.5-Pro \| GPT-4o \|
	\|--------------------------\|--------------\|--------------------------\|---------------\|---------------\|------------------\|-------------\|-------------------\|----------------\|--------\|
	\| Art Style \| 87.2 \| 62.4 \| 55.6 \| 52.1 \| 64.1 \| 70.1 \| 59.8 \| 70.9 \| 73.3 \|
	\| Counting \| 54.2 \| 56.7 \| 54.2 \| 66.7 \| 51.7 \| 55.0 \| 59.2 \| 65.0 \| 65.0 \|
	\| Forensic Detection \| 92.4 \| 31.1 \| 40.9 \| 34.1 \| 54.5 \| 38.6 \| 67.4 \| 60.6 \| 75.8 \|
	\| Functional Correspondence \| 29.2 \| 34.6 \| 24.6 \| 24.6 \| 33.1 \| 26.9 \| 33.8 \| 31.5 \| 43.8 \|
	\| IQ Test \| 25.3 \| 26.7 \| 26.0 \| 30.7 \| 25.3 \| 29.3 \| 26.0 \| 34.0 \| 19.3 \|
	\| Jigsaw \| 68.0 \| 86.0 \| 55.3 \| 52.7 \| 71.3 \| 72.7 \| 57.3 \| 68.0 \| 67.3 \|
	\| Multi-View Reasoning \| 54.1 \| 44.4 \| 48.9 \| 42.9 \| 48.9 \| 48.1 \| 55.6 \| 49.6 \| 46.6 \|
	\| Object Localization \| 49.2 \| 54.9 \| 53.3 \| 54.1 \| 44.3 \| 57.4 \| 62.3 \| 65.6 \| 68.0 \|
	\| Relative Depth \| 69.4 \| 77.4 \| 63.7 \| 67.7 \| 57.3 \| 58.1 \| 71.8 \| 76.6 \| 71.0 \|
	\| Relative Reflectance \| 37.3 \| 34.3 \| 32.8 \| 38.8 \| 32.8 \| 27.6 \| 36.6 \| 38.8 \| 40.3 \|
	\| Semantic Correspondence \| 36.7 \| 31.7 \| 31.7 \| 22.3 \| 32.4 \| 31.7 \| 45.3 \| 48.9 \| 54.0 \|
	\| Spatial Relation \| 65.7 \| 75.5 \| 78.3 \| 78.3 \| 55.9 \| 81.1 \| 60.1 \| 79.0 \| 84.6 \|
	\| Visual Correspondence \| 53.5 \| 40.7 \| 34.9 \| 33.1 \| 29.7 \| 52.9 \| 72.1 \| 81.4 \| 86.0 \|
	\| Visual Similarity \| 83.0 \| 91.9 \| 48.1 \| 45.2 \| 47.4 \| 77.8 \| 84.4 \| 81.5 \| 88.1 \|
	\| Overall \| 57.0 \| 53.1 \| 45.9 \| 45.4 \| 45.8 \| 51.9 \| 56.5 \| 61.0 \| 63.2 \|

	---

	## Video-MME Benchmark

	A benchmark that comprehensively assesses the capabilities of multimodal LLMs in processing video data, covering a wide range of visual domains, temporal durations, and data modalities.

	\| Benchmark \| Spec-Vision-V1 \| LlaVA-Interleave-Qwen-7B \| InternVL-2-4B \| InternVL-2-8B \| Gemini-1.5-Flash \| GPT-4o-mini \| Claude-3.5-Sonnet \| Gemini-1.5-Pro \| GPT-4o \|
	\|-------------------------\|--------------\|--------------------------\|---------------\|---------------\|------------------\|-------------\|-------------------\|----------------\|--------\|
	\| Short (<2min) \| 60.8 \| 62.3 \| 60.7 \| 61.7 \| 72.2 \| 70.1 \| 66.3 \| 73.3 \| 77.7 \|
	\| Medium (4-15min) \| 47.7 \| 47.1 \| 46.4 \| 49.6 \| 62.7 \| 59.6 \| 54.7 \| 61.2 \| 68.0 \|
	\| Long (30-60min) \| 43.8 \| 41.2 \| 42.6 \| 46.6 \| 52.1 \| 53.9 \| 46.6 \| 53.2 \| 59.6 \|
	\| Overall \| 50.8 \| 50.2 \| 49.9 \| 52.6 \| 62.3 \| 61.2 \| 55.9 \| 62.6 \| 68.4 \|


	---

	## 🏗️ Model Training Details

	\| Parameter \| Value \|
	\|----------------------\|--------------------------------\|
	\| Batch Size \| 16 \|
	\| Optimizer \| AdamW \|
	\| Learning Rate \| 5e-5 \|
	\| Training Steps \| 100k \|
	\| Loss Function \| CrossEntropyLoss \|
	\| Framework \| PyTorch & Transformers \|

	---

	## 📜 License

	Spec-Vision-V1 is released under the MIT.

	---

	## 📖 Citation

	If you use Spec-Vision-V1 in your research or application, please cite:

	```bibtex
	@article{SpecVision2025,
	title={Spec-Vision-V1: A Vision-Language Transformer Model},
	author={SVECTOR},
	year={2025},
	journal={SVECTOR Research}
	}
	```

	---

	## 📬 Contact

	For support or inquiries, reach out to SVECTOR:

	- 🌐 Website: [svector.co.in](https://www.svector.co.in)
	- 📧 Email: [[email protected]]([email protected])
	- ✨ GitHub: [SVECTOR GitHub](https://github.com/SVECTOR-CORPORATION)