Blinkin VL 32B Distill Monolith

Introduction

Our strategy prioritizes generality and scalability - By placing Blinkin VLM in the "No Domain-specific Modifications" category, we are investing in a single, powerful model designed for broad applicability. The core advantage is its flexibility; the model can handle a wide array of unforeseen tasks without needing to be re-engineered. The trade-off is that it may be less optimized for a single, niche task compared to a specialized model.
Comparison against "Scene Graph-based" approaches: We deliberately moved away from legacy methods like Scene Graphs.
- Pro: Our method scales much better with modern, large-scale datasets and is more adaptable to abstract or open-world concepts that are difficult to define in a rigid graph structure.
- Con: We sacrifice the explicit interpretability that scene graphs provide. A graph-based model can clearly show why it reached a conclusion (e.g., "the man is next to the car"), whereas our model's reasoning is less transparent, making it harder to debug for high-precision tasks.
Comparison against "Domain-specific Modifications": This represents a different strategic bet. These models are engineered for excellence in one area, such as document analysis (Donut, Pix2Struct) or by using specific fusion strategies (UNITER).
- Pro: We avoid the high development overhead of creating and maintaining separate, specialized architectures for each new problem domain. Our single, foundational model serves as a unified platform.
- Con: A model specifically modified for a narrow task will likely outperform our general-purpose model on that specific task. Our strength is strong performance across many domains, not necessarily being the absolute best in one.

Model Architecture

The above architecture diagram illustrates the distillation process used to train Blinkin VLM 32B.

An experienced "Teacher" model (bottom) guides a new "Student" model (top) on how to understand the world.
The system has two specialized parts: one for analyzing images (the red boxes) and another for reading text (the yellow boxes). The goal is to get these two specialists to work together.
The "Student" learns by trying to imitate the Teacher. It compares its own understanding of an image and text with the Teacher's more experienced understanding and adjusts itself to match.
A key training technique is a "fill-in-the-blanks" game. The AI is shown a sentence with a missing word (like [MASK]) or an image with a missing piece and must guess what should go there. This helps it learn the connection between objects and their names.
The final goal is for the "Student," after all its training, to be able to look at an image and generate a detailed, accurate description on its own, just as shown in the example with the piston engine.

Evaluation

MMMU Benchmark

The substantial gap between the "Human Expert - Best" (~88) and "Human Expert - Worst" (~76) scores indicates a high degree of ambiguity or required domain expertise within the MMMU benchmark. The top-performing models are only just beginning to approach the level of a lower-bound human expert.
Blinkin VLM 34B is positioned strongly within a highly competitive cluster of models. It outperforms several notable proprietary models, such as Gemini 1.0 Pro, and is on par with other leading open-source models in its size class, like LLaVA-NeXT-34B. This places it in the upper quartile of the models benchmarked.
The chart demonstrates that raw parameter count is not the sole determinant of performance. For instance, the 7B parameter InternVL-Chat-V1.5 outperforms the larger LLaVA-NeXT-70B. Similarly, Blinkin VLM 34B surpasses several models with presumably larger parameter counts, highlighting the increasing importance of model architecture, data quality, and training methodology.
The model distribution suggests distinct performance tiers. A top echelon is clearly dominated by proprietary models like Gemini Ultra and GPT-4V. Below this, there is a large, densely packed "chasing pack" in the 40-55 score range, where Blinkin VLM 34B resides. This dense clustering implies that many current-generation architectures and training approaches are converging on a similar performance ceiling.
While the top two spots are held by closed-source models, the strong performance of models like Yi-VL-34B, Blinkin VLM 34B, and LLaVA-NeXT variants demonstrates that open-source efforts are highly competitive and are successfully narrowing the performance gap with their proprietary counterparts.

Vision

Dataset	Qwen2.5-VL-72B	LLaVa-1.6-34B	Blinkin-VLM-34B
MMStar	70.8	68.3	69.5
OCRBenchV2	61.5/63.7	47.8/46.1	57.2/59.1
CC-OCR	79.8	68.7	77.1
DocVQA	96.4	96.5	94.8
InfoVQA	87.3	84.5	83.4
LVBench	47.3	-	49.00
VideoMME	73.3/79.1	71.2/77.8	70.5/77.9
MMBench-Video	2.02	1.7	1.93

Text

MODEL	MMLU	MMLU-PRO	GPQA-diamond	MBPP
Qwen2.5-VL-32B	78.4	68.8	46.0	84.0
Blinkin-VLM-34B	77.0	62.6	42.5	83.9
Mistral-Small-3.1-24B	80.6	69.3	46.0	74.7
Gemma3-27B-IT	76.9	67.5	42.4	74.4
GPT-4o	82.0	61.7	39.4	84.8
Claude-3.5-Haiku	77.6	65.0	41.6	85.6

Applications

Advanced Driver Monitoring: In-car cameras with Blinkin VLM can go beyond simple eye-tracking. They can understand context and generate specific alerts like, "Driver is distracted, looking down at a phone in their lap," or identify objects to provide interactive help, such as allowing a driver to point at a button and ask, "What does this control?"
Automotive Quality Control: During vehicle assembly, our VLM can perform end-of-line inspections with high detail. Instead of a simple pass/fail, it can scan the car's paint job and generate a specific report like, "Minor 'orange peel' texture detected on the rear passenger-side door," providing actionable feedback to the manufacturing process.
Construction Site Progress Monitoring: Blinkin VLM can analyze daily drone footage of a construction site and automatically generate a human-readable progress log. For example: "Foundation for Sector C is complete. Framework for the second story has been erected. A delivery of steel beams is awaiting unloading at the north gate."
Remote Expert Assistance for Heavy Machinery: A field technician can stream video of a malfunctioning engine to a senior expert. Our VLM then analyzes the feed in real-time and annotates it, stating, "This is the primary fuel injector; pressure gauge reads below optimal range." This allows the expert to diagnose and resolve the issue far more quickly.
Industrial Safety and Compliance: VLM can continuously monitor a factory floor or construction site to enforce safety protocols. It can automatically detect and report violations such as, "A worker has entered the robotic arm's safety perimeter while it is active," or "Personnel in the welding area are not wearing required safety goggles."
Insurance Claims Processing: After a car accident, a user can upload photos of the damage. The model then can analyze these images, identify the affected parts (e.g., "dented front bumper," "cracked headlight"), and automatically fill out the initial damage report, significantly speeding up the claims process.
Enhanced Retail & E-commerce: Blinkin VLM can automatically generate detailed product descriptions and tags just from an image. For an online clothing store, this means instantly creating descriptions like "Women's blue short-sleeve V-neck t-shirt," which improves searchability and saves countless hours of manual data entry.

Citation

@article{Blinkin-VL,
  title={Blinkin-VL Technical Report},
  author={Akshay Joshi, Bartosz Pampuch, Josef Suess, Intel Research},
  journal={arXiv preprint arXiv:2502},
  year={2025}
}

akkshay
/

blinkin-vlm

You need to agree to share your contact information to access this model

Blinkin VL 32B Distill Monolith

Introduction

Model Architecture

Evaluation

MMMU Benchmark

Vision

Text

Applications

Citation

Datasets used to train akkshay/blinkin-vlm