
SmolDocling-256M-preview
SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.
π Features:
- π·οΈ DocTags for Efficient Tokenization β Introduces DocTags an efficient and minimal representation for documents that is fully compatible with DoclingDocuments.
- π OCR (Optical Character Recognition) β Extracts text accurately from images.
- π Layout and Localization β Preserves document structure and document element bounding boxes.
- π» Code Recognition β Detects and formats code blocks including identation.
- π’ Formula Recognition β Identifies and processes mathematical expressions.
- π Chart Recognition β Extracts and interprets chart data.
- π Table Recognition β Supports column and row headers for structured table extraction.
- πΌοΈ Figure Classification β Differentiates figures and graphical elements.
- π Caption Correspondence β Links captions to relevant images and figures.
- π List Grouping β Organizes and structures list elements correctly.
- π Full-Page Conversion β Processes entire pages for comprehensive document conversion including all page elements (code, equations, tables, charts etc.)
- π² OCR with Bounding Boxes β OCR regions using a bounding box.
- π General Document Processing β Trained for both scientific and non-scientific documents.
- π Seamless Docling Integration β Import into Docling and export in multiple formats.
- π¨ Fast inference using VLLM β Avg of 0.35 secs per page on A100 GPU.
π§ Coming soon!
- π Better chart recognition π οΈ
- π One shot multi-page inference β±οΈ
- π§ͺ Chemical Recognition
- π Datasets
β¨οΈ Get started (code examples)
You can use transformers or vllm to perform inference, and Docling to convert results to variety of ourput formats (md, html, etc.):
π Single page image inference using Tranformers π€
# Prerequisites:
# pip install torch
# pip install docling_core
# pip install transformers
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained(
"ds4sd/SmolDocling-256M-preview",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
print(doc.export_to_markdown())
π Fast Batch Inference Using VLLM
# Prerequisites:
# pip install vllm
# pip install docling_core
# place page images you want to convert into "img/" dir
import time
import os
from vllm import LLM, SamplingParams
from PIL import Image
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
# Configuration
MODEL_PATH = "ds4sd/SmolDocling-256M-preview"
IMAGE_DIR = "img/" # Place your page images here
OUTPUT_DIR = "out/"
PROMPT_TEXT = "Convert page to Docling."
# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192)
chat_template = f"<|im_start|>User:<image>{PROMPT_TEXT}<end_of_utterance>\nAssistant:"
image_files = sorted([f for f in os.listdir(IMAGE_DIR) if f.lower().endswith((".png", ".jpg", ".jpeg"))])
start_time = time.time()
total_tokens = 0
for idx, img_file in enumerate(image_files, 1):
img_path = os.path.join(IMAGE_DIR, img_file)
image = Image.open(img_path).convert("RGB")
llm_input = {"prompt": chat_template, "multi_modal_data": {"image": image}}
output = llm.generate([llm_input], sampling_params=sampling_params)[0]
doctags = output.outputs[0].text
img_fn = os.path.splitext(img_file)[0]
output_filename = img_fn + ".dt"
output_path = os.path.join(OUTPUT_DIR, output_filename)
with open(output_path, "w", encoding="utf-8") as f:
f.write(doctags)
# To convert to Docling Document, MD, HTML, etc.:
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
doc = DoclingDocument(name="Document")
doc.load_from_doctags(doctags_doc)
# export as any format
# HTML
# doc.save_as_html(output_file)
# MD
output_filename_md = img_fn + ".md"
output_path_md = os.path.join(OUTPUT_DIR, output_filename_md)
doc.save_as_markdown(output_path_md)
print(f"Total time: {time.time() - start_time:.2f} sec")
DocTags

Supported Instructions
Description | Instruction | Comment |
Full conversion | Convert this page to docling. | DocTags represetation |
Chart | Convert chart to table. | (e.g., <chart>) |
Formula | Convert formula to LaTeX. | (e.g., <formula>) |
Code | Convert code to text. | (e.g., <code>) |
Table | Convert table to OTSL. | (e.g., <otsl>) OTSL: Lysak et al., 2023 |
Actions and Pipelines | OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> | |
Identify element at: <loc_247><loc_482><10c_252><loc_486> | ||
Find all 'text' elements on the page, retrieve all section headers. | ||
Detect footer elements on the page. |
Model Summary
- Developed by: Docling Team, IBM Research
- Model type: Multi-modal model (image+text)
- Language(s) (NLP): English
- License: Apache 2.0
- Finetuned from model: Based on Idefics3 (see technical summary)
Repository: Docling
Paper: arXiv
Citation:
@misc{nassar2025smoldoclingultracompactvisionlanguagemodel,
title={SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion},
author={Ahmed Nassar and Andres Marafioti and Matteo Omenetti and Maksym Lysak and Nikolaos Livathinos and Christoph Auer and Lucas Morin and Rafael Teixeira de Lima and Yusik Kim and A. Said Gurbuz and Michele Dolfi and Miquel FarrΓ© and Peter W. J. Staar},
year={2025},
eprint={2503.11576},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.11576},
}
Demo: [Coming soon]
- Downloads last month
- 247
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
Model tree for ds4sd/SmolDocling-256M-preview
Base model
HuggingFaceTB/SmolLM2-135M
Quantized
HuggingFaceTB/SmolLM2-135M-Instruct
Quantized
HuggingFaceTB/SmolVLM-256M-Instruct