ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

Community Article Published March 18, 2025

Upvote

manu Manuel Faysse

QuentinJG Quentin Macé

antonioloison Antonio Loison

Why a new benchmark?

Since the release of the original ViDoRe Benchmark, evaluating visual models on document retrieval tasks, visual retrieval models have significantly advanced! While the original ColPali model reported an average score of 81.3 nDCG@5, current SOTA models on the leaderboard surpass a nDCG@5 of 90, with some tasks becoming “too easy” to yield a meaningful signal!

With the benchmark approaching saturation for SOTA models, there is limited room to truly measure improvements and understand model capabilities in realistic scenarios. To continue pushing the boundaries of visual retrieval, it became essential to introduce a new benchmark designed specifically to challenge these advanced models: ViDoRe Benchmark V2.

Motivating the Creation of ViDoRe Benchmark V2

In developing ViDoRe Benchmark V2, our main goal was to create a benchmark reflective of real-world retrieval challenges—difficult, diverse, and meaningful. Current benchmarks exhibit limitations that prevent them from accurately reflecting real user behavior and complex retrieval scenarios. We identified three critical issues in existing benchmarks:

Extractive Nature of Queries: Current benchmarks typically rely on extractive queries, providing unrealistic retrieval contexts since real users rarely formulate queries from exact phrases in documents.
Single-Page Query Bias: Many benchmarks overly emphasize retrieval from single-page contexts, neglecting complex, multi-document or cross-document queries common in real-world applications.
Challenges in Synthetic Query Generation: Purely synthetic benchmarks, while appealing in theory, are difficult to implement effectively without extensive manual oversight. They often produce outliers, irrelevant or trivial queries, making human filtering essential yet costly.

Design Decisions and Techniques Used

To address these challenges and create a robust, realistic benchmark, ViDoRe Benchmark V2 includes several innovative features:

Blind Contextual Querying: In practice, users don’t often know the content of the corpus they are querying. To reduce the widespread extractive bias in most synthetic query-document datasets (datasets are often created with knowledge of the document content), we only provided query annotator models with limited information about the document (summaries, metadata, etc) and filtered out the many irrelevant queries that resulted, better reproducing real-world user interactions with the corpus.
Long and Cross-Document Queries: Unlike traditional benchmarks, ViDoRe Benchmark V2 emphasizes long-form and cross-document queries, closely mirroring real-world retrieval situations. Multiple datasets specifically focus on scenarios involving comprehensive documents or multi-document retrieval tasks.
Hybrid Synthetic and Human-in-the-Loop Creation: Recognizing the limitations of synthetic query generation alone, we adopted a hybrid approach—generating queries synthetically and extensively refining them through human review. This process, though intensive, ensured significantly higher query quality and dataset reliability.

Dataset Selection for ViDoRe Benchmark V2

The selected datasets for ViDoRe Benchmark V2 are diverse, publicly available, and challenging. Each dataset presents distinct visual complexity and is suitable for realistic retrieval tasks, including multilingual versions with queries translated into French, English, Spanish, and German. This multilingual approach further extends the applicability and challenge level of the benchmark.

Dataset Name	Original Version	Multilingual Version	Original Doc Lang	Query Lang	# Docs	# Queries	# Pages	# Qrels	Avg. Pages/Query	Comments
Axa Terms of Service	`vidore/synthetic_axa_filtered_v1.0`	`vidore/synthetic_axa_filtered_v1.0_multilingual`	French	French	4	18	260	86	4.7	Small but challenging, multi-document
MIT Tissue Interaction	`vidore/synthetic_mit_biomedical_tissue_interactions_unfiltered`	`vidore/synthetic_mit_biomedical_tissue_interactions_unfiltered_multilingual`	English	English	27	160	1016	515	3.2	Largest dataset, most extractive
World Economic Reports	`vidore/synthetic_economics_macro_economy_2024_filtered_v1.0`	`vidore/synthetic_economics_macro_economy_2024_filtered_v1.0_multilingual`	English	English	4	18	260	86	4.7	Cross-document queries, high complexity
ESG Reports	`vidore/synthetic_rse_restaurant_filtered_v1.0`	`vidore/synthetic_rse_restaurant_filtered_v1.0_multilingual`	English	French	30	57	1538	222	3.9	Natively cross-lingual, industry-specific

Evaluating Models

To evaluate models on ViDoRe Benchmark 2, we follow these steps:

Option 1: Using the CLI

Here is a CLI example for using a colpali type retriever on vidore benchmark 2. For other retrievers, please refer to this repo.

    vidore-benchmark evaluate-retriever \
        --model-class colpali \
        --model-name vidore/colpali-v1.3 \
        --collection-name vidore/vidore-benchmark-v2-dev-67ae03e3924e85b36e7f53b0 \
        --dataset-format beir \
        --split test

Option 2: Creating a custom retriever

Detailed instructions on how to do that are available here

Results

Here are for example some ndcg_at_5 results of visual retrieval models on ViDoRe Benchmark 2:

Dataset	voyageai	metrics-colqwen2.5-3B	colsmolvlm-v0.1	colqwen2-v1.0	colpali-v1.2	dse-qwen2-2b-mrl-v1	colSmol-256M	colpali-v1.3	colqwen2.5-v0.2	dse-llamaindex	tsystems-colqwen2.5-3b-multilingual-v1.0	gme-qwen2-VL-7B	visrag-ret	colSmol-500M	colpali-v1.1
restaurant_esg_reports_beir	0.561	0.645	0.624	0.622	0.321	0.614	0.460	0.511	0.684	0.631	0.721	0.658	0.537	0.522	0.465
synthetic_axa	0.641	0.579	0.555	0.651	0.560	0.655	0.504	0.598	0.603	0.688	0.693	0.607	0.505	0.587	0.547
synthetic_axa_multilingual	0.595	0.557	0.432	0.572	0.458	0.563	0.341	0.501	0.532	0.610	0.600	0.554	0.452	0.377	0.484
synthetic_economics_macro_economy_2024	0.588	0.566	0.609	0.615	0.531	0.615	0.534	0.516	0.598	0.612	0.548	0.629	0.596	0.503	0.567
synthetic_mit_biomedical_tissue_interactions	0.564	0.639	0.581	0.618	0.585	0.592	0.532	0.597	0.636	0.606	0.653	0.640	0.548	0.543	0.564
synthetic_mit_biomedical_tissue_interactions_multilingual	0.515	0.569	0.505	0.565	0.557	0.551	0.340	0.565	0.611	0.569	0.617	0.551	0.477	0.421	0.507
synthetic_rse_restaurant	0.472	0.496	0.511	0.534	0.519	0.549	0.272	0.570	0.574	0.503	0.517	0.543	0.459	0.392	0.461
synthetic_rse_restaurant_multilingual	0.462	0.492	0.476	0.542	0.540	0.557	0.313	0.557	0.574	0.512	0.533	0.567	0.464	0.391	0.481
synthetics_economics_macro_economy_2024_multilingual	0.550	0.535	0.474	0.532	0.479	0.528	0.273	0.499	0.565	0.528	0.512	0.562	0.487	0.361	0.438
Average	0.550	0.564	0.530	0.583	0.505	0.580	0.397	0.546	0.597	0.584	0.599	0.590	0.503	0.455	0.502

Notes on the benchmark :
We adapted the evaluation procedure for the voyageAI API, resulting in slightly lower performance on the ViDoRe benchmark v1 compared to the values reported by voyageAI. This discrepancy likely arises from our resizing of input images to a maximum image height of 1200 pixels to facilitate efficient benchmarking, a preprocessing step presumably not applied in voyageAI's original benchmarking setup.

The best models so far seem to be based on Qwen2.5. Be careful however, these models do not fall under an open license.

Insights on the Results

Insights from the ViDoRe v2 Benchmark:

The ViDoRe v2 benchmark maintains a strong correlation with the original ViDoRe benchmark, as evidenced by consistent model rankings across both versions.
ViDoRe v2 leaves substantial room for future improvements, contrasting with ViDoRe v1, which was approaching performance saturation (scores exceeding 90%).
Certain models exhibit signs of slightly overfitting to the training distribution, resulting in reduced generalization to novel data (e.g., vidore/colSmol-256M, vidore/colSmol-500M, Metric-AI/ColQwen2.5-3b-multilingual-v1.0). These models perform worst on the V2 than what their performance on the V1 would lead to believe.
The multilingual splits in ViDoRe v2 provide a more accurate assessment of non-english capabilities in visual retriever models. We observe a significant performance gap between models trained exclusively in English using an English-only VLM and those that are not.
Larger model scale is beneficial; notably, the gme-qwen7B model achieves strong overall performance but incurs significant computational cost and inference latency. Inversely, while impressive for their sizes, models under 1B parameters tend to lag behind, especially on previously unseen data distributions.
We tend to see better separation between model performances with the human labeled dataset (esg_human), indicating it is of slightly higher quality than the synthetic datasets and is a more discriminating signal.

Our goal is for ViDoRe V2 to become a dynamic, "living benchmark" that regularly grows with new tasks and datasets. To achieve this, we welcome and encourage the community to contribute datasets and evaluation tasks. This collaborative approach helps ensure that the benchmark stays relevant, useful, and reflective of real-world challenges.

Acknowledgements

For professionals interested in deeper discussions and projects around Visual RAG, ColPali, or agentic systems, don't hesitate to reach out to [email protected] and reach our term of experts at Illuin Technology that can help accelerate your AI efforts!
We look forward to your feedback and contributions! If you have any sets of documents and associated queries that you would find interesting / challenging for a retrieval task feel free to shoot us a mail!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

Upvote