ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval
Why a new benchmark?
Since the release of the original ViDoRe Benchmark, evaluating visual models on document retrieval tasks, visual retrieval models have significantly advanced! While the original ColPali model reported an average score of 81.3 nDCG@5, current SOTA models on the leaderboard surpass a nDCG@5 of 90, with some tasks becoming “too easy” to yield a meaningful signal!
With the benchmark approaching saturation for SOTA models, there is limited room to truly measure improvements and understand model capabilities in realistic scenarios. To continue pushing the boundaries of visual retrieval, it became essential to introduce a new benchmark designed specifically to challenge these advanced models: ViDoRe Benchmark V2.
Motivating the Creation of ViDoRe Benchmark V2
In developing ViDoRe Benchmark V2, our main goal was to create a benchmark reflective of real-world retrieval challenges—difficult, diverse, and meaningful. Current benchmarks exhibit limitations that prevent them from accurately reflecting real user behavior and complex retrieval scenarios. We identified three critical issues in existing benchmarks:
Extractive Nature of Queries: Current benchmarks typically rely on extractive queries, providing unrealistic retrieval contexts since real users rarely formulate queries from exact phrases in documents.
Single-Page Query Bias: Many benchmarks overly emphasize retrieval from single-page contexts, neglecting complex, multi-document or cross-document queries common in real-world applications.
Challenges in Synthetic Query Generation: Purely synthetic benchmarks, while appealing in theory, are difficult to implement effectively without extensive manual oversight. They often produce outliers, irrelevant or trivial queries, making human filtering essential yet costly.
Design Decisions and Techniques Used
To address these challenges and create a robust, realistic benchmark, ViDoRe Benchmark V2 includes several innovative features:
Blind Contextual Querying: In practice, users don’t often know the content of the corpus they are querying. To reduce the widespread extractive bias in most synthetic query-document datasets (datasets are often created with knowledge of the document content), we only provided query annotator models with limited information about the document (summaries, metadata, etc) and filtered out the many irrelevant queries that resulted, better reproducing real-world user interactions with the corpus.
Long and Cross-Document Queries: Unlike traditional benchmarks, ViDoRe Benchmark V2 emphasizes long-form and cross-document queries, closely mirroring real-world retrieval situations. Multiple datasets specifically focus on scenarios involving comprehensive documents or multi-document retrieval tasks.
Hybrid Synthetic and Human-in-the-Loop Creation: Recognizing the limitations of synthetic query generation alone, we adopted a hybrid approach—generating queries synthetically and extensively refining them through human review. This process, though intensive, ensured significantly higher query quality and dataset reliability.
Dataset Selection for ViDoRe Benchmark V2
The selected datasets for ViDoRe Benchmark V2 are diverse, publicly available, and challenging. Each dataset presents distinct visual complexity and is suitable for realistic retrieval tasks, including multilingual versions with queries translated into French, English, Spanish, and German. This multilingual approach further extends the applicability and challenge level of the benchmark.
Dataset Name | Original Version | Multilingual Version | Original Doc Lang | Query Lang | # Docs | # Queries | # Pages | # Qrels | Avg. Pages/Query | Comments |
---|---|---|---|---|---|---|---|---|---|---|
Axa Terms of Service | vidore/synthetic_axa_filtered_v1.0 |
vidore/synthetic_axa_filtered_v1.0_multilingual |
French | French | 4 | 18 | 260 | 86 | 4.7 | Small but challenging, multi-document |
MIT Tissue Interaction | vidore/synthetic_mit_biomedical_tissue_interactions_unfiltered |
vidore/synthetic_mit_biomedical_tissue_interactions_unfiltered_multilingual |
English | English | 27 | 160 | 1016 | 515 | 3.2 | Largest dataset, most extractive |
World Economic Reports | vidore/synthetic_economics_macro_economy_2024_filtered_v1.0 |
vidore/synthetic_economics_macro_economy_2024_filtered_v1.0_multilingual |
English | English | 4 | 18 | 260 | 86 | 4.7 | Cross-document queries, high complexity |
ESG Reports | vidore/synthetic_rse_restaurant_filtered_v1.0 |
vidore/synthetic_rse_restaurant_filtered_v1.0_multilingual |
English | French | 30 | 57 | 1538 | 222 | 3.9 | Natively cross-lingual, industry-specific |
Evaluating Models
To evaluate models on ViDoRe Benchmark 2, we follow these steps:
Option 1: Using the CLI
Here is a CLI example for using a colpali type retriever on vidore benchmark 2. For other retrievers, please refer to this repo.
vidore-benchmark evaluate-retriever \
--model-class colpali \
--model-name vidore/colpali-v1.3 \
--collection-name vidore/vidore-benchmark-v2-dev-67ae03e3924e85b36e7f53b0 \
--dataset-format beir \
--split test
Option 2: Creating a custom retriever
Detailed instructions on how to do that are available here
Results
Here are for example some ndcg_at_5
results of visual retrieval models on ViDoRe Benchmark 2:
Dataset | voyageai | metrics-colqwen2.5-3B | colsmolvlm-v0.1 | colqwen2-v1.0 | colpali-v1.2 | dse-qwen2-2b-mrl-v1 | colSmol-256M | colpali-v1.3 | colqwen2.5-v0.2 | dse-llamaindex | tsystems-colqwen2.5-3b-multilingual-v1.0 | gme-qwen2-VL-7B | visrag-ret | colSmol-500M | colpali-v1.1 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
restaurant_esg_reports_beir | 0.561 | 0.645 | 0.624 | 0.622 | 0.321 | 0.614 | 0.460 | 0.511 | 0.684 | 0.631 | 0.721 | 0.658 | 0.537 | 0.522 | 0.465 |
synthetic_axa | 0.641 | 0.579 | 0.555 | 0.651 | 0.560 | 0.655 | 0.504 | 0.598 | 0.603 | 0.688 | 0.693 | 0.607 | 0.505 | 0.587 | 0.547 |
synthetic_axa_multilingual | 0.595 | 0.557 | 0.432 | 0.572 | 0.458 | 0.563 | 0.341 | 0.501 | 0.532 | 0.610 | 0.600 | 0.554 | 0.452 | 0.377 | 0.484 |
synthetic_economics_macro_economy_2024 | 0.588 | 0.566 | 0.609 | 0.615 | 0.531 | 0.615 | 0.534 | 0.516 | 0.598 | 0.612 | 0.548 | 0.629 | 0.596 | 0.503 | 0.567 |
synthetic_mit_biomedical_tissue_interactions | 0.564 | 0.639 | 0.581 | 0.618 | 0.585 | 0.592 | 0.532 | 0.597 | 0.636 | 0.606 | 0.653 | 0.640 | 0.548 | 0.543 | 0.564 |
synthetic_mit_biomedical_tissue_interactions_multilingual | 0.515 | 0.569 | 0.505 | 0.565 | 0.557 | 0.551 | 0.340 | 0.565 | 0.611 | 0.569 | 0.617 | 0.551 | 0.477 | 0.421 | 0.507 |
synthetic_rse_restaurant | 0.472 | 0.496 | 0.511 | 0.534 | 0.519 | 0.549 | 0.272 | 0.570 | 0.574 | 0.503 | 0.517 | 0.543 | 0.459 | 0.392 | 0.461 |
synthetic_rse_restaurant_multilingual | 0.462 | 0.492 | 0.476 | 0.542 | 0.540 | 0.557 | 0.313 | 0.557 | 0.574 | 0.512 | 0.533 | 0.567 | 0.464 | 0.391 | 0.481 |
synthetics_economics_macro_economy_2024_multilingual | 0.550 | 0.535 | 0.474 | 0.532 | 0.479 | 0.528 | 0.273 | 0.499 | 0.565 | 0.528 | 0.512 | 0.562 | 0.487 | 0.361 | 0.438 |
Average | 0.550 | 0.564 | 0.530 | 0.583 | 0.505 | 0.580 | 0.397 | 0.546 | 0.597 | 0.584 | 0.599 | 0.590 | 0.503 | 0.455 | 0.502 |
Notes on the benchmark :
We adapted the evaluation procedure for the voyageAI API, resulting in slightly lower performance on the ViDoRe benchmark v1 compared to the values reported by voyageAI. This discrepancy likely arises from our resizing of input images to a maximum image height of 1200 pixels to facilitate efficient benchmarking, a preprocessing step presumably not applied in voyageAI's original benchmarking setup.
The best models so far seem to be based on Qwen2.5. Be careful however, these models do not fall under an open license.
Insights on the Results
Insights from the ViDoRe v2 Benchmark:
- The ViDoRe v2 benchmark maintains a strong correlation with the original ViDoRe benchmark, as evidenced by consistent model rankings across both versions.
- ViDoRe v2 leaves substantial room for future improvements, contrasting with ViDoRe v1, which was approaching performance saturation (scores exceeding 90%).
- Certain models exhibit signs of slightly overfitting to the training distribution, resulting in reduced generalization to novel data (e.g., vidore/colSmol-256M, vidore/colSmol-500M, Metric-AI/ColQwen2.5-3b-multilingual-v1.0). These models perform worst on the V2 than what their performance on the V1 would lead to believe.
- The multilingual splits in ViDoRe v2 provide a more accurate assessment of non-english capabilities in visual retriever models. We observe a significant performance gap between models trained exclusively in English using an English-only VLM and those that are not.
- Larger model scale is beneficial; notably, the gme-qwen7B model achieves strong overall performance but incurs significant computational cost and inference latency. Inversely, while impressive for their sizes, models under 1B parameters tend to lag behind, especially on previously unseen data distributions.
- We tend to see better separation between model performances with the human labeled dataset (esg_human), indicating it is of slightly higher quality than the synthetic datasets and is a more discriminating signal.
Our goal is for ViDoRe V2 to become a dynamic, "living benchmark" that regularly grows with new tasks and datasets. To achieve this, we welcome and encourage the community to contribute datasets and evaluation tasks. This collaborative approach helps ensure that the benchmark stays relevant, useful, and reflective of real-world challenges.
Acknowledgements
For professionals interested in deeper discussions and projects around Visual RAG, ColPali, or agentic systems, don't hesitate to reach out to [email protected] and reach our term of experts at Illuin Technology that can help accelerate your AI efforts!
We look forward to your feedback and contributions! If you have any sets of documents and associated queries that you would find interesting / challenging for a retrieval task feel free to shoot us a mail!