ds4sd/SmolDocling-256M-preview · Sharing MERIT Dataset: Support for Key Info Retrieval & Table Extraction

8 days ago

Congratulations to the DS4SD team on this great work! really exciting to see!

I noticed in the discussion section some WIP on key information retrieval and also a few challenges with tabular data extraction.

I just wanted to share the MERIT Dataset, which might be helpful for both cases. I’d recommend trying the "en-digital-seq" and "en-render-seq" versions, as the other options were designed for very specific research questions.

📂 https://huggingface.co/datasets/de-Rodrigo/merit
📄 https://arxiv.org/pdf/2409.00447

PeterWJStaar

Docling org 8 days ago

@de-Rodrigo Thank you for the dataset link! few questions:

do you have layout+table+readingorder information (I see not all subsets have clean images)
do you have specific prompts for data extraction according to schemas

happy to set up a quick call!

de-Rodrigo

8 days ago

•

edited 8 days ago

@PeterWJStaar , great! (sorry, subset names might not be clear). I try to clarify their structure 🤗: language-imageCreationProcess-task.

languages:
English (en)
Spanish (es)
imageCreationProcesses:
digital: we create a word file and convert it (to pdf) and jpeg/png
digital-xxx-degradation: we take the digital version (png) and degrade it in certain way.
render: we take the digital version (png), create a 3D model of the paper (+ lights, textures, etc.) and render it
taks:
seq: sequence generation (key information extraction)
token-class: token classification

I try to answer your questions:

If you are interested on layout+table+readingorder information, I suggest you using "en-digital-token-class": (https://huggingface.co/datasets/de-Rodrigo/merit/viewer/en-digital-token-class)
English, digital samples and layout info (clean images with no noise, if you notice the opposite, please let me know :) ).
The token classification label contains bboxes for every segment and every word in the segment (reading order is implicit here). Every segment is labelled as one of the following:

subject-name. For instance: History of Philosophy -> label: philosophy_year_12
subject-grade. For instance: 89 -> label: philosophy_year_12_answer
other: non relevant to the subject/grade pair. For instance: Paloalto, 11th of November of 2023 -> label: other
academic year. For instance: Year 12 -> label: year_12 (this might not be relevant for your case)

With this information you can get layout (bboxes), table content (by filtering out "other" and "academic-year") and reading order (again from the bboxes). Rendered subsets might pose more challenging scenarios, including occlusions.

No, since our approach is synthetic dataset generation using digital tools (Office suite, Blender, etc.) we create the samples/labels at the same time, so we do not have prompts linked to the samples.

PD: @PeterWJStaar let me know if I can make your life easier by better structuring sample labels. I just followed the standard used by FUNSD Dataset 🤗 (also useful to get a grasp of what we offer: https://guillaumejaume.github.io/FUNSD/)