Image-Text-to-Text
Transformers
ONNX
Safetensors
English
idefics3
image-to-text
conversational

Sharing MERIT Dataset: Support for Key Info Retrieval & Table Extraction

#50
by de-Rodrigo - opened

Congratulations to the DS4SD team on this great work! really exciting to see!

I noticed in the discussion section some WIP on key information retrieval and also a few challenges with tabular data extraction.

I just wanted to share the MERIT Dataset, which might be helpful for both cases. I’d recommend trying the "en-digital-seq" and "en-render-seq" versions, as the other options were designed for very specific research questions.

πŸ“‚ https://huggingface.co/datasets/de-Rodrigo/merit
πŸ“„ https://arxiv.org/pdf/2409.00447

Docling org

@de-Rodrigo Thank you for the dataset link! few questions:

  1. do you have layout+table+readingorder information (I see not all subsets have clean images)
  2. do you have specific prompts for data extraction according to schemas

happy to set up a quick call!

@PeterWJStaar , great! (sorry, subset names might not be clear). I try to clarify their structure πŸ€—: language-imageCreationProcess-task.

  • languages:
    English (en)
    Spanish (es)

  • imageCreationProcesses:
    digital: we create a word file and convert it (to pdf) and jpeg/png
    digital-xxx-degradation: we take the digital version (png) and degrade it in certain way.
    render: we take the digital version (png), create a 3D model of the paper (+ lights, textures, etc.) and render it

  • taks:
    seq: sequence generation (key information extraction)
    token-class: token classification

I try to answer your questions:

  1. If you are interested on layout+table+readingorder information, I suggest you using "en-digital-token-class": (https://huggingface.co/datasets/de-Rodrigo/merit/viewer/en-digital-token-class)
    English, digital samples and layout info (clean images with no noise, if you notice the opposite, please let me know :) ).
    The token classification label contains bboxes for every segment and every word in the segment (reading order is implicit here). Every segment is labelled as one of the following:
  • subject-name. For instance: History of Philosophy -> label: philosophy_year_12
  • subject-grade. For instance: 89 -> label: philosophy_year_12_answer
  • other: non relevant to the subject/grade pair. For instance: Paloalto, 11th of November of 2023 -> label: other
  • academic year. For instance: Year 12 -> label: year_12 (this might not be relevant for your case)

With this information you can get layout (bboxes), table content (by filtering out "other" and "academic-year") and reading order (again from the bboxes). Rendered subsets might pose more challenging scenarios, including occlusions.

  1. No, since our approach is synthetic dataset generation using digital tools (Office suite, Blender, etc.) we create the samples/labels at the same time, so we do not have prompts linked to the samples.

PD: @PeterWJStaar let me know if I can make your life easier by better structuring sample labels. I just followed the standard used by FUNSD Dataset πŸ€— (also useful to get a grasp of what we offer: https://guillaumejaume.github.io/FUNSD/)

Sign up or log in to comment