Sharing MERIT Dataset: Support for Key Info Retrieval & Table Extraction
Congratulations to the DS4SD team on this great work! really exciting to see!
I noticed in the discussion section some WIP on key information retrieval and also a few challenges with tabular data extraction.
I just wanted to share the MERIT Dataset, which might be helpful for both cases. Iβd recommend trying the "en-digital-seq" and "en-render-seq" versions, as the other options were designed for very specific research questions.
π https://huggingface.co/datasets/de-Rodrigo/merit
π https://arxiv.org/pdf/2409.00447
@de-Rodrigo Thank you for the dataset link! few questions:
- do you have layout+table+readingorder information (I see not all subsets have clean images)
- do you have specific prompts for data extraction according to schemas
happy to set up a quick call!
@PeterWJStaar , great! (sorry, subset names might not be clear). I try to clarify their structure π€: language-imageCreationProcess-task.
languages:
English (en)
Spanish (es)imageCreationProcesses:
digital: we create a word file and convert it (to pdf) and jpeg/png
digital-xxx-degradation: we take the digital version (png) and degrade it in certain way.
render: we take the digital version (png), create a 3D model of the paper (+ lights, textures, etc.) and render ittaks:
seq: sequence generation (key information extraction)
token-class: token classification
I try to answer your questions:
- If you are interested on layout+table+readingorder information, I suggest you using "en-digital-token-class": (https://huggingface.co/datasets/de-Rodrigo/merit/viewer/en-digital-token-class)
English, digital samples and layout info (clean images with no noise, if you notice the opposite, please let me know :) ).
The token classification label contains bboxes for every segment and every word in the segment (reading order is implicit here). Every segment is labelled as one of the following:
- subject-name. For instance: History of Philosophy -> label: philosophy_year_12
- subject-grade. For instance: 89 -> label: philosophy_year_12_answer
- other: non relevant to the subject/grade pair. For instance: Paloalto, 11th of November of 2023 -> label: other
- academic year. For instance: Year 12 -> label: year_12 (this might not be relevant for your case)
With this information you can get layout (bboxes), table content (by filtering out "other" and "academic-year") and reading order (again from the bboxes). Rendered subsets might pose more challenging scenarios, including occlusions.
- No, since our approach is synthetic dataset generation using digital tools (Office suite, Blender, etc.) we create the samples/labels at the same time, so we do not have prompts linked to the samples.
PD: @PeterWJStaar let me know if I can make your life easier by better structuring sample labels. I just followed the standard used by FUNSD Dataset π€ (also useful to get a grasp of what we offer: https://guillaumejaume.github.io/FUNSD/)