3 Arabic-Nougat: Fine-Tuning Vision Transformers for Arabic OCR and Markdown Extraction We present Arabic-Nougat, a suite of OCR models for converting Arabic book pages into structured Markdown text. Based on Meta's Nougat architecture, Arabic-Nougat includes three specialized models: arabic-small-nougat, arabic-base-nougat, and arabic-large-nougat. These models are fine-tuned on a synthetic dataset, arabic-img2md, comprising 13.7k pairs of Arabic book pages and their Markdown representations. Key contributions include the Aranizer-PBE-86k tokenizer, designed for efficient tokenization, and the use of torch.bfloat16 precision with Flash Attention 2 for optimized training and inference. Our models achieve state-of-the-art performance, with arabic-large-nougat delivering the highest Markdown Structure Accuracy and the lowest Character Error Rate. Additionally, we release a large-scale dataset containing 1.1 billion Arabic tokens extracted from over 8,500 books using our best-performing model, providing a valuable resource for Arabic OCR research. All models, datasets, and code are open-sourced and available at https://github.com/MohamedAliRashad/arabic-nougat. 1 authors · Nov 19, 2024 2
1 ReaderLM-v2: Small Language Model for HTML to Markdown and JSON We present ReaderLM-v2, a compact 1.5 billion parameter language model designed for efficient web content extraction. Our model processes documents up to 512K tokens, transforming messy HTML into clean Markdown or JSON formats with high accuracy -- making it an ideal tool for grounding large language models. The model's effectiveness results from two key innovations: (1) a three-stage data synthesis pipeline that generates high quality, diverse training data by iteratively drafting, refining, and critiquing web content extraction; and (2) a unified training framework combining continuous pre-training with multi-objective optimization. Intensive evaluation demonstrates that ReaderLM-v2 outperforms GPT-4o-2024-08-06 and other larger models by 15-20\% on carefully curated benchmarks, particularly excelling at documents exceeding 100K tokens, while maintaining significantly lower computational requirements. 5 authors · Mar 2