Spaces:
Running
Running
DocOwl 1.5 is the state-of-the-art document understanding model by Alibaba with Apache 2.0 license 😍📝 time to dive in and learn more 🧶 | |
data:image/s3,"s3://crabby-images/0ab3c/0ab3cde6824784cbd78b775e077d73f19883bb5f" alt="image_1" | |
This model consists of a ViT-based visual encoder part that takes in crops of image and the original image itself Then the outputs of the encoder goes through a convolution based model, after that the outputs are merged with text and then fed to LLM | |
data:image/s3,"s3://crabby-images/3dce9/3dce966829872a204001df7537a9ac4f13a0ca5f" alt="image_2" | |
Initially, the authors only train the convolution based part (called H-Reducer) and vision encoder while keeping LLM frozen Then for fine-tuning (on image captioning, VQA etc), they freeze vision encoder and train H-Reducer and LLM | |
data:image/s3,"s3://crabby-images/71a9f/71a9f67e78dac9bacbb4a22b6ac5e71acd45ba7f" alt="image_3" | |
Also they use simple linear projection on text and documents. You can see below how they model the text prompts and outputs 🤓 | |
data:image/s3,"s3://crabby-images/ae2e0/ae2e0d6158738abc0a18d3ff0dfc18c2a4258c9e" alt="image_4" | |
They train the model various downstream tasks including: | |
- document understanding (DUE benchmark and more) | |
- table parsing (TURL, PubTabNet) | |
- chart parsing (PlotQA and more) | |
- image parsing (OCR-CC) | |
- text localization (DocVQA and more) | |
data:image/s3,"s3://crabby-images/d33b8/d33b8c6bb12be0fe9ef729e92d8fbe6dffcfa204" alt="image_5" | |
They contribute a new model called DocOwl 1.5-Chat by: | |
1. creating a new document-chat dataset with questions from document VQA datasets | |
2. feeding them to ChatGPT to get long answers | |
3. fine-tune the base model with it (which IMO works very well!) | |
data:image/s3,"s3://crabby-images/07505/07505257c08b8066590d274b0abdae96d0af8dcd" alt="image_6" | |
Resulting generalist model and the chat model are pretty much state-of-the-art 😍 Below you can see how it compares to fine-tuned models | |
data:image/s3,"s3://crabby-images/9b350/9b350710e18357388a68c72749641799e0c8a514" alt="image_7" | |
Very good paper, read it [here](https://t.co/T23JOAPkv1). | |
All the models and the datasets (also some eval datasets on above tasks!) are in this [organization](https://t.co/sJdTw1jWTR). | |
The [Space](https://t.co/57E9DbNZXf). | |
Thanks a lot for reading! | |
data:image/s3,"s3://crabby-images/f457f/f457f94fd5a49f1d265dcbfa72090bc02326ba35" alt="image_8" | |
> [!TIP] | |
Ressources: | |
[mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding](https://arxiv.org/abs/2403.12895) | |
by Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Chen Li, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou (2024) | |
[GitHub](https://github.com/X-PLUG/mPLUG-DocOwl) | |
> [!NOTE] | |
[Original tweet](https://twitter.com/mervenoyann/status/1782421257591357824) (April 22, 2024) |