ReaderLM-v2 / README.md
nan's picture
feat: update the README
1256ad3
|
raw
history blame
7.58 kB
---
pipeline_tag: text-generation
language:
- multilingual
inference: false
license: cc-by-nc-4.0
library_name: transformers
---
<br><br>
<p align="center">
<img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
</p>
<p align="center">
<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
</p>
[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) | [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)
# ReaderLM-v2
`ReaderLM-v2` is the second generation of [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b), a **1.5B** parameter language model that converts raw HTML into formatted markdown or structured JSON with improved accuracy and better support for longer contexts.
Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.
## Model Overview
- **Model Type**: Autoregressive, decoder-only transformer
- **Parameter Count**: ~1.5B
- **Context Window**: Up to 512K tokens (combined input and output)
- **Supported Languages**: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)
## What's New in `ReaderLM-v2`
`ReaderLM-v2` features several improvements over [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b):
- **Better Markdown Generation**: Generates cleaner, more readable Markdown output.
- **JSON Output**: Produce structured JSON-formatted text, enabling structured extraction for further downstream processing.
- **Longer Context Handling**: Can handle up to 512K tokens, which is beneficial for large HTML documents.
- **Multilingual Support**: Covers 29 languages for broader applications across international web data.
---
# Usage
Below, you will find instructions and examples for using `ReaderLM-v2` locally using the Hugging Face Transformers library.
For a more hands-on experience in a hosted environment, see the [Google Colab Notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing).
## On Google Colab
The easiest way to experience `ReaderLM-v2` is by running our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
The notebook demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
Feel free to test it with any website.
For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.
## Local Usage
To use `ReaderLM-v2` locally:
1. Install the necessary dependencies:
```bash
pip install transformers
```
2. Load and run the model:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import re
device = "cuda" # or "cpu"
tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
```
3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input a bit (i.e. make it more friendly for GPU VRAM):
```python
# Patterns
SCRIPT_PATTERN = r'<[ ]*script.*?\/[ ]*script[ ]*>'
STYLE_PATTERN = r'<[ ]*style.*?\/[ ]*style[ ]*>'
META_PATTERN = r'<[ ]*meta.*?>'
COMMENT_PATTERN = r'<[ ]*!--.*?--[ ]*>'
LINK_PATTERN = r'<[ ]*link.*?>'
BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
SVG_PATTERN = r'(<svg[^>]*>)(.*?)(<\/svg>)'
def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
return re.sub(
SVG_PATTERN,
lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
html,
flags=re.DOTALL,
)
def replace_base64_images(html: str, new_image_src: str = "#") -> str:
return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)
def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
html = re.sub(SCRIPT_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
html = re.sub(STYLE_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
html = re.sub(META_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
html = re.sub(COMMENT_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
html = re.sub(LINK_PATTERN, '', html, flags=re.IGNORECASE | re.MULTILINE | re.DOTALL)
if clean_svg:
html = replace_svg(html)
if clean_base64:
html = replace_base64_images(html)
return html
```
4. Create a prompt for the model:
```python
def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
"""
Create a prompt for the model with optional instruction and JSON schema.
"""
if not instruction:
instruction = "Extract the main content from the given HTML and convert it to Markdown format."
if schema:
# This is an example instruction for JSON output
instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
else:
prompt = f"{instruction}\n```html\n{text}\n```"
messages = [
{
"role": "user",
"content": prompt,
}
]
return tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
```
### HTML to Markdown Example
```python
# Example HTML
html = "<html><body><h1>Hello, world!</h1></body></html>"
html = clean_html(html)
input_prompt = create_prompt(html)
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
print(tokenizer.decode(outputs[0]))
```
### Instruction-Focused Extraction
```python
instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
input_prompt = create_prompt(html, instruction=instruction)
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
print(tokenizer.decode(outputs[0]))
```
### HTML to JSON Example
```python
schema = """
{
"type": "object",
"properties": {
"title": {
"type": "string"
},
"author": {
"type": "string"
},
"date": {
"type": "string"
},
"content": {
"type": "string"
}
},
"required": ["title", "author", "date", "content"]
}
"""
html = clean_html(html)
input_prompt = create_prompt(html, schema=schema)
inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
print(tokenizer.decode(outputs[0]))
```
## AWS Sagemaker & Azure Marketplace & Google Cloud Platform
Coming soon.