ReaderLM-v2 / README.md

feat: update the README

1256ad3 about 2 months ago

7.58 kB

	---
	pipeline_tag: text-generation
	language:
	- multilingual
	inference: false
	license: cc-by-nc-4.0
	library_name: transformers
	---

	<br><br>

	<p align="center">
	<img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
	</p>

	<p align="center">
	<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
	</p>

	[Blog](https://jina.ai/news/readerlm-v2-frontier-small-language-model-for-markdown-and-json) \| [Colab](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing)

	# ReaderLM-v2

	`ReaderLM-v2` is the second generation of [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b), a 1.5B parameter language model that converts raw HTML into formatted markdown or structured JSON with improved accuracy and better support for longer contexts.
	Supporting multiple languages (29 in total), `ReaderLM-v2` is specialized for tasks involving HTML parsing, transformation, and text extraction.

	## Model Overview

	- Model Type: Autoregressive, decoder-only transformer
	- Parameter Count: ~1.5B
	- Context Window: Up to 512K tokens (combined input and output)
	- Supported Languages: English, Chinese, Japanese, Korean, French, Spanish, Portuguese, German, Italian, Russian, Vietnamese, Thai, Arabic, and more (29 total)

	## What's New in `ReaderLM-v2`

	`ReaderLM-v2` features several improvements over [ReaderLM-v1](https://huggingface.co/jinaai/reader-lm-1.5b):

	- Better Markdown Generation: Generates cleaner, more readable Markdown output.
	- JSON Output: Produce structured JSON-formatted text, enabling structured extraction for further downstream processing.
	- Longer Context Handling: Can handle up to 512K tokens, which is beneficial for large HTML documents.
	- Multilingual Support: Covers 29 languages for broader applications across international web data.

	---

	# Usage

	Below, you will find instructions and examples for using `ReaderLM-v2` locally using the Hugging Face Transformers library.
	For a more hands-on experience in a hosted environment, see the [Google Colab Notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing).

	## On Google Colab

	The easiest way to experience `ReaderLM-v2` is by running our [Colab notebook](https://colab.research.google.com/drive/1FfPjZwkMSocOLsEYH45B3B4NxDryKLGI?usp=sharing),
	The notebook demonstrates HTML-to-markdown conversion, JSON extraction, and instruction-following using the HackerNews frontpage as an example.
	The notebook is optimized for Colab's free T4 GPU tier and requires `vllm` and `triton` for acceleration and running.
	Feel free to test it with any website.
	For HTML-to-markdown tasks, simply input the raw HTML without any prefix instructions.
	However, JSON output and instruction-based extraction require specific prompt formatting as shown in the examples.


	## Local Usage

	To use `ReaderLM-v2` locally:

	1. Install the necessary dependencies:

	```bash
	pip install transformers
	```

	2. Load and run the model:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import re

	device = "cuda" # or "cpu"
	tokenizer = AutoTokenizer.from_pretrained("jinaai/ReaderLM-v2")
	model = AutoModelForCausalLM.from_pretrained("jinaai/ReaderLM-v2").to(device)
	```

	3. (Optional) Pre-clean your HTML to remove scripts, styles, comments, to reduce the noise and length of the input a bit (i.e. make it more friendly for GPU VRAM):

	```python
	# Patterns
	SCRIPT_PATTERN = r'<[ ]script.?\/[ ]script[ ]>'
	STYLE_PATTERN = r'<[ ]style.?\/[ ]style[ ]>'
	META_PATTERN = r'<[ ]meta.?>'
	COMMENT_PATTERN = r'<[ ]!--.?--[ ]*>'
	LINK_PATTERN = r'<[ ]link.?>'
	BASE64_IMG_PATTERN = r'<img[^>]+src="data:image/[^;]+;base64,[^"]+"[^>]*>'
	SVG_PATTERN = r'(<svg[^>]>)(.?)(<\/svg>)'

	def replace_svg(html: str, new_content: str = "this is a placeholder") -> str:
	return re.sub(
	SVG_PATTERN,
	lambda match: f"{match.group(1)}{new_content}{match.group(3)}",
	html,
	flags=re.DOTALL,
	)

	def replace_base64_images(html: str, new_image_src: str = "#") -> str:
	return re.sub(BASE64_IMG_PATTERN, f'<img src="{new_image_src}"/>', html)

	def clean_html(html: str, clean_svg: bool = False, clean_base64: bool = False):
	html = re.sub(SCRIPT_PATTERN, '', html, flags=re.IGNORECASE \| re.MULTILINE \| re.DOTALL)
	html = re.sub(STYLE_PATTERN, '', html, flags=re.IGNORECASE \| re.MULTILINE \| re.DOTALL)
	html = re.sub(META_PATTERN, '', html, flags=re.IGNORECASE \| re.MULTILINE \| re.DOTALL)
	html = re.sub(COMMENT_PATTERN, '', html, flags=re.IGNORECASE \| re.MULTILINE \| re.DOTALL)
	html = re.sub(LINK_PATTERN, '', html, flags=re.IGNORECASE \| re.MULTILINE \| re.DOTALL)

	if clean_svg:
	html = replace_svg(html)
	if clean_base64:
	html = replace_base64_images(html)
	return html
	```

	4. Create a prompt for the model:

	```python
	def create_prompt(text: str, tokenizer=None, instruction: str = None, schema: str = None) -> str:
	"""
	Create a prompt for the model with optional instruction and JSON schema.
	"""
	if not instruction:
	instruction = "Extract the main content from the given HTML and convert it to Markdown format."
	if schema:
	# This is an example instruction for JSON output
	instruction = "Extract the specified information from a list of news threads and present it in a structured JSON format."
	prompt = f"{instruction}\n```html\n{text}\n```\nThe JSON schema is as follows:```json{schema}```"
	else:
	prompt = f"{instruction}\n```html\n{text}\n```"

	messages = [
	{
	"role": "user",
	"content": prompt,
	}
	]

	return tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	```

	### HTML to Markdown Example

	```python
	# Example HTML
	html = "<html><body><h1>Hello, world!</h1></body></html>"

	html = clean_html(html)

	input_prompt = create_prompt(html)
	inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
	outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

	print(tokenizer.decode(outputs[0]))
	```

	### Instruction-Focused Extraction

	```python
	instruction = "Extract the menu items from the given HTML and convert it to Markdown format."
	input_prompt = create_prompt(html, instruction=instruction)
	inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
	outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

	print(tokenizer.decode(outputs[0]))
	```

	### HTML to JSON Example

	```python
	schema = """
	{
	"type": "object",
	"properties": {
	"title": {
	"type": "string"
	},
	"author": {
	"type": "string"
	},
	"date": {
	"type": "string"
	},
	"content": {
	"type": "string"
	}
	},
	"required": ["title", "author", "date", "content"]
	}
	"""

	html = clean_html(html)
	input_prompt = create_prompt(html, schema=schema)

	inputs = tokenizer.encode(input_prompt, return_tensors="pt").to(device)
	outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)

	print(tokenizer.decode(outputs[0]))
	```

	## AWS Sagemaker & Azure Marketplace & Google Cloud Platform

	Coming soon.