WebOrganizer/TopicClassifier
The TopicClassifier organizes web content into 17 categories based on the URL and text contents of web pages. The model is a gte-base-en-v1.5 with 140M parameters fine-tuned on the following training data:
- WebOrganizer/TopicAnnotations-Llama-3.1-8B: 1M documents annotated by Llama-3.1-8B (first-stage training)
- WebOrganizer/TopicAnnotations-Llama-3.1-405B-FP8: 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
All Domain Classifiers
- WebOrganizer/FormatClassifier
- WebOrganizer/FormatClassifier-NoURL
- WebOrganizer/TopicClassifier ← you are here!
- WebOrganizer/TopicClassifier-NoURL
Usage
This classifier expects input in the following input format:
{url}
{text}
Example:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("WebOrganizer/TopicClassifier")
model = AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/TopicClassifier",
trust_remote_code=True,
use_memory_efficient_attention=False)
web_page = """http://www.example.com
How to build a computer from scratch? Here are the components you need..."""
inputs = tokenizer([web_page], return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits.softmax(dim=-1)
print(probs.argmax(dim=-1))
# -> 5 ("Hardware" topic)
You can convert the logits
of the model with a softmax to obtain a probability distribution over the following 24 categories (in order of labels, also see id2label
and label2id
in the model config):
- Adult
- Art & Design
- Software Dev.
- Crime & Law
- Education & Jobs
- Hardware
- Entertainment
- Social Life
- Fashion & Beauty
- Finance & Business
- Food & Dining
- Games
- Health
- History
- Home & Hobbies
- Industrial
- Literature
- Politics
- Religion
- Science & Tech.
- Software
- Sports & Fitness
- Transportation
- Travel
The full definitions of the categories can be found in the taxonomy config.
Efficient Inference
We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This requires installing xformers
(see more here) and loading the model like:
AutoModelForSequenceClassification.from_pretrained(
"WebOrganizer/TopicClassifier",
trust_remote_code=True,
unpad_inputs=True,
use_memory_efficient_attention=True,
torch_dtype=torch.bfloat16
)
Citation
@article{wettig2025organize,
title={Organize the Web: Constructing Domains Enhances Pre-Training Data Curation},
author={Alexander Wettig and Kyle Lo and Sewon Min and Hannaneh Hajishirzi and Danqi Chen and Luca Soldaini},
year={2025}
}
- Downloads last month
- 2
Model tree for WebOrganizer/TopicClassifier
Base model
Alibaba-NLP/gte-base-en-v1.5