arxiv:2503.17247

KL3M Tokenizers: A Family of Domain-Specific and Character-Level Tokenizers for Legal, Financial, and Preprocessing Applications

Published on Mar 21

Authors:

Michael J Bommarito ,

Abstract

KL3M tokenizers, specialized for legal, financial, and governmental text, provide efficient tokenization with reduced vocabulary sizes and improved processing of domain-specific terminology.

AI-generated summary

We present the KL3M tokenizers, a family of specialized tokenizers for legal, financial, and governmental text. Despite established work on tokenization, specialized tokenizers for professional domains remain understudied. Our paper offers two main contributions to this area. First, we introduce domain-specific BPE tokenizers for legal, financial, and governmental text. Our kl3m-004-128k-cased tokenizer uses 9-17% fewer tokens than GPT-4o and Llama3 for domain-specific documents, despite having a smaller vocabulary. For specialized terminology, our cased tokenizer is even more efficient, using up to 83% fewer tokens for legal terms and 39% fewer tokens for financial terms. Second, we develop character-level BPE tokenizers (4K, 8K, and 16K vocabulary sizes) for text correction tasks like OCR post-processing. These tokenizers keep consistent token boundaries between error-containing and correct text, making it easier for models to learn correction patterns. These tokenizers help professional applications by fitting more text in context windows, reducing computational needs, and preserving the meaning of domain-specific terms. Our analysis shows these efficiency gains directly benefit the processing of long legal and financial documents. We release all tokenizers and code through GitHub and Hugging Face to support further research in specialized tokenization.

View arXiv page View PDF Add to collection

Community

stefan-it

5 days ago

Hi @mjbommar ,

I heavily think that the resources are really great. However, I heavily dislike the current structure of the datasets.

It first starts with the release of over 400 single datasets, with little to zero documentation, what is actually inside these datasets. Why not putting them into one dataset repo with good documentation and subsets?

Second: It is a bad idea to release a dataset that just has a link to an S3 bucket: it absolutely makes no sense. When the S3 bucket is deleted of whatever reason, it is completely gone. Why not releasing the content of these files directly here on Hugging Face?

Third: how can I download a single file from the dataset without an AWS account? And why do we need an extra client library for accessing the data?

Then I found this paper ( https://arxiv.org/pdf/2504.07854 ), sorry, but this is not democratization of AI:

As a non-profit, we unfortunately cannot afford to pay unrestricted egress fees for the raw data, but users within us-east-1 can access this data
for free and interested parties may contact us for assistance obtaining the data or coordinating alternative data transfer arrangements.

You could simply host the content here on HF.

The overall user experience is so frustrating, I am very close in scraping the S3 bucket and putting the content here on HF. This will cost some money for me, but at least I am doing something for democratization of AI.

mjbommar

Paper author 5 days ago

HI @stefan-it ,

Did you see in the paper that we do actually have a point-in-time snapshot on HF? This is the aggregate of most of the datasets.
https://huggingface.co/datasets/alea-institute/kl3m-data-snapshot-20250324

You are welcome to create a HF version. We did try, but the HF endpoints simply failed mid-upload so many times that we gave up afew a few weeks of trying (even when uploading from AWS, where HF is hosted). Maybe things have improved with XET, but the parquet splits over HTTP was not working.

As to the datasets modularity, we do strongly believe that the modularity makes curriculum training much easier. It is so much easier to download with errors, set the relative weight of each dataset in your data mixture, manage update frequency, etc. when the datasets are separated.
You can effectively access the data for free inside AWS. Just launch an EC2 instance in us-east-1 and your cost will be ~$0.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 12

Browse 12 models citing this paper

Datasets citing this paper 465

Browse 465 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.17247 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.