RoBERTa-base for 19th-Century English
This is a roberta-base
model that has been domain-adapted to understand and generate text in the style of 19th-century formal and literary English. It was trained on a specialized corpus derived from historical dictionaries and augmented with narrative sentences to build a deep contextual understanding of the vocabulary and syntax of the period.
This model is a Masked Language Model (MLM). Its primary function is to predict a masked (hidden) word in a sentence. It is not a chatbot or a generative model for long-form text.
Model Description
The model was developed to address the challenge of applying modern NLP techniques to historical texts. Standard models, while powerful, often lack the specific vocabulary and stylistic nuance required to accurately process language from the 1800-1900 period.
The training process involved a sophisticated data enrichment and model adaptation workflow:
- Data Sourcing: The initial dataset was built from 19th-century dictionary entries, providing a rich vocabulary but lacking natural, contextual sentences.
- Data Augmentation: A modern instruction-following LLM (
open-mistral-nemo
via a batch API) was used to rewrite the formulaic dictionary definitions into high-quality, narrative sentences that mimicked a 19th-century authorial voice. - Tokenizer Augmentation: Instead of training a new tokenizer from scratch (which can lead to "vocabulary shock"), the official
roberta-base
tokenizer was augmented. New, important 19th-century words that were poorly represented were added to the existing vocabulary. This preserved the robust tokenization of the original while expanding its domain-specific knowledge. - Continued Pre-training: The
roberta-base
model was then trained on the new, cleaned, and narrative-rich dataset. This process, also known as domain adaptation, allowed the model to fine-tune its weights and learn the semantic relationships of its new vocabulary.
The final model excels at predicting masked words in sentences that use the formal, literary, or scientific language of the 19th century.
How to Use
The primary way to use this model is with the fill-mask
pipeline.
# Install the transformers library
# !pip install transformers
from transformers import pipeline
# Load the model from the Hugging Face Hub
mask_filler = pipeline("fill-mask", model="your-username/roberta-base-19th-century") #<-- REPLACE WITH YOUR REPO ID
# --- Example 1: Literary Context ---
text1 = f"Her countenance, once so bright, betrayed a deep {mask_filler.tokenizer.mask_token}."
predictions1 = mask_filler(text1, top_k=5)
print(f"Sentence: {text1}")
for pred in predictions1:
print(f"- {pred['token_str'].strip():<15} | Score: {pred['score']:.4f}")
# --- Example 2: Scientific/Technical Context ---
text2 = f"The apothecary mixed a poultice for his ailing {mask_filler.tokenizer.mask_token}."
predictions2 = mask_filler(text2, top_k=5)
print(f"\nSentence: {text2}")
for pred in predictions2:
print(f"- {pred['token_str'].strip():<15} | Score: {pred['score']:.4f}")
# --- Example 3: Common Phrase ---
text3 = f"He fought with great courage and {mask_filler.tokenizer.mask_token}."
predictions3 = mask_filler(text3, top_k=5)
print(f"\nSentence: {text3}")
for pred in predictions3:
print(f"- {pred['token_str'].strip():<15} | Score: {pred['score']:.4f}")
- Downloads last month
- 7