flan-t5-small-capitalizer
A specialized fine-tuned version of google/flan-t5-small trained on the agentlans/c4-en-lowercased dataset to restore proper capitalization.
Key Features
- Restores proper noun and sentence capitalization
- Builds on FLAN-T5 small's robust NLP capabilities
- Designed for text normalization tasks
Intended Uses
- Capitalizing lowercased text
- Sentence and proper noun capitalization
- Text normalization
Usage Example
import torch
from transformers import pipeline
device = 0 if torch.cuda.is_available() else -1
model_name = "agentlans/flan-t5-small-capitalizer"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)
input_text = "buzzfeed's 360-degree look at the aftermath of california's valley fire has been viewed more than 6 million times. plenty of viewers have been asking how we made it."
output = flan_t5_pipeline(input_text, max_length=1024)
print(output[0]["generated_text"])
# Expected output: Buzzfeed's 360-degree look at the aftermath of California's Valley Fire has been viewed more than 6 million times. Plenty of viewers have been asking how we made it.
Limitations
- Language: English only
- Text Type: Primarily modern prose found on the Internet
- Capitalization Issues:
- May not capitalize titles correctly
- Inconsistent capitalization style across texts
- Difficulty with special terms and abbreviations requiring capitalization
- Input/Output Constraint: Maximum length of 1024 tokens for both input and output
Training and evaluation data
The model was trained on a subset of the C4 dataset's English configuration. This dataset contains 125,000 rows, split into 100,000 for training and 25,000 for validation. Each row includes the original text and its lowercased version. It achieves a final validation loss of 0.1338 after processing 56 941 616 input tokens.
Training hyperparameters
The model was trained using the following key hyperparameters:
- Learning rate: 5e-05
- Batch size: 8
- Number of epochs: 1
- Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- Learning rate scheduler: Linear
- Maximum source and target length: 1024 tokens
Additional training arguments included bf16 precision, automatic batch size finding, and the use of a sortish sampler.
Training results
Click here for table
Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen |
---|---|---|---|---|
0.2532 | 0.05 | 2500 | 0.1739 | 2824810 |
0.231 | 0.1 | 5000 | 0.1653 | 5702148 |
0.2163 | 0.15 | 7500 | 0.1571 | 8531178 |
0.1966 | 0.2 | 10000 | 0.1529 | 11350902 |
0.2013 | 0.25 | 12500 | 0.1491 | 14191502 |
0.1971 | 0.3 | 15000 | 0.1464 | 17050704 |
0.1791 | 0.35 | 17500 | 0.1447 | 19857804 |
0.193 | 0.4 | 20000 | 0.1424 | 22687180 |
0.1821 | 0.45 | 22500 | 0.1416 | 25532518 |
0.19 | 0.5 | 25000 | 0.1397 | 28423408 |
0.1753 | 0.55 | 27500 | 0.1388 | 31248170 |
0.184 | 0.6 | 30000 | 0.1378 | 34048604 |
0.1717 | 0.65 | 32500 | 0.1371 | 36903282 |
0.1693 | 0.7 | 35000 | 0.1359 | 39709784 |
0.1729 | 0.75 | 37500 | 0.1345 | 42614112 |
0.1711 | 0.8 | 40000 | 0.1344 | 45471178 |
0.1735 | 0.85 | 42500 | 0.1340 | 48355942 |
0.1797 | 0.9 | 45000 | 0.1340 | 51187066 |
0.1659 | 0.95 | 47500 | 0.1338 | 54074434 |
0.1658 | 1.0 | 50000 | 0.1338 | 56941616 |
Framework versions
- Transformers 4.43.3
- PyTorch 2.3.0+cu121
- Datasets 3.2.0
- Tokenizers 0.19.1
- Downloads last month
- 8
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.
Model tree for agentlans/flan-t5-small-capitalizer
Base model
google/flan-t5-small