flan-t5-small-capitalizer

A specialized fine-tuned version of google/flan-t5-small trained on the agentlans/c4-en-lowercased dataset to restore proper capitalization.

Key Features

Restores proper noun and sentence capitalization
Builds on FLAN-T5 small's robust NLP capabilities
Designed for text normalization tasks

Intended Uses

Capitalizing lowercased text
Sentence and proper noun capitalization
Text normalization

Usage Example

import torch
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1
model_name = "agentlans/flan-t5-small-capitalizer"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)

input_text = "buzzfeed's 360-degree look at the aftermath of california's valley fire has been viewed more than 6 million times. plenty of viewers have been asking how we made it."

output = flan_t5_pipeline(input_text, max_length=1024)
print(output[0]["generated_text"])
# Expected output: Buzzfeed's 360-degree look at the aftermath of California's Valley Fire has been viewed more than 6 million times. Plenty of viewers have been asking how we made it.

Limitations

Language: English only
Text Type: Primarily modern prose found on the Internet
Capitalization Issues:
- May not capitalize titles correctly
- Inconsistent capitalization style across texts
- Difficulty with special terms and abbreviations requiring capitalization
Input/Output Constraint: Maximum length of 1024 tokens for both input and output

Training and evaluation data

The model was trained on a subset of the C4 dataset's English configuration. This dataset contains 125,000 rows, split into 100,000 for training and 25,000 for validation. Each row includes the original text and its lowercased version. It achieves a final validation loss of 0.1338 after processing 56 941 616 input tokens.

Training hyperparameters

The model was trained using the following key hyperparameters:

Learning rate: 5e-05
Batch size: 8
Number of epochs: 1
Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
Learning rate scheduler: Linear
Maximum source and target length: 1024 tokens

Additional training arguments included bf16 precision, automatic batch size finding, and the use of a sortish sampler.

Training results

Click here for table

Training Loss	Epoch	Step	Validation Loss	Input Tokens Seen
0.2532	0.05	2500	0.1739	2824810
0.231	0.1	5000	0.1653	5702148
0.2163	0.15	7500	0.1571	8531178
0.1966	0.2	10000	0.1529	11350902
0.2013	0.25	12500	0.1491	14191502
0.1971	0.3	15000	0.1464	17050704
0.1791	0.35	17500	0.1447	19857804
0.193	0.4	20000	0.1424	22687180
0.1821	0.45	22500	0.1416	25532518
0.19	0.5	25000	0.1397	28423408
0.1753	0.55	27500	0.1388	31248170
0.184	0.6	30000	0.1378	34048604
0.1717	0.65	32500	0.1371	36903282
0.1693	0.7	35000	0.1359	39709784
0.1729	0.75	37500	0.1345	42614112
0.1711	0.8	40000	0.1344	45471178
0.1735	0.85	42500	0.1340	48355942
0.1797	0.9	45000	0.1340	51187066
0.1659	0.95	47500	0.1338	54074434
0.1658	1.0	50000	0.1338	56941616

Framework versions

Transformers 4.43.3
PyTorch 2.3.0+cu121
Datasets 3.2.0
Tokenizers 0.19.1

agentlans
/

flan-t5-small-capitalizer