flan-t5-small-capitalizer

A specialized fine-tuned version of google/flan-t5-small trained on the agentlans/c4-en-lowercased dataset to restore proper capitalization.

Key Features

  • Restores proper noun and sentence capitalization
  • Builds on FLAN-T5 small's robust NLP capabilities
  • Designed for text normalization tasks

Intended Uses

  • Capitalizing lowercased text
  • Sentence and proper noun capitalization
  • Text normalization

Usage Example

import torch
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1
model_name = "agentlans/flan-t5-small-capitalizer"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)

input_text = "buzzfeed's 360-degree look at the aftermath of california's valley fire has been viewed more than 6 million times. plenty of viewers have been asking how we made it."

output = flan_t5_pipeline(input_text, max_length=1024)
print(output[0]["generated_text"])
# Expected output: Buzzfeed's 360-degree look at the aftermath of California's Valley Fire has been viewed more than 6 million times. Plenty of viewers have been asking how we made it.

Limitations

  • Language: English only
  • Text Type: Primarily modern prose found on the Internet
  • Capitalization Issues:
    • May not capitalize titles correctly
    • Inconsistent capitalization style across texts
    • Difficulty with special terms and abbreviations requiring capitalization
  • Input/Output Constraint: Maximum length of 1024 tokens for both input and output

Training and evaluation data

The model was trained on a subset of the C4 dataset's English configuration. This dataset contains 125,000 rows, split into 100,000 for training and 25,000 for validation. Each row includes the original text and its lowercased version. It achieves a final validation loss of 0.1338 after processing 56 941 616 input tokens.

Training hyperparameters

The model was trained using the following key hyperparameters:

  • Learning rate: 5e-05
  • Batch size: 8
  • Number of epochs: 1
  • Optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • Learning rate scheduler: Linear
  • Maximum source and target length: 1024 tokens

Additional training arguments included bf16 precision, automatic batch size finding, and the use of a sortish sampler.

Training results

Click here for table
Training Loss Epoch Step Validation Loss Input Tokens Seen
0.2532 0.05 2500 0.1739 2824810
0.231 0.1 5000 0.1653 5702148
0.2163 0.15 7500 0.1571 8531178
0.1966 0.2 10000 0.1529 11350902
0.2013 0.25 12500 0.1491 14191502
0.1971 0.3 15000 0.1464 17050704
0.1791 0.35 17500 0.1447 19857804
0.193 0.4 20000 0.1424 22687180
0.1821 0.45 22500 0.1416 25532518
0.19 0.5 25000 0.1397 28423408
0.1753 0.55 27500 0.1388 31248170
0.184 0.6 30000 0.1378 34048604
0.1717 0.65 32500 0.1371 36903282
0.1693 0.7 35000 0.1359 39709784
0.1729 0.75 37500 0.1345 42614112
0.1711 0.8 40000 0.1344 45471178
0.1735 0.85 42500 0.1340 48355942
0.1797 0.9 45000 0.1340 51187066
0.1659 0.95 47500 0.1338 54074434
0.1658 1.0 50000 0.1338 56941616

Framework versions

  • Transformers 4.43.3
  • PyTorch 2.3.0+cu121
  • Datasets 3.2.0
  • Tokenizers 0.19.1
Downloads last month
8
Safetensors
Model size
77M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for agentlans/flan-t5-small-capitalizer

Finetuned
(322)
this model

Dataset used to train agentlans/flan-t5-small-capitalizer