flan-t5-small-detokenizer
This model is a fine-tuned version of google/flan-t5-small on the agentlans/c4-en-tokenized dataset. It achieves the following results on the evaluation set:
- Loss: 0.1093
- Number of Input Tokens Seen: 77โ750โ728
Model Description
The flan-t5-small-detokenizer is designed to correct spacing in tokenized text, making it more readable. This is particularly useful for processing older natural language processing (NLP) datasets that may contain tokenized text.
Example:
import torch
from transformers import pipeline
# Check if GPU is available
device = 0 if torch.cuda.is_available() else -1
# Initialize the pipeline
model_name = "agentlans/flan-t5-small-detokenizer"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)
# Example input
input_text = "A full time vegan , poet , artist and woman . Frontal and thorough when inspired . Our memories : childhood memories , the language we dearly speak and nature 's colorful and tasteful palette are a constant and renewable source of wonder and inspiration ."
# Generate output
output = flan_t5_pipeline(input_text, max_length=1024)
# Print the result
print(output[0]["generated_text"])
# Expected output: A full time vegan, poet, artist and woman. Frontal and thorough when inspired. Our memories: childhood memories, the language we dearly speak and nature's colorful and tasteful palette are a constant and renewable source of wonder and inspiration.
Training and Evaluation Data
The model was trained on the c4-en-tokenized dataset, which consists of:
- 100โ000 training examples
- 25โ000 validation examples (with 2โ500 examples used for this model)
Max source and target length 1024 tokens.
Training Procedure
Training Hyperparameters
The following hyperparameters were used during training:
- Learning rate: 5e-05
- Train batch size: 8
- Eval batch size: 8
- Seed: 42
- Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- Learning rate scheduler: Linear
- Number of epochs: 1.0
Training Results
Training Loss | Epoch | Step | Validation Loss | Input Tokens Seen |
---|---|---|---|---|
0.1477 | 0.1 | 2500 | 0.1197 | 7,758,904 |
0.1331 | 0.2 | 5000 | 0.1160 | 15,535,000 |
0.1295 | 0.3 | 7500 | 0.1142 | 23,379,996 |
0.1257 | 0.4 | 10000 | 0.1128 | 31,164,076 |
0.1148 | 0.5 | 12500 | 0.1115 | 38,943,032 |
0.1219 | 0.6 | 15000 | 0.1107 | 46,747,616 |
0.1112 | 0.7 | 17500 | 0.1103 | 54,513,880 |
0.1161 | 0.8 | 20000 | 0.1100 | 62,302,092 |
0.1215 | 0.9 | 22500 | 0.1093 | 70,000,044 |
0.1207 | 1.0 | 25000 | 0.1094 | 77,750,728 |
Framework Versions
- Transformers: 4.43.3
- PyTorch: 2.3.0+cu121
- Datasets: 3.2.0
- Tokenizers: 0.19.1
- Downloads last month
- 14
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.
Model tree for agentlans/flan-t5-small-detokenizer
Base model
google/flan-t5-small