flan-t5-small-detokenizer

This model is a fine-tuned version of google/flan-t5-small on the agentlans/c4-en-tokenized dataset. It achieves the following results on the evaluation set:

Loss: 0.1093
Number of Input Tokens Seen: 77 750 728

Model Description

The flan-t5-small-detokenizer is designed to correct spacing in tokenized text, making it more readable. This is particularly useful for processing older natural language processing (NLP) datasets that may contain tokenized text.

Example:

import torch
from transformers import pipeline

# Check if GPU is available
device = 0 if torch.cuda.is_available() else -1

# Initialize the pipeline
model_name = "agentlans/flan-t5-small-detokenizer"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)

# Example input
input_text = "A full time vegan , poet , artist and woman . Frontal and thorough when inspired . Our memories : childhood memories , the language we dearly speak and nature 's colorful and tasteful palette are a constant and renewable source of wonder and inspiration ."

# Generate output
output = flan_t5_pipeline(input_text, max_length=1024)

# Print the result
print(output[0]["generated_text"])
# Expected output: A full time vegan, poet, artist and woman. Frontal and thorough when inspired. Our memories: childhood memories, the language we dearly speak and nature's colorful and tasteful palette are a constant and renewable source of wonder and inspiration.

Training and Evaluation Data

The model was trained on the c4-en-tokenized dataset, which consists of:

100 000 training examples
25 000 validation examples (with 2 500 examples used for this model)

Max source and target length 1024 tokens.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

Learning rate: 5e-05
Train batch size: 8
Eval batch size: 8
Seed: 42
Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
Learning rate scheduler: Linear
Number of epochs: 1.0

Training Results

Training Loss	Epoch	Step	Validation Loss	Input Tokens Seen
0.1477	0.1	2500	0.1197	7,758,904
0.1331	0.2	5000	0.1160	15,535,000
0.1295	0.3	7500	0.1142	23,379,996
0.1257	0.4	10000	0.1128	31,164,076
0.1148	0.5	12500	0.1115	38,943,032
0.1219	0.6	15000	0.1107	46,747,616
0.1112	0.7	17500	0.1103	54,513,880
0.1161	0.8	20000	0.1100	62,302,092
0.1215	0.9	22500	0.1093	70,000,044
0.1207	1.0	25000	0.1094	77,750,728

Framework Versions

Transformers: 4.43.3
PyTorch: 2.3.0+cu121
Datasets: 3.2.0
Tokenizers: 0.19.1

agentlans
/

flan-t5-small-detokenizer