flan-t5-small-detokenizer

This model is a fine-tuned version of google/flan-t5-small on the agentlans/c4-en-tokenized dataset. It achieves the following results on the evaluation set:

  • Loss: 0.1093
  • Number of Input Tokens Seen: 77โ€‰750โ€‰728

Model Description

The flan-t5-small-detokenizer is designed to correct spacing in tokenized text, making it more readable. This is particularly useful for processing older natural language processing (NLP) datasets that may contain tokenized text.

Example:

import torch
from transformers import pipeline

# Check if GPU is available
device = 0 if torch.cuda.is_available() else -1

# Initialize the pipeline
model_name = "agentlans/flan-t5-small-detokenizer"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)

# Example input
input_text = "A full time vegan , poet , artist and woman . Frontal and thorough when inspired . Our memories : childhood memories , the language we dearly speak and nature 's colorful and tasteful palette are a constant and renewable source of wonder and inspiration ."

# Generate output
output = flan_t5_pipeline(input_text, max_length=1024)

# Print the result
print(output[0]["generated_text"])
# Expected output: A full time vegan, poet, artist and woman. Frontal and thorough when inspired. Our memories: childhood memories, the language we dearly speak and nature's colorful and tasteful palette are a constant and renewable source of wonder and inspiration.

Training and Evaluation Data

The model was trained on the c4-en-tokenized dataset, which consists of:

  • 100โ€‰000 training examples
  • 25โ€‰000 validation examples (with 2โ€‰500 examples used for this model)

Max source and target length 1024 tokens.

Training Procedure

Training Hyperparameters

The following hyperparameters were used during training:

  • Learning rate: 5e-05
  • Train batch size: 8
  • Eval batch size: 8
  • Seed: 42
  • Optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
  • Learning rate scheduler: Linear
  • Number of epochs: 1.0

Training Results

Training Loss Epoch Step Validation Loss Input Tokens Seen
0.1477 0.1 2500 0.1197 7,758,904
0.1331 0.2 5000 0.1160 15,535,000
0.1295 0.3 7500 0.1142 23,379,996
0.1257 0.4 10000 0.1128 31,164,076
0.1148 0.5 12500 0.1115 38,943,032
0.1219 0.6 15000 0.1107 46,747,616
0.1112 0.7 17500 0.1103 54,513,880
0.1161 0.8 20000 0.1100 62,302,092
0.1215 0.9 22500 0.1093 70,000,044
0.1207 1.0 25000 0.1094 77,750,728

Framework Versions

  • Transformers: 4.43.3
  • PyTorch: 2.3.0+cu121
  • Datasets: 3.2.0
  • Tokenizers: 0.19.1
Downloads last month
14
Safetensors
Model size
77M params
Tensor type
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for agentlans/flan-t5-small-detokenizer

Finetuned
(322)
this model

Dataset used to train agentlans/flan-t5-small-detokenizer