agentlans's picture
Upload 13 files
7da4ac3 verified
|
raw
history blame
4.09 kB
metadata
language:
  - en
license: apache-2.0
base_model: google/flan-t5-small
tags:
  - generated_from_trainer
datasets:
  - sentence-paraphrases
model-index:
  - name: flan-t5-small-simplifier
    results: []

flan-t5-small-simplifier

For paraphrasing and simplifying English text.

Fine-tuned version of google/flan-t5-small on the agentlans/sentence-paraphrases dataset. It achieves the following results on the evaluation set:

  • Loss: 1.1518
  • Num Input Tokens Seen: 32939232

Intended uses & limitations

Works best on sentence length texts.

import torch
from transformers import pipeline

# Check if GPU is available
device = 0 if torch.cuda.is_available() else -1

# Initialize the pipeline
model_name = "agentlans/flan-t5-small-simplifier"
flan_t5_pipeline = pipeline("text2text-generation", model=model_name, device=device)

# Example input
input_text = "While navigating the labyrinthine corridors of epistemological uncertainty, the precocious philosopher—whose seminal work on phenomenological interpretation had already garnered significant academic acclaim—paused momentarily to contemplate the intricate interplay between subjective perception and objective reality, ultimately recognizing that the boundaries of human understanding are perpetually fluid and dynamically reconstructed through continuous intellectual discourse and empirical investigation."

# Generate output
output = flan_t5_pipeline(input_text, max_length=1024)

# Print the result
print(output[0]["generated_text"])
# The precocious philosopher, who had already been a major academic acclaim for his seminal work on phenomenological interpretation, paused momentarily to contemplate the intricate interplay between subjective perception and objective reality, recognizing that the boundaries of human understanding are perpetually fluid and dynamically reconstructed through continuous intellectual discourse and empirical investigation.

Limitations:

  • English only
  • Doesn't handle mixed language texts well (for example, English with Greek letter words)
  • Might not be able to simplify some texts

Training and evaluation data

agentlans/sentence-paraphrases

This dataset is a curated collection of sentence-length paraphrases derived from two primary sources:

humarin/chatgpt-paraphrases
xwjzds/paraphrase_collections.

Dataset Details Dataset Description

The dataset is structured to provide pairs of sentences from an original text and its paraphrase(s). For each entry:

The "text" field contains the least readable paraphrase.
The "paraphrase" field contains the most readable paraphrase.

Readability was assessed using the agentlans/deberta-v3-xsmall-zyda-2-readability model.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 2.0

Training results

Training Loss Epoch Step Validation Loss Input Tokens Seen
1.4423 0.2224 10000 1.2431 3655312
1.3884 0.4448 20000 1.2093 7331520
1.3782 0.6673 30000 1.1859 10990432
1.3595 0.8897 40000 1.1787 14653328
1.3059 1.1121 50000 1.1665 18326104
1.3298 1.3345 60000 1.1589 21991016
1.2994 1.5569 70000 1.1562 25656600
1.2952 1.7794 80000 1.1518 29314808

Framework versions

  • Transformers 4.43.3
  • Pytorch 2.3.0+cu121
  • Datasets 3.2.0
  • Tokenizers 0.19.1