Model Card for T5-GenQ-T-v1

πŸ€– ✨ πŸ” Generate precise, realistic user-focused search queries from product text πŸ›’ πŸš€ πŸ“Š

Model Description

Model variations

Model ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum
T5-GenQ-T-v1 75.2151 54.8735 74.5142 74.5262
T5-GenQ-TD-v1 78.2570 58.9586 77.5308 77.5466
T5-GenQ-TDE-v1 76.9075 57.0980 76.1464 76.1502
T5-GenQ-TDC-v1 (best) 80.0754 61.5974 79.3557 79.3427

Uses

This model is designed to improve e-commerce search functionality by generating user-friendly search queries based on product descriptions. It is particularly suited for applications where product descriptions are the primary input, and the goal is to create concise, descriptive queries that align with user search intent.

Examples of Use:

  • Generating search queries for product indexing.
  • Enhancing product discoverability in e-commerce search engines.
  • Automating query generation for catalog management.
  • Comparison of ROUGE scores:

    Model ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum
    T5-GenQ-T-v1 73.11 52.27 72.51 72.51
    query-gen-msmarco-t5-base-v1 40.34 19.52 39.21 39.21

    Note: This evaluation is done after training, based on the test split of the smartcat/Amazon-2023-GenQ dataset.

    Examples

    Expand to see table with examples
    Input Text Target Query Before Fine-tuning After Fine-tuning
    PANDORA Jewelry Crossover Pave Triple Band Ring for Women - Sterling Silver with Cubic Zirconia PANDORA Crossover Triple Band Ring what is pandora jewelry Pandora crossover ring
    SAYOYO Baby Sneakers Leather Baby Shoes Crib Shoes Toddler Soft Sole Sneakers SAYOYO Baby Sneakers what kind of shoes are baby sneakers baby leather sneakers
    5 PCS Strap Replacement Compatible with Xiaomi Mi Band 3/4, Bands Xiaomi Mi Band 4 Smart Watch Wristbands Replacement Accessories Strap Bracelets for Mi Fit 3 Straps Replacement Straps for Xiaomi Mi Band 3/4p what is the strap on a xiaomi smartwatch Xiaomi Mi Fit 3 replacement bands
    Backpacker Ladies' Solid Flannel Shirt ladies flannel shirt what kind of shirt is a backpacker women's flannel shirt

    How to Get Started with the Model

    Use the code below to get started with the model.

    from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
    
    model = AutoModelForSeq2SeqLM.from_pretrained("smartcat/T5-GenQ-T-v1")
    tokenizer = AutoTokenizer.from_pretrained("smartcat/T5-GenQ-T-v1")
    
    description = "Silver-colored cuff with embossed braid pattern. Made of brass, flexible to fit wrist."
    
    inputs = tokenizer(description, return_tensors="pt", padding=True, truncation=True)
    generated_ids = model.generate(inputs["input_ids"], max_length=30, num_beams=4, early_stopping=True)
    
    generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True) 
    

    Training Details

    Training Data

    The model was trained on the smartcat/Amazon-2023-GenQ dataset, which consists of user-like queries generated from product descriptions. The dataset was created using Claude Haiku 3, incorporating key product attributes such as the title, description, and images to ensure relevant and realistic queries. For more information, read the Dataset Card. 😊

    Preprocessing

    • Trained on only the product titles
    • Tokenized using T5’s default tokenizer with truncation to handle long text.

    Training Hyperparameters

    • max_input_length: 512
    • max_target_length: 30
    • batch_size: 48
    • num_train_epochs: 8
    • evaluation_strategy: epoch
    • save_strategy: epoch
    • learning_rate: 5.6e-05
    • weight_decay: 0.01
    • predict_with_generate: true
    • load_best_model_at_end: true
    • metric_for_best_model: eval_rougeL
    • greater_is_better: true
    • logging_strategy: epoch

    Train time: 2.43 hrs

    Hardware

    A6000 GPU:

    • Memory Size: 48 GB
    • Memory Type: GDDR6
    • CUDA: 8.6

    Metrics

    ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used for evaluating automatic summarization and machine translation in NLP. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.

    In our evaluation, ROUGE scores are scaled to resemble percentages for better interpretability. The metric used in the training was ROUGE-L.

    Epoch Step Loss Grad Norm Learning Rate Eval Loss ROUGE-1 ROUGE-2 ROUGE-L ROUGE-Lsum
    1.042850.94656.78344.9e-050.764473.187252.201972.519972.5183
    2.085700.80764.90714.2e-050.726873.918253.136573.255173.2570
    3.0128550.74854.48143.5e-050.716074.475253.807673.771273.7792
    4.0171400.70825.31452.8e-050.702374.762854.331674.081174.0790
    5.0214250.67884.42662.1e-050.701374.943754.563074.263774.2668
    6.0257100.65615.28971.4e-050.699875.083454.716374.390774.3977
    7.0299950.63963.51977.0e-060.700575.215154.873574.514274.5262
    8.0342800.62784.46250.00.701675.189954.842374.469574.4801

    Model Analysis

    Average scores by model
    image The checkpoint-29995 (T5-GenQ-T-v1) model outperforms query-gen-msmarco-t5-base-v1 across all ROUGE metrics.

    The largest performance gap is in ROUGE2, where checkpoint-29995 achieves 52.27, whereas query-gen-msmarco-t5-base-v1 scores 19.52.

    ROUGE1, ROUGEL, and ROUGELSUM scores are very similar in both trends, with checkpoint-29995 consistently scoring above 72, while query-gen-msmarco-t5-base-v1 stays below 41.

    Density comparison
    image

    T5-GenQ-T-v1 - Higher concentration of high ROUGE scores, especially near 100%, indicating strong text overlap with references.

    query-gen-msmarco-t5-base-v1 – more spread-out distribution, with multiple peaks at 10-40%, suggesting greater variability but lower precision.

    ROUGE-1 & ROUGE-L: T5-GenQ-T-v1 peaks at 100%, while query-gen-msmarco-t5-base-v1 has lower, broader peaks.

    ROUGE-2: query-gen-msmarco-t5-base-v1 has a high density at 0%, indicating many low-overlap outputs.

    Histogram comparison
    image

    T5-GenQ-T-v1 – higher concentration of high ROUGE scores, especially near 100%, indicating strong text overlap with references.

    query-gen-msmarco-t5-base-v1 – more spread-out distribution, with peaks in the 10-40% range, suggesting greater variability but lower precision.

    ROUGE-1 & ROUGE-L: T5-GenQ-T-v1 shows a rising trend towards higher scores, while query-gen-msmarco-t5-base-v1 has multiple peaks at lower scores.

    ROUGE-2: query-gen-msmarco-t5-base-v1 has a high concentration of low-score outputs, whereas T5-GenQ-T-v1 achieves more high-scoring outputs.

    Scores by generated query length
    image This visualization analyzes average ROUGE scores and score differences across different query sizes.

    High ROUGE Scores for Most Sizes (3-9 words).

    ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-LSUM scores remain consistently high across most word sizes.

    Sharp Spike at Size 2:

    A large positive score difference at 2 words, suggesting strong alignment for very short phrases.

    Stable Score Differences (Sizes 3-9):

    After the initial spike at size 2, score differences stay close to zero, indicating consistent performance across phrase lengths.

    Semantic similarity distribution
    image This histogram visualizes the distribution of cosine similarity scores, which measure the semantic similarity between paired texts.

    The majority of similarity scores cluster near 1.0, indicating that most text pairs are highly similar.

    A gradual increase in frequency is observed as similarity scores rise, with a sharp peak at 1.0.

    Lower similarity scores (0.0–0.4) are rare, suggesting fewer instances of dissimilar text pairs.

    Semantic similarity score against ROUGE scores
    image This scatter plot matrix compares semantic similarity (cosine similarity) with ROUGE scores, showing their correlation.

    Higher similarity β†’ Higher ROUGE scores, indicating strong n-gram overlap in semantically similar texts. ROUGE-1 & ROUGE-L show the strongest correlation, while ROUGE-2 has more variability. Low-similarity outliers exist, where texts share words but differ semantically.

    More Information

    Authors

    Model Card Contact

    For questions, please open an issue on the GitHub Repository

    Downloads last month
    4
    Safetensors
    Model size
    223M params
    Tensor type
    F32
    Β·
    Inference Providers NEW
    This model is not currently available via any of the supported Inference Providers.

    Model tree for smartcat/T5-GenQ-T-v1

    Finetuned
    (4)
    this model

    Dataset used to train smartcat/T5-GenQ-T-v1

    Collection including smartcat/T5-GenQ-T-v1