T5-GenQ-TDE-v1 / README.md

Update README.md

ee7b888 verified 22 days ago

14.5 kB

	---
	library_name: transformers
	tags:
	- e-commerce
	- query-generation
	license: mit
	datasets:
	- smartcat/Amazon-2023-GenQ
	language:
	- en
	metrics:
	- rouge
	base_model:
	- BeIR/query-gen-msmarco-t5-base-v1
	pipeline_tag: text2text-generation
	---

	# Model Card for T5-GenQ-TDE-v1

	🤖 ✨ 🔍 Generate precise, realistic user-focused search queries from product text 🛒 🚀 📊


	### Model Description

	- Model Name: Fine-Tuned Query-Generation Model
	- Model type: Text-to-Text Transformer
	- Finetuned from model: [BeIR/query-gen-msmarco-t5-base-v1](https://huggingface.co/BeIR/query-gen-msmarco-t5-base-v1)
	- Dataset: [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ)
	- Primary Use Case: Generating accurate and relevant search queries from item descriptions
	- Repository: [smartcat-labs/product2query](https://github.com/smartcat-labs/product2query)

	### Model variations

	<table border="1" class="dataframe">
	<tr style="text-align: center;">
	<th>Model</th>
	<th>ROUGE-1</th>
	<th>ROUGE-2</th>
	<th>ROUGE-L</th>
	<th>ROUGE-Lsum</th>
	</tr>
	<tr>
	<td><b><a href="https://huggingface.co/smartcat/T5-GenQ-T-v1">T5-GenQ-T-v1</a></b></td>
	<td>75.2151</td>
	<td>54.8735</td>
	<td><b>74.5142</b></td>
	<td>74.5262</td>
	</tr>
	<tr>
	<td><b><a href="https://huggingface.co/smartcat/T5-GenQ-TD-v1">T5-GenQ-TD-v1</a></b></td>
	<td>78.2570</td>
	<td>58.9586</td>
	<td><b>77.5308</b></td>
	<td>77.5466</td>
	</tr>
	<tr>
	<td><b><a href="https://huggingface.co/smartcat/T5-GenQ-TDE-v1">T5-GenQ-TDE-v1</a></b></td>
	<td>76.9075</td>
	<td>57.0980</td>
	<td><b>76.1464</b></td>
	<td>76.1502</td>
	</tr>
	<tr>
	<td><b><a href="https://huggingface.co/smartcat/T5-GenQ-TDC-v1">T5-GenQ-TDC-v1</a> (best)</b></td>
	<td>80.0754</td>
	<td>61.5974</td>
	<td><b>79.3557</b></td>
	<td>79.3427</td>
	</tr>
	</table>

	### Uses

	This model is designed to improve e-commerce search functionality by generating user-friendly search queries based on product descriptions. It is particularly suited for applications where product descriptions are the primary input, and the goal is to create concise, descriptive queries that align with user search intent.

	### Examples of Use:

	<li>Generating search queries for product indexing.</li>
	<li>Enhancing product discoverability in e-commerce search engines.</li>
	<li>Automating query generation for catalog management.</li>

	### Comparison of ROUGE scores:

	<table border="1">
	<thead>
	<tr>
	<th>Model</th>
	<th>ROUGE-1</th>
	<th>ROUGE-2</th>
	<th>ROUGE-L</th>
	<th>ROUGE-Lsum</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>T5-GenQ-TDE-v1</td>
	<td>74.71</td>
	<td>54.31</td>
	<td>74.06</td>
	<td>74.06</td>
	</tr>
	<tr>
	<td>query-gen-msmarco-t5-base-v1</td>
	<td>37.63</td>
	<td>17.40</td>
	<td>36.69</td>
	<td>36.69</td>
	</tr>
	</tbody>
	</table>

	Note: This evaluation is done after training, based on the test split of the [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ/viewer/default/test?views%5B%5D=test).

	### Examples
	<details><summary>Expand to see the table with examples</summary>
	<table border="1" text-align: center>
	<thead>
	<tr>
	<th style="width: 25%;" >Input Text</th>
	<th style="width: 25%;">Target Query</th>
	<th>Before Fine-tuning</th>
	<th>After Fine-tuning</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>KIDSCOOL SPACE Baby Denim Overall,Hooded Little Kid Jean Jumper</strong></td>
	<td>KIDSCOOL SPACE Baby Denim Overall</td>
	<td>what is kidscool space denim</td>
	<td>baby denim overalls</td>
	</tr>
	<tr>
	<td><strong>NCAA Mens Long Sleeve Shirt Arm Team</strong>

	Show your Mountaineers pride with this West Virginia long sleeve shirt. Its soft cotton material and unique graphics make this a great addition to any West Virginia apparel collection. Features: -100% cotton -Ribbed and double stitched collar and sleeves -Officially licensed West Virginia University long sleeve shirt</td>
	<td>West Virginia long sleeve shirt</td>
	<td>wvu long sleeve shirt</td>
	<td>West Virginia long sleeve shirt</td>
	</tr>
	<tr>
	<td><strong>The Body Shop Mattifying Lotion (Vegan), Tea Tree, 1.69 Fl Oz</strong>

	Product Description
	Made with community trade tea tree oil, The Body Shop's Tea Tree Mattifying Lotion provides lightweight hydration, helps tackles excess oil and visibly reduces the appearance of blemishes, revealing a clearer looking, mattifed finish. 100 percent vegan, suitable for blemish prone skin.
	From the Manufacturer
	Made with Community Trade tea tree oil, The Body Shop's Tea Tree Mattifying Lotion provides lightweight hydration, helps tackles excess oil and visibly reduces the appearance of blemishes, revealing a clearer-looking, mattifed finish. 100% vegan, suitable for blemish-prone skin. Paraben-free. Gluten-free. 100% Vegan.</td>
	<td>Tea Tree Mattifying Lotion</td>
	<td>what is body shop tea tree lotion</td>
	<td>vegan matte lotion</td>
	</tr>
	</tbody>
	</table>
	</details>

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	model = AutoModelForSeq2SeqLM.from_pretrained("smartcat/T5-GenQ-TDE-v1")
	tokenizer = AutoTokenizer.from_pretrained("smartcat/T5-GenQ-TDE-v1")

	description = "Silver-colored cuff with embossed braid pattern. Made of brass, flexible to fit wrist."

	inputs = tokenizer(description, return_tensors="pt", padding=True, truncation=True)
	generated_ids = model.generate(inputs["input_ids"], max_length=30, num_beams=4, early_stopping=True)

	generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

	```
	## Training Details

	### Training Data

	The model was trained on the [smartcat/Amazon-2023-GenQ](https://huggingface.co/datasets/smartcat/Amazon-2023-GenQ) dataset, which consists of user-like
	queries generated from product descriptions. The dataset was created using Claude Haiku 3,
	incorporating key product attributes such as the title, description, and images to ensure relevant and realistic queries. For more information, read the Dataset Card. 😊


	### Preprocessing
	- Trained on titles + descriptions of the products and a duplicate set of products with titles only
	- Tokenized using T5’s default tokenizer with truncation to handle long text.


	### Training Hyperparameters

	<ul>
	<li><strong>max_input_length:</strong> 512</li>
	<li><strong>max_target_length:</strong> 30</li>
	<li><strong>batch_size:</strong> 48</li>
	<li><strong>num_train_epochs:</strong> 8</li>
	<li><strong>evaluation_strategy:</strong> epoch</li>
	<li><strong>save_strategy:</strong> epoch</li>
	<li><strong>learning_rate:</strong> 5.6e-05</li>
	<li><strong>weight_decay:</strong> 0.01 </li>
	<li><strong>predict_with_generate:</strong> true</li>
	<li><strong>load_best_model_at_end:</strong> true</li>
	<li><strong>metric_for_best_model:</strong> eval_rougeL</li>
	<li><strong>greater_is_better:</strong> true</li>
	<li><strong>logging_strategy:</strong> epoch</li>
	</ul>


	### Train time: 25.62 hrs

	### Hardware

	A6000 GPU:
	- Memory Size: 48 GB
	- Memory Type: GDDR6
	- CUDA: 8.6

	### Metrics

	### Metrics

	[ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics used for evaluating automatic summarization and machine translation in NLP. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. ROUGE metrics range between 0 and 1, with higher scores indicating higher similarity between the automatically produced summary and the reference.

	In our evaluation, ROUGE scores are scaled to resemble percentages for better interpretability. The metric used in the training was ROUGE-L.

	<table border="1">
	<thead>
	<tr>
	<th>Epoch</th>
	<th>Step</th>
	<th>Loss</th>
	<th>Grad Norm</th>
	<th>Learning Rate</th>
	<th>Eval Loss</th>
	<th>ROUGE-1</th>
	<th>ROUGE-2</th>
	<th>ROUGE-L</th>
	<th>ROUGE-Lsum</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>1.0</td>
	<td>8569</td>
	<td>0.7955</td>
	<td>2.9784</td>
	<td>4.9e-05</td>
	<td>0.6501</td>
	<td>75.3001</td>
	<td>55.0195</td>
	<td>74.6632</td>
	<td>74.6678</td>
	</tr>
	<tr>
	<td>2.0</td>
	<td>17138</td>
	<td>0.6595</td>
	<td>3.2943</td>
	<td>4.2e-05</td>
	<td>0.6293</td>
	<td>76.2210</td>
	<td>56.2050</td>
	<td>75.5728</td>
	<td>75.5670</td>
	</tr>
	<tr>
	<td>3.0</td>
	<td>25707</td>
	<td>0.5982</td>
	<td>4.0392</td>
	<td>3.5e-05</td>
	<td>0.6207</td>
	<td>76.5493</td>
	<td>56.7006</td>
	<td>75.8775</td>
	<td>75.8796</td>
	</tr>
	<tr>
	<td>4.0</td>
	<td>34276</td>
	<td>0.5552</td>
	<td>2.8237</td>
	<td>2.8e-05</td>
	<td>0.6267</td>
	<td>76.5433</td>
	<td>56.7025</td>
	<td>75.8319</td>
	<td>75.8343</td>
	</tr>
	<tr>
	<td>5.0</td>
	<td>42845</td>
	<td>0.5225</td>
	<td>2.7701</td>
	<td>2.1e-05</td>
	<td>0.6303</td>
	<td>76.7192</td>
	<td>56.9090</td>
	<td>75.9884</td>
	<td>75.9972</td>
	</tr>
	<tr>
	<td>6.0</td>
	<td>51414</td>
	<td>0.4974</td>
	<td>3.1344</td>
	<td>1.4e-05</td>
	<td>0.6316</td>
	<td>76.8851</td>
	<td>57.1349</td>
	<td>76.1420</td>
	<td>76.1484</td>
	</tr>
	<tr>
	<td>7.0</td>
	<td>59983</td>
	<td>0.4798</td>
	<td>3.5027</td>
	<td>7e-06</td>
	<td>0.6355</td>
	<td>76.8884</td>
	<td>57.1055</td>
	<td>76.1433</td>
	<td>76.1501</td>
	</tr>
	<tr>
	<td>8.0</td>
	<td>68552</td>
	<td>0.4674</td>
	<td>4.5172</td>
	<td>0.0</td>
	<td>0.6408</td>
	<td>76.9075</td>
	<td>57.0980</td>
	<td>76.1464</td>
	<td>76.1502</td>
	</tr>
	</tbody>
	</table>

	<style>
	.model-analysis table {
	width: 100%;
	border-collapse: collapse;
	}
	.model-analysis td {
	padding: 10px;
	vertical-align: middle;
	}
	.model-analysis img {
	width: auto; /* Maintain aspect ratio */
	display: block;
	margin: 0 auto;
	max-height: 750px; /* Default height for most images */

	}
	</style>

	<div class="model-analysis">

	### Model Analysis
	<details><summary>Average scores by model </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="average_scores_by_model.png" alt="image"></td>
	<td>

	```checkpoint-68552``` (T5-GenQ-TDE-v1) outperforms ```query-gen-msmarco-t5-base-v1``` across all ROUGE metrics.

	The most significant difference is in ROUGE-2, where ```checkpoint-68552``` scores 54.32% vs. 17.40% for the baseline model.</td></tr>
	</table>
	</details>

	<details><summary>Density comparison </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="density_comparison.png" alt="image"></td>
	<td>

	```checkpoint-68552``` (T5-GenQ-TDE-v1) peaks near 100%, showing strong text overlap.

	```query-gen-msmarco-t5-base-v1``` has a wider distribution, with peaks in the low to mid-score range (10-40%), indicating greater variability but lower precision.

	ROUGE-2 has a high density at 0% for the baseline model, meaning many outputs lack bigram overlap.</td></tr>
	</table>
	</details>

	<details><summary>Histogram comparison </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="histogram_comparison.png" alt="image"></td>
	<td>

	```checkpoint-68552``` (T5-GenQ-TDE-v1, blue) trends toward higher ROUGE scores, with a peak at 100%.

	```query-gen-msmarco-t5-base-v1``` (orange) has more low-score peaks, especially in ROUGE-2, reinforcing its lower precision.

	These histograms confirm ```checkpoint-68552``` consistently generates more accurate text.</td></tr>
	</table>
	</details>

	<details><summary>Scores by generated query length </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="group_sizes.png" alt="image"></td>
	<td>
	Stable ROUGE scores (Sizes 3-9): All metrics remain consistently high.

	Score spike at 2 words: Indicates better alignment for short phrases, followed by stability.

	Score differences remain near zero for most sizes, meaning consistent model performance across phrase lengths.</td></tr>
	</table>
	</details>

	<details><summary>Semantic similarity distribution </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="semantic_similarity_distribution.png" alt="image"></td>
	<td>
	This histogram visualizes the distribution of cosine similarity scores, which measure the semantic similarity between paired texts (generated query and target query).

	A strong peak near 1.0 suggests most pairs are highly semantically similar.

	Low similarity scores (0.0–0.4) are rare, meaning the dataset contains mostly closely related text pairs.</td></tr>
	</table>
	</details>

	<details><summary>Semantic similarity score against ROUGE scores </summary>
	<table style="width:100%"><tr>
	<td style="width:65%"><img src="similarity_vs_rouge.png" alt="image"></td>
	<td>Higher similarity → Higher ROUGE scores, indicating strong correlation.

	ROUGE-1 & ROUGE-L show the strongest alignment, while ROUGE-2 has more variation.

	Some low-similarity outliers still achieve moderate ROUGE scores, suggesting surface-level overlap without deep semantic alignment.
	</td></tr>
	</table>
	</details>
	</div>

	## More Information

	- Please visit the [GitHub Repository](https://github.com/smartcat-labs/product2query)

	## Authors

	- Mentor: [Milutin Studen](https://www.linkedin.com/in/milutin-studen/)
	- Engineers: [Petar Surla](https://www.linkedin.com/in/petar-surla-6448b6269/), [Andjela Radojevic](https://www.linkedin.com/in/an%C4%91ela-radojevi%C4%87-936197196/)

	## Model Card Contact

	For questions, please open an issue on the [GitHub Repository](https://github.com/smartcat-labs/product2query)