T5 Command Description Generator
This project fine-tunes a T5 model (t5-small) to generate descriptions of terminal commands based on prompts in the format "Describe the command: {name} in {source}". The model is trained on a dataset (all_commands.csv) containing command names, descriptions, and sources (e.g., cmd, linux, macos, vbscript). After fine-tuning, the model can generate descriptions for commands, such as "List information about file(s)" for ls in linux.
Table of Contents
- Overview
- Dataset
- Requirements
- Setup
- Fine-Tuning the Model
- Using the Model
- Example Output
- Troubleshooting
- Future Improvements
Overview
The T5 (Text-to-Text Transfer Transformer) model is fine-tuned to map prompts like "Describe the command: ls in linux" to descriptions like "List information about file(s)". The dataset used for training is all_commands.csv, which includes commands from various environments (cmd, linux, macos, vbscript). The fine-tuned model is saved to ./new_cmd_model and can be used to generate command descriptions interactively or programmatically.
Dataset
The dataset (all_commands.csv) contains the following columns:
name: The command name (e.g.,ls,dir,chmod,MsgBox).description: A brief description of what the command does (e.g., "List information about file(s)").source: The environment the command belongs to (cmd,linux,macos,vbscript).
Example entries:
name,description,source
ls,List information about file(s),linux
dir,Display a list of files and folders,cmd
chmod,Change access permissions,macos
MsgBox,Display a dialogue box message,vbscript
The dataset is split into 80% training and 20% validation sets for fine-tuning.
Requirements
- Python 3.8+
- Libraries:
transformerstorchsentencepiecedatasets
- CUDA-enabled GPU (optional, for faster training;
fp16=Truein the script enables mixed precision if available) - Dataset file:
all_commands.csv(place in the project directory)
Install dependencies:
pip install transformers torch sentencepiece datasets
Setup
Activate the Environment: Ensure you're in a Python environment with the required libraries. For example, using Conda:
conda activate safetensor_newPrepare the Dataset: Place
all_commands.csvin the project directory (e.g.,C:\app\dataset).Directory Structure:
C:\app\dataset\ โโโ all_commands.csv โโโ new_cmd_model\ (created after fine-tuning) โโโ fine_tune_script.py
Fine-Tuning the Model
The fine-tuning script (fine_tune_script.py) trains a t5-small model on the all_commands.csv dataset to generate command descriptions.
Script Overview
- Model:
t5-small(can be upgraded tot5-basefor better performance). - Input Prompt: "Describe the command: {name} in {source}" (e.g., "Describe the command: ls in linux").
- Output: The commandโs description (e.g., "List information about file(s)").
- Training Parameters:
- Epochs: 3
- Learning rate: 5e-5
- Batch size: 8
- Output directory:
./new_cmd_model - Mixed precision training: Enabled if CUDA is available
Running the Script
Save the following script as fine_tune_script.py and run it:
from transformers import T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import torch
# Load model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
# Load dataset
dataset = load_dataset("csv", data_files={"train": "all_commands.csv"})
dataset = dataset["train"].train_test_split(test_size=0.2)
dataset["validation"] = dataset["test"]
# Preprocess function
def preprocess_function(examples):
inputs = [f"Describe the command: {name} in {source}" for name, source in zip(examples["name"], examples["source"])]
targets = examples["description"]
model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
labels = tokenizer(targets, max_length=256, truncation=True, padding="max_length")
model_inputs["labels"] = labels["input_ids"]
return model_inputs
# Apply preprocessing
tokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=dataset["train"].column_names)
# Training arguments
training_args = TrainingArguments(
output_dir="./new_cmd_model",
evaluation_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
fp16=torch.cuda.is_available(),
)
# Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
)
# Train the model
trainer.train()
# Save the model and tokenizer
model.save_pretrained("./new_cmd_model")
tokenizer.save_pretrained("./new_cmd_model")
print("Fine-tuning complete. Model saved to './new_cmd_model'.")
Run the script:
python fine_tune_script.py
This will train the model and save it to ./new_cmd_model.
Using the Model
After fine-tuning, you can use the model to generate command descriptions with prompts like "Describe the command: {name} in {source}". Below is a script to load and use the model interactively or programmatically.
Usage Script
Save the following as use_t5_command_description.py:
import os
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch
from datetime import datetime
# Define model path
model_path = "./new_cmd_model"
# Check if model directory exists
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model directory '{model_path}' not found.")
# Load the fine-tuned model and tokenizer
try:
model = T5ForConditionalGeneration.from_pretrained(model_path)
tokenizer = T5Tokenizer.from_pretrained(model_path, legacy=False)
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] Model and tokenizer loaded successfully.")
except Exception as e:
raise Exception(f"Error loading model or tokenizer: {str(e)}")
# Function to generate a command description
def generate_description(command, source, max_length=100):
prompt = f"Describe the command: {command} in {source}"
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] Input prompt: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt", max_length=128, truncation=True)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {key: value.to(device) for key, value in inputs.items()}
print(f"[{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}] Using device: {device}")
try:
outputs = model.generate(
inputs["input_ids"],
max_length=max_length,
num_beams=4,
length_penalty=1.0,
early_stopping=True
)
description = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
if not description:
return "Warning: No description generated. Check if the command and source are valid."
return description
except Exception as e:
return f"Error generating description: {str(e)}"
# Example usage
test_commands = [
("ls", "linux"),
("dir", "cmd"),
("chmod", "macos"),
("MsgBox", "vbscript")
]
print("\nGenerated Descriptions:")
print("-" * 50)
for command, source in test_commands:
description = generate_description(command, source)
print(f"Command: {command} ({source})")
print(f"Description: {description}")
print("-" * 50)
# Interactive mode
print("\nInteractive Mode: Enter a command and source to get its description.")
print("Valid sources: cmd, linux, macos, vbscript")
print("Type 'exit' to quit.\n")
while True:
command = input("Enter command name (or 'exit' to quit): ").strip()
if command.lower() == "exit":
break
source = input("Enter source (e.g., cmd, linux, macos, vbscript): ").strip().lower()
valid_sources = ["cmd", "linux", "macos", "vbscript"]
if source not in valid_sources:
print(f"Invalid source. Please use one of: {', '.join(valid_sources)}")
continue
description = generate_description(command, source)
print(f"\nCommand: {command} ({source})")
print(f"Description: {description}")
print("-" * 50)
print("Exiting interactive mode.")
Run the script:
python use_t5_command_description.py
Example Output
After fine-tuning and running the usage script, you should see output like:
[2025-09-04 11:50:00] Model and tokenizer loaded successfully.
Generated Descriptions:
--------------------------------------------------
[2025-09-04 11:50:01] Input prompt: Describe the command: ls in linux
[2025-09-04 11:50:01] Using device: cuda
Command: ls (linux)
Description: List information about file(s)
--------------------------------------------------
[2025-09-04 11:50:02] Input prompt: Describe the command: dir in cmd
[2025-09-04 11:50:02] Using device: cuda
Command: dir (cmd)
Description: Display a list of files and folders
--------------------------------------------------
[2025-09-04 11:50:03] Input prompt: Describe the command: chmod in macos
[2025-09-04 11:50:03] Using device: cuda
Command: chmod (macos)
Description: Change access permissions
--------------------------------------------------
[2025-09-04 11:50:04] Input prompt: Describe the command: MsgBox in vbscript
[2025-09-04 11:50:04] Using device: cuda
Command: MsgBox (vbscript)
Description: Display a dialogue box message
--------------------------------------------------
Interactive Mode: Enter a command and source to get its description.
Valid sources: cmd, linux, macos, vbscript
Type 'exit' to quit.
Enter command name (or 'exit' to quit): ping
Enter source (e.g., cmd, linux, macos, vbscript): linux
[2025-09-04 11:50:05] Input prompt: Describe the command: ping in linux
[2025-09-04 11:50:05] Using device: cuda
Command: ping (linux)
Description: Test a network connection
--------------------------------------------------
Enter command name (or 'exit' to quit): exit
Exiting interactive mode.
Troubleshooting
- Empty Descriptions:
- Ensure
all_commands.csvhas valid entries with no missing descriptions. - Increase
num_train_epochsto 5โ10 or uset5-basefor better performance. - Check training logs in
./new_cmd_modelfor high loss values.
- Ensure
- Model Loading Issues:
- Verify the model saved correctly in
./new_cmd_model. - Try loading a checkpoint (e.g.,
./new_cmd_model/checkpoint-XXX) if issues persist.
- Verify the model saved correctly in
- Environment Errors:
- Ensure dependencies are installed:
pip install transformers torch sentencepiece datasets. - For CUDA errors, ensure your GPU drivers are up-to-date or set
fp16=Falsein the training script.
- Ensure dependencies are installed:
- Deprecation Warning:
- The script uses
evaluation_strategy, which is deprecated. Update toeval_strategyin newertransformersversions:training_args = TrainingArguments( output_dir="./new_cmd_model", eval_strategy="epoch", ... )
- The script uses
Future Improvements
- Augment Dataset: Add more command descriptions or variations to improve generalization.
- Use Larger Model: Switch to
t5-basefor better accuracy (updatemodel_nameand retrain). - Extend Task: Modify to generate commands from task descriptions (e.g., "List files in linux" โ
ls) by retraining with swapped inputs/outputs. - Command Execution: Add functionality to execute generated commands (requires careful validation for security).
For questions about xAIโs API, visit https://x.ai/api. [2025-09-04
- Downloads last month
- -
Model tree for ankitkushwaha90/safetensor_model_fine_tuning_project
Base model
c2p-cmd/FaceEmotionClassifier