|
--- |
|
license: mit |
|
language: |
|
- ar |
|
base_model: |
|
- Qwen/Qwen2.5-1.5B-Instruct |
|
pipeline_tag: text2text-generation |
|
library_name: transformers |
|
tags: |
|
- Text-To-SQL |
|
- Arabic |
|
- Spider |
|
- SQL |
|
--- |
|
|
|
# Model Card for Arabic Text-To-SQL (OsamaMo) |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**. |
|
|
|
- **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo)) |
|
- **Funded by:** Self-funded |
|
- **Shared by:** Osama Mohamed |
|
- **Model type:** Text-to-SQL fine-tuned model |
|
- **Language(s):** Arabic (ar) |
|
- **License:** MIT |
|
- **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL) |
|
- **Dataset:** Spider (translated to Arabic) |
|
- **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications. |
|
|
|
### Downstream Use |
|
|
|
Can be fine-tuned further for specific databases or Arabic dialect adaptations. |
|
|
|
### Out-of-Scope Use |
|
|
|
- The model is **not** intended for direct execution of SQL queries. |
|
- Not recommended for non-database-related NLP tasks. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
- The model might generate incorrect or non-optimized SQL queries. |
|
- Bias may exist due to dataset translations and model pretraining data. |
|
|
|
### Recommendations |
|
|
|
- Validate generated SQL queries before execution. |
|
- Ensure compatibility with specific database schemas. |
|
|
|
## How to Get Started with the Model |
|
### Load Model |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
import re |
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct" |
|
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B" |
|
|
|
# Load the base model and adapter for fine-tuning |
|
model = AutoModelForCausalLM.from_pretrained( |
|
base_model_id, |
|
device_map="auto", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
model.load_adapter(finetuned_model_id) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
|
|
|
def generate_resp(messages): |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(device) |
|
generated_ids = model.generate( |
|
model_inputs.input_ids, |
|
max_new_tokens=1024, |
|
do_sample=False, temperature= False, |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids):] |
|
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
return response |
|
``` |
|
|
|
|
|
|
|
### Example Usage |
|
```python |
|
|
|
# Production-ready system message for SQL generation |
|
system_message = ( |
|
"You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic " |
|
"natural language queries into SQL queries with precision and clarity.\n" |
|
) |
|
|
|
def get_sql_query(db_schema, arabic_query): |
|
# Construct the instruction message including the DB schema and the Arabic query |
|
instruction_message = "\n".join([ |
|
"## DB-Schema:", |
|
db_schema, |
|
"", |
|
"## User-Prompt:", |
|
arabic_query, |
|
"# Output SQL:", |
|
"```SQL" |
|
]) |
|
|
|
messages = [ |
|
{"role": "system", "content": system_message}, |
|
{"role": "user", "content": instruction_message} |
|
] |
|
|
|
response = generate_resp(messages) |
|
|
|
# Extract the SQL query from the response using a regex to capture text within the ```sql markdown block |
|
match = re.search(r"```sql\s*(.*?)\s*```", response, re.DOTALL | re.IGNORECASE) |
|
if match: |
|
sql_query = match.group(1).strip() |
|
return sql_query |
|
else: |
|
return response.strip() |
|
|
|
# Example usage: |
|
example_db_schema = r'''{ |
|
'Pharmcy': |
|
CREATE TABLE `purchase` ( |
|
`BARCODE` varchar(20) NOT NULL, |
|
`NAME` varchar(50) NOT NULL, |
|
`TYPE` varchar(20) NOT NULL, |
|
`COMPANY_NAME` varchar(20) NOT NULL, |
|
`QUANTITY` int NOT NULL, |
|
`PRICE` double NOT NULL, |
|
`AMOUNT` double NOT NULL, |
|
PRIMARY KEY (`BARCODE`), |
|
KEY `fkr3` (`COMPANY_NAME`), |
|
CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE |
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
|
|
|
CREATE TABLE `sales` ( |
|
`BARCODE` varchar(20) NOT NULL, |
|
`NAME` varchar(50) NOT NULL, |
|
`TYPE` varchar(10) NOT NULL, |
|
`DOSE` varchar(10) NOT NULL, |
|
`QUANTITY` int NOT NULL, |
|
`PRICE` double NOT NULL, |
|
`AMOUNT` double NOT NULL, |
|
`DATE` varchar(15) NOT NULL |
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
|
|
|
CREATE TABLE `users` ( |
|
`ID` int NOT NULL, |
|
`NAME` varchar(50) NOT NULL, |
|
`DOB` varchar(20) NOT NULL, |
|
`ADDRESS` varchar(100) NOT NULL, |
|
`PHONE` varchar(20) NOT NULL, |
|
`SALARY` double NOT NULL, |
|
`PASSWORD` varchar(20) NOT NULL, |
|
PRIMARY KEY (`ID`) |
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
|
|
|
CREATE TABLE `history_sales` ( |
|
`USER_NAME` varchar(20) NOT NULL, |
|
`BARCODE` varchar(20) NOT NULL, |
|
`NAME` varchar(50) NOT NULL, |
|
`TYPE` varchar(10) NOT NULL, |
|
`DOSE` varchar(10) NOT NULL, |
|
`QUANTITY` int NOT NULL, |
|
`PRICE` double NOT NULL, |
|
`AMOUNT` double NOT NULL, |
|
`DATE` varchar(15) NOT NULL, |
|
`TIME` varchar(20) NOT NULL |
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
|
|
|
CREATE TABLE `expiry` ( |
|
`PRODUCT_NAME` varchar(50) NOT NULL, |
|
`PRODUCT_CODE` varchar(20) NOT NULL, |
|
`DATE_OF_EXPIRY` varchar(10) NOT NULL, |
|
`QUANTITY_REMAIN` int NOT NULL |
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
|
|
|
CREATE TABLE `drugs` ( |
|
`NAME` varchar(50) NOT NULL, |
|
`TYPE` varchar(20) NOT NULL, |
|
`BARCODE` varchar(20) NOT NULL, |
|
`DOSE` varchar(10) NOT NULL, |
|
`CODE` varchar(10) NOT NULL, |
|
`COST_PRICE` double NOT NULL, |
|
`SELLING_PRICE` double NOT NULL, |
|
`EXPIRY` varchar(20) NOT NULL, |
|
`COMPANY_NAME` varchar(50) NOT NULL, |
|
`PRODUCTION_DATE` date NOT NULL, |
|
`EXPIRATION_DATE` date NOT NULL, |
|
`PLACE` varchar(20) NOT NULL, |
|
`QUANTITY` int NOT NULL, |
|
PRIMARY KEY (`BARCODE`) |
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
|
|
|
CREATE TABLE `company` ( |
|
`NAME` varchar(50) NOT NULL, |
|
`ADDRESS` varchar(50) NOT NULL, |
|
`PHONE` varchar(20) NOT NULL, |
|
PRIMARY KEY (`NAME`) |
|
) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
|
|
|
Answer the following questions about this schema: |
|
}''' |
|
|
|
example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'" |
|
|
|
sql_result = get_sql_query(example_db_schema, example_arabic_query) |
|
print("استعلام SQL الناتج:") |
|
print(sql_result) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- Dataset: **Spider (translated into Arabic)** |
|
- Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged. |
|
- Training format: |
|
- System instruction guiding Arabic-to-SQL conversion. |
|
- Database schema provided for context. |
|
- Arabic user queries mapped to correct SQL output. |
|
- Output is strictly formatted SQL queries enclosed in markdown code blocks. |
|
|
|
### Training Procedure |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Batch size:** 1 (per device) |
|
- **Gradient accumulation:** 4 steps |
|
- **Learning rate:** 1.0e-4 |
|
- **Epochs:** 3 |
|
- **Scheduler:** Cosine |
|
- **Warmup ratio:** 0.1 |
|
- **Precision:** bf16 |
|
|
|
#### Speeds, Sizes, Times |
|
|
|
- **Training time:** 15 hours on **NVIDIA P100 8GB** |
|
- **Checkpointing every:** 500 steps |
|
|
|
## Evaluation |
|
|
|
### Testing Data |
|
|
|
- Validation dataset: Spider validation set (translated to Arabic) |
|
|
|
### Metrics |
|
|
|
- Exact Match (EM) for SQL correctness |
|
- Execution Accuracy (EX) on databases |
|
|
|
### Results |
|
|
|
- Model achieved **competitive SQL generation accuracy** for Arabic queries. |
|
- Further testing required for robustness. |
|
|
|
## Environmental Impact |
|
|
|
- **Hardware Type:** NVIDIA Tesla P100 8GB |
|
- **Hours used:** 15 |
|
- **Cloud Provider:** Kaggle |
|
- **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute) |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
- Transformer-based **Qwen2.5-1.5B** architecture. |
|
- Fine-tuned for Text-to-SQL task using LoRA. |
|
|
|
### Compute Infrastructure |
|
|
|
- **Hardware:** Kaggle P100 GPU (8GB VRAM) |
|
- **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub |
|
|
|
## Citation |
|
|
|
If you use this model, please cite: |
|
|
|
```bibtex |
|
@misc{OsamaMo_ArabicSQL, |
|
author = {Osama Mohamed}, |
|
title = {Arabic Text-To-SQL Model}, |
|
year = {2024}, |
|
howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}} |
|
} |
|
``` |
|
|
|
## Model Card Contact |
|
|
|
For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)). |
|
|
|
|