File size: 9,650 Bytes

---
license: mit
language:
- ar
base_model:
- Qwen/Qwen2.5-1.5B-Instruct
pipeline_tag: text2text-generation
library_name: transformers
tags:
- Text-To-SQL
- Arabic
- Spider
- SQL
---

# Model Card for Arabic Text-To-SQL (OsamaMo)

## Model Details

### Model Description

This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**.

- **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo))
- **Funded by:** Self-funded
- **Shared by:** Osama Mohamed
- **Model type:** Text-to-SQL fine-tuned model
- **Language(s):** Arabic (ar)
- **License:** MIT
- **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct

### Model Sources

- **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL)
- **Dataset:** Spider (translated to Arabic)
- **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory)

## Uses

### Direct Use

This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications.

### Downstream Use

Can be fine-tuned further for specific databases or Arabic dialect adaptations.

### Out-of-Scope Use

- The model is **not** intended for direct execution of SQL queries.
- Not recommended for non-database-related NLP tasks.

## Bias, Risks, and Limitations

- The model might generate incorrect or non-optimized SQL queries.
- Bias may exist due to dataset translations and model pretraining data.

### Recommendations

- Validate generated SQL queries before execution.
- Ensure compatibility with specific database schemas.

## How to Get Started with the Model
### Load Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import re

device = "cuda" if torch.cuda.is_available() else "cpu"
base_model_id = "Qwen/Qwen2.5-1.5B-Instruct"
finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B"

# Load the base model and adapter for fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
model.load_adapter(finetuned_model_id)

tokenizer = AutoTokenizer.from_pretrained(base_model_id)

def generate_resp(messages):
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    model_inputs = tokenizer([text], return_tensors="pt").to(device)
    generated_ids = model.generate(
        model_inputs.input_ids,
        max_new_tokens=1024,
        do_sample=False,  temperature= False,
    )
    generated_ids = [
        output_ids[len(input_ids):]
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
    ]
    response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return response
```



### Example Usage 
```python

# Production-ready system message for SQL generation
system_message = (
    "You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic "
    "natural language queries into SQL queries with precision and clarity.\n"
)

def get_sql_query(db_schema, arabic_query):
    # Construct the instruction message including the DB schema and the Arabic query
    instruction_message = "\n".join([
        "## DB-Schema:",
        db_schema,
        "",
        "## User-Prompt:",
        arabic_query,
        "# Output SQL:",
        "```SQL"
    ])
    
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": instruction_message}
    ]
    
    response = generate_resp(messages)
    
    # Extract the SQL query from the response using a regex to capture text within the ```sql markdown block
    match = re.search(r"```sql\s*(.*?)\s*```", response, re.DOTALL | re.IGNORECASE)
    if match:
        sql_query = match.group(1).strip()
        return sql_query
    else:
        return response.strip()

# Example usage:
example_db_schema = r'''{
    'Pharmcy': 
          CREATE TABLE `purchase` (
            `BARCODE` varchar(20) NOT NULL,
            `NAME` varchar(50) NOT NULL,
            `TYPE` varchar(20) NOT NULL,
            `COMPANY_NAME` varchar(20) NOT NULL,
            `QUANTITY` int NOT NULL,
            `PRICE` double NOT NULL,
            `AMOUNT` double NOT NULL,
            PRIMARY KEY (`BARCODE`),
            KEY `fkr3` (`COMPANY_NAME`),
            CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `sales` (
            `BARCODE` varchar(20) NOT NULL,
            `NAME` varchar(50) NOT NULL,
            `TYPE` varchar(10) NOT NULL,
            `DOSE` varchar(10) NOT NULL,
            `QUANTITY` int NOT NULL,
            `PRICE` double NOT NULL,
            `AMOUNT` double NOT NULL,
            `DATE` varchar(15) NOT NULL
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `users` (
            `ID` int NOT NULL,
            `NAME` varchar(50) NOT NULL,
            `DOB` varchar(20) NOT NULL,
            `ADDRESS` varchar(100) NOT NULL,
            `PHONE` varchar(20) NOT NULL,
            `SALARY` double NOT NULL,
            `PASSWORD` varchar(20) NOT NULL,
            PRIMARY KEY (`ID`)
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `history_sales` (
            `USER_NAME` varchar(20) NOT NULL,
            `BARCODE` varchar(20) NOT NULL,
            `NAME` varchar(50) NOT NULL,
            `TYPE` varchar(10) NOT NULL,
            `DOSE` varchar(10) NOT NULL,
            `QUANTITY` int NOT NULL,
            `PRICE` double NOT NULL,
            `AMOUNT` double NOT NULL,
            `DATE` varchar(15) NOT NULL,
            `TIME` varchar(20) NOT NULL
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `expiry` (
            `PRODUCT_NAME` varchar(50) NOT NULL,
            `PRODUCT_CODE` varchar(20) NOT NULL,
            `DATE_OF_EXPIRY` varchar(10) NOT NULL,
            `QUANTITY_REMAIN` int NOT NULL
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `drugs` (
            `NAME` varchar(50) NOT NULL,
            `TYPE` varchar(20) NOT NULL,
            `BARCODE` varchar(20) NOT NULL,
            `DOSE` varchar(10) NOT NULL,
            `CODE` varchar(10) NOT NULL,
            `COST_PRICE` double NOT NULL,
            `SELLING_PRICE` double NOT NULL,
            `EXPIRY` varchar(20) NOT NULL,
            `COMPANY_NAME` varchar(50) NOT NULL,
            `PRODUCTION_DATE` date NOT NULL,
            `EXPIRATION_DATE` date NOT NULL,
            `PLACE` varchar(20) NOT NULL,
            `QUANTITY` int NOT NULL,
            PRIMARY KEY (`BARCODE`)
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          CREATE TABLE `company` (
            `NAME` varchar(50) NOT NULL,
            `ADDRESS` varchar(50) NOT NULL,
            `PHONE` varchar(20) NOT NULL,
            PRIMARY KEY (`NAME`)
          ) ENGINE=InnoDB DEFAULT CHARSET=latin1

          Answer the following questions about this schema:
}'''

example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'"

sql_result = get_sql_query(example_db_schema, example_arabic_query)
print("استعلام SQL الناتج:")
print(sql_result)
```

## Training Details

### Training Data

- Dataset: **Spider (translated into Arabic)**
- Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged.
- Training format:
  - System instruction guiding Arabic-to-SQL conversion.
  - Database schema provided for context.
  - Arabic user queries mapped to correct SQL output.
  - Output is strictly formatted SQL queries enclosed in markdown code blocks.

### Training Procedure

#### Training Hyperparameters

- **Batch size:** 1 (per device)
- **Gradient accumulation:** 4 steps
- **Learning rate:** 1.0e-4
- **Epochs:** 3
- **Scheduler:** Cosine
- **Warmup ratio:** 0.1
- **Precision:** bf16

#### Speeds, Sizes, Times

- **Training time:** 15 hours on **NVIDIA P100 8GB**
- **Checkpointing every:** 500 steps

## Evaluation

### Testing Data

- Validation dataset: Spider validation set (translated to Arabic)

### Metrics

- Exact Match (EM) for SQL correctness
- Execution Accuracy (EX) on databases

### Results

- Model achieved **competitive SQL generation accuracy** for Arabic queries.
- Further testing required for robustness.

## Environmental Impact

- **Hardware Type:** NVIDIA Tesla P100 8GB
- **Hours used:** 15
- **Cloud Provider:** Kaggle
- **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute)

## Technical Specifications

### Model Architecture and Objective

- Transformer-based **Qwen2.5-1.5B** architecture.
- Fine-tuned for Text-to-SQL task using LoRA.

### Compute Infrastructure

- **Hardware:** Kaggle P100 GPU (8GB VRAM)
- **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub

## Citation

If you use this model, please cite:

```bibtex
@misc{OsamaMo_ArabicSQL,
  author = {Osama Mohamed},
  title = {Arabic Text-To-SQL Model},
  year = {2024},
  howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}}
}
```

## Model Card Contact

For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).