--- license: mit language: - ar base_model: - Qwen/Qwen2.5-1.5B-Instruct pipeline_tag: text2text-generation library_name: transformers tags: - Text-To-SQL - Arabic - Spider - SQL --- # Model Card for Arabic Text-To-SQL (OsamaMo) ## Model Details ### Model Description This model is fine-tuned on the Spider dataset with Arabic-translated questions for the Text-To-SQL task. It is based on **Qwen/Qwen2.5-1.5B-Instruct** and trained using LoRA on Kaggle for 15 hours on a **P100 8GB GPU**. - **Developed by:** Osama Mohamed ([OsamaMo](https://huggingface.co/OsamaMo)) - **Funded by:** Self-funded - **Shared by:** Osama Mohamed - **Model type:** Text-to-SQL fine-tuned model - **Language(s):** Arabic (ar) - **License:** MIT - **Finetuned from:** Qwen/Qwen2.5-1.5B-Instruct ### Model Sources - **Repository:** [Hugging Face Model Hub](https://huggingface.co/OsamaMo/Arabic_Text-To-SQL) - **Dataset:** Spider (translated to Arabic) - **Training Script:** [LLaMA-Factory](https://github.com/huggingface/transformers/tree/main/src/transformers/models/llama_factory) ## Uses ### Direct Use This model is intended for converting **Arabic natural language questions** into SQL queries. It can be used for database querying in Arabic-speaking applications. ### Downstream Use Can be fine-tuned further for specific databases or Arabic dialect adaptations. ### Out-of-Scope Use - The model is **not** intended for direct execution of SQL queries. - Not recommended for non-database-related NLP tasks. ## Bias, Risks, and Limitations - The model might generate incorrect or non-optimized SQL queries. - Bias may exist due to dataset translations and model pretraining data. ### Recommendations - Validate generated SQL queries before execution. - Ensure compatibility with specific database schemas. ## How to Get Started with the Model ### Load Model ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch import re device = "cuda" if torch.cuda.is_available() else "cpu" base_model_id = "Qwen/Qwen2.5-1.5B-Instruct" finetuned_model_id = "OsamaMo/Arabic_Text-To-SQL_using_Qwen2.5-1.5B" # Load the base model and adapter for fine-tuning model = AutoModelForCausalLM.from_pretrained( base_model_id, device_map="auto", torch_dtype=torch.bfloat16 ) model.load_adapter(finetuned_model_id) tokenizer = AutoTokenizer.from_pretrained(base_model_id) def generate_resp(messages): text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( model_inputs.input_ids, max_new_tokens=1024, do_sample=False, temperature= False, ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] return response ``` ### Example Usage ```python # Production-ready system message for SQL generation system_message = ( "You are a highly advanced Arabic text-to-SQL converter. Your mission is to Understand first the db schema and reltions between it and then accurately transform Arabic " "natural language queries into SQL queries with precision and clarity.\n" ) def get_sql_query(db_schema, arabic_query): # Construct the instruction message including the DB schema and the Arabic query instruction_message = "\n".join([ "## DB-Schema:", db_schema, "", "## User-Prompt:", arabic_query, "# Output SQL:", "```SQL" ]) messages = [ {"role": "system", "content": system_message}, {"role": "user", "content": instruction_message} ] response = generate_resp(messages) # Extract the SQL query from the response using a regex to capture text within the ```sql markdown block match = re.search(r"```sql\s*(.*?)\s*```", response, re.DOTALL | re.IGNORECASE) if match: sql_query = match.group(1).strip() return sql_query else: return response.strip() # Example usage: example_db_schema = r'''{ 'Pharmcy': CREATE TABLE `purchase` ( `BARCODE` varchar(20) NOT NULL, `NAME` varchar(50) NOT NULL, `TYPE` varchar(20) NOT NULL, `COMPANY_NAME` varchar(20) NOT NULL, `QUANTITY` int NOT NULL, `PRICE` double NOT NULL, `AMOUNT` double NOT NULL, PRIMARY KEY (`BARCODE`), KEY `fkr3` (`COMPANY_NAME`), CONSTRAINT `fkr3` FOREIGN KEY (`COMPANY_NAME`) REFERENCES `company` (`NAME`) ON DELETE CASCADE ON UPDATE CASCADE ) ENGINE=InnoDB DEFAULT CHARSET=latin1 CREATE TABLE `sales` ( `BARCODE` varchar(20) NOT NULL, `NAME` varchar(50) NOT NULL, `TYPE` varchar(10) NOT NULL, `DOSE` varchar(10) NOT NULL, `QUANTITY` int NOT NULL, `PRICE` double NOT NULL, `AMOUNT` double NOT NULL, `DATE` varchar(15) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 CREATE TABLE `users` ( `ID` int NOT NULL, `NAME` varchar(50) NOT NULL, `DOB` varchar(20) NOT NULL, `ADDRESS` varchar(100) NOT NULL, `PHONE` varchar(20) NOT NULL, `SALARY` double NOT NULL, `PASSWORD` varchar(20) NOT NULL, PRIMARY KEY (`ID`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 CREATE TABLE `history_sales` ( `USER_NAME` varchar(20) NOT NULL, `BARCODE` varchar(20) NOT NULL, `NAME` varchar(50) NOT NULL, `TYPE` varchar(10) NOT NULL, `DOSE` varchar(10) NOT NULL, `QUANTITY` int NOT NULL, `PRICE` double NOT NULL, `AMOUNT` double NOT NULL, `DATE` varchar(15) NOT NULL, `TIME` varchar(20) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 CREATE TABLE `expiry` ( `PRODUCT_NAME` varchar(50) NOT NULL, `PRODUCT_CODE` varchar(20) NOT NULL, `DATE_OF_EXPIRY` varchar(10) NOT NULL, `QUANTITY_REMAIN` int NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 CREATE TABLE `drugs` ( `NAME` varchar(50) NOT NULL, `TYPE` varchar(20) NOT NULL, `BARCODE` varchar(20) NOT NULL, `DOSE` varchar(10) NOT NULL, `CODE` varchar(10) NOT NULL, `COST_PRICE` double NOT NULL, `SELLING_PRICE` double NOT NULL, `EXPIRY` varchar(20) NOT NULL, `COMPANY_NAME` varchar(50) NOT NULL, `PRODUCTION_DATE` date NOT NULL, `EXPIRATION_DATE` date NOT NULL, `PLACE` varchar(20) NOT NULL, `QUANTITY` int NOT NULL, PRIMARY KEY (`BARCODE`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 CREATE TABLE `company` ( `NAME` varchar(50) NOT NULL, `ADDRESS` varchar(50) NOT NULL, `PHONE` varchar(20) NOT NULL, PRIMARY KEY (`NAME`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 Answer the following questions about this schema: }''' example_arabic_query = "اريد الباركود الخاص بدواء يبداء اسمه بحرف 's'" sql_result = get_sql_query(example_db_schema, example_arabic_query) print("استعلام SQL الناتج:") print(sql_result) ``` ## Training Details ### Training Data - Dataset: **Spider (translated into Arabic)** - Preprocessing: Questions converted to Arabic while keeping SQL queries unchanged. - Training format: - System instruction guiding Arabic-to-SQL conversion. - Database schema provided for context. - Arabic user queries mapped to correct SQL output. - Output is strictly formatted SQL queries enclosed in markdown code blocks. ### Training Procedure #### Training Hyperparameters - **Batch size:** 1 (per device) - **Gradient accumulation:** 4 steps - **Learning rate:** 1.0e-4 - **Epochs:** 3 - **Scheduler:** Cosine - **Warmup ratio:** 0.1 - **Precision:** bf16 #### Speeds, Sizes, Times - **Training time:** 15 hours on **NVIDIA P100 8GB** - **Checkpointing every:** 500 steps ## Evaluation ### Testing Data - Validation dataset: Spider validation set (translated to Arabic) ### Metrics - Exact Match (EM) for SQL correctness - Execution Accuracy (EX) on databases ### Results - Model achieved **competitive SQL generation accuracy** for Arabic queries. - Further testing required for robustness. ## Environmental Impact - **Hardware Type:** NVIDIA Tesla P100 8GB - **Hours used:** 15 - **Cloud Provider:** Kaggle - **Carbon Emitted:** Estimated using [ML Impact Calculator](https://mlco2.github.io/impact#compute) ## Technical Specifications ### Model Architecture and Objective - Transformer-based **Qwen2.5-1.5B** architecture. - Fine-tuned for Text-to-SQL task using LoRA. ### Compute Infrastructure - **Hardware:** Kaggle P100 GPU (8GB VRAM) - **Software:** Python, Transformers, LLaMA-Factory, Hugging Face Hub ## Citation If you use this model, please cite: ```bibtex @misc{OsamaMo_ArabicSQL, author = {Osama Mohamed}, title = {Arabic Text-To-SQL Model}, year = {2024}, howpublished = {\url{https://huggingface.co/OsamaMo/Arabic_Text-To-SQL}} } ``` ## Model Card Contact For questions, contact **Osama Mohamed** via Hugging Face ([OsamaMo](https://huggingface.co/OsamaMo)).