SURESHBEEKHANI
/

Llama_3_2_3B_SFT_GGUF

@@ -1,166 +1,67 @@
-# **Fine-Tuning Meta-Llama-3.2-3B with Unsloth for CPU and GPU Inference - GGML**
-## **Overview**
-On **September 25, 2024**, Meta released the **Llama 3.2** series, featuring highly optimized multilingual language models in 1B and 3B parameter configurations. These models excel in multilingual dialogue tasks, summarization, and agentic retrieval, supporting extensive text processing with a **128K token context length**.
-This repository demonstrates fine-tuning the **Meta-Llama-3.2-3B** model using **Unsloth** for efficient training and inference. It also includes steps to convert the model into **GGML format**, enabling memory-efficient deployment on CPUs and GPUs.
 ---
-## **Table of Contents**
-1. [Key Features](#key-features)
-2. [Setup and Installation](#setup-and-installation)
-3. [Fine-Tuning Workflow](#fine-tuning-workflow)
-4. [Data Preparation](#data-preparation)
-5. [Training the Model](#training-the-model)
-6. [Model Conversion to GGML](#model-conversion-to-ggml)
 ---
-## **Key Features**
-- **Low-Rank Adaptation (LoRA):** Enables efficient parameter fine-tuning, reducing training costs.
-- **Memory Optimization:** Supports **4-bit quantization** for memory-constrained environments.
-- **Fast Processing:** Includes gradient checkpointing and optimized data handling for faster inference.
-- **Extended Context Length:** Handles input sequences up to **128K tokens** for large document processing.
-- **Versatile Applications:** Ideal for dialogue systems, summarization, and knowledge retrieval tasks.
----
-## **Setup and Installation**
-### **Step 1: Install Dependencies**
-Install the necessary packages, including the latest version of **Unsloth** for enhanced fine-tuning efficiency.
-```bash
-%%capture
-!pip install unsloth
-# Install the latest nightly version of Unsloth
-!pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
-```
----
-### **Step 2: Load the Model and Tokenizer**
-The following code initializes the Llama-3.2 model and tokenizer:
-```python
-from unsloth import FastLanguageModel
-import torch
-# Configuration settings
-max_seq_length = 2048  # Maximum sequence length
-dtype = None  # Automatically detects dtype; Float16 for T4, Bfloat16 for Ampere+
-load_in_4bit = True  # Use 4-bit quantization for memory efficiency
-# Load the model and tokenizer
-model, tokenizer = FastLanguageModel.from_pretrained(
-    model_name="unsloth/Llama-3.2-3B-Instruct",
-    max_seq_length=max_seq_length,
-    dtype=dtype,
-    load_in_4bit=load_in_4bit,
-)
-```
----
-## **Fine-Tuning Workflow**
-### **LoRA Fine-Tuning with Unsloth**
-Use LoRA adapters to fine-tune only a small subset of model parameters:
-```python
-model = FastLanguageModel.get_peft_model(
-    model,
-    r=16,  # Rank for LoRA; options: 8, 16, 32, etc.
-    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
-    lora_alpha=16,
-    lora_dropout=0,
-    bias="none",
-    use_gradient_checkpointing="unsloth",  # Enable optimized checkpointing
-    random_state=3407,
-)
-```
----
-## **Data Preparation**
-Prepare your dataset in **ShareGPT-style** conversation format using the `unsloth.chat_templates` module:
-```python
-from unsloth.chat_templates import get_chat_template
-from datasets import load_dataset
-# Apply the chat template
-tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
-def formatting_prompts_func(examples):
-    convos = examples["conversations"]
-    texts = [
-        tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
-        for convo in convos
-    ]
-    return {"text": texts}
-# Load and prepare the dataset
-dataset = load_dataset("mlabonne/FineTome-100k", split="train")
-dataset = dataset.select(range(500))  # Use a subset for quick testing
-from unsloth.chat_templates import standardize_sharegpt
-dataset = standardize_sharegpt(dataset)
-dataset = dataset.map(formatting_prompts_func, batched=True)
-```
----
-## **Training the Model**
-### **SFT Training with TRL**
-Fine-tune the model using Hugging Face's TRL library:
-```python
-from trl import SFTTrainer
-from transformers import TrainingArguments, DataCollatorForSeq2Seq
-from unsloth import is_bfloat16_supported
-trainer = SFTTrainer(
-    model=model,
-    tokenizer=tokenizer,
-    train_dataset=dataset,
-    dataset_text_field="text",
-    max_seq_length=max_seq_length,
-    data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
-    dataset_num_proc=2,
-    packing=False,
-    args=TrainingArguments(
-        per_device_train_batch_size=2,
-        gradient_accumulation_steps=4,
-        warmup_steps=5,
-        max_steps=60,
-        learning_rate=2e-4,
-        fp16=not is_bfloat16_supported(),
-        bf16=is_bfloat16_supported(),
-        logging_steps=1,
-        optim="adamw_8bit",
-        weight_decay=0.01,
-        lr_scheduler_type="linear",
-        seed=3407,
-        output_dir="outputs",
-        report_to="none",
-    ),
-)
-# Train on assistant responses only
-from unsloth.chat_templates import train_on_responses_only
-trainer = train_on_responses_only(
-    trainer,
-    instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
-    response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
-)
-```
----
-## **Model Conversion to GGML**
-Convert the fine-tuned model into GGML format for memory-efficient inference:
-```bash
-python -m unsloth.export_ggml --model outputs --output llama3.2-3b.ggml
-```
----
-## **License**
-This project is distributed under the Apache License 2.0. See [LICENSE](LICENSE) for more details.

 ---
+license: mit
+datasets:
+- mlabonne/FineTome-100k
+language:
+- en
+base_model:
+- unsloth/Llama-3.2-3B-Instruct
+pipeline_tag: question-answering
 ---
+# Llama-3.2-3B-Instruct Fine-Tuning on Custom Dataset
+## Overview
+This repository demonstrates the process of fine-tuning the **Llama-3.2-3B-Instruct** model using the **Unsloth** library. The model is trained on a custom dataset, **FineTome-100k**, for **60 steps**. Key optimizations include:
+- **4-bit quantization** to reduce memory usage
+- **LoRA (Low-Rank Adaptation)** for efficient fine-tuning
+- Techniques for improving inference speed and generating text with the model
+## Model Details
+- **Model Name**: Llama-3.2-3B-Instruct
+- **Pretrained Weights**: Unsloth’s pretrained version for Llama-3.2-3B
+- **Quantization**: 4-bit quantization (set via `load_in_4bit=True`) for reduced memory usage
+### LoRA Configuration:
+- **Rank**: 16
+- **Target Modules**:
+  - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
+- **LoRA Alpha**: 16
+- **LoRA Dropout**: 0
+### Gradient Checkpointing:
+- **Use Gradient Checkpointing**: "unsloth" for improved long-context training
+## Training
+- **Dataset**: FineTome-100k (first 500 records selected)
+- **Loss Function**: Standard loss for sequence-to-sequence tasks
+- **Training Steps**: 60 steps with batch size of 2 (gradient accumulation steps set to 4)
+- **Optimizer**: AdamW 8-bit
+### Training Parameters:
+- **Max Sequence Length**: 2048 tokens
+- **Learning Rate**: 2e-4
+- **Gradient Accumulation Steps**: 4
+- **Total Steps**: 60
+- **Epochs**: 1 (as `max_steps` was set to 60)
+- **Training Precision**: Use FP16 or BF16 for training depending on GPU support
+## Performance
+- **GPU Used**: Tesla T4 (14.7 GB max memory)
+### Peak Memory Usage:
+- **Total Reserved Memory**: 3.855 GB
+- **Memory Used for LoRA**: 1.312 GB
+- **Memory Utilization**: 26.1% (peak) of available memory
+## Conclusion
+This notebook showcases an efficient approach to fine-tuning large language models with memory optimizations and improved training efficiency using **LoRA** and **4-bit quantization**. The **Unsloth** library allows for fast training and inference, making this setup ideal for large-scale tasks even with limited GPU resources.
+## Notebook
+Access the implementation notebook for this model [here](https://github.com/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/Llama_3_2_3B_SFT_GGUF.ipynb). This notebook provides detailed steps for fine-tuning and deploying the model.