SURESHBEEKHANI commited on
Commit
0840525
verified
1 Parent(s): 689d87e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -148
README.md CHANGED
@@ -1,166 +1,67 @@
1
- # **Fine-Tuning Meta-Llama-3.2-3B with Unsloth for CPU and GPU Inference - GGML**
2
-
3
- ## **Overview**
4
- On **September 25, 2024**, Meta released the **Llama 3.2** series, featuring highly optimized multilingual language models in 1B and 3B parameter configurations. These models excel in multilingual dialogue tasks, summarization, and agentic retrieval, supporting extensive text processing with a **128K token context length**.
5
-
6
- This repository demonstrates fine-tuning the **Meta-Llama-3.2-3B** model using **Unsloth** for efficient training and inference. It also includes steps to convert the model into **GGML format**, enabling memory-efficient deployment on CPUs and GPUs.
7
-
8
  ---
9
-
10
- ## **Table of Contents**
11
- 1. [Key Features](#key-features)
12
- 2. [Setup and Installation](#setup-and-installation)
13
- 3. [Fine-Tuning Workflow](#fine-tuning-workflow)
14
- 4. [Data Preparation](#data-preparation)
15
- 5. [Training the Model](#training-the-model)
16
- 6. [Model Conversion to GGML](#model-conversion-to-ggml)
17
-
18
  ---
 
19
 
20
- ## **Key Features**
21
- - **Low-Rank Adaptation (LoRA):** Enables efficient parameter fine-tuning, reducing training costs.
22
- - **Memory Optimization:** Supports **4-bit quantization** for memory-constrained environments.
23
- - **Fast Processing:** Includes gradient checkpointing and optimized data handling for faster inference.
24
- - **Extended Context Length:** Handles input sequences up to **128K tokens** for large document processing.
25
- - **Versatile Applications:** Ideal for dialogue systems, summarization, and knowledge retrieval tasks.
26
 
27
- ---
28
 
29
- ## **Setup and Installation**
 
 
30
 
31
- ### **Step 1: Install Dependencies**
32
- Install the necessary packages, including the latest version of **Unsloth** for enhanced fine-tuning efficiency.
33
- ```bash
34
- %%capture
35
- !pip install unsloth
36
- # Install the latest nightly version of Unsloth
37
- !pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
38
- ```
39
 
40
- ---
 
 
41
 
42
- ### **Step 2: Load the Model and Tokenizer**
43
- The following code initializes the Llama-3.2 model and tokenizer:
44
- ```python
45
- from unsloth import FastLanguageModel
46
- import torch
47
-
48
- # Configuration settings
49
- max_seq_length = 2048 # Maximum sequence length
50
- dtype = None # Automatically detects dtype; Float16 for T4, Bfloat16 for Ampere+
51
- load_in_4bit = True # Use 4-bit quantization for memory efficiency
52
-
53
- # Load the model and tokenizer
54
- model, tokenizer = FastLanguageModel.from_pretrained(
55
- model_name="unsloth/Llama-3.2-3B-Instruct",
56
- max_seq_length=max_seq_length,
57
- dtype=dtype,
58
- load_in_4bit=load_in_4bit,
59
- )
60
- ```
61
 
62
- ---
 
63
 
64
- ## **Fine-Tuning Workflow**
65
-
66
- ### **LoRA Fine-Tuning with Unsloth**
67
- Use LoRA adapters to fine-tune only a small subset of model parameters:
68
- ```python
69
- model = FastLanguageModel.get_peft_model(
70
- model,
71
- r=16, # Rank for LoRA; options: 8, 16, 32, etc.
72
- target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
73
- lora_alpha=16,
74
- lora_dropout=0,
75
- bias="none",
76
- use_gradient_checkpointing="unsloth", # Enable optimized checkpointing
77
- random_state=3407,
78
- )
79
- ```
80
 
81
- ---
 
 
 
82
 
83
- ## **Data Preparation**
84
- Prepare your dataset in **ShareGPT-style** conversation format using the `unsloth.chat_templates` module:
85
- ```python
86
- from unsloth.chat_templates import get_chat_template
87
- from datasets import load_dataset
88
-
89
- # Apply the chat template
90
- tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
91
-
92
- def formatting_prompts_func(examples):
93
- convos = examples["conversations"]
94
- texts = [
95
- tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
96
- for convo in convos
97
- ]
98
- return {"text": texts}
99
-
100
- # Load and prepare the dataset
101
- dataset = load_dataset("mlabonne/FineTome-100k", split="train")
102
- dataset = dataset.select(range(500)) # Use a subset for quick testing
103
- from unsloth.chat_templates import standardize_sharegpt
104
- dataset = standardize_sharegpt(dataset)
105
- dataset = dataset.map(formatting_prompts_func, batched=True)
106
- ```
107
 
108
- ---
109
 
110
- ## **Training the Model**
111
-
112
- ### **SFT Training with TRL**
113
- Fine-tune the model using Hugging Face's TRL library:
114
- ```python
115
- from trl import SFTTrainer
116
- from transformers import TrainingArguments, DataCollatorForSeq2Seq
117
- from unsloth import is_bfloat16_supported
118
-
119
- trainer = SFTTrainer(
120
- model=model,
121
- tokenizer=tokenizer,
122
- train_dataset=dataset,
123
- dataset_text_field="text",
124
- max_seq_length=max_seq_length,
125
- data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
126
- dataset_num_proc=2,
127
- packing=False,
128
- args=TrainingArguments(
129
- per_device_train_batch_size=2,
130
- gradient_accumulation_steps=4,
131
- warmup_steps=5,
132
- max_steps=60,
133
- learning_rate=2e-4,
134
- fp16=not is_bfloat16_supported(),
135
- bf16=is_bfloat16_supported(),
136
- logging_steps=1,
137
- optim="adamw_8bit",
138
- weight_decay=0.01,
139
- lr_scheduler_type="linear",
140
- seed=3407,
141
- output_dir="outputs",
142
- report_to="none",
143
- ),
144
- )
145
-
146
- # Train on assistant responses only
147
- from unsloth.chat_templates import train_on_responses_only
148
- trainer = train_on_responses_only(
149
- trainer,
150
- instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
151
- response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
152
- )
153
- ```
154
 
155
- ---
 
 
 
156
 
157
- ## **Model Conversion to GGML**
158
- Convert the fine-tuned model into GGML format for memory-efficient inference:
159
- ```bash
160
- python -m unsloth.export_ggml --model outputs --output llama3.2-3b.ggml
161
- ```
162
 
163
- ---
 
 
164
 
165
- ## **License**
166
- This project is distributed under the Apache License 2.0. See [LICENSE](LICENSE) for more details.
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ datasets:
4
+ - mlabonne/FineTome-100k
5
+ language:
6
+ - en
7
+ base_model:
8
+ - unsloth/Llama-3.2-3B-Instruct
9
+ pipeline_tag: question-answering
 
10
  ---
11
+ # Llama-3.2-3B-Instruct Fine-Tuning on Custom Dataset
12
 
13
+ ## Overview
 
 
 
 
 
14
 
15
+ This repository demonstrates the process of fine-tuning the **Llama-3.2-3B-Instruct** model using the **Unsloth** library. The model is trained on a custom dataset, **FineTome-100k**, for **60 steps**. Key optimizations include:
16
 
17
+ - **4-bit quantization** to reduce memory usage
18
+ - **LoRA (Low-Rank Adaptation)** for efficient fine-tuning
19
+ - Techniques for improving inference speed and generating text with the model
20
 
21
+ ## Model Details
 
 
 
 
 
 
 
22
 
23
+ - **Model Name**: Llama-3.2-3B-Instruct
24
+ - **Pretrained Weights**: Unsloth鈥檚 pretrained version for Llama-3.2-3B
25
+ - **Quantization**: 4-bit quantization (set via `load_in_4bit=True`) for reduced memory usage
26
 
27
+ ### LoRA Configuration:
28
+ - **Rank**: 16
29
+ - **Target Modules**:
30
+ - q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
31
+ - **LoRA Alpha**: 16
32
+ - **LoRA Dropout**: 0
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ ### Gradient Checkpointing:
35
+ - **Use Gradient Checkpointing**: "unsloth" for improved long-context training
36
 
37
+ ## Training
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ - **Dataset**: FineTome-100k (first 500 records selected)
40
+ - **Loss Function**: Standard loss for sequence-to-sequence tasks
41
+ - **Training Steps**: 60 steps with batch size of 2 (gradient accumulation steps set to 4)
42
+ - **Optimizer**: AdamW 8-bit
43
 
44
+ ### Training Parameters:
45
+ - **Max Sequence Length**: 2048 tokens
46
+ - **Learning Rate**: 2e-4
47
+ - **Gradient Accumulation Steps**: 4
48
+ - **Total Steps**: 60
49
+ - **Epochs**: 1 (as `max_steps` was set to 60)
50
+ - **Training Precision**: Use FP16 or BF16 for training depending on GPU support
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
+ ## Performance
53
 
54
+ - **GPU Used**: Tesla T4 (14.7 GB max memory)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
+ ### Peak Memory Usage:
57
+ - **Total Reserved Memory**: 3.855 GB
58
+ - **Memory Used for LoRA**: 1.312 GB
59
+ - **Memory Utilization**: 26.1% (peak) of available memory
60
 
61
+ ## Conclusion
 
 
 
 
62
 
63
+ This notebook showcases an efficient approach to fine-tuning large language models with memory optimizations and improved training efficiency using **LoRA** and **4-bit quantization**. The **Unsloth** library allows for fast training and inference, making this setup ideal for large-scale tasks even with limited GPU resources.
64
+
65
+ ## Notebook
66
 
67
+ Access the implementation notebook for this model [here](https://github.com/SURESHBEEKHANI/Advanced-LLM-Fine-Tuning/blob/main/Llama_3_2_3B_SFT_GGUF.ipynb). This notebook provides detailed steps for fine-tuning and deploying the model.