SURESHBEEKHANI commited on
Commit
689d87e
verified
1 Parent(s): e785b62

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -15
README.md CHANGED
@@ -1,22 +1,166 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
3
- tags:
4
- - text-generation-inference
5
- - transformers
6
- - unsloth
7
- - llama
8
- - gguf
9
- license: apache-2.0
10
- language:
11
- - en
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- # Uploaded model
 
 
 
 
 
 
 
15
 
16
- - **Developed by:** SURESHBEEKHANI
17
- - **License:** apache-2.0
18
- - **Finetuned from model :** unsloth/llama-3.2-3b-instruct-bnb-4bit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
 
 
 
 
21
 
 
 
 
 
 
 
 
 
 
22
 
 
 
 
1
+ # **Fine-Tuning Meta-Llama-3.2-3B with Unsloth for CPU and GPU Inference - GGML**
2
+
3
+ ## **Overview**
4
+ On **September 25, 2024**, Meta released the **Llama 3.2** series, featuring highly optimized multilingual language models in 1B and 3B parameter configurations. These models excel in multilingual dialogue tasks, summarization, and agentic retrieval, supporting extensive text processing with a **128K token context length**.
5
+
6
+ This repository demonstrates fine-tuning the **Meta-Llama-3.2-3B** model using **Unsloth** for efficient training and inference. It also includes steps to convert the model into **GGML format**, enabling memory-efficient deployment on CPUs and GPUs.
7
+
8
+ ---
9
+
10
+ ## **Table of Contents**
11
+ 1. [Key Features](#key-features)
12
+ 2. [Setup and Installation](#setup-and-installation)
13
+ 3. [Fine-Tuning Workflow](#fine-tuning-workflow)
14
+ 4. [Data Preparation](#data-preparation)
15
+ 5. [Training the Model](#training-the-model)
16
+ 6. [Model Conversion to GGML](#model-conversion-to-ggml)
17
+
18
+ ---
19
+
20
+ ## **Key Features**
21
+ - **Low-Rank Adaptation (LoRA):** Enables efficient parameter fine-tuning, reducing training costs.
22
+ - **Memory Optimization:** Supports **4-bit quantization** for memory-constrained environments.
23
+ - **Fast Processing:** Includes gradient checkpointing and optimized data handling for faster inference.
24
+ - **Extended Context Length:** Handles input sequences up to **128K tokens** for large document processing.
25
+ - **Versatile Applications:** Ideal for dialogue systems, summarization, and knowledge retrieval tasks.
26
+
27
+ ---
28
+
29
+ ## **Setup and Installation**
30
+
31
+ ### **Step 1: Install Dependencies**
32
+ Install the necessary packages, including the latest version of **Unsloth** for enhanced fine-tuning efficiency.
33
+ ```bash
34
+ %%capture
35
+ !pip install unsloth
36
+ # Install the latest nightly version of Unsloth
37
+ !pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git
38
+ ```
39
+
40
+ ---
41
+
42
+ ### **Step 2: Load the Model and Tokenizer**
43
+ The following code initializes the Llama-3.2 model and tokenizer:
44
+ ```python
45
+ from unsloth import FastLanguageModel
46
+ import torch
47
+
48
+ # Configuration settings
49
+ max_seq_length = 2048 # Maximum sequence length
50
+ dtype = None # Automatically detects dtype; Float16 for T4, Bfloat16 for Ampere+
51
+ load_in_4bit = True # Use 4-bit quantization for memory efficiency
52
+
53
+ # Load the model and tokenizer
54
+ model, tokenizer = FastLanguageModel.from_pretrained(
55
+ model_name="unsloth/Llama-3.2-3B-Instruct",
56
+ max_seq_length=max_seq_length,
57
+ dtype=dtype,
58
+ load_in_4bit=load_in_4bit,
59
+ )
60
+ ```
61
+
62
+ ---
63
+
64
+ ## **Fine-Tuning Workflow**
65
+
66
+ ### **LoRA Fine-Tuning with Unsloth**
67
+ Use LoRA adapters to fine-tune only a small subset of model parameters:
68
+ ```python
69
+ model = FastLanguageModel.get_peft_model(
70
+ model,
71
+ r=16, # Rank for LoRA; options: 8, 16, 32, etc.
72
+ target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
73
+ lora_alpha=16,
74
+ lora_dropout=0,
75
+ bias="none",
76
+ use_gradient_checkpointing="unsloth", # Enable optimized checkpointing
77
+ random_state=3407,
78
+ )
79
+ ```
80
+
81
  ---
82
+
83
+ ## **Data Preparation**
84
+ Prepare your dataset in **ShareGPT-style** conversation format using the `unsloth.chat_templates` module:
85
+ ```python
86
+ from unsloth.chat_templates import get_chat_template
87
+ from datasets import load_dataset
88
+
89
+ # Apply the chat template
90
+ tokenizer = get_chat_template(tokenizer, chat_template="llama-3.1")
91
+
92
+ def formatting_prompts_func(examples):
93
+ convos = examples["conversations"]
94
+ texts = [
95
+ tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False)
96
+ for convo in convos
97
+ ]
98
+ return {"text": texts}
99
+
100
+ # Load and prepare the dataset
101
+ dataset = load_dataset("mlabonne/FineTome-100k", split="train")
102
+ dataset = dataset.select(range(500)) # Use a subset for quick testing
103
+ from unsloth.chat_templates import standardize_sharegpt
104
+ dataset = standardize_sharegpt(dataset)
105
+ dataset = dataset.map(formatting_prompts_func, batched=True)
106
+ ```
107
+
108
  ---
109
 
110
+ ## **Training the Model**
111
+
112
+ ### **SFT Training with TRL**
113
+ Fine-tune the model using Hugging Face's TRL library:
114
+ ```python
115
+ from trl import SFTTrainer
116
+ from transformers import TrainingArguments, DataCollatorForSeq2Seq
117
+ from unsloth import is_bfloat16_supported
118
 
119
+ trainer = SFTTrainer(
120
+ model=model,
121
+ tokenizer=tokenizer,
122
+ train_dataset=dataset,
123
+ dataset_text_field="text",
124
+ max_seq_length=max_seq_length,
125
+ data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
126
+ dataset_num_proc=2,
127
+ packing=False,
128
+ args=TrainingArguments(
129
+ per_device_train_batch_size=2,
130
+ gradient_accumulation_steps=4,
131
+ warmup_steps=5,
132
+ max_steps=60,
133
+ learning_rate=2e-4,
134
+ fp16=not is_bfloat16_supported(),
135
+ bf16=is_bfloat16_supported(),
136
+ logging_steps=1,
137
+ optim="adamw_8bit",
138
+ weight_decay=0.01,
139
+ lr_scheduler_type="linear",
140
+ seed=3407,
141
+ output_dir="outputs",
142
+ report_to="none",
143
+ ),
144
+ )
145
 
146
+ # Train on assistant responses only
147
+ from unsloth.chat_templates import train_on_responses_only
148
+ trainer = train_on_responses_only(
149
+ trainer,
150
+ instruction_part="<|start_header_id|>user<|end_header_id|>\n\n",
151
+ response_part="<|start_header_id|>assistant<|end_header_id|>\n\n",
152
+ )
153
+ ```
154
 
155
+ ---
156
+
157
+ ## **Model Conversion to GGML**
158
+ Convert the fine-tuned model into GGML format for memory-efficient inference:
159
+ ```bash
160
+ python -m unsloth.export_ggml --model outputs --output llama3.2-3b.ggml
161
+ ```
162
+
163
+ ---
164
 
165
+ ## **License**
166
+ This project is distributed under the Apache License 2.0. See [LICENSE](LICENSE) for more details.