--- library_name: transformers datasets: - teknium/openhermes pipeline_tag: text-generation license: apache-2.0 base_model: Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-2.0 --- DALL-E-2024-08-08-05-21-39-An-artistic-representation-for-a-model-card-featuring-an-abstract-and-sty # Model Card for Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0: ## Model Details: ### Model Description: - **Finetuned from model: Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-2.0 on teknium/openhermes.** - We pruned the 4 layers of meta-llama/Meta-Llama-3.1-8B that had the less impact on the performance of the model according to the paper [The Unreasonable Ineffectiveness of the Deeper Layers](https://arxiv.org/pdf/2403.17887). - We have therefore 1.09B parameters less than the foundation model, which means less memory needed, faster training and less latency during inference mode. - We then recovered the performance loss induced by the pruning process by fine-tuning (from 0.2642 MMLU-Pro 0-shot to 0.3120), this step is called healing the pruned model. ### Upcoming Work: - More healing through SFT/DPO/TPO to see if we can get closer to the meta-llama/Meta-Llama-3.1-8B performance (which has an MMLU-Pro 0-shot of 0.3659 vs 0.3120 for our model). **(In Progress)** - Compare the same exact process when applied to meta-llama/LLama-3.1-70B. ### Training Details: model = FastLanguageModel.get_peft_model( model, r = 4, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 4, lora_dropout = 0.05, bias = "none", use_gradient_checkpointing = "unsloth", random_state = 3407, use_rslora = False, loftq_config = None, ) from trl import SFTTrainer from transformers import TrainingArguments from unsloth import is_bfloat16_supported trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, dataset_text_field = "completion", max_seq_length = max_seq_length, dataset_num_proc = 2, packing = False, args = TrainingArguments( per_device_train_batch_size = 10, gradient_accumulation_steps = 4, warmup_steps = 5, max_steps=5000, learning_rate = 2e-4, fp16 = not is_bfloat16_supported(), bf16 = is_bfloat16_supported(), logging_steps = 1, optim = "adamw_8bit", weight_decay = 0.01, lr_scheduler_type = "cosine", seed = 3407, output_dir = "outputs_4", push_to_hub=True, hub_always_push=True, ), ) ### Training Data: [teknium/openhermes](https://huggingface.co/datasets/teknium/openhermes) ### Memory and Latency gain (Using [**Optimum-Benchmark**](https://github.com/huggingface/optimum-benchmark)): **Load Mode Memory Metrics** | **Model** | **Max Global VRAM (MB)** | **Max Process VRAM (MB)** | **Max Reserved VRAM (MB)** | **Max Allocated VRAM (MB)** | |:--------------------------------------------------:|:------------------------:|:-------------------------:|:--------------------------:|:---------------------------:| | Llama-3.1-8B | 18521.98 | 16630.42 | 16196.30 | 16060.54 | | Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 | 16319.97 | 14428.41 | 13994.30 | 13879.42 | **Inference Mode Latency Metrics** | **Model** | **Latency Mean (s)** | **Throughput (tokens/s)** | |:--------------------------------------------------:|:--------------------:|:-------------------------:| | Llama-3.1-8B | 0.8104 | 38.2536 | | Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 | 0.5530 | 56.0570 | ### Evaluation: - (Foundation model) MMLU Pro 0-shot of meta-llama/Meta-Llama-3.1-8B: 0.3659 - (Pruned model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers: 0.2642 - (Healed model) MMLU Pro 0-shot of Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0: 0.3120 Screenshot-2024-08-08-at-7-41-26-AM ### Evaluation Data and Process: - [TIGER-AI-Lab/MMLU-Pro](https://github.com/TIGER-AI-Lab/MMLU-Pro). - HuggingFace Lighteval benchmarking repo. ## Additional Benchmark Results ### BoolQ 0-shots Benchmark Results | Model | Average Score | boolq (0 shots) | boolq contrastset (0 shots) | |-------|---------------|-----------------|---------------------------| | meta-llama/Meta-Llama-3.1-8B | 0.569 | 0.569 | 0.568 | | Na0s/Llama-3.1-8B-Pruned-4-Layers | 0.240 | 0.240 | 0.240 | | Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 | **0.833** | **0.834** | **0.831** | ### BigBench 0-shots Benchmark Results | Model | Average Score | bigbench:causal_judgment (0 shots) | bigbench:date_understanding (0 shots) | bigbench:disambiguation_qa (0 shots) | bigbench:geometric_shapes (0 shots) | bigbench:logical_deduction (0 shots) | ... | |-------|---------------|-------------------------------------|---------------------------------------|--------------------------------------|-------------------------------------|--------------------------------------|--------------------------------------| | meta-llama/Meta-Llama-3.1-8B | **0.351** | **0.574** | 0.499 | 0.302 | 0.164 | 0.208 | ... | | Na0s/Llama-3.1-8B-Pruned-4-Layers | 0.299 | 0.537 | 0.341 | 0.314 | 0.200 | **0.212** | ... | | Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 | **0.364** | **0.579** | **0.610** | **0.407** | **0.264** | 0.208 | ... | ### Few Shots Benchmark Results | Model | Average Score | arc:challenge (25 shots) | hellaswag (10 shots) | mmlu:abstract_algebra (5 shots) | mmlu:college_chemistry (5 shots) | mmlu:college_computer_science (5 shots) | mmlu:college_mathematics (5 shots) | ... | |-------|---------------|--------------------------|----------------------|--------------------------------|----------------------------------|----------------------------------------|-----------------------------------|-----------------------------------| | meta-llama/Meta-Llama-3.1-8B | **0.552** | **0.541** | **0.620** | 0.290 | 0.450 | 0.480 | **0.350** | ... | | Na0s/Llama-3.1-8B-Pruned-4-Layers | 0.516 | 0.462 | 0.549 | 0.290 | 0.440 | 0.460 | 0.280 | ... | | Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 | 0.544 | 0.479 | 0.554 | **0.340** | **0.480** | **0.520** | **0.350** | ... | ### BigBench 3-shots Benchmark Results | Model | Average Score | bigbench:causal_judgment (3 shots) | bigbench:date_understanding (3 shots) | bigbench:disambiguation_qa (3 shots) | bigbench:geometric_shapes (3 shots) | bigbench:logical_deduction (3 shots) | ... | |-------|---------------|-------------------------------------|---------------------------------------|--------------------------------------|-------------------------------------|--------------------------------------|--------------------------------------| | meta-llama/Meta-Llama-3.1-8B | 0.442 | 0.563 | 0.596 | 0.593 | 0.181 | 0.298 | ... | | Na0s/Llama-3.1-8B-Pruned-4-Layers | 0.420 | 0.563 | 0.642 | 0.574 | 0.217 | 0.258 | ... | | Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 | **0.450** | **0.621** | **0.686** | **0.663** | **0.225** | **0.332** | ... | ### Overall Average Score | Model | Overall Average Score | |-------|------------------------| | meta-llama/Meta-Llama-3.1-8B | 0.472 | | Na0s/Llama-3.1-8B-Pruned-4-Layers | 0.364 | | Na0s/Llama-3.1-8B-Pruned-4-Layers_LoRA-PEFT-3.0 | **0.513** | ### Environmental Impact: Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).