metadata

language: en
license: apache-2.0

Shears Model Card: shears-llama-7b-50-math-super-adapter

The super-adapter fine-tuned on sparsified LLaMA-7B with some math reasoning datasets using Shears.

The release of the super-network is to facilitate users to apply their own search algorithms and evaluation indicators to extract subnetworks suitable for their specific needs.

Model Details

Information

Model name: shears-llama-7b-50-math-super-adapter
Base model: Sparsified LLaMA-7B
Sparsity: 50%
Domain: Math
Subnetwork version: Super
NNCF Configuration: nncf_shears_llama.json

Sparsified Base Model

Shears employs a simple but effective pruning approach Wanda to sparsify the language model, serving as the base model. Clone the Wanda repo:

git clone https://github.com/locuslab/wanda.git && cd wanda && git checkout 8e8fc87 && cd ..

The command for unstructured sparsifying LLaMA-7B with Wanda, to achieve unstructured 50% sparsity:

python wanda/main.py \
    --model yahma/llama-7b-hf \
    --prune_method wanda \
    --sparsity_ratio 0.5 \
    --sparsity_type unstructured \
    --save wanda_out \
    --save_model shears-llama-7b-50-base

--model: The identifier for the model on the Hugging Face model hub or local path.
--sparsity_ratio: Specifies the percentage of weights to be pruned.
--save_model: Specifies the directory where the pruned language model will be stored.

Refer to our repo for the environment information to run this command.

Adapter Configuration

LoRA rank: 32
LoRA alpha: 64
LoRA target modules: q_proj, k_proj, v_proj, up_proj, down_proj
LoRA rank search space: [32, 24, 16] (for each LoRA module)

Training Hyperparameters

Batch size: 16
Learning rate: 3e-4
Epoch: 3

Training Data

Unified math reasoning dataset: math_10k.json (collected with the training sets of GSM8K, MAWPS, and AQuA).

Evaluation Data

GSM8K, AQuA, MAWPS, SVAMP

How to use

Refer to the illustrative example provided in load_and_explore_supernet.ipynb for a comprehensive understanding. This notebook shows the direct loading of a Shears super-network and the extraction of diverse subnetworks from it. This feature empowers users to employ their own search algorithms and evaluation metrics for the extraction of subnetworks customized to their specific requirements.

Moreover, the super-network is essentially the maximal subnetwork, and it can also be directly loaded:

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

def generate_prompt(instruction):
    return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request. 

                    ### Instruction:
                    {instruction}

                    ### Response:
                    """

base_model = AutoModelForCausalLM.from_pretrained("shears-llama-7b-50-base")
model = PeftModel.from_pretrained(base_model, "IntelLabs/shears-llama-7b-50-math-super-adapter")
model.eval()

non_zero_params = sum([(param.data != 0).sum().item() for _, param in model.named_parameters()])
print(f"Number of all non-zero parameters: {non_zero_params}")

tokenizer = AutoTokenizer.from_pretrained("shears-llama-7b-50-base")

instruction = "Edgar eats 18 pretzels a day. If his brother eats 1/2 as many, how many does his brother eat in a week?"
prompt = generate_prompt(instruction)
inputs = tokenizer(prompt, return_tensors="pt")
input_ids = inputs["input_ids"].to(model.device)
with torch.no_grad():
    generation_output = model.generate(
        input_ids=input_ids,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=256,
        use_cache=True,
        num_beams=4,
    )
    s = generation_output.sequences[0]
    output = tokenizer.decode(s)
print(output)

Evaluation Results

Results of the heuristic sub-network discoverd from the super-network:

Model	Sparsity	GSM8K	AQuA	MAWPS	SVAMP	Average
LLaMA-7B-LoRA	-	37.5	18.9	79.0	52.1	46.9
LLaMA-7B-Shears	50%	36.1	22.0	78.6	44.5	45.3
LLaMA-13B-LoRA	-	47.5	18.5	83.6	54.6	51.1
LLaMA-13B-Shears	50%	45.1	22.0	83.2	53.3	50.9

Model Sources

Repository: https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears
Paper: Shears: Unstructured Sparsity with Neural Low-rank Adapter Search

Citation

@article{munoz2024shears,
  title = {Shears: Unstructured Sparsity with Neural Low-rank Adapter Search},
  author={J. Pablo Munoz and Jinjie Yuan and Nilesh Jain},
  journal={The 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-2024)},
  year={2024}
}

License

Apache-2.0