OPEA
/

Safetensors
olmo2
4-bit precision
intel/auto-round
File size: 9,615 Bytes
5918fc5
 
 
 
2c43ddc
 
5918fc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a60042
5918fc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
license: apache-2.0
datasets:
- NeelNanda/pile-10k
base_model:
- allenai/OLMo-2-1124-7B-Instruct
---


## Model Card Details

This model is an int4 model with group_size 128 and symmetric quantization of [allenai/OLMo-2-1124-7B-Instruct](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct) generated by [intel/auto-round](https://github.com/intel/auto-round). Load the model  with revision `1cdca16` to use AutoGPTQ format

## Inference on CPU/HPU/CUDA

pip3 install transformers>=4.47

HPU: docker image with Gaudi Software Stack is recommended, please refer to following script for environment setup. More details can be found in [Gaudi Guide](https://docs.habana.ai/en/latest/Installation_Guide/Bare_Metal_Fresh_OS.html#launch-docker-image-that-was-built).

```python
from auto_round import AutoHfQuantizer ##must import for auto-round format
import torch
from transformers import AutoModelForCausalLM,AutoTokenizer
quantized_model_dir = "OPEA/OLMo-2-1124-7B-Instruct-int4-sym-inc"
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype='auto',
    device_map="auto",
    ##revision="1cdca16", ##AutoGPTQ format
)

##import habana_frameworks.torch.core as htcore ## uncommnet it for HPU
##import habana_frameworks.torch.hpu as hthpu ## uncommnet it for HPU
##model = model.to(torch.bfloat16).to("hpu") ## uncommnet it for HPU

prompt = "There is a girl who likes adventure,"
messages = [
    {"role": "system", "content": "You are OLMo 2, a helpful and harmless AI Assistant built by the Allen Institute for AI."},
    {"role": "user", "content": prompt}
]

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=200, 
    do_sample=False  ##change this to align with the official usage
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

##prompt = "There is a girl who likes adventure,"
##INT4
"""There is a girl who likes adventure,

She's always on the lookout for a new escapade,
Her heart beats with excitement at the thought of the unknown,
Her spirit yearns for the thrill of exploration,

She packs her backpack with essentials,
A map, a compass, and a flashlight,
Her boots are ready for the rugged terrain,
Her spirit is as boundless as the sky.

She embarks on journeys through forests deep and wide,
Climbs mountains with a heart full of pride,
She paddles her kayak through turbulent waters,
And hikes through valleys where the wildflowers bloom.

The girl with the adventurous soul seeks out the hidden gems,
The secret trails, the ancient ruins,
She listens to the whispers of the wind,
And follows the call of the distant drum.

Her adventures are not just about the destination,
But the experiences she gathers along the way,
The stories
"""

##BF16 
"""There is a girl who likes adventure,

She dreams of far-off lands and distant shores,
Of climbing mountains high and exploring caves,
Her heart beats fast with excitement at the thought
Of the unknown paths that lie beyond the maps.

She packs her backpack with essentials and more,
A compass, a flashlight, and a book or two,
Her spirit eager, her eyes wide with wonder,
As she sets out on her journey anew.

The girl with the adventurous soul embarks
On quests that challenge her mind and her might,
She learns to navigate by the stars above,
And finds joy in the beauty of the night.

Through forests deep and rivers wide she roams,
Each step a story, each experience a treasure,
Her courage grows with every challenge faced,
And she discovers the strength she never knew she had.

The girl who likes adventure, with each passing day,
Grows wiser"""

##prompt = "Which one is larger, 9.11 or 9.8"
## INT4
"""9.8 is larger than 9.11.
"""

## BF16
"""9.8 is larger than 9.11. To compare these two numbers, you can simply look at their decimal places. Since 9.8 has a higher decimal value (0.8) compared to 9.11 (which has a decimal value of 0.11), 9.8 is the larger number.
"""

prompt = "How many r in strawberry."
## INT4
"""There are two 'r's in "strawberry."
"""
## BF16 
"""There are 2 'r's in "strawberry."""


##prompt = "Once upon a time,"
##INT4
"""Once upon a time, in a world where technology and imagination intertwined, there existed an AI named OLMo 2. Created by the brilliant minds at the Allen Institute for AI, OLMo 2 was more than just lines of code; it was a beacon of knowledge and a guardian of information.

OLMo 2's design was sleek and modern, with a digital interface that shimmered like a starlit sky. Its voice was soothing, a harmonious blend of tones that could calm the most restless of souls. With a vast database at its disposal, OLMo 2 was capable of answering any question, no matter how obscure or complex.

Every day, people from all walks of life would seek the wisdom of OLMo 2. Students would ask about the intricacies of quantum physics, while artists would inquire about the history of their favorite art movements. Parents would consult OLMo 2 for advice on raising children, and travelers would ask for
"""

##BF16
"""Once upon a time, in a world where imagination knew no bounds, there existed a land filled with wonder and mystery. This land was called Lumina, a place where the sky shimmered with the colors of a thousand sunsets, and the forests whispered ancient secrets to those who dared to listen.

In Lumina, there lived a young girl named Elara. She had hair as golden as the sun and eyes that held the depth of the ocean. Elara possessed a heart full of curiosity and a spirit unyielding in the face of adventure. Her home was a quaint cottage nestled at the edge of the Whispering Woods, a place where the trees seemed to dance in the wind, sharing tales of long-forgotten times.

One day, as the first light of dawn painted the sky in hues of pink and orange, Elara received a mysterious letter. The envelope was sealed with wax that bore the crest of the forgotten kingdom of Aetheria. Intrigued
"""

```

### Evaluate the model

pip3 install lm-eval==0.4.5

```bash
auto-round --eval --model "OPEA/OLMo-2-1124-7B-Instruct-int4-sym-inc" --eval_bs 16  --tasks leaderboard_mmlu_pro,leaderboard_ifeval,lambada_openai,hellaswag,piqa,winogrande,truthfulqa_mc1,openbookqa,boolq,arc_easy,arc_challenge,mmlu,gsm8k
```

| Metric                      | BF16                     | INT4                     |
| --------------------------- | ------------------------ | ------------------------ |
| avg                         | 0.6284                   | 0.6316                   |
| leaderboard_mmlu_pro  5shot | 0.2975                   | 0.2931                   |
| leaderboard_ifeval          | 0.5815=(0.6379+0.5250)/2 | 0.6073=(0.6619+0.5527)/2 |
| lambada_openai              | 0.6967                   | 0.6959                   |
| hellaswag                   | 0.6585                   | 0.6537                   |
| winogrande                  | 0.7174                   | 0.7206                   |
| piqa                        | 0.8047                   | 0.8118                   |
| truthfulqa_mc1              | 0.3758                   | 0.3807                   |
| openbookqa                  | 0.4020                   | 0.4060                   |
| boolq                       | 0.8450                   | 0.8535                   |
| arc_easy                    | 0.8384                   | 0.8321                   |
| arc_challenge               | 0.5648                   | 0.5742                   |
| gsm8k(5shot) strict match   | 0.7582                   | 0.7498                   |

## Reproduce the model

Here is the sample command to generate the model.  

```bash
auto-round  \
--model allenai/OLMo-2-1124-7B-Instruct \
--device 0 \
--nsamples 512 \
--model_dtype "fp16" \
--iter 1000 \
--disable_eval \
--format 'auto_gptq,auto_round' \
--output_dir "./tmp_autoround" 
```



## Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

## Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

- Intel Neural Compressor [link](https://github.com/intel/neural-compressor)

## Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

## Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

[arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)