--- datasets: - NeelNanda/pile-10k base_model: - MiniMaxAI/MiniMax-Text-01 --- ## Model Details This model is an int4 model with group_size 128 and symmetric quantization of [MiniMaxAI/MiniMax-Text-01](https://huggingface.co/MiniMaxAI/MiniMax-Text-01) generated by [intel/auto-round](https://github.com/intel/auto-round) algorithm. This model is in AutoRound format, which is **NOT** supported by other serving frameworks, such as vLLM. Please follow the [license](https://huggingface.co/MiniMaxAI/MiniMax-Text-01/blob/main/LICENSE) of the original model. ## How To Use **INT4 Inference on CUDA**(**4*80G**) Requirements ```bash pip3 install git+https://github.com/intel/auto-round.git@bf16_inference pip3 install auto-gptq ``` **This model is prone to overflow when running with int4 kernel with FP16 computation dtype** and does not support CPU, as it explicitly relies on CUDA operations in the model files. While we have implemented several workarounds to ensure functionality, **some prompts may still produce unexpected and random outputs**. ~~~python from auto_round import AutoRoundConfig ##must import for autoround format from transformers import AutoModelForCausalLM, AutoTokenizer import torch quantized_model_dir = "OPEA/MiniMax-Text-01-int4-sym-inc-preview" tokenizer=AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(quantized_model_dir, trust_remote_code=True, torch_dtype=torch.bfloat16,##must use bf16 device_map="auto") ##workaround for overflow def forward_hook(module, input, output): return torch.clamp(output,-65504,65504).to(torch.bfloat16) def register_fp16_pre_hooks(model): for name, module in model.named_modules(): if "QuantLinear" in module.__class__.__name__ or isinstance(module, torch.nn.Linear): module.register_forward_hook(forward_hook) register_fp16_pre_hooks(model) tokenizer.pad_token = tokenizer.eos_token prompt="How many r in strawberry." messages = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant created by MiniMax based on MiniMax-Text-01 model."}]}, {"role": "user", "content": [{"type": "text", "text": prompt}]}, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt") outputs = model.generate( input_ids=inputs["input_ids"].to(model.device), attention_mask=inputs["attention_mask"].to(model.device), max_new_tokens=512, num_return_sequences=1, do_sample=False, ##change this to align with offical usage eos_token_id=200020, ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] print(response) """ Prompt: 为什么企鹅没有被北极熊吃掉? Generated: 在自然界中,**企鹅**和**北极熊**分别生活在地球的两端:**企鹅**主要生活在**南半球**的**南极洲**及其周围的海域,而** 北极熊**则生活在**北半球**的**北极地区**。这种地理上的分隔确保了**企鹅**和**北极熊**在自然界中**无法相遇**,因此**北极熊**无法**吃掉****企 鹅**。 ### 详细解释: 1. **地理分布**: - **企鹅**生活在**南半球**,特别是在**南极洲**及其周围的海域。**企鹅**的种类包括**帝企鹅**、**阿德利企鹅**等。 - **北极熊**生活在**北半球**,主要在**北极**地区,如**加拿大**、**阿拉斯加**、**格陵兰**和**俄罗斯**等。 2. **生态习性**: - **企鹅** -------------------------------------------------- Prompt: 树枝上有十只鸟,如果你射杀了一只,还剩下几只?请用中文回答 Generated: 让我一步步思考这个问题: 1. 原本树枝上有10只鸟 2. 射杀1只后: * 射杀1只后, 鸟会受惊飞走 * 剩下的鸟会全部飞走 3. 所以答案是: * 0只鸟会留在树枝上 * 因为鸟会受惊飞走 所以答案是0只。 这个答案考虑到了自然界中动物对危险的本能反应, 当有同伴被射杀时, 其他鸟会立即飞走, 而不是继续停留在树枝上。 -------------------------------------------------- Prompt: How many r in strawberry. Generated: Let me help you count the number of "r" in "strawberry" step by step. 1. First, let's break down the word "strawberry" into its letters: s - t - r - a - w - b - e - r - r - y 2. Now, let's count the "r" letters: - First "r" is at position 3 - Second "r" is at position 8 - Third "r" is at position 9 3. So there are 3 "r" letters in "strawberry" Therefore, there are 3 "r" letters in the word "strawberry". -------------------------------------------------- Prompt: How many r in strawberry. Generated: Let me help you solve this step by step. 1) First, let's look at the word "strawberry" and count the letter "r" in it. 2) The word "strawberry" has 11 letters. 3) The letter "r" appears twice in "strawberry" (at the end of the word, and before the last "r") 4) So, the answer is 2. The number of "r" in "strawberry" is 2. This is a good example of how the appearance of a word can be misleading - the word "strawberry" has more "r" than it appears to have at first glance. -------------------------------------------------- Prompt: There is a girl who likes adventure, Generated: and she is not alone in her love for adventure. -------------------------------------------------- -------------------------------------------------- Prompt: hello Generated: Hello! How can I assist you today? """ ~~~ ## Generated the model (2*80G) pip3 install git+https://github.com/intel/auto-round.git@bf16_inference ```pytho import torch from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig model_name = "MiniMaxAI/MiniMax-Text-01" config = AutoConfig.from_pretrained(model_name, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16") fp_layers = [f"model.layers.{i}.block_sparse_moe.gate" for i in range(config.num_hidden_layers)] fp_layers.append("model.layers.4.block_sparse_moe.experts.10.w2") fp_layers.append("model.layers.5.block_sparse_moe.experts.18.w2") device_map = {} for i in range(32): key = fr"model\.layers\.\d+\.block_sparse_moe\.experts\.{str(i)}\..*$" if i < 14: device_map[key] = 0 else: device_map[key] = 1 layer_config = {} for fp_layer in fp_layers: layer_config[fp_layer] = {"bits": 16} from auto_round import AutoRound autoround = AutoRound(model=model, tokenizer=tokenizer, layer_config=layer_config, device_map=device_map, low_gpu_mem_usage=False, batch_size=1, gradient_accumulate_steps=4, seqlen=512,iters=50,lr=5e-3) autoround.quantize() autoround.save_quantized(format="auto_round", output_dir="tmp_autoround") exit() ``` ## Ethical Considerations and Limitations The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs. Therefore, before deploying any applications of the model, developers should perform safety testing. ## Caveats and Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. Here are a couple of useful links to learn more about Intel's AI software: - Intel Neural Compressor [link](https://github.com/intel/neural-compressor) ## Disclaimer The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes. ## Cite @article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} } [arxiv](https://arxiv.org/abs/2309.05516) [github](https://github.com/intel/auto-round)