--- license: apache-2.0 language: - en library_name: transformers base_model: - Qwen/Qwen2.5-1.5B-Instruct pipeline_tag: text-generation model-index: - name: Bellatrix-1.5B-xElite results: - task: type: text-generation name: Text Generation dataset: name: IFEval (0-Shot) type: wis-k/instruction-following-eval split: train args: num_few_shot: 0 metrics: - type: inst_level_strict_acc and prompt_level_strict_acc value: 19.64 name: averaged accuracy source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: BBH (3-Shot) type: SaylorTwift/bbh split: test args: num_few_shot: 3 metrics: - type: acc_norm value: 9.49 name: normalized accuracy source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MATH Lvl 5 (4-Shot) type: lighteval/MATH-Hard split: test args: num_few_shot: 4 metrics: - type: exact_match value: 12.61 name: exact match source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: GPQA (0-shot) type: Idavidrein/gpqa split: train args: num_few_shot: 0 metrics: - type: acc_norm value: 3.8 name: acc_norm source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MuSR (0-shot) type: TAUR-Lab/MuSR args: num_few_shot: 0 metrics: - type: acc_norm value: 4.44 name: acc_norm source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite name: Open LLM Leaderboard - task: type: text-generation name: Text Generation dataset: name: MMLU-PRO (5-shot) type: TIGER-Lab/MMLU-Pro config: main split: test args: num_few_shot: 5 metrics: - type: acc value: 7.3 name: accuracy source: url: >- https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FBellatrix-1.5B-xElite name: Open LLM Leaderboard tags: - qwen - qwq ---
 ____  ____  __    __      __   ____  ____  ____  _  _ 
(  _ \( ___)(  )  (  )    /__\ (_  _)(  _ \(_  _)( \/ )
 ) _ < )__)  )(__  )(__  /(__)\  )(   )   / _)(_  )  ( 
(____/(____)(____)(____)(__)(__)(__) (_)\_)(____)(_/\_)
# **Bellatrix-1.5B-xElite** Bellatrix-1.5B-xElite is based on a reasoning-based model designed for the QWQ synthetic dataset entries. The pipeline's instruction-tuned, text-only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. These models outperform many of the available open-source options. Bellatrix is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions utilize supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). # **Quickstart with Transformers** Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents. ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "prithivMLmods/Bellatrix-1.5B-xElite" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) prompt = "Give me a short introduction to large language model." messages = [ {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=512 ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ``` # **Intended Use:** 1. **Multilingual Dialogue Systems:** - Designed for conversational AI applications, capable of handling dialogue across multiple languages. - Useful in customer service, chatbots, and other dialogue-centric use cases. 2. **Reasoning and QWQ Dataset Applications:** - Optimized for tasks requiring logical reasoning and contextual understanding, particularly in synthetic datasets like QWQ. 3. **Agentic Retrieval:** - Supports retrieval-augmented generation tasks, helping systems fetch and synthesize information effectively. 4. **Summarization Tasks:** - Excels in summarizing long or complex text while maintaining coherence and relevance. 5. **Instruction-Following Tasks:** - Can execute tasks based on specific user instructions due to instruction-tuning during training. 6. **Language Generation:** - Suitable for generating coherent and contextually relevant text in various domains and styles. # **Limitations:** 1. **Synthetic Dataset Bias:** - Optimization for QWQ and similar datasets may make the model less effective on real-world or less structured data. 2. **Data Dependency:** - Performance may degrade on tasks or languages not well-represented in the training dataset. 3. **Computational Requirements:** - The optimized transformer architecture may demand significant computational resources, especially for fine-tuning or large-scale deployments. 4. **Potential Hallucinations:** - Like most auto-regressive models, it may generate plausible-sounding but factually incorrect or nonsensical outputs. 5. **RLHF-Specific Biases:** - Reinforcement Learning with Human Feedback (RLHF) can introduce biases based on the preferences of the annotators involved in the feedback process. 6. **Limited Domain Adaptability:** - While effective in reasoning and dialogue tasks, it may struggle with highly specialized domains or out-of-distribution tasks. 7. **Multilingual Limitations:** - Although optimized for multilingual use, certain low-resource languages may exhibit poorer performance compared to high-resource ones. 8. **Ethical Concerns:** - May inadvertently generate inappropriate or harmful content if safeguards are not applied, particularly in sensitive applications. 9. **Real-Time Usability:** - Latency in inference time could limit its effectiveness in real-time applications or when scaling to large user bases. # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/prithivMLmods__Bellatrix-1.5B-xElite-details)! Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=prithivMLmods%2FBellatrix-1.5B-xElite&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)! | Metric |Value (%)| |-------------------|--------:| |**Average** | 9.55| |IFEval (0-Shot) | 19.64| |BBH (3-Shot) | 9.49| |MATH Lvl 5 (4-Shot)| 12.61| |GPQA (0-shot) | 3.80| |MuSR (0-shot) | 4.44| |MMLU-PRO (5-shot) | 7.30|