--- base_model: - unsloth/Qwen2.5-7B-Instruct-bnb-4bit tags: - transformers - unsloth - trl - qwen2.5 - lora license: apache-2.0 language: - en - zh datasets: - openai/gsm8k pipeline_tag: text-generation library_name: peft --- This model uses reinforcement learning to train on the GSM8K dataset, generating reasoning chains and formatted outputs despite the dataset lacking intermediate steps. A reward function guides the model, prioritizing answer correctness and XML format adherence. **Training Details:** * Dataset: GSM8K * Algorithm: GRPO * Hardware: Single NVIDIA GeForce RTX 3090 Ti * Training Duration: 250 epochs, ~48 minutes ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b36c0a26893eb6a6e63da3/r8Fz5cQtx38wcoZLDKQ_0.png) **Limitations:** The output length limit(200) restricts the model's ability to generate complex reasoning chains, hindering observation of output length growth during training. **Example:** Which one is bigger? 9.11 or 9.8? ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b36c0a26893eb6a6e63da3/gbfcQXMLOn-n_CsbSVpy7.png) This qwen2.5 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. [](https://github.com/unslothai/unsloth)