--- base_model: google/gemma-3-4b-it library_name: transformers model_name: trainer_output tags: - generated_from_trainer - trl - grpo - reasoning - math - step-by-step-thinking licence: license --- # gemma3-4b-thinking This model is a fine-tuned version of [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) trained to enhance its reasoning and step-by-step thinking capabilities. It has been trained using [TRL](https://github.com/huggingface/trl) with GRPO (Generative Reinforcement Learning from Policy Optimization). ## Model Description This model was specifically tuned to demonstrate step-by-step reasoning when solving problems, particularly mathematical word problems. The training process used reinforcement learning to reward the model for: - Providing clear reasoning steps - Using logical deduction - Arriving at the correct numerical answer ## Quick start ```python from transformers import pipeline, AutoProcessor # Load the model and processor processor = AutoProcessor.from_pretrained("real-jiakai/gemma3-4b-thinking") generator = pipeline("text-generation", model="real-jiakai/gemma3-4b-thinking", tokenizer=processor.tokenizer) # Example math problem question = "The school principal decided that she wanted every class to have an equal number of boys and girls in each first-grade classroom. There are 4 classrooms. There are 56 boys and 44 girls. How many total students are in each classroom?" # Format the input with chat template input_text = processor.apply_chat_template([{"role": "user", "content": question}]) # Generate response with reasoning output = generator(input_text, max_new_tokens=1024) print(output[0]["generated_text"]) ``` ## Model Performance The model demonstrates enhanced reasoning capabilities compared to the base model, particularly for: - Mathematical word problems - Step-by-step logical deduction - Breaking complex problems into solvable components ## Training Procedure This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300). ### Training Details - **Dataset**: GSM8k (Grade School Math 8k), a dataset of diverse grade school math word problems - **Fine-tuning Method**: GRPO (Generative Reinforcement Learning from Policy Optimization) - **Training Steps**: 100 - **Batch Size**: 2 - **Learning Rate**: 5e-6 - **Hardware**: A100 80GB GPU - **Parameter-Efficient Fine-Tuning**: Used LoRA with r=16, alpha=32 ### Reward Functions The training used multiple reward functions to guide the model: - Correctness of final answer - Using proper numerical formats - Demonstrating clear reasoning steps - Following structured formats ### Framework versions - TRL: 0.16.0.dev0 - Transformers: 4.50.0.dev0 - Pytorch: 2.6.0 - Datasets: 3.3.2 - Tokenizers: 0.21.1 ## Limitations - The model sometimes reverts to its base output format rather than following the structured reasoning format used during training - Performance may vary across different types of problems - The model is primarily optimized for mathematical reasoning and may not show the same level of improvement on other tasks ## Ethics and Responsible Use - This model is intended to demonstrate reasoning capabilities and should not be used as a sole solution for educational assessments - Users should verify mathematical results independently for critical applications - The model can still make reasoning errors despite showing its work ## Citations ``` @article{gemma_2025, title={Gemma 3}, url={https://goo.gle/Gemma3Report}, publisher={Kaggle}, author={Gemma Team}, year={2025} } @article{shao2024deepseekmath, title={Deepseekmath: Pushing the limits of mathematical reasoning in open language models}, author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Y and others}, journal={arXiv preprint arXiv:2402.03300}, year={2024} } ```