gemma3-4b-thinking / README.md
real-jiakai's picture
Update README.md
75449fb verified
---
base_model: google/gemma-3-4b-it
library_name: transformers
model_name: trainer_output
tags:
- generated_from_trainer
- trl
- grpo
- reasoning
- math
- step-by-step-thinking
licence: license
---
# gemma3-4b-thinking
This model is a fine-tuned version of [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) trained to enhance its reasoning and step-by-step thinking capabilities. It has been trained using [TRL](https://github.com/huggingface/trl) with GRPO (Generative Reinforcement Learning from Policy Optimization).
## Model Description
This model was specifically tuned to demonstrate step-by-step reasoning when solving problems, particularly mathematical word problems. The training process used reinforcement learning to reward the model for:
- Providing clear reasoning steps
- Using logical deduction
- Arriving at the correct numerical answer
## Quick start
```python
from transformers import pipeline, AutoProcessor
# Load the model and processor
processor = AutoProcessor.from_pretrained("real-jiakai/gemma3-4b-thinking")
generator = pipeline("text-generation", model="real-jiakai/gemma3-4b-thinking", tokenizer=processor.tokenizer)
# Example math problem
question = "The school principal decided that she wanted every class to have an equal number of boys and girls in each first-grade classroom. There are 4 classrooms. There are 56 boys and 44 girls. How many total students are in each classroom?"
# Format the input with chat template
input_text = processor.apply_chat_template([{"role": "user", "content": question}])
# Generate response with reasoning
output = generator(input_text, max_new_tokens=1024)
print(output[0]["generated_text"])
```
## Model Performance
The model demonstrates enhanced reasoning capabilities compared to the base model, particularly for:
- Mathematical word problems
- Step-by-step logical deduction
- Breaking complex problems into solvable components
## Training Procedure
This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
### Training Details
- **Dataset**: GSM8k (Grade School Math 8k), a dataset of diverse grade school math word problems
- **Fine-tuning Method**: GRPO (Generative Reinforcement Learning from Policy Optimization)
- **Training Steps**: 100
- **Batch Size**: 2
- **Learning Rate**: 5e-6
- **Hardware**: A100 80GB GPU
- **Parameter-Efficient Fine-Tuning**: Used LoRA with r=16, alpha=32
### Reward Functions
The training used multiple reward functions to guide the model:
- Correctness of final answer
- Using proper numerical formats
- Demonstrating clear reasoning steps
- Following structured formats
### Framework versions
- TRL: 0.16.0.dev0
- Transformers: 4.50.0.dev0
- Pytorch: 2.6.0
- Datasets: 3.3.2
- Tokenizers: 0.21.1
## Limitations
- The model sometimes reverts to its base output format rather than following the structured reasoning format used during training
- Performance may vary across different types of problems
- The model is primarily optimized for mathematical reasoning and may not show the same level of improvement on other tasks
## Ethics and Responsible Use
- This model is intended to demonstrate reasoning capabilities and should not be used as a sole solution for educational assessments
- Users should verify mathematical results independently for critical applications
- The model can still make reasoning errors despite showing its work
## Citations
```
@article{gemma_2025,
title={Gemma 3},
url={https://goo.gle/Gemma3Report},
publisher={Kaggle},
author={Gemma Team},
year={2025}
}
@article{shao2024deepseekmath,
title={Deepseekmath: Pushing the limits of mathematical reasoning in open language models},
author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Y and others},
journal={arXiv preprint arXiv:2402.03300},
year={2024}
}
```