metadata

base_model: google/gemma-3-4b-it
library_name: transformers
model_name: trainer_output
tags:
  - generated_from_trainer
  - trl
  - grpo
  - reasoning
  - math
  - step-by-step-thinking
licence: license

gemma3-4b-thinking

This model is a fine-tuned version of google/gemma-3-4b-it trained to enhance its reasoning and step-by-step thinking capabilities. It has been trained using TRL with GRPO (Generative Reinforcement Learning from Policy Optimization).

Model Description

This model was specifically tuned to demonstrate step-by-step reasoning when solving problems, particularly mathematical word problems. The training process used reinforcement learning to reward the model for:

Providing clear reasoning steps
Using logical deduction
Arriving at the correct numerical answer

Quick start

from transformers import pipeline, AutoProcessor

# Load the model and processor
processor = AutoProcessor.from_pretrained("real-jiakai/gemma3-4b-thinking")
generator = pipeline("text-generation", model="real-jiakai/gemma3-4b-thinking", tokenizer=processor.tokenizer)

# Example math problem
question = "The school principal decided that she wanted every class to have an equal number of boys and girls in each first-grade classroom. There are 4 classrooms. There are 56 boys and 44 girls. How many total students are in each classroom?"

# Format the input with chat template
input_text = processor.apply_chat_template([{"role": "user", "content": question}])

# Generate response with reasoning
output = generator(input_text, max_new_tokens=1024)
print(output[0]["generated_text"])

Model Performance

The model demonstrates enhanced reasoning capabilities compared to the base model, particularly for:

Mathematical word problems
Step-by-step logical deduction
Breaking complex problems into solvable components

Training Procedure

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Training Details

Dataset: GSM8k (Grade School Math 8k), a dataset of diverse grade school math word problems
Fine-tuning Method: GRPO (Generative Reinforcement Learning from Policy Optimization)
Training Steps: 100
Batch Size: 2
Learning Rate: 5e-6
Hardware: A100 80GB GPU
Parameter-Efficient Fine-Tuning: Used LoRA with r=16, alpha=32

Reward Functions

The training used multiple reward functions to guide the model:

Correctness of final answer
Using proper numerical formats
Demonstrating clear reasoning steps
Following structured formats

Framework versions

TRL: 0.16.0.dev0
Transformers: 4.50.0.dev0
Pytorch: 2.6.0
Datasets: 3.3.2
Tokenizers: 0.21.1

Limitations

The model sometimes reverts to its base output format rather than following the structured reasoning format used during training
Performance may vary across different types of problems
The model is primarily optimized for mathematical reasoning and may not show the same level of improvement on other tasks

Ethics and Responsible Use

This model is intended to demonstrate reasoning capabilities and should not be used as a sole solution for educational assessments
Users should verify mathematical results independently for critical applications
The model can still make reasoning errors despite showing its work

Citations

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

@article{shao2024deepseekmath,
  title={Deepseekmath: Pushing the limits of mathematical reasoning in open language models},
  author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Y and others},
  journal={arXiv preprint arXiv:2402.03300},
  year={2024}
}