gemma3-4b-thinking / README.md
real-jiakai's picture
Update README.md
75449fb verified
metadata
base_model: google/gemma-3-4b-it
library_name: transformers
model_name: trainer_output
tags:
  - generated_from_trainer
  - trl
  - grpo
  - reasoning
  - math
  - step-by-step-thinking
licence: license

gemma3-4b-thinking

This model is a fine-tuned version of google/gemma-3-4b-it trained to enhance its reasoning and step-by-step thinking capabilities. It has been trained using TRL with GRPO (Generative Reinforcement Learning from Policy Optimization).

Model Description

This model was specifically tuned to demonstrate step-by-step reasoning when solving problems, particularly mathematical word problems. The training process used reinforcement learning to reward the model for:

  • Providing clear reasoning steps
  • Using logical deduction
  • Arriving at the correct numerical answer

Quick start

from transformers import pipeline, AutoProcessor

# Load the model and processor
processor = AutoProcessor.from_pretrained("real-jiakai/gemma3-4b-thinking")
generator = pipeline("text-generation", model="real-jiakai/gemma3-4b-thinking", tokenizer=processor.tokenizer)

# Example math problem
question = "The school principal decided that she wanted every class to have an equal number of boys and girls in each first-grade classroom. There are 4 classrooms. There are 56 boys and 44 girls. How many total students are in each classroom?"

# Format the input with chat template
input_text = processor.apply_chat_template([{"role": "user", "content": question}])

# Generate response with reasoning
output = generator(input_text, max_new_tokens=1024)
print(output[0]["generated_text"])

Model Performance

The model demonstrates enhanced reasoning capabilities compared to the base model, particularly for:

  • Mathematical word problems
  • Step-by-step logical deduction
  • Breaking complex problems into solvable components

Training Procedure

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

Training Details

  • Dataset: GSM8k (Grade School Math 8k), a dataset of diverse grade school math word problems
  • Fine-tuning Method: GRPO (Generative Reinforcement Learning from Policy Optimization)
  • Training Steps: 100
  • Batch Size: 2
  • Learning Rate: 5e-6
  • Hardware: A100 80GB GPU
  • Parameter-Efficient Fine-Tuning: Used LoRA with r=16, alpha=32

Reward Functions

The training used multiple reward functions to guide the model:

  • Correctness of final answer
  • Using proper numerical formats
  • Demonstrating clear reasoning steps
  • Following structured formats

Framework versions

  • TRL: 0.16.0.dev0
  • Transformers: 4.50.0.dev0
  • Pytorch: 2.6.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.1

Limitations

  • The model sometimes reverts to its base output format rather than following the structured reasoning format used during training
  • Performance may vary across different types of problems
  • The model is primarily optimized for mathematical reasoning and may not show the same level of improvement on other tasks

Ethics and Responsible Use

  • This model is intended to demonstrate reasoning capabilities and should not be used as a sole solution for educational assessments
  • Users should verify mathematical results independently for critical applications
  • The model can still make reasoning errors despite showing its work

Citations

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

@article{shao2024deepseekmath,
  title={Deepseekmath: Pushing the limits of mathematical reasoning in open language models},
  author={Shao, Zhihong and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Song, Junxiao and Bi, Xiao and Zhang, Haowei and Zhang, Mingchuan and Li, YK and Wu, Y and others},
  journal={arXiv preprint arXiv:2402.03300},
  year={2024}
}