πŸ€– Model Card: InfiX-ai/InfiAlign-Qwen-7B-SFT

arXiv Paper Hugging Face Paper Hugging Face SFT Model Hugging Face DPO Model GitHub Repository

InfiAlign is a scalable and data-efficient post-training framework that combines supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with a high-quality data selection pipeline to enhance reasoning in large language models.

At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources.

When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks.

Additional improvements are obtained through the application of Direct Preference Optimization (DPO), with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks.

πŸš€ InfiAlign Model Series

The InfiAlign framework offers multiple variants tailored for different alignment strategies:

  • InfiAlign-Qwen-7B-SFT: A 7B-parameter model optimized for math, code, and science reasoning. [You are here!]
  • InfiAlign-Qwen-7B-DPO: Trained with Direct Preference Optimization for improved reasoning. [Stay Tuned!]
  • InfiAlign-Qwen-7B-R1: Reinforcement learning variant for further reasoning capability. [Stay Tuned!]

πŸ“‹ Model Description

  • Model Name: InfiAlign-Qwen-7B-SFT
  • Developed by: InfiX-ai
  • Finetuned from model: Qwen2.5-7B-math-base
  • Model Type: 7B-parameter decoder-only Transformer
  • Context Length: 32K tokens
  • License: Apache 2.0

πŸ‹οΈ Training Details

πŸ“Š Training Data

To enable sample-efficient alignment via supervised fine-tuning, we curated InfiR-SFT-92K and InfiR-SFT-165K, two compact, high-quality reasoning-focused QA pair instruction corpora. This dataset is constructed from over 10M raw Q&A examples drawn from open-source reasoning datasets, including:

  • OpenThoughts-114K: Reasoning and thought process examples
  • OpenThoughts3-1.2M: Extended reasoning dataset
  • AM-DeepSeek-R1-Distilled-1.4M: Distilled general reasoning
  • Mixture-of-Thoughts: Distilled general reasoning
  • Infinity-Instruct: General instruction following
  • NuminaMath-CoT: Mathematical chain-of-thought reasoning
  • OpenCodeReasoning: Code generation and reasoning
  • Llama-Nemotron-Post-Training-Dataset: Post-training alignment data
  • OpenScience: Scientific reasoning and explanations

πŸ—οΈ Training Procedure

  • Training regime: Supervised fine-tuning with two-stage curriculum learning
  • Training stages:
    • Stage 1: Train on 70% relatively simple data (predominantly math and code instructions) to provide structured and accessible reasoning patterns.
    • Stage 2: Expand to the full corpus, incorporating more diverse and domain-specific instructions, especially from scientific and open-ended domains. First-stage samples are retained to ensure distributional continuity and avoid catastrophic forgetting.

βš™οΈ Training Hyperparameters

Hyperparameter Value
Learning rate 2e-5
Warmup Ratio 0.05
Batch size 16
Epochs 5
Optimizer AdamW
Max length 32768
Packing True

πŸ“Š Evaluation

🎯 Benchmarks

The model is evaluated on the following reasoning benchmarks:

  • AIME 25/24: American Invitational Mathematics Examination
  • MATH500: Mathematical reasoning benchmark
  • GPQA: Graduate-level physics questions
  • MMLU-PRO: Professional-level multi-task language understanding
  • SuperGPQA: Graduate-level knowledge and reasoning capabilities across 285 disciplines
  • LiveCodeBench: Real-world coding challenges

πŸ† Benchmark Performance

Model Data AIME 25 AIME24 MATH500 GPQA MMLU-PRO SuperGPQA LiveCodeBench (8/1/2024–2/1/2025)
avg@8 / pass@1
Avg.
DeepSeek-R1-Distill-Qwen-7B 800K 37.97 (Pass@4: 53.33%) 54.95 (Pass@4: 80.00%) 92.80 49.10 54.16 32.25 37.60 51.26
InfiAlign-Qwen-7B-SFT-92K 92K 43.39 (Pass@4: 63.33%) 56.46 (Pass@4: 80.00%) 92.35 (Pass@4: 95.60%) 48.48 (Pass@4: 74.24%) 53.51 30.90 34.05 51.30
InfiAlign-Qwen-7B-SFT-165K 165K 42.19 63.75 92.70 53.60 56.68 33.97 36.20 54.15

Quick Start

Here is a code snippet with apply_chat_template showing you how to load the tokenizer and model and how to generate content.

  • PS: Make sure the model starts with "<think>\n" to avoid generating empty thoughts, which will reduce the output quality. If you use "apply_chat_template" and set "add_generation_prompt=True", this will be automatically implemented, but this may result in a missing "<think>" label at the beginning of the response.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "InfiX-ai/InfiAlign-Qwen-7B-SFT"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

🎯 Intended Uses

βœ… Direct Use

This model is intended for research and commercial use. Example use cases include:

  • Instruction following
  • Mathematical reasoning
  • Code generation
  • General reasoning

❌ Out-of-Scope Use

The model should not be used for:

  • Generating harmful, offensive, or inappropriate content
  • Creating misleading information

🎭 Bias

As a post-trained language model optimized for reasoning, InfiAlign inherits and may amplify limitations common to large language models. While our framework prioritizes data efficiency and alignment, users should consider the following risks:

  • Language & Reasoning Biases: InfiAlign is primarily trained on English datasets with curated reasoning tasks. Performance may degrade on low-resource languages or culturally nuanced prompts. Biases in underlying data (e.g., stereotypes in logical premises) could persist despite alignment efforts.
  • Safety in Reasoning Contexts: The model might generate harmful, misleading, or unsafe content when solving complex problems (e.g., adversarial math or ethics puzzles). Rigorous output filtering is recommended for sensitive applications.
  • Reinforcement Learning Pitfalls: While RL improves alignment, it may inadvertently reinforce undesirable behaviors due to reward misalignment or edge cases in the feedback pipeline. Monitor for over-optimization (e.g., "sycophantic" reasoning).
  • Data Efficiency Trade-offs: High-quality data selection reduces low-quality outputs but cannot eliminate hallucinations or factual errors. Always verify critical claims, especially in domains like medicine or law.

πŸ“š Citation

If you find this work helpful, feel free to give us a cite.

@misc{cai2025infialignscalablesampleefficientframework,
      title={InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities}, 
      author={Shuo Cai and Su Lu and Qi Zhou and Kejing Yang and Zhijie Sang and Congkai Xie and Hongxia Yang},
      year={2025},
      eprint={2508.05496},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.05496}, 
}
Downloads last month
161
Safetensors
Model size
7.62B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collections including InfiX-ai/InfiAlign-Qwen-7B-SFT