π€ Model Card: InfiX-ai/InfiAlign-Qwen-7B-SFT
InfiAlign is a scalable and data-efficient post-training framework that combines supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with a high-quality data selection pipeline to enhance reasoning in large language models.
At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources.
When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks.
Additional improvements are obtained through the application of Direct Preference Optimization (DPO), with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks.
π InfiAlign Model Series
The InfiAlign framework offers multiple variants tailored for different alignment strategies:
- InfiAlign-Qwen-7B-SFT: A 7B-parameter model optimized for math, code, and science reasoning. [You are here!]
- InfiAlign-Qwen-7B-DPO: Trained with Direct Preference Optimization for improved reasoning. [Stay Tuned!]
- InfiAlign-Qwen-7B-R1: Reinforcement learning variant for further reasoning capability. [Stay Tuned!]
π Model Description
- Model Name: InfiAlign-Qwen-7B-SFT
- Developed by: InfiX-ai
- Finetuned from model: Qwen2.5-7B-math-base
- Model Type: 7B-parameter decoder-only Transformer
- Context Length: 32K tokens
- License: Apache 2.0
ποΈ Training Details
π Training Data
To enable sample-efficient alignment via supervised fine-tuning, we curated InfiR-SFT-92K and InfiR-SFT-165K, two compact, high-quality reasoning-focused QA pair instruction corpora. This dataset is constructed from over 10M raw Q&A examples drawn from open-source reasoning datasets, including:
- OpenThoughts-114K: Reasoning and thought process examples
- OpenThoughts3-1.2M: Extended reasoning dataset
- AM-DeepSeek-R1-Distilled-1.4M: Distilled general reasoning
- Mixture-of-Thoughts: Distilled general reasoning
- Infinity-Instruct: General instruction following
- NuminaMath-CoT: Mathematical chain-of-thought reasoning
- OpenCodeReasoning: Code generation and reasoning
- Llama-Nemotron-Post-Training-Dataset: Post-training alignment data
- OpenScience: Scientific reasoning and explanations
ποΈ Training Procedure
- Training regime: Supervised fine-tuning with two-stage curriculum learning
- Training stages:
- Stage 1: Train on 70% relatively simple data (predominantly math and code instructions) to provide structured and accessible reasoning patterns.
- Stage 2: Expand to the full corpus, incorporating more diverse and domain-specific instructions, especially from scientific and open-ended domains. First-stage samples are retained to ensure distributional continuity and avoid catastrophic forgetting.
βοΈ Training Hyperparameters
Hyperparameter | Value |
---|---|
Learning rate | 2e-5 |
Warmup Ratio | 0.05 |
Batch size | 16 |
Epochs | 5 |
Optimizer | AdamW |
Max length | 32768 |
Packing | True |
π Evaluation
π― Benchmarks
The model is evaluated on the following reasoning benchmarks:
- AIME 25/24: American Invitational Mathematics Examination
- MATH500: Mathematical reasoning benchmark
- GPQA: Graduate-level physics questions
- MMLU-PRO: Professional-level multi-task language understanding
- SuperGPQA: Graduate-level knowledge and reasoning capabilities across 285 disciplines
- LiveCodeBench: Real-world coding challenges
π Benchmark Performance
Model | Data | AIME 25 | AIME24 | MATH500 | GPQA | MMLU-PRO | SuperGPQA | LiveCodeBench (8/1/2024β2/1/2025) avg@8 / pass@1 |
Avg. |
---|---|---|---|---|---|---|---|---|---|
DeepSeek-R1-Distill-Qwen-7B | 800K | 37.97 (Pass@4: 53.33%) | 54.95 (Pass@4: 80.00%) | 92.80 | 49.10 | 54.16 | 32.25 | 37.60 | 51.26 |
InfiAlign-Qwen-7B-SFT-92K | 92K | 43.39 (Pass@4: 63.33%) | 56.46 (Pass@4: 80.00%) | 92.35 (Pass@4: 95.60%) | 48.48 (Pass@4: 74.24%) | 53.51 | 30.90 | 34.05 | 51.30 |
InfiAlign-Qwen-7B-SFT-165K | 165K | 42.19 | 63.75 | 92.70 | 53.60 | 56.68 | 33.97 | 36.20 | 54.15 |
Quick Start
Here is a code snippet with apply_chat_template showing you how to load the tokenizer and model and how to generate content.
- PS: Make sure the model starts with "<think>\n" to avoid generating empty thoughts, which will reduce the output quality. If you use "apply_chat_template" and set "add_generation_prompt=True", this will be automatically implemented, but this may result in a missing "<think>" label at the beginning of the response.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "InfiX-ai/InfiAlign-Qwen-7B-SFT"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
π― Intended Uses
β Direct Use
This model is intended for research and commercial use. Example use cases include:
- Instruction following
- Mathematical reasoning
- Code generation
- General reasoning
β Out-of-Scope Use
The model should not be used for:
- Generating harmful, offensive, or inappropriate content
- Creating misleading information
π Bias
As a post-trained language model optimized for reasoning, InfiAlign inherits and may amplify limitations common to large language models. While our framework prioritizes data efficiency and alignment, users should consider the following risks:
- Language & Reasoning Biases: InfiAlign is primarily trained on English datasets with curated reasoning tasks. Performance may degrade on low-resource languages or culturally nuanced prompts. Biases in underlying data (e.g., stereotypes in logical premises) could persist despite alignment efforts.
- Safety in Reasoning Contexts: The model might generate harmful, misleading, or unsafe content when solving complex problems (e.g., adversarial math or ethics puzzles). Rigorous output filtering is recommended for sensitive applications.
- Reinforcement Learning Pitfalls: While RL improves alignment, it may inadvertently reinforce undesirable behaviors due to reward misalignment or edge cases in the feedback pipeline. Monitor for over-optimization (e.g., "sycophantic" reasoning).
- Data Efficiency Trade-offs: High-quality data selection reduces low-quality outputs but cannot eliminate hallucinations or factual errors. Always verify critical claims, especially in domains like medicine or law.
π Citation
If you find this work helpful, feel free to give us a cite.
@misc{cai2025infialignscalablesampleefficientframework,
title={InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities},
author={Shuo Cai and Su Lu and Qi Zhou and Kejing Yang and Zhijie Sang and Congkai Xie and Hongxia Yang},
year={2025},
eprint={2508.05496},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2508.05496},
}
- Downloads last month
- 161