🤖 Model Card: InfiX-ai/InfiAlign-Qwen-7B-SFT

InfiAlign is a scalable and data-efficient post-training framework that combines supervised fine-tuning (SFT) and Direct Preference Optimization (DPO) with a high-quality data selection pipeline to enhance reasoning in large language models.

At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning datasets using multidimensional quality metrics. This pipeline enables significant performance gains while drastically reducing data requirements and remains extensible to new data sources.

When applied to the Qwen2.5-Math-7B-Base model, our SFT model achieves performance on par with DeepSeek-R1-Distill-Qwen-7B, while using only approximately 12% of the training data, and demonstrates strong generalization across diverse reasoning tasks.

Additional improvements are obtained through the application of Direct Preference Optimization (DPO), with particularly notable gains in mathematical reasoning tasks. The model achieves an average improvement of 3.89% on AIME 24/25 benchmarks.

🚀 InfiAlign Model Series

The InfiAlign framework offers multiple variants tailored for different alignment strategies:

InfiAlign-Qwen-7B-SFT: A 7B-parameter model optimized for math, code, and science reasoning. [You are here!]
InfiAlign-Qwen-7B-DPO: Trained with Direct Preference Optimization for improved reasoning. [Stay Tuned!]
InfiAlign-Qwen-7B-R1: Reinforcement learning variant for further reasoning capability. [Stay Tuned!]

📋 Model Description

Model Name: InfiAlign-Qwen-7B-SFT
Developed by: InfiX-ai
Finetuned from model: Qwen2.5-7B-math-base
Model Type: 7B-parameter decoder-only Transformer
Context Length: 32K tokens
License: Apache 2.0

🏋️ Training Details

📊 Training Data

To enable sample-efficient alignment via supervised fine-tuning, we curated InfiR-SFT-92K and InfiR-SFT-165K, two compact, high-quality reasoning-focused QA pair instruction corpora. This dataset is constructed from over 10M raw Q&A examples drawn from open-source reasoning datasets, including:

OpenThoughts-114K: Reasoning and thought process examples
OpenThoughts3-1.2M: Extended reasoning dataset
AM-DeepSeek-R1-Distilled-1.4M: Distilled general reasoning
Mixture-of-Thoughts: Distilled general reasoning
Infinity-Instruct: General instruction following
NuminaMath-CoT: Mathematical chain-of-thought reasoning
OpenCodeReasoning: Code generation and reasoning
Llama-Nemotron-Post-Training-Dataset: Post-training alignment data
OpenScience: Scientific reasoning and explanations

🏗️ Training Procedure

Training regime: Supervised fine-tuning with two-stage curriculum learning
Training stages:
- Stage 1: Train on 70% relatively simple data (predominantly math and code instructions) to provide structured and accessible reasoning patterns.
- Stage 2: Expand to the full corpus, incorporating more diverse and domain-specific instructions, especially from scientific and open-ended domains. First-stage samples are retained to ensure distributional continuity and avoid catastrophic forgetting.

⚙️ Training Hyperparameters

Hyperparameter	Value
Learning rate	2e-5
Warmup Ratio	0.05
Batch size	16
Epochs	5
Optimizer	AdamW
Max length	32768
Packing	True

📊 Evaluation

🎯 Benchmarks

The model is evaluated on the following reasoning benchmarks:

AIME 25/24: American Invitational Mathematics Examination
MATH500: Mathematical reasoning benchmark
GPQA: Graduate-level physics questions
MMLU-PRO: Professional-level multi-task language understanding
SuperGPQA: Graduate-level knowledge and reasoning capabilities across 285 disciplines
LiveCodeBench: Real-world coding challenges

🏆 Benchmark Performance

Model	Data	AIME 25	AIME24	MATH500	GPQA	MMLU-PRO	SuperGPQA	LiveCodeBench (8/1/2024–2/1/2025) avg@8 / pass@1	Avg.
DeepSeek-R1-Distill-Qwen-7B	800K	37.97 (Pass@4: 53.33%)	54.95 (Pass@4: 80.00%)	92.80	49.10	54.16	32.25	37.60	51.26
InfiAlign-Qwen-7B-SFT-92K	92K	43.39 (Pass@4: 63.33%)	56.46 (Pass@4: 80.00%)	92.35 (Pass@4: 95.60%)	48.48 (Pass@4: 74.24%)	53.51	30.90	34.05	51.30
InfiAlign-Qwen-7B-SFT-165K	165K	42.19	63.75	92.70	53.60	56.68	33.97	36.20	54.15

Quick Start

Here is a code snippet with apply_chat_template showing you how to load the tokenizer and model and how to generate content.

PS: Make sure the model starts with "<think>\n" to avoid generating empty thoughts, which will reduce the output quality. If you use "apply_chat_template" and set "add_generation_prompt=True", this will be automatically implemented, but this may result in a missing "<think>" label at the beginning of the response.


from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "InfiX-ai/InfiAlign-Qwen-7B-SFT"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Convert the point $(0,3)$ in rectangular coordinates to polar coordinates. Enter your answer in the form $(r,\theta),$ where $r > 0$ and $0 \le \theta < 2 \pi.$"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

🎯 Intended Uses

✅ Direct Use

This model is intended for research and commercial use. Example use cases include:

Instruction following
Mathematical reasoning
Code generation
General reasoning

❌ Out-of-Scope Use

The model should not be used for:

Generating harmful, offensive, or inappropriate content
Creating misleading information

🎭 Bias

As a post-trained language model optimized for reasoning, InfiAlign inherits and may amplify limitations common to large language models. While our framework prioritizes data efficiency and alignment, users should consider the following risks:

Language & Reasoning Biases: InfiAlign is primarily trained on English datasets with curated reasoning tasks. Performance may degrade on low-resource languages or culturally nuanced prompts. Biases in underlying data (e.g., stereotypes in logical premises) could persist despite alignment efforts.
Safety in Reasoning Contexts: The model might generate harmful, misleading, or unsafe content when solving complex problems (e.g., adversarial math or ethics puzzles). Rigorous output filtering is recommended for sensitive applications.
Reinforcement Learning Pitfalls: While RL improves alignment, it may inadvertently reinforce undesirable behaviors due to reward misalignment or edge cases in the feedback pipeline. Monitor for over-optimization (e.g., "sycophantic" reasoning).
Data Efficiency Trade-offs: High-quality data selection reduces low-quality outputs but cannot eliminate hallucinations or factual errors. Always verify critical claims, especially in domains like medicine or law.

📚 Citation

If you find this work helpful, feel free to give us a cite.

@misc{cai2025infialignscalablesampleefficientframework,
      title={InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities}, 
      author={Shuo Cai and Su Lu and Qi Zhou and Kejing Yang and Zhijie Sang and Congkai Xie and Hongxia Yang},
      year={2025},
      eprint={2508.05496},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.05496}, 
}

InfiX-ai
/

InfiAlign-Qwen-7B-SFT