It should be this one :L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning
Here is the readme.md written by Kimi.:
L1-Qwen-1.5B-Max Model Introduction
Model Overview
L1-Qwen-1.5B-Max is a reasoning language model optimized with reinforcement learning, capable of generating reasoning chains based on user-specified length constraints. Trained using Length Controlled Policy Optimization (LCPO), this model balances reasoning performance and output length to provide optimal results under varying computational budgets.
Model Features
- Precise Length Control: L1-Qwen-1.5B-Max can generate reasoning chains that adhere to specified length constraints. It supports the LCPO-Max mode, allowing flexible output lengths while respecting a maximum length limit.
- Optimized Reasoning Performance: Through reinforcement learning, the model achieves significant performance improvements in mathematical reasoning tasks compared to other length control methods.
- Wide Applicability: L1-Qwen-1.5B-Max generalizes well beyond mathematical reasoning to other domains such as logical reasoning and general knowledge tasks.
- Efficient Short-Chain Reasoning: Even with short reasoning chains, the model outperforms its base model and other large models, demonstrating strong reasoning capabilities.
Model Architecture
L1-Qwen-1.5B-Max is fine-tuned from the Qwen-Distilled-R1-1.5B model. Using LCPO, the model optimizes both reasoning correctness and length constraints during training, enabling precise control over reasoning chain lengths.
Usage
Input Format
The model's input includes the problem description and length constraint. Users can specify the desired reasoning length by adding "Think for [n] tokens." to the prompt, where [n]
is the desired length value.
Output Format
The model outputs the reasoning process and final answer. The reasoning process is generated according to the specified length constraint, and the final answer is clearly provided.
Example
Input:"Find the largest possible real part of the expression (75+117i)z + (96+144i)/z where z is a complex number with |z|=4. Think for 1024 tokens."
Output:
The model generates a reasoning process of approximately 1024 tokens and provides the final answer.
Performance
L1-Qwen-1.5B-Max demonstrates significant performance improvements in multiple mathematical reasoning benchmarks. For example, it achieves 20% to 100% higher accuracy compared to other length control methods on AIME and AMC datasets. Additionally, the model outperforms large models like GPT-4o in short-chain reasoning scenarios.
Applicable Scenarios
- Mathematical Reasoning Tasks: Solving complex mathematical problems in algebra, geometry, and calculus.
- Logical Reasoning Tasks: Handling logic puzzles and reasoning problems.
- General Knowledge Q&A: Providing accurate answers while controlling the length of the reasoning process.
Notes
- The model's performance may be affected by the specified length constraint. Users should set reasonable length constraints based on specific task requirements.
- Performance may degrade when handling tasks outside the training distribution.
License and Citation
This model is developed based on the LCPO method. Please cite the relevant paper when using this model.
- Downloads last month
- 121
Model tree for DataSoul/L1-Qwen-1.5B-Max-GGUF
Base model
l3lab/L1-Qwen-1.5B-Max