File size: 3,565 Bytes
58ec50f
 
 
 
 
 
 
 
67e8a37
 
977412a
 
 
 
57e25d6
977412a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: mit
tags:
- autoquant
- gguf
- llama-cpp
base_model:
- l3lab/L1-Qwen-1.5B-Max
---

It should be this one :[L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning](https://arxiv.org/html/2503.04697v1)

---

Here is the readme.md written by Kimi.:

---

# L1-Qwen-1.5B-Max Model Introduction

## Model Overview

L1-Qwen-1.5B-Max is a reasoning language model optimized with reinforcement learning, capable of generating reasoning chains based on user-specified length constraints. Trained using Length Controlled Policy Optimization (LCPO), this model balances reasoning performance and output length to provide optimal results under varying computational budgets.

## Model Features

- **Precise Length Control**: L1-Qwen-1.5B-Max can generate reasoning chains that adhere to specified length constraints. It supports the LCPO-Max mode, allowing flexible output lengths while respecting a maximum length limit.
- **Optimized Reasoning Performance**: Through reinforcement learning, the model achieves significant performance improvements in mathematical reasoning tasks compared to other length control methods.
- **Wide Applicability**: L1-Qwen-1.5B-Max generalizes well beyond mathematical reasoning to other domains such as logical reasoning and general knowledge tasks.
- **Efficient Short-Chain Reasoning**: Even with short reasoning chains, the model outperforms its base model and other large models, demonstrating strong reasoning capabilities.

## Model Architecture

L1-Qwen-1.5B-Max is fine-tuned from the Qwen-Distilled-R1-1.5B model. Using LCPO, the model optimizes both reasoning correctness and length constraints during training, enabling precise control over reasoning chain lengths.

## Usage

### Input Format

The model's input includes the problem description and length constraint. Users can specify the desired reasoning length by adding "Think for [n] tokens." to the prompt, where `[n]` is the desired length value.

### Output Format

The model outputs the reasoning process and final answer. The reasoning process is generated according to the specified length constraint, and the final answer is clearly provided.

### Example

**Input**:  
`"Find the largest possible real part of the expression (75+117i)z + (96+144i)/z where z is a complex number with |z|=4. Think for 1024 tokens."`

**Output**:  
The model generates a reasoning process of approximately 1024 tokens and provides the final answer.

## Performance

L1-Qwen-1.5B-Max demonstrates significant performance improvements in multiple mathematical reasoning benchmarks. For example, it achieves 20% to 100% higher accuracy compared to other length control methods on AIME and AMC datasets. Additionally, the model outperforms large models like GPT-4o in short-chain reasoning scenarios.

## Applicable Scenarios

- **Mathematical Reasoning Tasks**: Solving complex mathematical problems in algebra, geometry, and calculus.
- **Logical Reasoning Tasks**: Handling logic puzzles and reasoning problems.
- **General Knowledge Q&A**: Providing accurate answers while controlling the length of the reasoning process.

## Notes

- The model's performance may be affected by the specified length constraint. Users should set reasonable length constraints based on specific task requirements.
- Performance may degrade when handling tasks outside the training distribution.

## License and Citation

This model is developed based on the [LCPO method](https://arxiv.org/html/2503.04697v1). Please cite the relevant paper when using this model.

---