It should be this one :L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning


Here is the readme.md written by Kimi.:


L1-Qwen-1.5B-Max Model Introduction

Model Overview

L1-Qwen-1.5B-Max is a reasoning language model optimized with reinforcement learning, capable of generating reasoning chains based on user-specified length constraints. Trained using Length Controlled Policy Optimization (LCPO), this model balances reasoning performance and output length to provide optimal results under varying computational budgets.

Model Features

  • Precise Length Control: L1-Qwen-1.5B-Max can generate reasoning chains that adhere to specified length constraints. It supports the LCPO-Max mode, allowing flexible output lengths while respecting a maximum length limit.
  • Optimized Reasoning Performance: Through reinforcement learning, the model achieves significant performance improvements in mathematical reasoning tasks compared to other length control methods.
  • Wide Applicability: L1-Qwen-1.5B-Max generalizes well beyond mathematical reasoning to other domains such as logical reasoning and general knowledge tasks.
  • Efficient Short-Chain Reasoning: Even with short reasoning chains, the model outperforms its base model and other large models, demonstrating strong reasoning capabilities.

Model Architecture

L1-Qwen-1.5B-Max is fine-tuned from the Qwen-Distilled-R1-1.5B model. Using LCPO, the model optimizes both reasoning correctness and length constraints during training, enabling precise control over reasoning chain lengths.

Usage

Input Format

The model's input includes the problem description and length constraint. Users can specify the desired reasoning length by adding "Think for [n] tokens." to the prompt, where [n] is the desired length value.

Output Format

The model outputs the reasoning process and final answer. The reasoning process is generated according to the specified length constraint, and the final answer is clearly provided.

Example

Input:
"Find the largest possible real part of the expression (75+117i)z + (96+144i)/z where z is a complex number with |z|=4. Think for 1024 tokens."

Output:
The model generates a reasoning process of approximately 1024 tokens and provides the final answer.

Performance

L1-Qwen-1.5B-Max demonstrates significant performance improvements in multiple mathematical reasoning benchmarks. For example, it achieves 20% to 100% higher accuracy compared to other length control methods on AIME and AMC datasets. Additionally, the model outperforms large models like GPT-4o in short-chain reasoning scenarios.

Applicable Scenarios

  • Mathematical Reasoning Tasks: Solving complex mathematical problems in algebra, geometry, and calculus.
  • Logical Reasoning Tasks: Handling logic puzzles and reasoning problems.
  • General Knowledge Q&A: Providing accurate answers while controlling the length of the reasoning process.

Notes

  • The model's performance may be affected by the specified length constraint. Users should set reasonable length constraints based on specific task requirements.
  • Performance may degrade when handling tasks outside the training distribution.

License and Citation

This model is developed based on the LCPO method. Please cite the relevant paper when using this model.


Downloads last month
121
GGUF
Model size
1.78B params
Architecture
qwen2

8-bit

16-bit

Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for DataSoul/L1-Qwen-1.5B-Max-GGUF

Quantized
(4)
this model