-
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning
Paper • 2412.16849 • Published • 9 -
Offline Reinforcement Learning for LLM Multi-Step Reasoning
Paper • 2412.16145 • Published • 38 -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Paper • 2501.12948 • Published • 346
Collections
Discover the best community collections!
Collections including paper arxiv:2412.16849
-
ProcessBench: Identifying Process Errors in Mathematical Reasoning
Paper • 2412.06559 • Published • 80 -
Maya: An Instruction Finetuned Multilingual Multimodal Model
Paper • 2412.07112 • Published • 27 -
OpenAI o1 System Card
Paper • 2412.16720 • Published • 31 -
Diving into Self-Evolving Training for Multimodal Reasoning
Paper • 2412.17451 • Published • 43
-
Extending Llama-3's Context Ten-Fold Overnight
Paper • 2404.19553 • Published • 34 -
ReFT: Representation Finetuning for Language Models
Paper • 2404.03592 • Published • 94 -
Why do small language models underperform? Studying Language Model Saturation via the Softmax Bottleneck
Paper • 2404.07647 • Published • 4 -
SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning
Paper • 2401.07950 • Published • 4
-
Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers
Paper • 2408.06195 • Published • 70 -
Thinking LLMs: General Instruction Following with Thought Generation
Paper • 2410.10630 • Published • 19 -
Democratizing Reasoning Ability: Tailored Learning from Large Language Model
Paper • 2310.13332 • Published • 15 -
OpenRFT: Adapting Reasoning Foundation Model for Domain-specific Tasks with Reinforcement Fine-Tuning
Paper • 2412.16849 • Published • 9
-
Search, Verify and Feedback: Towards Next Generation Post-training Paradigm of Foundation Models via Verifier Engineering
Paper • 2411.11504 • Published • 20 -
Top-nσ: Not All Logits Are You Need
Paper • 2411.07641 • Published • 20 -
Adaptive Decoding via Latent Preference Optimization
Paper • 2411.09661 • Published • 10 -
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training
Paper • 2411.13476 • Published • 16
-
ORPO: Monolithic Preference Optimization without Reference Model
Paper • 2403.07691 • Published • 65 -
sDPO: Don't Use Your Data All at Once
Paper • 2403.19270 • Published • 41 -
Teaching Large Language Models to Reason with Reinforcement Learning
Paper • 2403.04642 • Published • 48 -
Best Practices and Lessons Learned on Synthetic Data for Language Models
Paper • 2404.07503 • Published • 30