Trending Papers

GitHub 33.9k arXiv Page

Bitnet.cpp: Efficient Edge Inference for Ternary LLMs

Bitnet.cpp enhances edge inference for ternary LLMs using a novel mixed-precision matrix multiplication library, achieving significant speed improvements over baselines.

10 authors

· Feb 17, 2025

GitHub 33.9k arXiv Page

Submitted by

taesiri

Fish Audio S2 Technical Report

Fish Audio S2 is an open-source text-to-speech system with multi-speaker capabilities, multi-turn generation, and instruction-following control through natural-language descriptions, utilizing a multi-stage training approach and production-ready inference engine.

Fish Audio · Published on Mar 9, 2026

22

GitHub 26.8k arXiv Page

Submitted by

taesiri

Fish Audio S2 Technical Report

Fish Audio S2 is an open-source text-to-speech system with multi-speaker capabilities, multi-turn generation, and instruction-following control through natural-language descriptions, utilizing a multi-stage training approach and production-ready inference engine.

Fish Audio · Mar 9, 2026

22

GitHub 26.8k arXiv Page

Submitted by

Lingaaaaaaa

OpenClaw-RL: Train Any Agent Simply by Talking

OpenClaw-RL framework enables policy learning from diverse next-state signals across multiple interaction modalities using asynchronous training with PRM judges and hindsight-guided distillation.

Princeton AI Lab · Published on Mar 10, 2026

86

GitHub 2.36k arXiv Page

Submitted by

Lingaaaaaaa

OpenClaw-RL: Train Any Agent Simply by Talking

OpenClaw-RL framework enables policy learning from diverse next-state signals across multiple interaction modalities using asynchronous training with PRM judges and hindsight-guided distillation.

Princeton AI Lab · Mar 10, 2026

86

GitHub 2.36k arXiv Page

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

A novel tokenization scheme synchronizes acoustic features with text tokens in TTS systems, enabling unified modeling and reduced hallucinations through flow matching and text-only guidance.

Hume AI · Published on Feb 26, 2026

5

GitHub 612 arXiv Page

TADA: A Generative Framework for Speech Modeling via Text-Acoustic Dual Alignment

A novel tokenization scheme synchronizes acoustic features with text tokens in TTS systems, enabling unified modeling and reduced hallucinations through flow matching and text-only guidance.

Hume AI · Feb 26, 2026

5

GitHub 612 arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Published on Apr 28, 2025

48

GitHub 49.7k arXiv Page

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

5 authors

· Apr 28, 2025

48

GitHub 49.7k arXiv Page

Submitted by

AdinaY

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

DeepPlanning benchmark addresses limitations of current LLM planning assessments by introducing complex, real-world tasks requiring both global optimization and local constraint reasoning.

Qwen · Published on Jan 26, 2026

35

GitHub 15.6k arXiv Page

Submitted by

AdinaY

DeepPlanning: Benchmarking Long-Horizon Agentic Planning with Verifiable Constraints

DeepPlanning benchmark addresses limitations of current LLM planning assessments by introducing complex, real-world tasks requiring both global optimization and local constraint reasoning.

Qwen · Jan 26, 2026

35

GitHub 15.6k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

IBM Granite · Published on Mar 14, 2025

155

GitHub 55.7k arXiv Page

Submitted by

andito

SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

SmolDocling is a compact vision-language model that performs end-to-end document conversion with robust performance across various document types using 256M parameters and a new markup format.

IBM Granite · Mar 14, 2025

155

GitHub 55.7k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Published on Sep 12, 2023

46

GitHub 73k arXiv Page

Submitted by

akhaliq

Efficient Memory Management for Large Language Model Serving with PagedAttention

PagedAttention algorithm and vLLM system enhance the throughput of large language models by efficiently managing memory and reducing waste in the key-value cache.

9 authors

· Sep 12, 2023

46

GitHub 73k arXiv Page

Submitted by

exander

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

RL3DEdit uses reinforcement learning with rewards from a 3D foundation model to achieve multi-view consistent 3D editing from 2D editing priors.

AMAP-ML · Published on Mar 3, 2026

136

GitHub 124 arXiv Page

Submitted by

exander

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

RL3DEdit uses reinforcement learning with rewards from a 3D foundation model to achieve multi-view consistent 3D editing from 2D editing priors.

AMAP-ML · Mar 3, 2026

136

GitHub 124 arXiv Page

Submitted by

UglyToilet

MemOS: A Memory OS for AI System

MemOS, a memory operating system for Large Language Models, addresses memory management challenges by unifying plaintext, activation-based, and parameter-level memories, enabling efficient storage, retrieval, and continual learning.

39 authors

· Published on Jul 4, 2025

GitHub 6.83k arXiv Page

Submitted by

UglyToilet

MemOS: A Memory OS for AI System

MemOS, a memory operating system for Large Language Models, addresses memory management challenges by unifying plaintext, activation-based, and parameter-level memories, enabling efficient storage, retrieval, and continual learning.

39 authors

· Jul 4, 2025

GitHub 6.83k arXiv Page

Submitted by

Junyi42

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

LoGeR enables long-term 3D video reconstruction by combining bidirectional priors with a hybrid memory system that includes parametric Test-Time Training and non-parametric sliding window attention mechanisms.

Deepmind · Published on Mar 3, 2026

52

GitHub 371 arXiv Page

Submitted by

Junyi42

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

LoGeR enables long-term 3D video reconstruction by combining bidirectional priors with a hybrid memory system that includes parametric Test-Time Training and non-parametric sliding window attention mechanisms.

Deepmind · Mar 3, 2026

52

GitHub 371 arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Published on Oct 23, 2024

11

GitHub 54.8k arXiv Page

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

A novel GPT-based model, OmniFlatten, enables real-time natural full-duplex spoken dialogue through a multi-stage post-training technique that integrates speech and text without altering the original model's architecture.

9 authors

· Oct 23, 2024

11

GitHub 54.8k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Published on Dec 28, 2024

GitHub 32k arXiv Page

TradingAgents: Multi-Agents LLM Financial Trading Framework

A multi-agent framework using large language models for stock trading simulates real-world trading firms, improving performance metrics like cumulative returns and Sharpe ratio.

4 authors

· Dec 28, 2024

GitHub 32k arXiv Page

OASIS: Open Agent Social Interaction Simulations with One Million Agents

OASIS is a scalable and generalizable social media simulator that models large-scale user interactions and replicates complex social phenomena across platforms.

23 authors

· Published on Nov 18, 2024

1

GitHub 3.11k arXiv Page

OASIS: Open Agent Social Interaction Simulations with One Million Agents

OASIS is a scalable and generalizable social media simulator that models large-scale user interactions and replicates complex social phenomena across platforms.

23 authors

· Nov 18, 2024

1

GitHub 3.11k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Published on Sep 26, 2025

149

GitHub 56.1k arXiv Page

Submitted by

taesiri

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

MinerU2.5, a 1.2B-parameter document parsing vision-language model, achieves state-of-the-art recognition accuracy with computational efficiency through a coarse-to-fine parsing strategy.

61 authors

· Sep 26, 2025

149

GitHub 56.1k arXiv Page

Submitted by

Harold328

DVD: Deterministic Video Depth Estimation with Generative Priors

Video depth estimation framework DVD adapts pre-trained video diffusion models into deterministic single-pass depth regressors using structural anchors, latent manifold rectification, and global affine coherence for improved accuracy and efficiency.

15 authors

· Published on Mar 12, 2026

GitHub 72 arXiv Page

Submitted by

Harold328

DVD: Deterministic Video Depth Estimation with Generative Priors

Video depth estimation framework DVD adapts pre-trained video diffusion models into deterministic single-pass depth regressors using structural anchors, latent manifold rectification, and global affine coherence for improved accuracy and efficiency.

15 authors

· Mar 12, 2026

GitHub 72 arXiv Page

Submitted by

lifuguan

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Holi-Spatial presents the first fully automated, large-scale, spatially-aware multimodal dataset constructed from raw video inputs, supporting multi-level spatial supervision for 3D scene understanding and spatial reasoning tasks.

17 authors

· Published on Mar 8, 2026

GitHub 183 arXiv Page

Submitted by

lifuguan

Holi-Spatial: Evolving Video Streams into Holistic 3D Spatial Intelligence

Holi-Spatial presents the first fully automated, large-scale, spatially-aware multimodal dataset constructed from raw video inputs, supporting multi-level spatial supervision for 3D scene understanding and spatial reasoning tasks.

17 authors

· Mar 8, 2026

GitHub 183 arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Published on Oct 16, 2025

120

GitHub 72.2k arXiv Page

Submitted by

taesiri

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

PaddleOCR-VL, a vision-language model combining NaViT-style dynamic resolution and ERNIE, achieves state-of-the-art performance in document parsing and element recognition with high efficiency.

PaddlePaddle · Oct 16, 2025

120

GitHub 72.2k arXiv Page

Submitted by

taesiri

Helios: Real Real-Time Long Video Generation Model

Helios is a 14 billion parameter autoregressive diffusion model for video generation that achieves real-time performance and high-quality long-video synthesis without conventional optimization techniques.

ByteDance · Published on Mar 4, 2026

GitHub 1.14k arXiv Page

Submitted by

taesiri

Helios: Real Real-Time Long Video Generation Model

Helios is a 14 billion parameter autoregressive diffusion model for video generation that achieves real-time performance and high-quality long-video synthesis without conventional optimization techniques.

ByteDance · Mar 4, 2026

GitHub 1.14k arXiv Page

Submitted by

Liuff23

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Spatial-TTT enables streaming visual-based spatial intelligence through test-time training that adapts parameters to capture spatial evidence over long video sequences using hybrid architecture and 3D spatiotemporal convolution.

Tencent Hunyuan · Published on Mar 12, 2026

63

GitHub 71 arXiv Page

Submitted by

Liuff23

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Spatial-TTT enables streaming visual-based spatial intelligence through test-time training that adapts parameters to capture spatial evidence over long video sequences using hybrid architecture and 3D spatiotemporal convolution.

Tencent Hunyuan · Mar 12, 2026

63

GitHub 71 arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Published on Jul 23, 2024

GitHub 69.1k arXiv Page

Submitted by

akhaliq

OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

OpenDevin is a platform for developing AI agents that interact with the world by writing code, using command lines, and browsing the web, with support for multiple agents and evaluation benchmarks.

24 authors

· Jul 23, 2024

GitHub 69.1k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Published on Mar 20, 2024

182

GitHub 68.4k arXiv Page

Submitted by

akhaliq

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

LlamaFactory is a unified framework enabling efficient fine-tuning of large language models across various tasks using a web-based user interface.

5 authors

· Mar 20, 2024

182

GitHub 68.4k arXiv Page

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

8 authors

· Published on Jul 25, 2024

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

8 authors

· Jul 25, 2024

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Published on Aug 22, 2025

57

Submitted by

taesiri

AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications

AgentScope enhances agentic applications by providing flexible tool-based interactions, unified interfaces, and advanced infrastructure based on the ReAct paradigm, supporting efficient and safe development and deployment.

23 authors

· Aug 22, 2025

57

Submitted by

taesiri

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

InternVL-U is a 4-billion parameter unified multimodal model that combines advanced visual generation with robust semantic understanding through specialized modular design and reasoning-centric data synthesis.

29 authors

· Published on Mar 10, 2026

39

GitHub 181 arXiv Page

Submitted by

taesiri

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

InternVL-U is a 4-billion parameter unified multimodal model that combines advanced visual generation with robust semantic understanding through specialized modular design and reasoning-centric data synthesis.

29 authors

· Mar 10, 2026

39

GitHub 181 arXiv Page

Submitted by

xssstory

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

AReaL, a fully asynchronous reinforcement learning system, decouples generation and training to achieve higher GPU utilization and up to 2.57x training speedup for large language models on reasoning tasks.

13 authors

· Published on May 30, 2025

34

GitHub 4.78k arXiv Page

Submitted by

xssstory

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

AReaL, a fully asynchronous reinforcement learning system, decouples generation and training to achieve higher GPU utilization and up to 2.57x training speedup for large language models on reasoning tasks.

13 authors

· May 30, 2025

34

GitHub 4.78k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Published on Nov 17, 2025

27

GitHub 18.9k arXiv Page

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Nov 17, 2025

27

GitHub 18.9k arXiv Page

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

29 authors

· Published on Jan 6, 2026

166

GitHub 4.71k arXiv Page

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

29 authors

· Jan 6, 2026

166

GitHub 4.71k arXiv Page

Submitted by

Charlie019

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

CourtSI is a large-scale spatial intelligence dataset for sports scenarios that enables evaluation and improvement of vision-language models' understanding of human motion and object interactions.

14 authors

· Published on Mar 10, 2026

24

GitHub 57 arXiv Page

Submitted by

Charlie019

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

CourtSI is a large-scale spatial intelligence dataset for sports scenarios that enables evaluation and improvement of vision-language models' understanding of human motion and object interactions.

14 authors

· Mar 10, 2026

24

GitHub 57 arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Published on Jan 20, 2025

10

GitHub 23.7k arXiv Page

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Zep, a memory layer service, outperforms MemGPT in the DMR benchmark and LongMemEval by excelling in dynamic knowledge integration and temporal reasoning, critical for enterprise use cases.

5 authors

· Jan 20, 2025

10

GitHub 23.7k arXiv Page

Submitted by

taesiri

Qwen3-TTS Technical Report

The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.

Qwen · Published on Jan 22, 2026

71

GitHub 9.46k arXiv Page

Submitted by

taesiri

Qwen3-TTS Technical Report

The Qwen3-TTS series presents advanced multilingual text-to-speech models with voice cloning and controllable speech generation capabilities, utilizing dual-track LM architecture and specialized speech tokenizers for efficient streaming synthesis.

Qwen · Jan 22, 2026

71

GitHub 9.46k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

31

GitHub 29.3k arXiv Page

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Oct 8, 2024

31

GitHub 29.3k arXiv Page

AutoDev: Automated AI-Driven Development

AutoDev is an AI-driven software development framework that automates complex engineering tasks within a secure Docker environment, achieving high performance in code and test generation.

5 authors

· Published on Mar 13, 2024

GitHub 9.47k arXiv Page

AutoDev: Automated AI-Driven Development

AutoDev is an AI-driven software development framework that automates complex engineering tasks within a secure Docker environment, achieving high performance in code and test generation.

5 authors

· Mar 13, 2024

GitHub 9.47k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Published on Feb 7, 2025

17

GitHub 65.1k arXiv Page

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

9 authors

· Feb 7, 2025

17

GitHub 65.1k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Published on Oct 14, 2025

70

GitHub 14.2k arXiv Page

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Data Intelligence Lab@HKU · Oct 14, 2025

70

GitHub 14.2k arXiv Page

Submitted by

taesiri

PaperBanana: Automating Academic Illustration for AI Scientists

_paperbanana is an agentic framework that automates the creation of publication-ready academic illustrations using advanced vision-language models and image generation techniques.

Google · Published on Jan 30, 2026

217

GitHub 5.06k arXiv Page

Submitted by

taesiri

PaperBanana: Automating Academic Illustration for AI Scientists

_paperbanana is an agentic framework that automates the creation of publication-ready academic illustrations using advanced vision-language models and image generation techniques.

Google · Jan 30, 2026

217

GitHub 5.06k arXiv Page

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

ReMe is a framework for experience-driven agent evolution in LLMs, enhancing memory management through distillation, context-adaptive reuse, and refinement, outperforming larger memoryless models.

7 authors

· Published on Dec 11, 2025

3

GitHub 2.2k arXiv Page

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

ReMe is a framework for experience-driven agent evolution in LLMs, enhancing memory management through distillation, context-adaptive reuse, and refinement, outperforming larger memoryless models.

7 authors

· Dec 11, 2025

3

GitHub 2.2k arXiv Page

Submitted by

StreamFormer

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

OmniStream is a unified visual backbone that processes streaming video data through causal spatiotemporal attention and 3D rotary positional embeddings, enabling general-purpose visual understanding across multiple domains.

5 authors

· Published on Mar 12, 2026

8

GitHub 25 arXiv Page

Submitted by

StreamFormer

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

OmniStream is a unified visual backbone that processes streaming video data through causal spatiotemporal attention and 3D rotary positional embeddings, enabling general-purpose visual understanding across multiple domains.

5 authors

· Mar 12, 2026

8

GitHub 25 arXiv Page

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

54 authors

· Published on Nov 14, 2025

191

GitHub 6.72k arXiv Page

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

54 authors

· Nov 14, 2025

191

GitHub 6.72k arXiv Page

Moonshine: Speech Recognition for Live Transcription and Voice Commands

Moonshine, an encoder-decoder transformer architecture for speech recognition, uses Rotary Position Embedding, reducing compute requirements without decreasing accuracy.

6 authors

· Published on Oct 21, 2024

GitHub 7.31k arXiv Page

Moonshine: Speech Recognition for Live Transcription and Voice Commands

Moonshine, an encoder-decoder transformer architecture for speech recognition, uses Rotary Position Embedding, reducing compute requirements without decreasing accuracy.

6 authors

· Oct 21, 2024

GitHub 7.31k arXiv Page

Submitted by

evanking

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

Monolingual ASR models trained on a balanced mix of high-quality, pseudo-labeled, and synthetic data outperform multilingual models for small model sizes, achieving superior error rates and enabling on-device ASR for underrepresented languages.

5 authors

· Published on Sep 2, 2025

GitHub 7.32k arXiv Page

Submitted by

evanking

Flavors of Moonshine: Tiny Specialized ASR Models for Edge Devices

Monolingual ASR models trained on a balanced mix of high-quality, pseudo-labeled, and synthetic data outperform multilingual models for small model sizes, achieving superior error rates and enabling on-device ASR for underrepresented languages.

5 authors

· Sep 2, 2025

GitHub 7.32k arXiv Page

Submitted by

taesiri

Fara-7B: An Efficient Agentic Model for Computer Use

FaraGen creates synthetic datasets for computer use agents, enabling the training of efficient and high-performing models like Fara-7B on diverse web tasks, outperforming larger models on benchmarks.

Microsoft · Published on Nov 24, 2025

GitHub 4.5k arXiv Page

Submitted by

taesiri

Fara-7B: An Efficient Agentic Model for Computer Use

FaraGen creates synthetic datasets for computer use agents, enabling the training of efficient and high-performing models like Fara-7B on diverse web tasks, outperforming larger models on benchmarks.

Microsoft · Nov 24, 2025

GitHub 4.5k arXiv Page

Submitted by

zpy777

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Reinforcement learning framework with novel reward modeling and benchmarking approaches improves fidelity and instruction adherence in image editing and text-to-image generation.

SJTU VisionXLab · Published on Mar 12, 2026

GitHub 23 arXiv Page

Submitted by

zpy777

Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Reinforcement learning framework with novel reward modeling and benchmarking approaches improves fidelity and instruction adherence in image editing and text-to-image generation.

SJTU VisionXLab · Mar 12, 2026

GitHub 23 arXiv Page

DeepSeek-V3 Technical Report

DeepSeek-V3 is a parameter-efficient Mixture-of-Experts language model using MLA and DeepSeekMoE architectures, achieving high performance with efficient training and minimal computational cost.

DeepSeek · Published on Dec 27, 2024

78

GitHub 102k arXiv Page

DeepSeek-V3 Technical Report

DeepSeek-V3 is a parameter-efficient Mixture-of-Experts language model using MLA and DeepSeekMoE architectures, achieving high performance with efficient training and minimal computational cost.

DeepSeek · Dec 27, 2024

78

GitHub 102k arXiv Page

Submitted by

xihc-ucb

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Flash-kmeans enables efficient online k-means clustering on GPUs through novel kernel-level optimizations that eliminate I/O bottlenecks and reduce atomic write contention.

UC Berkeley · Published on Mar 10, 2026

62

GitHub 192 arXiv Page

Submitted by

xihc-ucb

Flash-KMeans: Fast and Memory-Efficient Exact K-Means

Flash-kmeans enables efficient online k-means clustering on GPUs through novel kernel-level optimizations that eliminate I/O bottlenecks and reduce atomic write contention.

UC Berkeley · Mar 10, 2026

62

GitHub 192 arXiv Page

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

mmGRPO, a multi-module extension of GRPO, enhances accuracy in modular AI systems by optimizing LM calls and prompts across various tasks.

13 authors

· Published on Aug 6, 2025

2

GitHub 32.8k arXiv Page

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

mmGRPO, a multi-module extension of GRPO, enhances accuracy in modular AI systems by optimizing LM calls and prompts across various tasks.

13 authors

· Aug 6, 2025

2

GitHub 32.8k arXiv Page

Submitted by

zhongwenxu

Single-stream Policy Optimization

Single-stream Policy Optimization (SPO) improves policy-gradient training for Large Language Models by eliminating group-based issues and providing a stable, low-variance learning signal, leading to better performance and efficiency.

Tencent · Published on Sep 16, 2025

36

GitHub 19.9k arXiv Page

Submitted by

zhongwenxu

Single-stream Policy Optimization

Single-stream Policy Optimization (SPO) improves policy-gradient training for Large Language Models by eliminating group-based issues and providing a stable, low-variance learning signal, leading to better performance and efficiency.

Tencent · Sep 16, 2025

36

GitHub 19.9k arXiv Page

Submitted by

wenbowen

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Fast-FoundationStereo achieves real-time stereo matching with strong zero-shot generalization through efficient compression, neural architecture search, and structured pruning techniques.

NVIDIA · Published on Dec 11, 2025

GitHub 769 arXiv Page

Submitted by

wenbowen

Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching

Fast-FoundationStereo achieves real-time stereo matching with strong zero-shot generalization through efficient compression, neural architecture search, and structured pruning techniques.

NVIDIA · Dec 11, 2025

GitHub 769 arXiv Page

Submitted by

taesiri

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.

NVIDIA · Published on Dec 23, 2025

GitHub 582 arXiv Page

Submitted by

taesiri

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning

We present Nemotron 3 Nano 30B-A3B, a Mixture-of-Experts hybrid Mamba-Transformer language model. Nemotron 3 Nano was pretrained on 25 trillion text tokens, including more than 3 trillion new unique tokens over Nemotron 2, followed by supervised fine tuning and large-scale RL on diverse environments. Nemotron 3 Nano achieves better accuracy than our previous generation Nemotron 2 Nano while activating less than half of the parameters per forward pass. It achieves up to 3.3x higher inference throughput than similarly-sized open models like GPT-OSS-20B and Qwen3-30B-A3B-Thinking-2507, while also being more accurate on popular benchmarks. Nemotron 3 Nano demonstrates enhanced agentic, reasoning, and chat abilities and supports context lengths up to 1M tokens. We release both our pretrained Nemotron 3 Nano 30B-A3B Base and post-trained Nemotron 3 Nano 30B-A3B checkpoints on Hugging Face.

NVIDIA · Dec 23, 2025

GitHub 582 arXiv Page

Multi-Agent Collaboration via Evolving Orchestration

A centralized orchestrator dynamically directs LLM agents via reinforcement learning, achieving superior multi-agent collaboration in varying tasks with reduced computational costs.

14 authors

· Published on May 26, 2025

7

GitHub 31.6k arXiv Page

Multi-Agent Collaboration via Evolving Orchestration

A centralized orchestrator dynamically directs LLM agents via reinforcement learning, achieving superior multi-agent collaboration in varying tasks with reduced computational costs.

14 authors

· May 26, 2025