AudioStory: Generating Long-Form Narrative Audio with Large Language Models

[github]

✨ TL; DR: We propose a model for long-form narrative audio generation built upon a unified understanding–generation framework, capable of handling video dubbing, audio continuation, and long-form narrative audio synthesis.

🎥 Watch Full Demo on YouTube

📖 Release

[2025/09/02] 🔥🔥 Text-to-long audio checkpoint released!
[2025/08/28] 🔥🔥 We release the inference code!
[2025/08/28] 🔥🔥 We release our demo videos!

🔎 Introduction

Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features:

Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components—a bridging query for intra-event semantic alignment and a consistency query for cross-event coherence preservation.
End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives.

Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity.

⭐ Demos

1. Video Dubbing (Tom & Jerry style)

Dubbing is achieved using AudioStory (trained on Tom & Jerry) with visual captions extracted from videos.

2. Cross-domain Video Dubbing (Tom & Jerry style)

3. Text-to-Long Audio (Natural sound)

Instruction: "Develop a comprehensive audio that fully represents jake shimabukuro performs a complex ukulele piece in a studio, receives applause, and discusses his career in an interview. The total duration is 49.9 seconds."
Instruction: "Develop a comprehensive audio that fully represents a fire truck leaves the station with sirens blaring, signaling an emergency response, and drives away. The total duration is 35.1 seconds."
Instruction: "Understand the input audio, infer the subsequent events, and generate the continued audio of the coach giving basketball lessons to the players. The total duration is 36.6 seconds."

🔎 Methods

To achieve effective instruction-following audio generation, the ability to understand the input instruction or audio stream and reason about relevant audio sub-events is essential. To this end, AudioStory adopts a unified understanding-generation framework (Fig.). Specifically, given textual instruction or audio input, the LLM analyzes and decomposes it into structured audio sub-events with context. Based on the inferred sub-events, the LLM performs interleaved reasoning generation, sequentially producing captions, semantic tokens, and residual tokens for each audio clip. These two types of tokens are fused and passed to the DiT, effectively bridging the LLM with the audio generator. Through progressive training, AudioStory ultimately achieves both strong instruction comprehension and high-quality audio generation.

🔩 Installation

Dependencies

Python >= 3.10 (Recommend to use Anaconda)
PyTorch >=2.1.0
NVIDIA GPU + CUDA

Installation

git clone https://github.com/TencentARC/AudioStory.git
cd AudioStory
conda create -n audiostory python=3.10 -y
conda activate audiostory
bash install_audiostory.sh

📊 Evaluation

Download model checkpoint from Huggingface Models.

Inference

python evaluate/inference.py \
    --model_path ckpt/audiostory-3B \
    --guidance 4.0 \
    --save_folder_name audiostory \
    --total_duration 50

🔋 Acknowledgement

When building the codebase of continuous denosiers, we refer to SEED-X and TangoFlux. Thanks for their wonderful projects.

📆 TO DO

Release our gradio demo.
💾 Release AudioStory model checkpoints
Release AudioStory-10k dataset.
Release training codes of all three stages.

📜 License

This repository is under the Apache 2 License.

📚 BibTeX

@misc{guo2025audiostory,
      title={AudioStory: Generating Long-Form Narrative Audio with Large Language Models}, 
      author={Yuxin Guo and Teng Wang and Yuying Ge and Shijie Ma and Yixiao Ge and Wei Zou and Ying Shan},
      year={2025},
      eprint={2508.20088},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2508.20088}, 
}

📧 Contact

If you have further questions, feel free to contact me: [email protected]

Discussions and potential collaborations are also welcome.