OSUM-EChat / README.md
ASLP-lab's picture
Update README.md
d658ae8 verified
---
# 1. 基础信息(必填/核心推荐项,确保模型可被检索)
language:
- en # 模型支持的语言,如中文填 zh,多语言用数组:[en, zh, ja]
- zh
tags: # 标签,用于 Hugging Face 搜索分类,按模型特性补充
- speech-processing # 语音处理(根据你的模型类型调整,如 NLP 填 natural-language-processing)
- empathetic-dialogue # 共情对话(你的模型核心功能)
- end-to-end-model # 端到端模型
- spoken-dialogue
- pytorch # 框架(如 TensorFlow 填 tensorflow)
license: apache-2.0 # 许可证,如 Apache-2.0、GPL-3.0,需与仓库 LICENSE 文件一致
library_name: transformers # 关联的 Hugging Face 库(如语音模型可能填 speechbrain、transformers)
datasets: # 模型训练/评估用到的数据集(如有,填 Hugging Face Datasets 库中的名称)
- EChat-200K # 你的共情对话数据集(若未上传到 Datasets 库,可暂不填或填自定义名称)
# 2. 作者与机构(可选但推荐,提升可信度)
author:
- name: Xuelong Geng # 作者/团队名称(如你的实验室:西北工业大学音频语音与语言处理组)
email: [email protected] # 可选,方便协作联系
organization:
- name: ASLP@NPU # 所属机构
url: http://www.npu-aslp.org/ # 机构官网
# 4. 资源链接(可选,方便用户快速跳转)
links:
- name: Paper # 论文链接(如有,填 arXiv 或期刊地址)
url: https://www.arxiv.org/abs/2508.09600 # 示例:arXiv 链接
- name: GitHub Repo # 若有额外代码仓库,填链接
url: https://github.com/ASLP-lab/OSUM-EChat
- name: Demo Page # 在线演示页面(如你的网页链接)
url: https://www.osum-echat.npu-aslp.org
---
<p align="center">
<h1>OSUM-EChat: Enhancing End-to-End Empathetic Spoken Chatbot via Understanding-Driven Spoken Dialogue</h1>
<p>
Xuelong Geng, Qijie Shao, Hongfei Xue, Shuiyuan Wang, Hanke Xie, Zhao Guo, Yi Zhao, Guojian Li, Wenjie Tian, Chengyou Wang, Zhixian Zhao, Kangxiang Xia, Ziyu Zhang, Zhennan Lin, Tianlun Zuo, Mingchen Shao, Yuang Cao, Guobin Ma, Longhao Li, Yuhang Dai, Dehui Gao, Dake Guo, Lei Xie
<p align="center">
<img src="images/osum-echat/SUM.png" width="500"/>
<p>
<p align="center">
<a href="https://www.osum-echat.npu-aslp.org/"> Test Page</a> &nbsp;&nbsp; <a href="https://github.com/ASLP-lab/OSUM-EChat"> Code</a>&nbsp;
<br>
📑 <a href="https://www.arxiv.org/abs/2508.09600">Paper</a> &nbsp&nbsp | &nbsp&nbsp 📑 <a href="https://aslp-lab.github.io/osum-echat.github.io/">Demo</a> &nbsp&nbsp | &nbsp&nbsp 💬 <a href="raw/fig/wechat.png">WeChat (微信)</a>&nbsp&nbsp
</p>
Empathy is crucial in enabling natural interactions within spoken dialogue systems, allowing machines to recognize and respond appropriately to paralinguistic cues such as age, gender, and emotion.
Recent advancements in end-to-end speech language models, which unify speech understanding and generation, provide promising solutions.
However, several challenges persist, including an over-reliance on large-scale dialogue datasets, insufficient extraction of paralinguistic cues vital for conveying empathy, and the lack of empathy-specific datasets and evaluation frameworks.
To address these issues, we introduce OSUM-EChat, an open-source, end-to-end spoken dialogue system designed to enhance empathetic interactions, particularly in resource-limited settings.
Based on [OSUM](https://github.com/ASLP-lab/OSUM/tree/main/OSUM), OSUM-EChat introduces two key innovations: (1) a three-stage understanding-driven spoken dialogue training strategy that extends the capabilities of a large speech understanding model to spoken dialogue tasks, and (2) a linguistic-paralinguistic dual thinking mechanism that integrates paralinguistic understanding through a chain of thought with dialogue generation, enabling the system to produce more empathetic responses. This approach reduces reliance on large-scale dialogue datasets while maintaining high-quality empathetic interactions. Additionally, we introduce the EChat-200K dataset, a rich corpus of empathetic speech-to-speech dialogues, and the EChat-eval benchmark, a comprehensive framework for evaluating the empathetic capabilities of dialogue systems. Experimental results demonstrate that OSUM-EChat outperforms end-to-end spoken dialogue models regarding empathetic responsiveness, validating its effectiveness.
<p align="center">
<img src="images/osum-echat/demo_en.png" width="80%"/>
<p>
## Architecture
This section presents an overview of the overall architecture and core tasks of OSUM-EChat. OSUM-EChat consists of three modules: a speech encoder (with an adapter), a text LLM (Large Language Model), and a token-to-speech module. It also possesses a wide range of speech functions, including various speech understanding tasks (speech-to-text), speech synthesis tasks, speech dialogue tasks, and text dialogue tasks. Meanwhile, by leveraging internally constructed empathetic dialogue data and a paralinguistic information reasoning mechanism, OSUM-EChat can generate more empathetic responses in speech dialogue tasks.
<p align="center">
<img src="images/osum-echat/system.png" width="80%"/>
<p>
## Training Strategy
To enable OSUM-EChat to achieve empathetic dialogue in resource-constrained environments, the study proposes a three-stage training strategy called **"Understanding-Driven Spoken Dialogue"**, which consists of the stages of understanding, generation, and empathy. In the empathy stage, a **dual-thinking mechanism of linguistic and paralinguistic information** is introduced to explicitly separate paralinguistic and semantic information, thereby helping the model generate more empathetic responses.
### Stage 1: Understanding
The goal of this stage is to enable the LLM to understand both linguistic and paralinguistic information in speech. OSUM’s **“ASR+P” strategy** is employed (where *P* represents paralinguistic labels such as emotion, gender, age, and sound events). Multiple “ASR+P” tasks are jointly trained, with only the encoder and adapters being trainable.
### Stage 2: Generation
This stage aims to equip the OSUM-based understanding model with speech generation capabilities. A two-step training process is adopted: text-to-speech (TTS) generation and speech-to-speech (S2S) dialogue. Additionally, text-to-text (T2T) data is incorporated to maintain the model’s reasoning intelligence.
### Stage 3: Empathy
In this stage, linguistic and paralinguistic information obtained from speech understanding is integrated into the dialogue generation process, significantly improving the model’s ability to produce contextually coherent and empathetic responses. By introducing a **dual-thinking mechanism** before generating text and speech responses, the model first recognizes the linguistic content of the user’s speech, then infers paralinguistic details, and finally integrates these insights to generate appropriate responses.
In the design of the **Chain of Thought (CoT)**, the study explores two different textual forms: **label-based CoT** and **natural language-based CoT**, to investigate how different approaches affect the model’s empathetic understanding and response generation.
#### Label-Based CoT
This form follows a fixed template structure. Specifically, the model first outputs the transcription text obtained via automatic speech recognition (ASR), extracting semantic information from the user’s input. Then, it sequentially outputs predefined paralinguistic labels such as age, gender, speech events, and emotion. The main advantage is that the CoT stage produces content with relatively fixed format and short length, making the process easy to control and efficient in extracting and integrating core paralinguistic cues. However, this approach has limitations: due to the restricted number and scope of predefined labels, it cannot fully express richer and more nuanced paralinguistic states, such as subtle shifts in tone intensity or fine-grained emotional transitions.
#### Natural Language-Based CoT
This form abandons the fixed label template in favor of natural, fluent language descriptions. The model generates coherent textual paragraphs: first interpreting the semantic meaning of the user’s speech (rather than merely transcribing it), then describing the paralinguistic details in depth — including specific manifestations of age characteristics, gender-related vocal traits, the contexts and features of various speech events, and fine-grained emotional layers and dynamics. The advantage of this method is its flexibility in overcoming label limitations, allowing the model to capture and express complex paralinguistic states more comprehensively, thereby providing richer grounding for empathetic response generation. However, its content is harder to control in length and structure, which may increase computational overhead and require stronger language organization abilities from the model.
### Download Model Checkpoints
```python
from huggingface_hub import hf_hub_download
# For natural language think model
pt_file_path = hf_hub_download(repo_id="ASLP-lab/OSUM-EChat", filename="language_think_final.pt")
# For tag-based think model
pt_file_path2 = hf_hub_download(repo_id="ASLP-lab/OSUM-EChat", filename="tag_think_final.pt")
# Token2wav model (compressed tar file)
pt_file_path3 = hf_hub_download(repo_id="ASLP-lab/OSUM-EChat", filename="CosyVoice-300M-25Hz.tar")
# Extract token2wav model parameters
import os
os.system(f"tar -xvf {pt_file_path3}")
```