CTIM-Gen: Controllable Traditional Chinese Instrument Music Generation
CTIM-Gen is a specialized music generation model fine-tuned on MusicGen Small, designed to generate high-quality traditional Chinese instrument performances (e.g., Guqin(Han Chinese), Guzheng(Han Chinese), Matouqin(Mongolian), Hulusi(Dai), Pipa(Han Chinese), Yangqin(Han Chinese), Erhu(Han Chinese)).
π΅ Audio Samples
Below are samples generated by our model using prompt templates.
| Instrument | Prompt | Audio |
|---|---|---|
| Guqin (ε€η΄) | A performance of guqin, featuring woody tone tone, traditional Chinese folk music style, medium tempo. | |
| Guzheng (ε€η) | A performance of guzheng, featuring bright and elegant tone, traditional Chinese folk music style, medium tempo. | |
| Matouqin (马倴η΄) | A performance of matouqin, featuring wild tone, traditional Chinese folk music style, fast tempo. | |
| Hulusi (θ«θ¦δΈ) | A performance of hulusi, featuring gentle and lyrical tone, traditional Chinese folk music style, slow tempo. | |
| Pipa (η΅ηΆ) | A performance of pipa, featuring crisp and rhythmic tone, traditional Chinese folk music style, fast tempo. |
π₯ Training Details
The MusicGen-Small model was fine-tuned on the Chinese-trad-inst dataset. Below are the specific hyperparameters used during training:
| Configuration | Value |
|---|---|
| Model Architecture | MusicGen-Small |
| Dataset | Chinese-trad-inst |
| Training Epochs | 10 |
| Optimizer | AdamW |
| Optimizer Betas | (beta1=0.9, beta2=0.95) |
| Weight Decay | 0.1 |
| Learning Rate | 1e-5 |
| Effective Batch Size | 128 |
π Benchmark Results
Evaluated on CTIM-Bench (500 samples). Lower FAD/KL/JSD is better; Higher CLAP is better.
| Model | FAD (β) | CLAP Score (β) | KL Divergence (β) | JSD (β) |
|---|---|---|---|---|
| CTIM-Gen (Ours) (rvq0.1) | 3.27 | 0.431 | 0.0083 | 0.0021 |
| MusicGen Small (Base) | 4.94 | 0.546 | 0.0724 | 0.0180 |
| MusicGen Medium | 5.28 | 0.523 | 0.0920 | 0.0235 |
| MusicGen Large | 8.51 | 0.473 | 0.0232 | 0.0059 |
| AudioLDM2 | 10.24 | 0.402 | 0.0379 | 0.0094 |
| AudioGen | 12.61 | 0.321 | 0.0240 | 0.0059 |
(Note: Results aligned with ICMR 2026 submission)
π£οΈ Subjective Evaluation (MOS)
We conducted a MUSHRA-like subjective evaluation with 32 listeners to assess Semantic Consistency (REL) and Audio Quality (OVL).
| Model | Semantic Consistency (REL) | Audio Quality (OVL) |
|---|---|---|
| CTIM-Gen (Ours) | 3.98 Β± 0.17 | 3.65 Β± 0.18 |
| MusicGen Large | 2.42 Β± 0.22 | 2.91 Β± 0.21 |
| MusicGen Small | 2.37 Β± 0.20 | 2.73 Β± 0.20 |
Visual Analysis
Comparison of spectrograms between CTIM-Gen and Baseline.
π Usage
1. Installation
pip install audiocraft
2. Inference
You can download the model weights and use our inference script code/inference.py, or use the following python snippet:
import torch
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write
# 1. Load Base Model
model = MusicGen.get_pretrained('facebook/musicgen-small')
# 2. Load CTIM-Gen Weights
# Download 'rvq0.1_tokendrop0.0.pt' (or best model) from 'checkpoints/' and place it locally
model_path = 'rvq0.1_tokendrop0.0.pt'
if os.path.exists(model_path):
model.lm.load_state_dict(torch.load(model_path, map_location='cpu'))
else:
print("Please download the weights file first.")
# 3. Set Generation Params (30s duration, best config)
model.set_generation_params(
duration=30,
top_k=250,
top_p=0.0,
temperature=1.0,
cfg_coef=3.0
)
# 4. Generate
prompts = ["A performance of guqin, featuring deep and resonant tone, traditional Chinese folk music style, slow tempo."]
wav = model.generate(prompts)
audio_write('output', wav[0].cpu(), model.sample_rate, strategy="loudness")
π Repository Structure
checkpoints/: Model weights (Best model & Ablation studies).code/: Inference and benchmarking scripts.evidence/: Logs and metric results supporting the paper.configs/: Benchmark metadata.
π Dataset
The training and benchmark dataset Chinese-trad-inst is available at: CTIM-Gen/Chinese-trad-inst (Benchmark subset available at CTIM-Gen/CTIM-Bench).
βοΈ Data Source & Usage Policy
The audio data used in this project was primarily collected from China Music Network (www.china1901.com), with a small amount supplemented from Bilibili.
Usage Policy: This dataset is intended solely for academic research in the field of AI music generation. Any commercial use is strictly prohibited.
Submitted to ICMR 2026 (Anonymous Submission)
- Downloads last month
- -