CTIM-Gen: Controllable Traditional Chinese Instrument Music Generation

CTIM-Gen is a specialized music generation model fine-tuned on MusicGen Small, designed to generate high-quality traditional Chinese instrument performances (e.g., Guqin(Han Chinese), Guzheng(Han Chinese), Matouqin(Mongolian), Hulusi(Dai), Pipa(Han Chinese), Yangqin(Han Chinese), Erhu(Han Chinese)).

🎵 Audio Samples

Below are samples generated by our model using prompt templates.

Instrument	Prompt	Audio
Guqin (古琴)	A performance of guqin, featuring woody tone tone, traditional Chinese folk music style, medium tempo.
Guzheng (古筝)	A performance of guzheng, featuring bright and elegant tone, traditional Chinese folk music style, medium tempo.
Matouqin (马头琴)	A performance of matouqin, featuring wild tone, traditional Chinese folk music style, fast tempo.
Hulusi (葫芦丝)	A performance of hulusi, featuring gentle and lyrical tone, traditional Chinese folk music style, slow tempo.
Pipa (琵琶)	A performance of pipa, featuring crisp and rhythmic tone, traditional Chinese folk music style, fast tempo.

🔥 Training Details

The MusicGen-Small model was fine-tuned on the Chinese-trad-inst dataset. Below are the specific hyperparameters used during training:

Configuration	Value
Model Architecture	MusicGen-Small
Dataset	Chinese-trad-inst
Training Epochs	10
Optimizer	AdamW
Optimizer Betas	(beta1=0.9, beta2=0.95)
Weight Decay	0.1
Learning Rate	1e-5
Effective Batch Size	128

📊 Benchmark Results

Evaluated on CTIM-Bench (500 samples). Lower FAD/KL/JSD is better; Higher CLAP is better.

Model	FAD (↓)	CLAP Score (↑)	KL Divergence (↓)	JSD (↓)
CTIM-Gen (Ours) (rvq0.1)	3.27	0.431	0.0083	0.0021
MusicGen Small (Base)	4.94	0.546	0.0724	0.0180
MusicGen Medium	5.28	0.523	0.0920	0.0235
MusicGen Large	8.51	0.473	0.0232	0.0059
AudioLDM2	10.24	0.402	0.0379	0.0094
AudioGen	12.61	0.321	0.0240	0.0059

(Note: Results aligned with ICMR 2026 submission)

🗣️ Subjective Evaluation (MOS)

We conducted a MUSHRA-like subjective evaluation with 32 listeners to assess Semantic Consistency (REL) and Audio Quality (OVL).

Model	Semantic Consistency (REL)	Audio Quality (OVL)
CTIM-Gen (Ours)	3.98 ± 0.17	3.65 ± 0.18
MusicGen Large	2.42 ± 0.22	2.91 ± 0.21
MusicGen Small	2.37 ± 0.20	2.73 ± 0.20

Visual Analysis

Comparison of spectrograms between CTIM-Gen and Baseline.

🚀 Usage

1. Installation

pip install audiocraft

2. Inference

You can download the model weights and use our inference script code/inference.py, or use the following python snippet:

import torch
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

# 1. Load Base Model
model = MusicGen.get_pretrained('facebook/musicgen-small')

# 2. Load CTIM-Gen Weights 
# Download 'rvq0.1_tokendrop0.0.pt' (or best model) from 'checkpoints/' and place it locally
model_path = 'rvq0.1_tokendrop0.0.pt' 
if os.path.exists(model_path):
    model.lm.load_state_dict(torch.load(model_path, map_location='cpu'))
else:
    print("Please download the weights file first.")

# 3. Set Generation Params (30s duration, best config)
model.set_generation_params(
    duration=30,
    top_k=250, 
    top_p=0.0, 
    temperature=1.0, 
    cfg_coef=3.0
)

# 4. Generate
prompts = ["A performance of guqin, featuring deep and resonant tone, traditional Chinese folk music style, slow tempo."]
wav = model.generate(prompts)
audio_write('output', wav[0].cpu(), model.sample_rate, strategy="loudness")

📂 Repository Structure

checkpoints/: Model weights (Best model & Ablation studies).
code/: Inference and benchmarking scripts.
evidence/: Logs and metric results supporting the paper.
configs/: Benchmark metadata.

🔗 Dataset

The training and benchmark dataset Chinese-trad-inst is available at: CTIM-Gen/Chinese-trad-inst (Benchmark subset available at CTIM-Gen/CTIM-Bench).

⚖️ Data Source & Usage Policy

The audio data used in this project was primarily collected from China Music Network (www.china1901.com), with a small amount supplemented from Bilibili.

Usage Policy: This dataset is intended solely for academic research in the field of AI music generation. Any commercial use is strictly prohibited.

Submitted to ICMR 2026 (Anonymous Submission)

Downloads last month: -