CTIM-Gen: Controllable Traditional Chinese Instrument Music Generation

CTIM-Gen is a specialized music generation model fine-tuned on MusicGen Small, designed to generate high-quality traditional Chinese instrument performances (e.g., Guqin(Han Chinese), Guzheng(Han Chinese), Matouqin(Mongolian), Hulusi(Dai), Pipa(Han Chinese), Yangqin(Han Chinese), Erhu(Han Chinese)).

🎡 Audio Samples

Below are samples generated by our model using prompt templates.

Instrument Prompt Audio
Guqin (叀琴) A performance of guqin, featuring woody tone tone, traditional Chinese folk music style, medium tempo.
Guzheng (叀筝) A performance of guzheng, featuring bright and elegant tone, traditional Chinese folk music style, medium tempo.
Matouqin (马倴琴) A performance of matouqin, featuring wild tone, traditional Chinese folk music style, fast tempo.
Hulusi (θ‘«θŠ¦δΈ) A performance of hulusi, featuring gentle and lyrical tone, traditional Chinese folk music style, slow tempo.
Pipa (琡琢) A performance of pipa, featuring crisp and rhythmic tone, traditional Chinese folk music style, fast tempo.

πŸ”₯ Training Details

The MusicGen-Small model was fine-tuned on the Chinese-trad-inst dataset. Below are the specific hyperparameters used during training:

Configuration Value
Model Architecture MusicGen-Small
Dataset Chinese-trad-inst
Training Epochs 10
Optimizer AdamW
Optimizer Betas (beta1=0.9, beta2=0.95)
Weight Decay 0.1
Learning Rate 1e-5
Effective Batch Size 128

πŸ“Š Benchmark Results

Evaluated on CTIM-Bench (500 samples). Lower FAD/KL/JSD is better; Higher CLAP is better.

Model FAD (↓) CLAP Score (↑) KL Divergence (↓) JSD (↓)
CTIM-Gen (Ours) (rvq0.1) 3.27 0.431 0.0083 0.0021
MusicGen Small (Base) 4.94 0.546 0.0724 0.0180
MusicGen Medium 5.28 0.523 0.0920 0.0235
MusicGen Large 8.51 0.473 0.0232 0.0059
AudioLDM2 10.24 0.402 0.0379 0.0094
AudioGen 12.61 0.321 0.0240 0.0059

(Note: Results aligned with ICMR 2026 submission)

πŸ—£οΈ Subjective Evaluation (MOS)

We conducted a MUSHRA-like subjective evaluation with 32 listeners to assess Semantic Consistency (REL) and Audio Quality (OVL).

Model Semantic Consistency (REL) Audio Quality (OVL)
CTIM-Gen (Ours) 3.98 Β± 0.17 3.65 Β± 0.18
MusicGen Large 2.42 Β± 0.22 2.91 Β± 0.21
MusicGen Small 2.37 Β± 0.20 2.73 Β± 0.20

Visual Analysis

Spectrogram Comparison Comparison of spectrograms between CTIM-Gen and Baseline.

πŸš€ Usage

1. Installation

pip install audiocraft

2. Inference

You can download the model weights and use our inference script code/inference.py, or use the following python snippet:

import torch
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

# 1. Load Base Model
model = MusicGen.get_pretrained('facebook/musicgen-small')

# 2. Load CTIM-Gen Weights 
# Download 'rvq0.1_tokendrop0.0.pt' (or best model) from 'checkpoints/' and place it locally
model_path = 'rvq0.1_tokendrop0.0.pt' 
if os.path.exists(model_path):
    model.lm.load_state_dict(torch.load(model_path, map_location='cpu'))
else:
    print("Please download the weights file first.")

# 3. Set Generation Params (30s duration, best config)
model.set_generation_params(
    duration=30,
    top_k=250, 
    top_p=0.0, 
    temperature=1.0, 
    cfg_coef=3.0
)

# 4. Generate
prompts = ["A performance of guqin, featuring deep and resonant tone, traditional Chinese folk music style, slow tempo."]
wav = model.generate(prompts)
audio_write('output', wav[0].cpu(), model.sample_rate, strategy="loudness")

πŸ“‚ Repository Structure

  • checkpoints/: Model weights (Best model & Ablation studies).
  • code/: Inference and benchmarking scripts.
  • evidence/: Logs and metric results supporting the paper.
  • configs/: Benchmark metadata.

πŸ”— Dataset

The training and benchmark dataset Chinese-trad-inst is available at: CTIM-Gen/Chinese-trad-inst (Benchmark subset available at CTIM-Gen/CTIM-Bench).

βš–οΈ Data Source & Usage Policy

The audio data used in this project was primarily collected from China Music Network (www.china1901.com), with a small amount supplemented from Bilibili.

Usage Policy: This dataset is intended solely for academic research in the field of AI music generation. Any commercial use is strictly prohibited.


Submitted to ICMR 2026 (Anonymous Submission)

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support