Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao | Zhejiang University
PyTorch implementation of AccompBand of VersBand (EMNLP 2025): Versatile Framework for Song Generation with Prompt-based Control.
Visit our demo page for song samples.
- 2025.08: We released the code of AcccompBand!
- 2025.08: VersBand is accepted by EMNLP 2025!
- We propose VersBand, a multi-task song generation approach for generating high-quality, aligned songs with prompt-based control.
- We design a decoupled model VocalBand, which leverages the flow-matching method to generate singing styles, pitches, and melspectrograms, enabling fast and high-quality vocal synthesis with high-level style control.
- We introduce a flow-based transformer model AccompBand to generate high-quality, controllable, aligned accompaniments, with the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control.
- Experimental results demonstrate that VersBand achieves superior objective and subjective evaluations compared to baseline models across multiple song generation tasks.
Since VocalBand is similar to our other SVS models (like TCSinger, TechSinger), we only provide the code of AccompBand in this repo. We give an example of how you can train your own model and infer with AccompBand.
To try on your own song dataset, clone this repo on your local machine with NVIDIA GPU + CUDA cuDNN and follow the instructions below.
A suitable conda environment named versband
can be created
and activated with:
conda create -n versband python=3.10
conda install --yes --file requirements.txt
conda activate versband
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count()
.
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE
environment variable before running the training module.
- Crawl websites to build your own song datasets, then annotate them with automatic tools, like source–accompaniment separation, MIDI extraction, beat tracking, and music caption annotation.
- Prepare TSV files that include at least an item_name column, and adapt preprocess/preprocess.py to parse your custom file format accordingly.
- Preprocess the dataset:
export PYTHONPATH=.
python preprocess/preprocess.py
- Compute mel-spectrograms:
python preprocess/mel_spec_24k.py --tsv_path ./data/music24k/music.tsv --num_gpus 4 --max_duration 20
- Post-process:
python preprocess/postprocess_data.py
- Train the VAE module and duration predictor
python main.py --base configs/ae_accomp.yaml -t --gpus 0,1,2,3,4,5,6,7
- Train the main VersBand model
python main.py --base configs/vocal2music.yaml -t --gpus 0,1,2,3,4,5,6,7
Notes
- Adjust the compression ratio in the config files (and related scripts).
- Change the padding length in the dataloader as needed.
python scripts/test_final.py
Replace the checkpoint path and CFG coefficient as required.
This implementation uses parts of the code from the following Github repos: Make-An-Audio-3, TCSinger2 Lumina-T2X as described in our code.
If you find this code useful in your research, please cite our work:
@article{zhang2025versatile,
title={Versatile framework for song generation with prompt-based control},
author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Zhu, Zhiyuan and Li, Ruiqi and Lu, Jingyu and Huang, Rongjie and Zhang, Ruiyuan and Hong, Zhiqing and Jiang, Ziyue and others},
journal={arXiv preprint arXiv:2504.19062},
year={2025}
}
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's songs without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.