Versatile Framework for Song Generation with Prompt-based Control

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao | Zhejiang University

PyTorch implementation of AccompBand of VersBand (EMNLP 2025): Versatile Framework for Song Generation with Prompt-based Control.

Visit our demo page for song samples.

News

2025.08: We released the code of AcccompBand!
2025.08: VersBand is accepted by EMNLP 2025!

Key Features

We propose VersBand, a multi-task song generation approach for generating high-quality, aligned songs with prompt-based control.
We design a decoupled model VocalBand, which leverages the flow-matching method to generate singing styles, pitches, and melspectrograms, enabling fast and high-quality vocal synthesis with high-level style control.
We introduce a flow-based transformer model AccompBand to generate high-quality, controllable, aligned accompaniments, with the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control.
Experimental results demonstrate that VersBand achieves superior objective and subjective evaluations compared to baseline models across multiple song generation tasks.

Quick Start

Since VocalBand is similar to our other SVS models (like TCSinger, TechSinger), we only provide the code of AccompBand in this repo. We give an example of how you can train your own model and infer with AccompBand.

To try on your own song dataset, clone this repo on your local machine with NVIDIA GPU + CUDA cuDNN and follow the instructions below.

Dependencies

A suitable conda environment named versband can be created and activated with:

conda create -n versband python=3.10
conda install --yes --file requirements.txt
conda activate versband

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Data Preparation

Crawl websites to build your own song datasets, then annotate them with automatic tools, like source–accompaniment separation, MIDI extraction, beat tracking, and music caption annotation.
Prepare TSV files that include at least an item_name column, and adapt preprocess/preprocess.py to parse your custom file format accordingly.
Preprocess the dataset:

export PYTHONPATH=.
python preprocess/preprocess.py

Compute mel-spectrograms:

python preprocess/mel_spec_24k.py --tsv_path ./data/music24k/music.tsv --num_gpus 4 --max_duration 20

Post-process:

python preprocess/postprocess_data.py

Download HIFI-GAN as the vocoder in useful_ckpts/hifigan and FLAN-T5 in useful_ckpts/flan-t5-large.

Training AccompBand

Train the VAE module and duration predictor

python main.py --base configs/ae_accomp.yaml -t --gpus 0,1,2,3,4,5,6,7

Train the main VersBand model

python main.py --base configs/vocal2music.yaml -t --gpus 0,1,2,3,4,5,6,7

Notes

Adjust the compression ratio in the config files (and related scripts).
Change the padding length in the dataloader as needed.

Inference with AccompBand

python scripts/test_final.py

Replace the checkpoint path and CFG coefficient as required.

Acknowledgements

This implementation uses parts of the code from the following Github repos: Make-An-Audio-3, TCSinger2 Lumina-T2X as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@article{zhang2025versatile,
  title={Versatile framework for song generation with prompt-based control},
  author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Zhu, Zhiyuan and Li, Ruiqi and Lu, Jingyu and Huang, Rongjie and Zhang, Ruiyuan and Hong, Zhiqing and Jiang, Ziyue and others},
  journal={arXiv preprint arXiv:2504.19062},
  year={2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's songs without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Versatile Framework for Song Generation with Prompt-based Control

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao | Zhejiang University

News

Key Features

Quick Start

Dependencies

Multi-GPU

Data Preparation

Training AccompBand

Inference with AccompBand

Acknowledgements

Citations

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
configs		configs
ldm		ldm
preprocess		preprocess
scripts		scripts
utils		utils
vocoder		vocoder
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

AaronZ345/VersBand

Folders and files

Latest commit

History

Repository files navigation

Versatile Framework for Song Generation with Prompt-based Control

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao | Zhejiang University

News

Key Features

Quick Start

Dependencies

Multi-GPU

Data Preparation

Training AccompBand

Inference with AccompBand

Acknowledgements

Citations

Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages