Skip to content

AaronZ345/VersBand

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Versatile Framework for Song Generation with Prompt-based Control

Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Ruiqi Li, Jingyu Lu, Rongjie Huang, Ruiyuan Zhang, Zhiqing Hong, Ziyue Jiang, Zhou Zhao | Zhejiang University

PyTorch implementation of AccompBand of VersBand (EMNLP 2025): Versatile Framework for Song Generation with Prompt-based Control.

arXiv Demo weixin weixin zhihu GitHub Stars

Visit our demo page for song samples.

News

  • 2025.08: We released the code of AcccompBand!
  • 2025.08: VersBand is accepted by EMNLP 2025!

Key Features

  • We propose VersBand, a multi-task song generation approach for generating high-quality, aligned songs with prompt-based control.
  • We design a decoupled model VocalBand, which leverages the flow-matching method to generate singing styles, pitches, and melspectrograms, enabling fast and high-quality vocal synthesis with high-level style control.
  • We introduce a flow-based transformer model AccompBand to generate high-quality, controllable, aligned accompaniments, with the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control.
  • Experimental results demonstrate that VersBand achieves superior objective and subjective evaluations compared to baseline models across multiple song generation tasks.

Quick Start

Since VocalBand is similar to our other SVS models (like TCSinger, TechSinger), we only provide the code of AccompBand in this repo. We give an example of how you can train your own model and infer with AccompBand.

To try on your own song dataset, clone this repo on your local machine with NVIDIA GPU + CUDA cuDNN and follow the instructions below.

Dependencies

A suitable conda environment named versband can be created and activated with:

conda create -n versband python=3.10
conda install --yes --file requirements.txt
conda activate versband

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Data Preparation

  1. Crawl websites to build your own song datasets, then annotate them with automatic tools, like source–accompaniment separation, MIDI extraction, beat tracking, and music caption annotation.
  2. Prepare TSV files that include at least an item_name column, and adapt preprocess/preprocess.py to parse your custom file format accordingly.
  3. Preprocess the dataset:
export PYTHONPATH=.
python preprocess/preprocess.py
  1. Compute mel-spectrograms:
python preprocess/mel_spec_24k.py --tsv_path ./data/music24k/music.tsv --num_gpus 4 --max_duration 20
  1. Post-process:
python preprocess/postprocess_data.py
  1. Download HIFI-GAN as the vocoder in useful_ckpts/hifigan and FLAN-T5 in useful_ckpts/flan-t5-large.

Training AccompBand

  1. Train the VAE module and duration predictor
python main.py --base configs/ae_accomp.yaml -t --gpus 0,1,2,3,4,5,6,7
  1. Train the main VersBand model
python main.py --base configs/vocal2music.yaml -t --gpus 0,1,2,3,4,5,6,7

Notes

  • Adjust the compression ratio in the config files (and related scripts).
  • Change the padding length in the dataloader as needed.

Inference with AccompBand

python scripts/test_final.py

Replace the checkpoint path and CFG coefficient as required.

Acknowledgements

This implementation uses parts of the code from the following Github repos: Make-An-Audio-3, TCSinger2 Lumina-T2X as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@article{zhang2025versatile,
  title={Versatile framework for song generation with prompt-based control},
  author={Zhang, Yu and Guo, Wenxiang and Pan, Changhao and Zhu, Zhiyuan and Li, Ruiqi and Lu, Jingyu and Huang, Rongjie and Zhang, Ruiyuan and Hong, Zhiqing and Jiang, Ziyue and others},
  journal={arXiv preprint arXiv:2504.19062},
  year={2025}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's songs without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

visitors

About

PyTorch Implementation of VersBand(EMNLP 2025): Versatile Framework for Song Generation with Prompt-based Control

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages