Paper link: arXiv
While existing methods incorporate length rewards to GRPO to promote concise reasoning, they incur significant performance degradation. We identify the root cause: when rewards for correct but long rollouts are penalized, GRPO's group-relative advantage function can assign them negative advantages, actively discouraging valid reasoning.
DRPO is a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones. DRPO is grounded in integrating an optimized positive data distribution, which maximizes length-based rewards under a KL regularization, into a discriminative objective. We derive a closed-form solution for this distribution for efficient learning.
Comparison of performance-efficiency trade-off. Left is for fine-tuning 1.5B model and right is for fine-tuning 7B model. Grey lines represent the base model performance before finetuning, with generation length of 4698 for 1.5B model and 4119 for 7B model. Squares denote models trained with reference methods without length penalties (i.e.,
- DRPO finetuned DeepSeek-R1-Distill-Qwen-1.5B Model: DRPO-1.5B-Lambda-0.1
- DRPO finetuned DeepSeek-R1-Distill-Qwen-7B Model: DRPO-7B-Lambda-0.1
# Recommend Python 3.10.
conda create -n drpo python=3.10
conda activate drpo
git clone https://github.com/Optimization-AI/DisCO.git
cd DRPO
pip install -e ./verl
pip install -e ./deepscaler
pip install wandb
If the above commands install other versions of vllm
rather than vllm==0.6.3
and you can't manually install vllm==0.6.3
due to the conflicting dependencies related to outlines
, please try the following workaround:
pip install --no-deps vllm==0.6.3
pip install outlines==0.0.6 xformers==0.0.27.post2 torchvision==0.19 torch==2.4.0 lm-format-enforcer==0.10.6 gguf==0.10.0 pyzmq partial-json-parser msgspec mistral-common
pip uninstall -y vllm-flash-attn
Datesets utilized in our training are included in the datasets
folder. Feel free to adapt file scripts/data/deepscaler_dataset.py
to generate your own datasets.
We provide training scripts for both single-node and multi-node setups in scripts/train/
.
We start with one node for training 1.5B Qwen models with 8k context, with 8 A100-80GB GPUs.
bash ./scripts/train/run_drpo_1.5b_8k.sh #### DRPO
To train with longer context or larger models, multi-node training is necessary. To achieve this, follow these steps:
- On the head node:
# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
# Start Ray head node
ray start --head
- On each worker node:
# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
# Connect to head node (replace with your head node's address)
ray start --address=[RAY_ADDRESS]
- Finally, on the head node, run the training script, such as:
bash ./scripts/train/run_drpo_1.5b_8k.sh
Our evaluation scripts automatically runs vLLM to generate 16 samples for each problem. To run our evaluation scripts, run:
bash ./scripts/eval/eval_model.sh --model [CHECKPOINT_PATH] --datasets [DATASET1] [DATASET2] --output-dir [OUTPUT_DIR]
We report Pass@1 accuracy averaged over 16 samples for each problem. To replicate our reported numbers, for example, run:
bash ./scripts/eval/eval_model.sh --model ganglii/DRPO-1.5B --datasets aime aime25 olympiad_bench math gsm8k --output-dir ./val_results/DRPO-1.5B
- Our training pipeline is built on the Github repository DeepScaleR with Verl framework. We thank the authors for open-sourcing their code.
If you find DRPO useful in your research, please consider citing the following paper:
@article{li2025drpo,
title={DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization},
author={Li, Gang and Chen, Yan and Lin, Ming and Yang, Tianbao},
journal={arXiv preprint arXiv:2510.04474},
year={2025}
}