🚀 DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

Paper link: arXiv

🔍 Why DRPO ? - Decoupled Reward Policy Optimization

While existing methods incorporate length rewards to GRPO to promote concise reasoning, they incur significant performance degradation. We identify the root cause: when rewards for correct but long rollouts are penalized, GRPO's group-relative advantage function can assign them negative advantages, actively discouraging valid reasoning.

DRPO is a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones. DRPO is grounded in integrating an optimized positive data distribution, which maximizes length-based rewards under a KL regularization, into a discriminative objective. We derive a closed-form solution for this distribution for efficient learning.

📈 Quick Results

Comparison of performance-efficiency trade-off. Left is for fine-tuning 1.5B model and right is for fine-tuning 7B model. Grey lines represent the base model performance before finetuning, with generation length of 4698 for 1.5B model and 4119 for 7B model. Squares denote models trained with reference methods without length penalties (i.e., $\lambda$=+$\infty$ for DRPO, $\alpha=0$ for RLOO-LP, $\beta=0$ for ALP, $w=0$ for HAPO). Triangles denote the models trained by other works.

Model Checkpoints
Getting Started
Citing DRPO

Model Checkpoints

DRPO finetuned DeepSeek-R1-Distill-Qwen-1.5B Model: DRPO-1.5B-Lambda-0.1
DRPO finetuned DeepSeek-R1-Distill-Qwen-7B Model: DRPO-7B-Lambda-0.1

Getting Started

Installation

# Recommend Python 3.10.
conda create -n drpo python=3.10
conda activate drpo
git clone https://github.com/Optimization-AI/DisCO.git
cd DRPO
pip install -e ./verl
pip install -e ./deepscaler
pip install wandb

If the above commands install other versions of vllm rather than vllm==0.6.3 and you can't manually install vllm==0.6.3 due to the conflicting dependencies related to outlines, please try the following workaround:

pip install --no-deps vllm==0.6.3
pip install outlines==0.0.6 xformers==0.0.27.post2  torchvision==0.19 torch==2.4.0 lm-format-enforcer==0.10.6 gguf==0.10.0 pyzmq partial-json-parser msgspec mistral-common 
pip uninstall -y vllm-flash-attn

Datasets

Datesets utilized in our training are included in the datasets folder. Feel free to adapt file scripts/data/deepscaler_dataset.py to generate your own datasets.

Training

We provide training scripts for both single-node and multi-node setups in scripts/train/.

Single-Node Training (8 GPUs)

We start with one node for training 1.5B Qwen models with 8k context, with 8 A100-80GB GPUs.

bash ./scripts/train/run_drpo_1.5b_8k.sh   #### DRPO

Multi-Node Training

To train with longer context or larger models, multi-node training is necessary. To achieve this, follow these steps:

On the head node:

# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
# Start Ray head node
ray start --head

On each worker node:

# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
# Connect to head node (replace with your head node's address)
ray start --address=[RAY_ADDRESS]

Finally, on the head node, run the training script, such as:

bash ./scripts/train/run_drpo_1.5b_8k.sh

Evaluation

Our evaluation scripts automatically runs vLLM to generate 16 samples for each problem. To run our evaluation scripts, run:

bash ./scripts/eval/eval_model.sh --model [CHECKPOINT_PATH] --datasets [DATASET1] [DATASET2] --output-dir [OUTPUT_DIR]

We report Pass@1 accuracy averaged over 16 samples for each problem. To replicate our reported numbers, for example, run:

bash ./scripts/eval/eval_model.sh --model ganglii/DRPO-1.5B --datasets aime aime25 olympiad_bench math gsm8k --output-dir ./val_results/DRPO-1.5B

Acknowledgements

Our training pipeline is built on the Github repository DeepScaleR with Verl framework. We thank the authors for open-sourcing their code.

Citing DRPO

If you find DRPO useful in your research, please consider citing the following paper:

@article{li2025drpo,
  title={DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization},
  author={Li, Gang and Chen, Yan and Lin, Ming and Yang, Tianbao},
  journal={arXiv preprint arXiv:2510.04474},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
datasets/deepscaler/data		datasets/deepscaler/data
deepscaler		deepscaler
scripts		scripts
verl		verl
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

🔍 Why DRPO ? - Decoupled Reward Policy Optimization

📈 Quick Results

Model Checkpoints

Getting Started

Installation

Datasets

Training

Single-Node Training (8 GPUs)

Multi-Node Training

Evaluation

Acknowledgements

Citing DRPO

About

Uh oh!

Releases

Packages

Languages

License

Optimization-AI/DRPO

Folders and files

Latest commit

History

Repository files navigation

🚀 DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization

🔍 Why DRPO ? - Decoupled Reward Policy Optimization

📈 Quick Results

Model Checkpoints

Getting Started

Installation

Datasets

Training

Single-Node Training (8 GPUs)

Multi-Node Training

Evaluation

Acknowledgements

Citing DRPO

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages