Paper: https://arxiv.org/pdf/2510.06915
Models: ModelScope LongRM Collections / HuggingFace LongRM Collections
Long-RewardBench V1: ModelScope / HuggingFacePushing the limits of reward modeling beyond 128K tokens — with memory-efficient training and a new benchmark for long-context reward models.
-
2025-10-09 — 🚀 Code released: Full training & evaluation pipeline now open-sourced.
-
2025-10-11 — LongRMs are released on 🤗 Huggingface LongRM Collections and Long-RewardBench is released on 🤗 Huggingface Long-RewardBench Dataset.
-
Coming Soon — We are continually improving Long-RewardBench.
Set up your training environment with the following steps:
cd LongRM
pip install -r requirements.txt
For optimal memory efficiency and speed during long-context training, install FlashAttention-2:
-
Download the compatible
.whl
file for your CUDA version from:
👉 https://github.com/Dao-AILab/flash-attention/releases -
Install locally:
pip install <path_to_downloaded_flash_attn_whl_file>
-
Install Ring-Flash-Attention for scalable long-sequence training:
pip install ring_flash_attn
💡 Tip: Use
nvidia-smi
to verify your CUDA version. Ensure the FlashAttention wheel matches your driver and PyTorch setup.
LongRM uses a two-stage training paradigm to unlock long-context reward modeling. Choose your modeling approach below.
Stage 1: Supervised Fine-Tuning (SFT)
Train the base model on instruction-following data:
bash scripts/sft.sh
Stage 2: Preference Optimization with SIMPO
Refine the model using long-context preference signals:
bash scripts/simpo_grm.sh
✅ Requires completion of Stage 1.
Single-Stage: Direct Preference Optimization
Train a binary classifier to score response pairs directly under long contexts:
bash scripts/simpo_disrm.sh
✅ No SFT pretraining needed — end-to-end training from scratch.
We provide a new benchmark dataset — LongReward-Bench
— and pretrained models on ModelScope for easy evaluation.
🔗 Dataset & Models Download
modelscope download LCM_group/LongReward_Qwen3-8B --repo-type model --local_dir ./LongReward_Qwen3-8B
python evaluate/eval.py --model-path ./LongReward_Qwen3-8B --data-path ./LongReward-Bench
modelscope download LCM_group/LongReward_Skywork-Reward-V2-Llama-3.1-8B --repo-type model --local_dir ./LongReward_Skywork-Reward-V2-Llama-3.1-8B
python evaluate/eval.py --model-path ./LongReward_Skywork-Reward-V2-Llama-3.1-8B --data-path ./LongReward-Bench --is-disrm
✅ Add
--is-disrm
flag only for discriminative models.
📈 Results are reported as accuracy and AUC on long-context preference pairs (up to 131K tokens).
We welcome contributions! Whether it’s:
- Adding new datasets or evaluation metrics
- Improving training efficiency
- Porting to other architectures (e.g., Mistral, Gemma)
Please open an Issue or submit a Pull Request.
Questions? Suggestions? Reach out at: zctang2000@gmail.com
If you find this work useful, please cite our papers:
@article{tang2025longrm,
title={LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling},
author={Tang, Zecheng and Ji, Baibei and Qiu, Quantong and Wang, Haitian and Liang, Xiaobo and Li, Juntao and Zhang, Min},
journal={arXiv preprint arXiv:2510.06915},
year={2025}
}
@article{tang2025loom,
title={LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework},
author={Tang, Zecheng and Wang, Haitian and Qiu, Quantong and Ji, Baibei and Sun, Ruoxi and Zhou, Keyan and Li, Juntao and Zhang, Min},
journal={arXiv preprint arXiv:2507.04723},
year={2025}
}
}