📜 LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

Paper: https://arxiv.org/pdf/2510.06915
Models: ModelScope LongRM Collections / HuggingFace LongRM Collections
Long-RewardBench V1: ModelScope / HuggingFace

Pushing the limits of reward modeling beyond 128K tokens — with memory-efficient training and a new benchmark for long-context reward models.

📅 Update Log

2025-10-09 — 🚀 Code released: Full training & evaluation pipeline now open-sourced.
2025-10-11 — LongRMs are released on 🤗 Huggingface LongRM Collections and Long-RewardBench is released on 🤗 Huggingface Long-RewardBench Dataset.
Coming Soon — We are continually improving Long-RewardBench.

💻 Environment & Installation

Set up your training environment with the following steps:

cd LongRM
pip install -r requirements.txt

🔌 Install FlashAttention (Highly Recommended)

For optimal memory efficiency and speed during long-context training, install FlashAttention-2:

Download the compatible .whl file for your CUDA version from:
👉 https://github.com/Dao-AILab/flash-attention/releases

Install locally:

pip install <path_to_downloaded_flash_attn_whl_file>

Install Ring-Flash-Attention for scalable long-sequence training:
```
pip install ring_flash_attn
```

💡 Tip: Use nvidia-smi to verify your CUDA version. Ensure the FlashAttention wheel matches your driver and PyTorch setup.

🔥 Training Pipeline

LongRM uses a two-stage training paradigm to unlock long-context reward modeling. Choose your modeling approach below.

1) Generative Reward Model (GRM)

Stage 1: Supervised Fine-Tuning (SFT)
Train the base model on instruction-following data:

bash scripts/sft.sh

Stage 2: Preference Optimization with SIMPO
Refine the model using long-context preference signals:

bash scripts/simpo_grm.sh

✅ Requires completion of Stage 1.

2) Discriminative Reward Model (DisRM)

Single-Stage: Direct Preference Optimization
Train a binary classifier to score response pairs directly under long contexts:

bash scripts/simpo_disrm.sh

✅ No SFT pretraining needed — end-to-end training from scratch.

📊 Evaluation

We provide a new benchmark dataset — LongReward-Bench — and pretrained models on ModelScope for easy evaluation.

🔗 Dataset & Models Download

⚠️ We provide the ModelScope links 👉 Here. If you wish to use Hugging Face–hosted models and datasets, please refer to the download links at the beginning of the README to obtain the corresponding models and datasets.

🤖 Evaluate Generative RM (e.g., Qwen3-8B)

modelscope download LCM_group/LongReward_Qwen3-8B --repo-type model --local_dir ./LongReward_Qwen3-8B

python evaluate/eval.py --model-path ./LongReward_Qwen3-8B --data-path ./LongReward-Bench

🔍 Evaluate Discriminative RM (e.g., Skywork-Reward-V2-Llama-3.1-8B)

modelscope download LCM_group/LongReward_Skywork-Reward-V2-Llama-3.1-8B --repo-type model --local_dir ./LongReward_Skywork-Reward-V2-Llama-3.1-8B

python evaluate/eval.py --model-path ./LongReward_Skywork-Reward-V2-Llama-3.1-8B --data-path ./LongReward-Bench --is-disrm

✅ Add --is-disrm flag only for discriminative models.
📈 Results are reported as accuracy and AUC on long-context preference pairs (up to 131K tokens).

🤝 Contributing

We welcome contributions! Whether it’s:

Adding new datasets or evaluation metrics
Improving training efficiency
Porting to other architectures (e.g., Mistral, Gemma)

Please open an Issue or submit a Pull Request.

📬 Contact

Questions? Suggestions? Reach out at: zctang2000@gmail.com

📚 Cite Us

If you find this work useful, please cite our papers:

@article{tang2025longrm,
  title={LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling},
  author={Tang, Zecheng and Ji, Baibei and Qiu, Quantong and Wang, Haitian and Liang, Xiaobo and Li, Juntao and Zhang, Min},
  journal={arXiv preprint arXiv:2510.06915},
  year={2025}
}

@article{tang2025loom,
    title={LOOM-Scope: a comprehensive and efficient LOng-cOntext Model evaluation framework},
    author={Tang, Zecheng and Wang, Haitian and Qiu, Quantong and Ji, Baibei and Sun, Ruoxi and Zhou, Keyan and Li, Juntao and Zhang, Min},
    journal={arXiv preprint arXiv:2507.04723},
    year={2025}
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

📜 LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

📅 Update Log

💻 Environment & Installation

🔌 Install FlashAttention (Highly Recommended)

🔥 Training Pipeline

1) Generative Reward Model (GRM)

2) Discriminative Reward Model (DisRM)

📊 Evaluation

🤖 Evaluate Generative RM (e.g., Qwen3-8B)

🔍 Evaluate Discriminative RM (e.g., Skywork-Reward-V2-Llama-3.1-8B)

🤝 Contributing

📬 Contact

📚 Cite Us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
datas		datas
evaluate		evaluate
scripts		scripts
train		train
README.md		README.md
requirements.txt		requirements.txt

LCM-Lab/LongRM

Folders and files

Latest commit

History

Repository files navigation

📜 LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

📅 Update Log

💻 Environment & Installation

🔌 Install FlashAttention (Highly Recommended)

🔥 Training Pipeline

1) Generative Reward Model (GRM)

2) Discriminative Reward Model (DisRM)

📊 Evaluation

🤖 Evaluate Generative RM (e.g., Qwen3-8B)

🔍 Evaluate Discriminative RM (e.g., Skywork-Reward-V2-Llama-3.1-8B)

🤝 Contributing

📬 Contact

📚 Cite Us

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages