Context Denoising Training for Long-Context Modeling

This is the official repository for the paper:
Revisiting Long-Context Modeling from a Context Denoising Perspective.

🛠️ Environment Setup

We recommend using transformers==4.46.1 to ensure successful model deployment.

Install the required dependencies with:

pip install -r requirements.txt

📚 Data Preparation

We use the pg19-test dataset in our experiments. To download it, run:

cd preliminary/data
git clone https://huggingface.co/datasets/emozilla/pg19-test

Preliminary Experiments

During evaluation, we dynamically generate test data from the source.
A subset of the full evaluation data is provided at:
preliminary/data/full20.jsonl

However, we strongly recommend generating data on-the-fly for the most accurate results. To do so, run:

cd ../..
python preliminary/src/test_score.py --model=meta-llama/Meta-Llama-3.1-8B-Instruct --context_lengths=11900

⚠️ Note:
This step requires at least 8 GPUs, each with more than 85 GB of memory.

After generation, compute and visualize the IG (Information Gain) and FR (Faithfulness Ratio) scores:

python preliminary/src/stats_igscore.py --context_length=11900
python preliminary/src/stats_frscore.py --context_length=11900

🔥 Training

To set up the training environment, first clone the LOOM-Train framework:

git clone https://github.com/LCM-Lab/LOOM-Train.git

Then follow the setup instructions in the LOOM-Train repository.

Once ready, launch the Context Denoising Training (CDT) process:

cd train
bash train_cdt.sh

🤝 Contributing

We welcome contributions! Whether it’s bug fixes, new features, or documentation improvements — feel free to open an issue or PR.

📬 Contact

Questions? Suggestions? Reach out at: zctang2000@gmail.com

📚 Cite Us

If you find this work useful, please cite our paper:

@article{tang2025revisiting,
  title={Revisiting Long-context Modeling from Context Denoising Perspective},
  author={Tang, Zecheng and Ji, Baibei and Li, Juntao and Wu, Lijun and Gui, Haijia and Zhang, Min},
  journal={arXiv preprint arXiv:2510.05862},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
preliminary		preliminary
train		train
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Context Denoising Training for Long-Context Modeling

🛠️ Environment Setup

📚 Data Preparation

Preliminary Experiments

🔥 Training

🤝 Contributing

📬 Contact

📚 Cite Us

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

LCM-Lab/context-denoising-training

Folders and files

Latest commit

History

Repository files navigation

Context Denoising Training for Long-Context Modeling

🛠️ Environment Setup

📚 Data Preparation

Preliminary Experiments

🔥 Training

🤝 Contributing

📬 Contact

📚 Cite Us

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages