"When a metric is used as a target, it ceases to be a good metric." — Goodhart's Law
IFDecorator addresses this fundamental challenge in RLVR training, where models often exploit verification shortcuts rather than truly understanding user intent, leading to the critical over-optimization problem.
The IFDecorator Framework: A synergistic architecture combining three components - the cooperative-adversarial data flywheel that evolves instruction-verification pairs, IntentCheck for robust intent alignment, and trip wires for proactive reward hacking detection. This unified approach transforms RLVR training into a robust and sample-efficient pipeline.
The datasets are available on Hugging Face: guox18/IFDecorator. Each data entry includes a "difficulty" label rather than a "complexity" label:
- Difficulty: Pass rate under corresponding verification.
- Complexity: Number of constraints.
Breaking the Trade-off: IFDecorator achieves the optimal balance between instruction following performance and hack resistance. Our framework guides models toward the upper-right region, where strong instruction following capability coexists with robust resistance to reward hacking - a combination that traditional RLVR approaches struggle to achieve.
Difficulty instead of Complexity: Instructions with different complexity levels may have varying actual difficulty. Our data flywheel quantifies difficulty through pass rates, ensuring efficient training.
preprocess/: Data collection and preprocessingenhance/: Data evolvingpostprocess/: Post-processing and filtering
reward/andreward_manager/: Reward Design- Training recipes for Qwen2.5-7B and Qwen2.5-32B models
- Instructions with trap (
probe.jsonl) - Trigger and capture reward hacking.
- Python 3.10
# Clone the repository
git clone <repository-url>
cd code
# Install dependencies for flywheel
pip install -r requirements.txtThe data preparation process consists of three sequential steps:
cd modules/preprocess
./run_preprocess.sh <input_dir> <output_path> [seed]cd modules/enhance
./run_pipeline.shcd modules/postprocess
./run_postprocess.sh [pipeline_num] [input_file]# Clone VERL repository
git clone https://github.com/volcengine/verl.git
cd verl
# Checkout specific commit for compatibility
git checkout 5c5b92819db93dd47ad3403f41ef9b871c47874c
# Install VERL
pip install .Important: Different VERL versions may have different output formats regarding special tokens. Use commit 5c5b92819db93dd47ad3403f41ef9b871c47874c for guaranteed compatibility
You have two options for reward manager:
- Option A: Replace the reward manager with our custom implementation
- Option B: Use the official batch reward manager (recommended for newer VERL versions)
Navigate to the recipe directory and run the appropriate training script:
cd recipe
# For Qwen2.5-7B model
bash run_qwen2_5-7b.sh
# For Qwen2.5-32B model
bash run_qwen2_5-32b.shYou can monitor and detect potential reward hacking using our tripwires system:
cd tripwires
bash run_hacking_prob.shThis project is licensed under the Creative Commons Attribution 4.0 International License - see the LICENSE file for details.
If you use this work in your research, please cite:
@misc{guo2025ifdecoratorwrappinginstructionfollowing,
title={IFDECORATOR: Wrapping Instruction Following Reinforcement Learning with Verifiable Rewards},
author={Xu Guo and Tianyi Liang and Tong Jian and Xiaogui Yang and Ling-I Wu and Chenhui Li and Zhihui Lu and Qipeng Guo and Kai Chen},
year={2025},
eprint={2508.04632},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.04632},
}