SEELE: A capability-adaptive framework for modulating the proportion of off-policy learning in large reasoning models.
SEELE is a capability-adaptive RVLR approach that incorporates dynamic off-policy reasoning trace prefixes into the exploration. Guided by the theoretical optimality of RL optimization, SEELE adjusts the proportion of the off-policy prefixes to adapt the evolving model capability, achieving high learning efficiency throughout the whole training stage.
- Optimal question difficulty: Theoretically models the relationship between the question difficulty and RL optimization efficiency.
- Dynamic difficulty adaption: Modulate the length of the hint for controling the question difficulty within optimal region in an instance-level and real-time manner.
- Multi-round sampling: Decompose the rollout step into several rounds to collect hint-accuracy data for building hint length prediction model.
You can install SEELE dependencies by running the following commands:
conda create -n SEELE python=3.11.9
conda activate SEELE
cd verl
pip install -e .
cd ..
pip install -r requirements.txt
cd simpleRL-reason
pip install -r requirements.txtIf you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whlThis repository includes:
seele: Codes for training SEELE.data: Data including the step-level trace annotations.scripts: Example script to train and evaluate SEELE.verl: veRL RL training codebase.simpleRL-reason: simpleRL evluation codebase.
We provide an example script to train SEELE on our subset of DeepMath-103k. You can run the following command to train SEELE:
bash scripts/train_qwen2.5-7b.shWe currently support automated evaluation on six widely used mathematical reasoning benchmarks (GSM8K, AIME24, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-C, GPQA-D, and MMLU-pro). Our evaluation code is adapted from SimpleRL. We replace Math-Verify by Mathruler to be consistent with training.
You can evaluate by running the following command:
bash scripts/eval.shThis script will first merge the FSDP checkpoints of the LAST_STEP and launch simpleRL evluation.
SEELE achieves state-of-the-art results among all zero-RL and supervision-aided methods
| Model | GSM8K | MATH500 | Minerva | Olympiad | AIME24 | AMC23 | Avg.(Math) | ARC-C | GPQA-D | MMLU-Pro | Avg.(Gen) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-Math-7B | 71.6 | 63.2 | 25.7 | 32.0 | 14.9 | 45.2 | 42.1 | 69.5 | 24.7 | 17.7 | 37.3 |
| + SFT | 89.5 | 74.6 | 35.7 | 37.9 | 9.4 | 52.2 | 49.9 | 77.1 | 37.4 | 47.9 | 54.1 |
| + GRPO | 92.0 | 80.6 | 36.0 | 41.2 | 23.9 | 60.2 | 55.7 | 77.1 | 37.4 | 45.2 | 53.2 |
| + LUFFY | 91.7 | 80.0 | 35.3 | 42.4 | 18.5 | 66.2 | 55.7 | 80.5 | 39.9 | 49.9 | 56.8 |
| + UFT | 92.1 | 82.4 | 34.6 | 40.3 | 17.6 | 66.6 | 55.6 | 81.0 | 40.9 | 49.9 | 57.3 |
| + Prefix-RFT | 92.1 | 81.6 | 36.8 | 43.0 | 20.9 | 63.5 | 56.3 | 80.3 | 37.4 | 49.8 | 55.8 |
| + SEELE (Ours) | 92.4 | 82.6 | 37.1 | 46.5 | 25.8 | 69.7 | 59.0 | 80.7 | 42.9 | 52.0 | 58.5 |
| Model | GSM8K | MATH500 | Minerva | Olympiad | AIME24 | AMC23 | Avg.(Math) | ARC-C | GPQA-D | MMLU-Pro | Avg.(Gen) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-3B | 73.3 | 29.0 | 7.0 | 10.7 | 0.6 | 11.6 | 22.0 | 66.6 | 23.7 | 15.2 | 35.2 |
| + SFT | 73.1 | 52.6 | 18.4 | 18.5 | 2.3 | 25.0 | 31.7 | 75.7 | 30.3 | 40.1 | 48.7 |
| + GRPO | 78.5 | 45.8 | 17.3 | 15.0 | 1.7 | 23.9 | 30.4 | 75.6 | 36.9 | 40.5 | 51.0 |
| + LUFFY | 81.1 | 56.6 | 21.0 | 21.3 | 2.7 | 27.0 | 35.0 | 78.4 | 32.8 | 41.5 | 50.9 |
| + UFT | 84.8 | 62.6 | 22.1 | 25.9 | 5.8 | 41.6 | 40.5 | 79.4 | 35.4 | 42.9 | 52.6 |
| + Prefix-RFT | 77.4 | 57.0 | 21.3 | 21.9 | 6.0 | 31.5 | 35.9 | 82.2 | 35.9 | 37.9 | 52.0 |
| + SEELE (Ours) | 86.3 | 66.4 | 26.1 | 28.9 | 5.9 | 39.4 | 42.2 | 81.2 | 34.3 | 44.0 | 53.2 |
| Model | GSM8K | MATH500 | Minerva | Olympiad | AIME24 | AMC23 | Avg.(Math) | ARC-C | GPQA-D | MMLU-Pro | Avg.(Gen) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-1.5B | 61.9 | 22.8 | 9.6 | 6.7 | 0.7 | 9.1 | 18.5 | 45.1 | 15.7 | 12.2 | 24.3 |
| + SFT | 67.4 | 43.6 | 13.6 | 12.6 | 1.4 | 16.4 | 25.8 | 63.7 | 25.8 | 30.2 | 39.9 |
| + GRPO | 70.1 | 36.4 | 10.7 | 11.1 | 1.8 | 15.2 | 24.2 | 64.8 | 25.3 | 23.1 | 37.7 |
| + LUFFY | 67.2 | 45.4 | 11.0 | 12.9 | 1.6 | 16.5 | 25.8 | 64.3 | 26.8 | 24.7 | 38.6 |
| + UFT | 72.6 | 50.4 | 12.9 | 15.9 | 3.9 | 26.6 | 30.4 | 66.1 | 23.7 | 28.5 | 39.4 |
| + Prefix-RFT | 71.5 | 48.0 | 13.6 | 14.5 | 2.1 | 22.7 | 28.7 | 65.3 | 23.7 | 25.0 | 38.0 |
| + SEELE (Ours) | 76.5 | 58.0 | 16.2 | 19.9 | 4.1 | 30.4 | 34.2 | 68.3 | 27.8 | 31.7 | 42.6 |
| Model | GSM8K | MATH500 | Minerva | Olympiad | AIME24 | AMC23 | Avg.(Math) | ARC-C | GPQA-D | MMLU-Pro | Avg.(Gen) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| LLaMA 3.2-3B | 8.1 | 4.2 | 2.6 | 1.9 | 0.0 | 2.7 | 3.3 | 13.7 | 15.2 | 7.3 | 12.1 |
| + SFT | 20.7 | 8.8 | 4.0 | 3.0 | 0.2 | 4.0 | 6.8 | 59.0 | 27.3 | 26.3 | 37.5 |
| + GRPO | 5.9 | 8.2 | 5.5 | 2.5 | 0.0 | 3.2 | 4.2 | 43.5 | 28.8 | 23.9 | 32.1 |
| + LUFFY | 11.9 | 9.6 | 4.8 | 2.4 | 0.2 | 3.5 | 5.4 | 58.8 | 30.8 | 26.0 | 38.5 |
| + UFT | 24.2 | 11.4 | 8.1 | 3.6 | 0.2 | 7.5 | 9.2 | 56.2 | 27.3 | 25.2 | 36.2 |
| + Prefix-RFT | 21.1 | 9.2 | 5.8 | 3.6 | 0.1 | 2.8 | 7.1 | 55.5 | 28.4 | 24.9 | 36.3 |
| + SEELE (Ours) | 29.5 | 12.8 | 7.4 | 5.2 | 0.7 | 6.5 | 10.4 | 62.4 | 28.8 | 26.1 | 39.1 |
| Model | GSM8K | MATH500 | Minerva | Olympiad | AIME24 | AMC23 | Avg.(Math) | ARC-C | GPQA-D | MMLU-Pro | Avg.(Gen) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Mathstral-7B | 76.0 | 34.6 | 15.8 | 14.5 | 1.4 | 15.7 | 26.3 | 62.9 | 29.3 | 16.7 | 36.3 |
| + SFT | 80.1 | 53.4 | 24.6 | 21.6 | 1.8 | 26.6 | 34.7 | 73.1 | 44.4 | 42.2 | 53.2 |
| + GRPO | 85.6 | 44.0 | 22.4 | 16.0 | 0.9 | 21.7 | 31.8 | 80.0 | 38.4 | 43.5 | 54.0 |
| + LUFFY | 88.4 | 60.0 | 25.0 | 24.4 | 6.4 | 32.4 | 39.4 | 80.2 | 32.8 | 44.1 | 52.4 |
| + UFT | 87.9 | 57.6 | 23.2 | 20.6 | 4.3 | 33.1 | 37.8 | 81.8 | 37.9 | 47.7 | 55.8 |
| + Prefix-RFT | 87.9 | 57.6 | 23.2 | 20.6 | 4.3 | 33.1 | 37.8 | 81.8 | 37.9 | 47.7 | 55.8 |
| + SEELE (Ours) | 90.2 | 63.4 | 27.9 | 29.5 | 6.7 | 38.4 | 42.7 | 82.9 | 38.9 | 50.5 | 57.4 |
SEELE builds upon veRL.
