Skip to content

Official Repository of "Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding"

License

Notifications You must be signed in to change notification settings

ChillingDream/seele

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

SEELE: A capability-adaptive framework for modulating the proportion of off-policy learning in large reasoning models.

overview

📚 Overview


📖Introduction

SEELE is a capability-adaptive RVLR approach that incorporates dynamic off-policy reasoning trace prefixes into the exploration. Guided by the theoretical optimality of RL optimization, SEELE adjusts the proportion of the off-policy prefixes to adapt the evolving model capability, achieving high learning efficiency throughout the whole training stage.

Key Highlights:

  • Optimal question difficulty: Theoretically models the relationship between the question difficulty and RL optimization efficiency.
  • Dynamic difficulty adaption: Modulate the length of the hint for controling the question difficulty within optimal region in an instance-level and real-time manner.
  • Multi-round sampling: Decompose the rollout step into several rounds to collect hint-accuracy data for building hint length prediction model.

✨Getting Started

Installation

You can install SEELE dependencies by running the following commands:

conda create -n SEELE python=3.11.9
conda activate SEELE
cd verl
pip install -e .
cd ..
pip install -r requirements.txt
cd simpleRL-reason
pip install -r requirements.txt

If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

Repo Structure

This repository includes:

  • seele: Codes for training SEELE.
  • data: Data including the step-level trace annotations.
  • scripts: Example script to train and evaluate SEELE.
  • verl: veRL RL training codebase.
  • simpleRL-reason: simpleRL evluation codebase.

🔧Usage

Training

We provide an example script to train SEELE on our subset of DeepMath-103k. You can run the following command to train SEELE:

bash scripts/train_qwen2.5-7b.sh

Evaluation

We currently support automated evaluation on six widely used mathematical reasoning benchmarks (GSM8K, AIME24, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-C, GPQA-D, and MMLU-pro). Our evaluation code is adapted from SimpleRL. We replace Math-Verify by Mathruler to be consistent with training.

You can evaluate by running the following command:

bash scripts/eval.sh

This script will first merge the FSDP checkpoints of the LAST_STEP and launch simpleRL evluation.

SEELE on Qwen2.5 models (zero-RL)

SEELE achieves state-of-the-art results among all zero-RL and supervision-aided methods

Model GSM8K MATH500 Minerva Olympiad AIME24 AMC23 Avg.(Math) ARC-C GPQA-D MMLU-Pro Avg.(Gen)
Qwen2.5-Math-7B 71.6 63.2 25.7 32.0 14.9 45.2 42.1 69.5 24.7 17.7 37.3
+ SFT 89.5 74.6 35.7 37.9 9.4 52.2 49.9 77.1 37.4 47.9 54.1
+ GRPO 92.0 80.6 36.0 41.2 23.9 60.2 55.7 77.1 37.4 45.2 53.2
+ LUFFY 91.7 80.0 35.3 42.4 18.5 66.2 55.7 80.5 39.9 49.9 56.8
+ UFT 92.1 82.4 34.6 40.3 17.6 66.6 55.6 81.0 40.9 49.9 57.3
+ Prefix-RFT 92.1 81.6 36.8 43.0 20.9 63.5 56.3 80.3 37.4 49.8 55.8
+ SEELE (Ours) 92.4 82.6 37.1 46.5 25.8 69.7 59.0 80.7 42.9 52.0 58.5
Model GSM8K MATH500 Minerva Olympiad AIME24 AMC23 Avg.(Math) ARC-C GPQA-D MMLU-Pro Avg.(Gen)
Qwen2.5-3B 73.3 29.0 7.0 10.7 0.6 11.6 22.0 66.6 23.7 15.2 35.2
+ SFT 73.1 52.6 18.4 18.5 2.3 25.0 31.7 75.7 30.3 40.1 48.7
+ GRPO 78.5 45.8 17.3 15.0 1.7 23.9 30.4 75.6 36.9 40.5 51.0
+ LUFFY 81.1 56.6 21.0 21.3 2.7 27.0 35.0 78.4 32.8 41.5 50.9
+ UFT 84.8 62.6 22.1 25.9 5.8 41.6 40.5 79.4 35.4 42.9 52.6
+ Prefix-RFT 77.4 57.0 21.3 21.9 6.0 31.5 35.9 82.2 35.9 37.9 52.0
+ SEELE (Ours) 86.3 66.4 26.1 28.9 5.9 39.4 42.2 81.2 34.3 44.0 53.2
Model GSM8K MATH500 Minerva Olympiad AIME24 AMC23 Avg.(Math) ARC-C GPQA-D MMLU-Pro Avg.(Gen)
Qwen2.5-1.5B 61.9 22.8 9.6 6.7 0.7 9.1 18.5 45.1 15.7 12.2 24.3
+ SFT 67.4 43.6 13.6 12.6 1.4 16.4 25.8 63.7 25.8 30.2 39.9
+ GRPO 70.1 36.4 10.7 11.1 1.8 15.2 24.2 64.8 25.3 23.1 37.7
+ LUFFY 67.2 45.4 11.0 12.9 1.6 16.5 25.8 64.3 26.8 24.7 38.6
+ UFT 72.6 50.4 12.9 15.9 3.9 26.6 30.4 66.1 23.7 28.5 39.4
+ Prefix-RFT 71.5 48.0 13.6 14.5 2.1 22.7 28.7 65.3 23.7 25.0 38.0
+ SEELE (Ours) 76.5 58.0 16.2 19.9 4.1 30.4 34.2 68.3 27.8 31.7 42.6

SEELE on LLaMA 3.2-3B

Model GSM8K MATH500 Minerva Olympiad AIME24 AMC23 Avg.(Math) ARC-C GPQA-D MMLU-Pro Avg.(Gen)
LLaMA 3.2-3B 8.1 4.2 2.6 1.9 0.0 2.7 3.3 13.7 15.2 7.3 12.1
+ SFT 20.7 8.8 4.0 3.0 0.2 4.0 6.8 59.0 27.3 26.3 37.5
+ GRPO 5.9 8.2 5.5 2.5 0.0 3.2 4.2 43.5 28.8 23.9 32.1
+ LUFFY 11.9 9.6 4.8 2.4 0.2 3.5 5.4 58.8 30.8 26.0 38.5
+ UFT 24.2 11.4 8.1 3.6 0.2 7.5 9.2 56.2 27.3 25.2 36.2
+ Prefix-RFT 21.1 9.2 5.8 3.6 0.1 2.8 7.1 55.5 28.4 24.9 36.3
+ SEELE (Ours) 29.5 12.8 7.4 5.2 0.7 6.5 10.4 62.4 28.8 26.1 39.1

SEELE on Mathstral 7B

Model GSM8K MATH500 Minerva Olympiad AIME24 AMC23 Avg.(Math) ARC-C GPQA-D MMLU-Pro Avg.(Gen)
Mathstral-7B 76.0 34.6 15.8 14.5 1.4 15.7 26.3 62.9 29.3 16.7 36.3
+ SFT 80.1 53.4 24.6 21.6 1.8 26.6 34.7 73.1 44.4 42.2 53.2
+ GRPO 85.6 44.0 22.4 16.0 0.9 21.7 31.8 80.0 38.4 43.5 54.0
+ LUFFY 88.4 60.0 25.0 24.4 6.4 32.4 39.4 80.2 32.8 44.1 52.4
+ UFT 87.9 57.6 23.2 20.6 4.3 33.1 37.8 81.8 37.9 47.7 55.8
+ Prefix-RFT 87.9 57.6 23.2 20.6 4.3 33.1 37.8 81.8 37.9 47.7 55.8
+ SEELE (Ours) 90.2 63.4 27.9 29.5 6.7 38.4 42.7 82.9 38.9 50.5 57.4

🌻Acknowledgement

SEELE builds upon veRL.

About

Official Repository of "Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published