Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

SEELE: A capability-adaptive framework for modulating the proportion of off-policy learning in large reasoning models.

📚 Overview

📖 Introduction
✨ Getting Started
🔧 Usage
🌻 Acknowledgement

📖Introduction

SEELE is a capability-adaptive RVLR approach that incorporates dynamic off-policy reasoning trace prefixes into the exploration. Guided by the theoretical optimality of RL optimization, SEELE adjusts the proportion of the off-policy prefixes to adapt the evolving model capability, achieving high learning efficiency throughout the whole training stage.

Key Highlights:

Optimal question difficulty: Theoretically models the relationship between the question difficulty and RL optimization efficiency.
Dynamic difficulty adaption: Modulate the length of the hint for controling the question difficulty within optimal region in an instance-level and real-time manner.
Multi-round sampling: Decompose the rollout step into several rounds to collect hint-accuracy data for building hint length prediction model.

✨Getting Started

Installation

You can install SEELE dependencies by running the following commands:

conda create -n SEELE python=3.11.9
conda activate SEELE
cd verl
pip install -e .
cd ..
pip install -r requirements.txt
cd simpleRL-reason
pip install -r requirements.txt

If you encounter issues when installing flash-attn, we recommend you to install it here flash-attn. For example, we use this version.

wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

Repo Structure

This repository includes:

seele: Codes for training SEELE.
data: Data including the step-level trace annotations.
scripts: Example script to train and evaluate SEELE.
verl: veRL RL training codebase.
simpleRL-reason: simpleRL evluation codebase.

🔧Usage

Training

We provide an example script to train SEELE on our subset of DeepMath-103k. You can run the following command to train SEELE:

bash scripts/train_qwen2.5-7b.sh

Evaluation

We currently support automated evaluation on six widely used mathematical reasoning benchmarks (GSM8K, AIME24, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-C, GPQA-D, and MMLU-pro). Our evaluation code is adapted from SimpleRL. We replace Math-Verify by Mathruler to be consistent with training.

You can evaluate by running the following command:

bash scripts/eval.sh

This script will first merge the FSDP checkpoints of the LAST_STEP and launch simpleRL evluation.

SEELE on Qwen2.5 models (zero-RL)

SEELE achieves state-of-the-art results among all zero-RL and supervision-aided methods

Model	GSM8K	MATH500	Minerva	Olympiad	AIME24	AMC23	Avg.(Math)	ARC-C	GPQA-D	MMLU-Pro	Avg.(Gen)
Qwen2.5-Math-7B	71.6	63.2	25.7	32.0	14.9	45.2	42.1	69.5	24.7	17.7	37.3
+ SFT	89.5	74.6	35.7	37.9	9.4	52.2	49.9	77.1	37.4	47.9	54.1
+ GRPO	92.0	80.6	36.0	41.2	23.9	60.2	55.7	77.1	37.4	45.2	53.2
+ LUFFY	91.7	80.0	35.3	42.4	18.5	66.2	55.7	80.5	39.9	49.9	56.8
+ UFT	92.1	82.4	34.6	40.3	17.6	66.6	55.6	81.0	40.9	49.9	57.3
+ Prefix-RFT	92.1	81.6	36.8	43.0	20.9	63.5	56.3	80.3	37.4	49.8	55.8
+ SEELE (Ours)	92.4	82.6	37.1	46.5	25.8	69.7	59.0	80.7	42.9	52.0	58.5

Model	GSM8K	MATH500	Minerva	Olympiad	AIME24	AMC23	Avg.(Math)	ARC-C	GPQA-D	MMLU-Pro	Avg.(Gen)
Qwen2.5-3B	73.3	29.0	7.0	10.7	0.6	11.6	22.0	66.6	23.7	15.2	35.2
+ SFT	73.1	52.6	18.4	18.5	2.3	25.0	31.7	75.7	30.3	40.1	48.7
+ GRPO	78.5	45.8	17.3	15.0	1.7	23.9	30.4	75.6	36.9	40.5	51.0
+ LUFFY	81.1	56.6	21.0	21.3	2.7	27.0	35.0	78.4	32.8	41.5	50.9
+ UFT	84.8	62.6	22.1	25.9	5.8	41.6	40.5	79.4	35.4	42.9	52.6
+ Prefix-RFT	77.4	57.0	21.3	21.9	6.0	31.5	35.9	82.2	35.9	37.9	52.0
+ SEELE (Ours)	86.3	66.4	26.1	28.9	5.9	39.4	42.2	81.2	34.3	44.0	53.2

Model	GSM8K	MATH500	Minerva	Olympiad	AIME24	AMC23	Avg.(Math)	ARC-C	GPQA-D	MMLU-Pro	Avg.(Gen)
Qwen2.5-1.5B	61.9	22.8	9.6	6.7	0.7	9.1	18.5	45.1	15.7	12.2	24.3
+ SFT	67.4	43.6	13.6	12.6	1.4	16.4	25.8	63.7	25.8	30.2	39.9
+ GRPO	70.1	36.4	10.7	11.1	1.8	15.2	24.2	64.8	25.3	23.1	37.7
+ LUFFY	67.2	45.4	11.0	12.9	1.6	16.5	25.8	64.3	26.8	24.7	38.6
+ UFT	72.6	50.4	12.9	15.9	3.9	26.6	30.4	66.1	23.7	28.5	39.4
+ Prefix-RFT	71.5	48.0	13.6	14.5	2.1	22.7	28.7	65.3	23.7	25.0	38.0
+ SEELE (Ours)	76.5	58.0	16.2	19.9	4.1	30.4	34.2	68.3	27.8	31.7	42.6

SEELE on LLaMA 3.2-3B

Model	GSM8K	MATH500	Minerva	Olympiad	AIME24	AMC23	Avg.(Math)	ARC-C	GPQA-D	MMLU-Pro	Avg.(Gen)
LLaMA 3.2-3B	8.1	4.2	2.6	1.9	0.0	2.7	3.3	13.7	15.2	7.3	12.1
+ SFT	20.7	8.8	4.0	3.0	0.2	4.0	6.8	59.0	27.3	26.3	37.5
+ GRPO	5.9	8.2	5.5	2.5	0.0	3.2	4.2	43.5	28.8	23.9	32.1
+ LUFFY	11.9	9.6	4.8	2.4	0.2	3.5	5.4	58.8	30.8	26.0	38.5
+ UFT	24.2	11.4	8.1	3.6	0.2	7.5	9.2	56.2	27.3	25.2	36.2
+ Prefix-RFT	21.1	9.2	5.8	3.6	0.1	2.8	7.1	55.5	28.4	24.9	36.3
+ SEELE (Ours)	29.5	12.8	7.4	5.2	0.7	6.5	10.4	62.4	28.8	26.1	39.1

SEELE on Mathstral 7B

Model	GSM8K	MATH500	Minerva	Olympiad	AIME24	AMC23	Avg.(Math)	ARC-C	GPQA-D	MMLU-Pro	Avg.(Gen)
Mathstral-7B	76.0	34.6	15.8	14.5	1.4	15.7	26.3	62.9	29.3	16.7	36.3
+ SFT	80.1	53.4	24.6	21.6	1.8	26.6	34.7	73.1	44.4	42.2	53.2
+ GRPO	85.6	44.0	22.4	16.0	0.9	21.7	31.8	80.0	38.4	43.5	54.0
+ LUFFY	88.4	60.0	25.0	24.4	6.4	32.4	39.4	80.2	32.8	44.1	52.4
+ UFT	87.9	57.6	23.2	20.6	4.3	33.1	37.8	81.8	37.9	47.7	55.8
+ Prefix-RFT	87.9	57.6	23.2	20.6	4.3	33.1	37.8	81.8	37.9	47.7	55.8
+ SEELE (Ours)	90.2	63.4	27.9	29.5	6.7	38.4	42.7	82.9	38.9	50.5	57.4

🌻Acknowledgement

SEELE builds upon veRL.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
figures		figures
scripts		scripts
seele		seele
simpleRL-reason		simpleRL-reason
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

📚 Overview

📖Introduction

Key Highlights:

✨Getting Started

Installation

Repo Structure

🔧Usage

Training

Evaluation

SEELE on Qwen2.5 models (zero-RL)

SEELE on LLaMA 3.2-3B

SEELE on Mathstral 7B

🌻Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

ChillingDream/seele

Folders and files

Latest commit

History

Repository files navigation

Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding

📚 Overview

📖Introduction

Key Highlights:

✨Getting Started

Installation

Repo Structure

🔧Usage

Training

Evaluation

SEELE on Qwen2.5 models (zero-RL)

SEELE on LLaMA 3.2-3B

SEELE on Mathstral 7B

🌻Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages