Honghao Chen1,2,3#, Xingzhou Lou1,2#, Xiaokun Feng1,2#, Kaiqi Huang1,2, Xinlong Wang3
In this work, we introduce Chain of Step reasoning for vision-language models, enabling assessing reasoning step quality accurately and leading to effective reinforcement learning and inference-time scaling with fine-grained rewards. Experimental results across multiple benchmarks demonstrate the effectiveness of CoS. More importantly, we conduct extensive empirical analysis and ablations to unveil CoS’s appealing properties. We hope this paper offers insights into more complex multi-modal reasoning.
Note: You can directly use our SFT dataset (special tokens have been added) through the following link, or you can assess the raw step data to customize your SFT dataset. For customization, you can modify get_sft_json.py to get your SFT data accordingly.
| Description | Links | |
|---|---|---|
| ShareGPT-Step-300K.jsonl | The SFT Jsonl | 🤗 HF link |
| images.zip | image files | 🤗 HF link |
| raw_jsonl.zip | raw step jsonl file for customization | 🤗 HF link |
Note: You can directly use our train jsonl file to train the PRM (special tokens have been added with a fixed format) through the following link, or you can assess the raw data to customize your dataset. For customization, you can modify get_prm_json.py to get your data accordingly.
| Description | Links | |
|---|---|---|
| CoS-PRM | The PRM model | 🤗 HF link |
| prm_data_raw.json | raw prm data | 🤗 HF link |
| prm_data_train.jsonl | prm training jsonl | 🤗 HF link |
| Description | Links | |
|---|---|---|
| CoS-SFT | The SFT model | 🤗 HF link |
| CoS | The RL model | 🤗 HF link |
- SFT Dataset
- PRM & Dataset
- Training & Inference code
- Checkpoints
@article{chen2025unveiling,
title={Unveiling Chain of Step Reasoning for Vision-Language Models with Fine-grained Rewards},
author={Chen, Honghao and Lou, Xingzhou and Feng, Xiaokun and Huang, Kaiqi and Wang, Xinlong},
journal={arXiv preprint arXiv:2509.19003},
year={2025}
}
We thank the repositories for their excellent work: InternVL, LLaVa-NeXt, TAP
