V0 is a Generalist Value Model designed to predict the expected performance of any model on unseen instructions without requiring parameter updates or additional rollouts. By treating a policy's dynamic capability as explicit context, V0 serves as an efficient resource scheduler for LLM training and deployment.
Function: V0 uses a model's historical performance to predict how it will perform on unseen instructions without running the model itself.
In LLM policy-gradient RL training (e.g., PPO), value models are typically coupled to a specific policy. V0 reframes this paradigm by:
- State Zero Estimation: Focusing on the initial prompt to predict success rates before generation.
- Dynamic Profiling: Using instruction-performance pairs to perceive capability shifts without retraining.
- Resource Scheduling: Optimizing sampling budgets in GRPO and acting as an intelligent router during deployment.
The V0 architecture consists of a Semantic Backbone for embedding extraction and a Residual Query Adapter that integrates static and dynamic queries. These features are processed by a TabPFN inference head to generate value predictions.
Clone the repository and install the dependencies:
git clone https://github.com/Now-Join-Us/V0.git
cd V0
pip install -r requirements.txtpython main_train.pyTo launch the local demo:
python demo.pyIf you find V0 useful in your research, please cite our work:
@article{generalist_value_model_v0,
author = {Yi-Kai Zhang and Zhiyuan Yao and Hongyan Hao and Yueqing Sun and Qi Gu and Hui Su and Xunliang Cai and De-Chuan Zhan and Han-Jia Ye},
title = {V0: A Generalist Value Model for Any Policy at State Zero},
journal = {CoRR},
volume = {abs/2602.03584},
year = {2026}
}
