🔮 Spaeing the Unseen
Haoyi Jiang1,
Liu Liu2, Xinjie Wang2, Yonghao He3,
Wei Sui3, Zhizhong Su2,
Wenyu Liu1, Xinggang Wang1
1Huazhong University of Science & Technology,
2Horizon Robotics,
3D-Robotics
Please clone this project with --recursive.
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
pip install submodules/vggt
pip install -e submodules/lmms-eval
We utilize a combination of large-scale indoor scene datasets: ScanNet and ScanNet++.
- For video-centric VSI-Bench, we fine-tune on VSI-590K.
- For image-based benchmarks, we use a composite training set. Please refer to the VG-LLM datasets.
To train the Predictive Spatial Field Modeling (PSFM) framework from scratch:
export PYTHONPATH=.
python scripts/train_spa3r.py
Set the pre-trained Spa3R path in the script: geometry_encoder_path=/path/to/spa3r.ckpt
bash scripts/train_vlm_sft.sh
To evaluate Spa3-VLM on spatial reasoning benchmarks:
bash scripts/eval_vlm.sh
If you find our work helpful for your research, please consider starring this repository ⭐ and citing our work:
@article{Spa3R,
title={Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning},
author={Haoyi Jiang and Liu Liu and Xinjie Wang and Yonghao He and Wei Sui and Zhizhong Su and Wenyu Liu and Xinggang Wang},
journal={arXiv preprint arXiv:2602.21186},
year=2026
}This project is released under the MIT License.