We released ZoomEarth🌍, a vision language model that is designed to solve visual reasoning and question answering tasks on ultra-high-resolution remote sensing imagery with active perception. Moreover, ZoomEarth can seamlessly integrate with downstream models for tasks such as cloud removal, denoising, segmentation, and image editing through simple tool interfaces, demonstrating strong extensibility.
2025.11.22🎉🎉🎉 We release the code that supports faster inference with vLLM!2025.11.18🎉🎉🎉 ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks is now avilable on arXiv!2025.11.15🎉🎉🎉 ZoomEarth-3B is publicly available on huggingface🤗!2025.11.15🎉🎉🎉 LRS-GRO is publicly available on huggingface🤗!
ZoomEarth-Demo.mp4
Our model, ZoomEarth, is built upon Qwen2.5-VL-3B, a powerful VLM that
It supports fine-grained reasoning, spatial context interpretation, and multi-level object understanding.
- Model weights: ZoomEarth-3B 🤗
- Training scripts: src/train/
- Evaluate scripts: src/eval/
LRS-GRO contains high-resolution satellite images annotated with:
- Multi-level question types (global, regional, object)
- Bounding boxes and spatial relations
- Reasoning-based and factual QAs
| Split | #Images | #Questions | Avg. Resolution |
|---|---|---|---|
| SFT | 88 | 1011 | 5000 |
| RL | 228 | 2500 | 5000 |
| Test | 908 | 9734 | 5000 |
Download:
LRS-GRO 🤗
Step 1. Create a conda environment and activate it.
conda create -n zoom-earth python=3.10 -y
conda activate zoom-earth
Step 2. Install PyTorch (We use PyTorch 2.4.1 / CUDA 12.1)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Step 3. Install other depencencies
pip install -r requirements.txt
Step 4. Configure NLTK local corpora (for WordNet)
# (1) Download WordNet data to a local directory (optional if already exists)
python -m nltk.downloader wordnet -d ./nltk_data
# (2) In your code, add the following before importing WordNet
import nltk
local_corpora = "./nltk_data"
nltk.data.path.insert(0, local_corpora)
from nltk.corpus import wordnet as wn
and then replace local_corpora with actual path in src/eval/eval.py, src/train/RL/src/open-r1-multimodal/src/open_r1/custom/customized_funcs.py
python src/demo.pyTo train ZoomEarth, first run bash ./run_scripts/train_sft.sh to start SFT training phase.
After that, run bash ./run_scripts/train_rl.sh to start RL training phase.
To evaluate model on LRS-GRO, first run bash ./run_scripts/infer.sh to generate inference file.
After that, run bash ./run_scripts/eval.sh to get detailed evaluation result.
Or infer with vLLM:
First install vLLM to your environment And then start vLLM services by:
VLLM_USE_MODELSCOPE=true vllm serve \
PATH_TO_ZOOM_EARTH_MODEL \
--served-model-name ZoomEarth \
--max_model_len 2048 \
--host 0.0.0.0 \
--port 8000
Finally run python ./src/eval/infer_vllm.py --exp_name zoomearth-infer, and after infer you will find your result in result/zoomearth-infer.jsonl
If you have questions or would like to collaborate, please contact us at:
📧 liuruixun6343@gmail.com
If you found our work usful, welcome to cite us:
@misc{liu2025zoomearthactiveperceptionultrahighresolution,
title={ZoomEarth: Active Perception for Ultra-High-Resolution Geospatial Vision-Language Tasks},
author={Ruixun Liu and Bowen Fu and Jiayi Song and Kaiyu Li and Wanchen Li and Lanxuan Xue and Hui Qiao and Weizhan Zhang and Deyu Meng and Xiangyong Cao},
year={2025},
eprint={2511.12267},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2511.12267},
}
Thanks to the images from FAIR1M, GLH-Bridge and STAR, the benchmarks: LRS-VQA, MME-RealWorld, XLRS-Bench and GeoLLaVA-8K and the VLM-R1 training framework code.
© 2025 ZoomEarth Project. Released under the Apache 2.0 License.