Skip to content

[CVPR 2025] This is a model aggregated with CLIP and SAM version of SkySense for remote sensing interpretation described in SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling.

License

Notifications You must be signed in to change notification settings

zqcrafts/SkySense-O

Repository files navigation

image

paper Dataset Weight DemoDemo

Introduction✨

1. SkySense Family

Welcome to SkySense-O — part of the SkySense family [homepage], a series of remote sensing foundation models for earth observation as following. We'd be delighted to have your attention and earn a star!
(1) SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery
(2) SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling
(3) SkySense-V2: A Unified Foundation Model for Multi-modal Remote Sensing
(4) SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model for Earth Observation
(5) SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

2. SkySense-O

This is a model aggregated with CLIP and SAM version of SkySense for remote sensing interpretation described in SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling. In addition to introducing a powerful remote sensing vision-language foundation model, we have also proposed the first open-vocabulary segmentation dataset in the remote sensing domain. Each ground truth (contains mask and text) in the dataset has undergone multiple rounds of annotation and validation by human experts, enabling the capability to segment anything in open remote sensing scenarios.

The primary advantage of our model, in comparison to SAM and GroundingDINO, lies in its ability to deliver output with pixel-level spatial high density and more expansive semantic labeling as following.

News 🚀

  • 2025/02/27: 🔥 SkySense-O has been accepted to CVPR2025 !
  • 2025/04/08: 🔥 We introduce SkySense-O, demonstrating impressive zero-shot capabilities on a thorough evaluation encompassing 14 datasets, from recognizing to reasoning and classification to localization. Specifically, it outperforms the latest models such as SegEarth-OV, GeoRSCLIP, and VHM by a large margin, i.e., 11.95%, 8.04% and 3.55% on average respectively.
  • 2025/06/10: 🔥 We release the training and evaluation code.
  • 2025/06/11: 🔥 We release the checkpoints and demo. Welcome to try!
  • 2025/06/17: 🔥 We release the checkpoints of SkySense-CLIP[ckpt] for future research.
  • 2025/06/29: 🔥 We release the Sky-SA dataset[dataset] .
  • 2025/08/06: 🔥 Our new work, SkySense++[paper][code], has been accepted to Nature Machine Intelligence! Different from text prompt of Skysense-O, SkySense++ focuses on visual prompt.
  • 2025/08/08: 🔥 The SkySense family homepage is now live. Welcome to follow us!

Try Our Demo 🕹️

  1. Install dependencies.
  2. Download the demo checkpoint. [ckpt]
  3. Run the demo according to the demo guide. [docs]

Dependencies and Installation

1. install detectron2
python -m pip install 'git+https://github.com/MaureenZOU/detectron2-xyz.git'
2. clone this repository and install dependencies
git clone https://github.com/zqcraft/SkySense-O.git
cd SkySense-O
pip install -r require.txt
pip install accelerate -U

Dataset Preparation

After downloading Sky-SA dataset, organize the data as follows in ./data,

├── Sky-SA
│   ├── img_dir
│   ├── ann_dir
│   ├── skysa_dataset.jsonl
│   ├── skysa_graph.jsonl

Model Training and Evaluation

sh run_train.sh 

To evaluate only, modify the script as follows: add --eval-only to the command in run_train.sh. The line should read: python train_net.py --eval-only. Then execute above command.

Results

Citation

@InProceedings{Zhu_2025_CVPR,
    author    = {Zhu, Qi and Lao, Jiangwei and Ji, Deyi and Luo, Junwei and Wu, Kang and Zhang, Yingying and Ru, Lixiang and Wang, Jian and Chen, Jingdong and Yang, Ming and Liu, Dong and Zhao, Feng},
    title     = {SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {14733-14744}
}

@article{wu2025semantic,
  author       = {Wu, Kang and Zhang, Yingying and Ru, Lixiang and Dang, Bo and Lao, Jiangwei and Yu, Lei and Luo, Junwei and Zhu, Zifan and Sun, Yue and Zhang, Jiahao and Zhu, Qi and Wang, Jian and Yang, Ming and Chen, Jingdong and Zhang, Yongjun and Li, Yansheng},
  title        = {A semantic‑enhanced multi‑modal remote sensing foundation model for Earth observation},
  journal      = {Nature Machine Intelligence},
  year         = {2025},
  doi          = {10.1038/s42256-025-01078-8},
  url          = {https://doi.org/10.1038/s42256-025-01078-8}
}

@inproceedings{guo2024skysense,
    author    = {Guo, Xin and Lao, Jiangwei and Dang, Bo and Zhang, Yingying and Yu, Lei and Ru, Lixiang and Zhong, Liheng and Huang, Ziyuan and Wu, Kang and Hu, Dingxiang and He, Huimei and Wang, Jian and Chen, Jingdong and Yang, Ming and Zhang, Yongjun and Li, Yansheng},
    title     = {SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {27672-27683}
}

@article{luo2024skysensegpt,
  title={Skysensegpt: A fine-grained instruction tuning dataset and model for remote sensing vision-language understanding},
  author={Luo, Junwei and Pang, Zhen and Zhang, Yongjun and Wang, Tingzhu and Wang, Linlin and Dang, Bo and Lao, Jiangwei and Wang, Jian and Chen, Jingdong and Tan, Yihua and others},
  journal={arXiv preprint arXiv:2406.10100},
  year={2024}
}

Acknowledgement

This implementation is based on Detectron 2. Thanks for the awesome work.

About

[CVPR 2025] This is a model aggregated with CLIP and SAM version of SkySense for remote sensing interpretation described in SkySense-O: Towards Open-World Remote Sensing Interpretation with Vision-Centric Visual-Language Modeling.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5