Author: Kaiyu Li*, Zixuan Jiang*, Xiangyong Cao✉, Jiayu Wang, Yuchen Xiao, Deyu Meng, Zhi Wang
Automated textual description of remote sensing images is essential for applications such as environmental monitoring, urban planning, and disaster management. However, most existing methods only generate captions at the image level, lacking fine-grained object-level interpretation.
To address this gap, we propose Geo-DLC, a new task of object-level fine-grained image captioning for remote sensing. To support this task, we introduce:
- DE-Dataset: a large-scale dataset with 25 categories and 261,806 annotated instances, providing detailed descriptions of object attributes, relationships, and contexts.
- DE-Benchmark: an LLM-assisted question-answering evaluation suite to systematically measure model performance on Geo-DLC.
- DescribeEarth: a Multi-modal Large Language Model (MLLM) explicitly designed for Geo-DLC, featuring a scale-adaptive focal strategy and a domain-guided fusion module to capture both high-resolution details and global context.
DescribeEarth consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, achieving superior factual accuracy, descriptive richness, and grammatical soundness across diverse remote sensing scenarios.
see here
cd scripts
python app.pyDe-Dataset can be downloaded from here. The dataset is formatted in the form as follow:
DE-Dataset
- {DIOR, DOTA}
- - image
- - descriptionUse bash scripts/format_data.sh to format data for training.
De-Benchmark can be downloaded from huggingface.
the Pretrained checkpoints of DescribeEarth can be downloaded from huggingface. To use it, put the whole folder in weights/.
python inference.py --model_dir <model_dir> --image <image_dir> --bbox <4-points-bbox/2-points-bbox>-
Example
python inference.py --model_dir ../weights/DescribeEarth_0930 --image ./example1/image.jpg --bbox 36.0 332.0 311.0 325.0 317.0 584.0 42.0 591.0
-
Result
The object of category baseball_field within the specified polygon bounding box is a well-defined outdoor sports facility designed for baseball. The field features a central dirt infield area, clearly demarcated from the surrounding grassy outfield. The infield includes a pitcher's mound and bases, indicating its purpose for baseball games. The surrounding area consists of a large, open grassy field, typical of a baseball diamond layout. Adjacent to the field are structures that appear to be part of a larger complex, possibly including facilities such as dugouts or storage areas. The overall layout and design confirm this as a dedicated baseball field. There are no visible signs of current activity on the field itself.
Following Qwen2.5-VL baseline, do the following to train on DE-dataset / your own dataset:
- Edit
Qwen2.5-VL/qwen-vl-finetune/qwenvl/data/__init__.pyfor the Path to the Formatted dataset. - Download pretrained weights (merged checkpoint of Qwen2.5-VL-3B and RemoteCLIP-vit-b32) from huggingface.
bash script/sft.shunderQwen2.5-VL/qwen-vl-finetune
Use scripts/openai_valid.py to evaluate DescribeEarth and other models.
python openai_valid.py <path to QA.json> <path to image_dataset> -o <output_dir> --generator <'api' or 'local'> --api-key <api_key> --model_dir <model_dir>Use calculate_score.py to get the final results.
@article{li2025describeearth,
title={DescribeEarth: Describe Anything for Remote Sensing Images},
author={Li, Kaiyu and Jiang, Zixuan and Cao, Xiangyong and Wang, Jiayu and Xiao Yuchen and Meng, Deyu and Wang, Zhi},
journal={arXiv preprint arXiv:2509.25654},
year={2025}
}