GeoPlan-bench is a benchmark platform for evaluating agent architectures in remote sensing task planning. The platform provides a complete workflow for task generation, filtering, and evaluation, supporting multiple agent architectures and evaluation metrics.
GeoPlan-bench implements the following 6 agent architectures:
- ReAct: Reasoning-Acting loop agent that solves problems by alternating between reasoning and acting
- Plan-and-Execute: Planning-execution separation agent that plans first and then executes
- CoT (Chain of Thought): Zero-shot chain-of-thought agent
- Debate: Multi-agent debate architecture that improves solutions through discussion among multiple agents
- AFlow: Adaptive flow agent that can dynamically adjust tool calling sequences
- EarthAgent: Hierarchical expert agent with multi-layer architecture optimized for remote sensing tasks
The platform provides 3 evaluation metrics:
-
Correctness: Evaluates the correctness of agent tool calls
- Key Step Recall
- Key Tool Precision
- F1 Score
-
Structural: Evaluates structural similarity of tool calling sequences
- Tool Flow Similarity
- Enhanced Edit Distance
-
Holistic: Evaluates solution completeness
- Elo ranking-based completeness score
- Task Generation: Automatically generates diverse remote sensing tasks with DAG templates, tool flows, and real-world questions
- Task Filtering: Intelligent filtering and deduplication to ensure task quality and diversity
- Task Evaluation: Automated evaluation pipeline supporting batch evaluation and result analysis
- Python >= 3.8
- pip
- Clone the repository:
git clone https://github.com/earth-insights/geoplan-bench.git
cd geoplan-bench- Create a virtual environment (venv or conda):
# Using venv (Python 3.8+)
python -m venv venv
# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Install package (optional, for command-line tools):
pip install -e .- Configure environment variables:
cp env.example .env
# Edit .env and add your API keysRequired environment variables:
OPENAI_API_KEY: OpenAI API keyOPENAI_API_BASE: OpenAI API base URL (optional)GEMINI_API_KEY: Google Gemini API key
Generate tasks using the command-line tool:
geoplan-generate --num-dags 1 --output-dir data/tasks/rawOr use Python script:
python scripts/generate_tasks.py --num-dags 1 --output-dir data/tasks/rawFilter and deduplicate generated tasks:
geoplan-filter --input-dir data/tasks/raw --output-dir data/tasks/filteredOr use Python script:
python scripts/filter_tasks.py --input-dir data/tasks/raw --output-dir data/tasks/filteredEvaluate agent performance on tasks:
geoplan-evaluate --start-index 0 --end-index 3Or use Python script:
python scripts/evaluate.py --start-index 0 --end-index 3geoplan_bench/
├── agents/ # Agent implementations
│ ├── base.py # Base agent interface
│ ├── ReAct.py # ReAct agent
│ ├── Plan_and_Execute.py # Plan-and-Execute agent
│ ├── CoT.py # CoT agent
│ ├── Debate.py # Debate agent
│ ├── AFlow.py # AFlow agent
│ └── EarthAgent.py # EarthAgent agent
├── evaluation/ # Evaluation modules
│ └── metrics/ # Evaluation metrics
│ ├── correctness.py # Correctness evaluation
│ ├── holistic.py # Holistic evaluation
│ └── structural.py # Structural evaluation
├── tools/ # Tool implementations
│ └── tools.py # Remote sensing toolset
├── pipeline/ # Pipelines
│ ├── task_generation.py # Task generation pipeline
│ ├── task_evaluation.py # Task evaluation pipeline
│ └── task_validation.py # Task filtering logic
├── data/ # Data management
│ └── schemas.py # Data schema definitions
├── utils/ # Utility functions
│ ├── arena.py # Agent creation utilities
│ └── importance.py # Tool importance analysis
└── config/ # Configuration
├── constants.py # Constant definitions
└── prompts.py # Prompt templates
scripts/ # Executable scripts
├── generate_tasks.py # Task generation script
├── filter_tasks.py # Task filtering script
└── evaluate.py # Evaluation script
examples/ # Example code and data
docs/ # Documentation
data/ # Our data files
├── aflow_train_tasks/ # Training set for AFlow
└── test_tasks/ # Test set
For detailed usage instructions, please refer to:
- Usage Guide - Detailed feature usage instructions
- Architecture Documentation - System architecture description
- Example Code and Data - Code and Data examples
GeoPlan Benchmark supports the following 7 remote sensing task domains:
- Agriculture & Forestry
- Urban & Regional Planning
- Environmental Monitoring & Climate Change
- Disaster Emergency & Management
- Earth Science & Resource Exploration
- Marine & Water Resources
- Defense & Security
Each domain supports three complexity levels: Simple, Medium, and Complex.
Contributions are welcome! Please check our contributing guidelines.
See LICENSE file for details.
If you use GeoPlan Benchmark, please cite:
@article{li2025designing,
title={Designing Domain-Specific Agents via Hierarchical Task Abstraction Mechanism},
author={Li, Kaiyu and Wang, Jiayu and Wang, Zhi and Qiao, Hui and Zhang, Weizhan and Meng, Deyu and Cao, Xiangyong},
journal={arXiv preprint arXiv:2511.17198},
year={2025}
}