GeoPlan Benchmark

GeoPlan-bench is a benchmark platform for evaluating agent architectures in remote sensing task planning. The platform provides a complete workflow for task generation, filtering, and evaluation, supporting multiple agent architectures and evaluation metrics.

Features

Supported Agent Architectures

GeoPlan-bench implements the following 6 agent architectures:

ReAct: Reasoning-Acting loop agent that solves problems by alternating between reasoning and acting
Plan-and-Execute: Planning-execution separation agent that plans first and then executes
CoT (Chain of Thought): Zero-shot chain-of-thought agent
Debate: Multi-agent debate architecture that improves solutions through discussion among multiple agents
AFlow: Adaptive flow agent that can dynamically adjust tool calling sequences
EarthAgent: Hierarchical expert agent with multi-layer architecture optimized for remote sensing tasks

Evaluation Metrics

The platform provides 3 evaluation metrics:

Correctness: Evaluates the correctness of agent tool calls
- Key Step Recall
- Key Tool Precision
- F1 Score
Structural: Evaluates structural similarity of tool calling sequences
- Tool Flow Similarity
- Enhanced Edit Distance
Holistic: Evaluates solution completeness
- Elo ranking-based completeness score

Core Functions

Task Generation: Automatically generates diverse remote sensing tasks with DAG templates, tool flows, and real-world questions
Task Filtering: Intelligent filtering and deduplication to ensure task quality and diversity
Task Evaluation: Automated evaluation pipeline supporting batch evaluation and result analysis

Installation

Prerequisites

Python >= 3.8
pip

Installation Steps

Clone the repository:

git clone https://github.com/earth-insights/geoplan-bench.git
cd geoplan-bench

Create a virtual environment (venv or conda):

# Using venv (Python 3.8+)
python -m venv venv

# Activate the virtual environment
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Install package (optional, for command-line tools):

pip install -e .

Configure environment variables:

cp env.example .env
# Edit .env and add your API keys

Required environment variables:

OPENAI_API_KEY: OpenAI API key
OPENAI_API_BASE: OpenAI API base URL (optional)
GEMINI_API_KEY: Google Gemini API key

Quick Start

1. Generate Tasks

Generate tasks using the command-line tool:

geoplan-generate --num-dags 1 --output-dir data/tasks/raw

Or use Python script:

python scripts/generate_tasks.py --num-dags 1 --output-dir data/tasks/raw

2. Filter Tasks

Filter and deduplicate generated tasks:

geoplan-filter --input-dir data/tasks/raw --output-dir data/tasks/filtered

Or use Python script:

python scripts/filter_tasks.py --input-dir data/tasks/raw --output-dir data/tasks/filtered

3. Evaluate Agents

Evaluate agent performance on tasks:

geoplan-evaluate --start-index 0 --end-index 3

Or use Python script:

python scripts/evaluate.py --start-index 0 --end-index 3

Project Structure

geoplan_bench/
├── agents/             # Agent implementations
│   ├── base.py         # Base agent interface
│   ├── ReAct.py        # ReAct agent
│   ├── Plan_and_Execute.py  # Plan-and-Execute agent
│   ├── CoT.py          # CoT agent
│   ├── Debate.py       # Debate agent
│   ├── AFlow.py        # AFlow agent
│   └── EarthAgent.py   # EarthAgent agent
├── evaluation/          # Evaluation modules
│   └── metrics/        # Evaluation metrics
│       ├── correctness.py  # Correctness evaluation
│       ├── holistic.py     # Holistic evaluation
│       └── structural.py   # Structural evaluation
├── tools/              # Tool implementations
│   └── tools.py        # Remote sensing toolset
├── pipeline/           # Pipelines
│   ├── task_generation.py    # Task generation pipeline
│   ├── task_evaluation.py    # Task evaluation pipeline
│   └── task_validation.py    # Task filtering logic
├── data/               # Data management
│   └── schemas.py      # Data schema definitions
├── utils/              # Utility functions
│   ├── arena.py        # Agent creation utilities
│   └── importance.py   # Tool importance analysis
└── config/             # Configuration
    ├── constants.py    # Constant definitions
    └── prompts.py      # Prompt templates

scripts/                # Executable scripts
├── generate_tasks.py   # Task generation script
├── filter_tasks.py     # Task filtering script
└── evaluate.py         # Evaluation script

examples/               # Example code and data

docs/                   # Documentation

data/                   # Our data files
├── aflow_train_tasks/  # Training set for AFlow
└── test_tasks/         # Test set

Usage Guide

For detailed usage instructions, please refer to:

Usage Guide - Detailed feature usage instructions
Architecture Documentation - System architecture description
Example Code and Data - Code and Data examples

Task Domains

GeoPlan Benchmark supports the following 7 remote sensing task domains:

Agriculture & Forestry
Urban & Regional Planning
Environmental Monitoring & Climate Change
Disaster Emergency & Management
Earth Science & Resource Exploration
Marine & Water Resources
Defense & Security

Each domain supports three complexity levels: Simple, Medium, and Complex.

Contributing

Contributions are welcome! Please check our contributing guidelines.

License

See LICENSE file for details.

Citation

If you use GeoPlan Benchmark, please cite:

@article{li2025designing,
  title={Designing Domain-Specific Agents via Hierarchical Task Abstraction Mechanism},
  author={Li, Kaiyu and Wang, Jiayu and Wang, Zhi and Qiao, Hui and Zhang, Weizhan and Meng, Deyu and Cao, Xiangyong},
  journal={arXiv preprint arXiv:2511.17198},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeoPlan Benchmark

Features

Supported Agent Architectures

Evaluation Metrics

Core Functions

Installation

Prerequisites

Installation Steps

Quick Start

1. Generate Tasks

2. Filter Tasks

3. Evaluate Agents

Project Structure

Usage Guide

Task Domains

Contributing

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
docs		docs
examples		examples
geoplan_bench		geoplan_bench
scripts		scripts
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
env.example		env.example
requirements.txt		requirements.txt
setup.py		setup.py

License

earth-insights/GeoPlan-bench

Folders and files

Latest commit

History

Repository files navigation

GeoPlan Benchmark

Features

Supported Agent Architectures

Evaluation Metrics

Core Functions

Installation

Prerequisites

Installation Steps

Quick Start

1. Generate Tasks

2. Filter Tasks

3. Evaluate Agents

Project Structure

Usage Guide

Task Domains

Contributing

License

Citation

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages