Skip to content

RouteEval: A benchmark for evaluating LLM tool calling in running route generation. Evaluation data and results for 13 state-of-the-art models across 800 evaluations each.

License

Notifications You must be signed in to change notification settings

vjroy/routeeval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RouteEval: A Benchmark for Evaluating LLM Tool Calling in Running Route Generation

Cite this repository

Authors: Veejhay Roy, Roger Jin

This repository contains the evaluation data and results for RouteEval, a benchmark designed to evaluate Large Language Models' ability to generate accurate running routes through tool calling.

Overview

Route generation presents unique challenges beyond general tool use: it requires spatial reasoning, precise numerical constraint satisfaction, and real-world validation. Our benchmark assesses 13 state-of-the-art models across 50 diverse prompts with 16 runs each (totaling 800 evaluations per model), measuring their capability to use a route generation tool effectively while adhering to distance constraints.

Key Results

Model Success Rate Avg Accuracy Perfect Rate High Accuracy (≥0.8)
GPT-5 (High) 91.1% 0.650 12.2% 44.6%
Grok-4 98.5% 0.555 6.2% 26.5%
Gemini-2.5-Pro 98.5% 0.524 5.8% 20.9%
Gemini-2.5-Flash 99.5% 0.520 6.9% 23.1%
DeepSeek-V3.1 92.6% 0.519 6.6% 24.1%
Claude-Opus-4.1 99.1% 0.477 2.9% 25.5%
GLM-4.5 78.5% 0.475 4.9% 18.4%
Claude-Sonnet-4 100.0% 0.458 4.9% 27.5%
Kimi-K2 88.8% 0.444 5.1% 20.0%
GPT-4o-mini 99.0% 0.430 4.2% 18.9%
Hermes-4-70B 91.6% 0.410 4.1% 16.0%
Qwen3-235B 77.9% 0.373 2.9% 15.9%
Qwen3-Coder 75.0% 0.315 2.0% 8.8%

Perfect Rate: Routes with accuracy ≥ 0.95 (≤5% distance error)
High Accuracy Rate: Routes with accuracy ≥ 0.80 (≤20% distance error)

Repository Contents

  • evaluation_data/ — Raw evaluation results per model (JSON). Each file contains ~800 evaluations (50 prompts × 16 runs) with prompts, waypoints, accuracy scores, and tool call details.
  • leaderboards/ — Summary statistics derived from the evaluation data (CSV).
  • paper/ — Full research paper (PDF) and abstract (LaTeX).

Evaluation Data Format

Each file in evaluation_data/ is a JSON array of evaluation records. Each record has:

Field Type Description
prompt_index int Index of the prompt (0–49)
prompt string Natural language route request
run int Run number (1–16) for this prompt
target_distance float Target distance in miles
success bool Whether the model produced a valid route
waypoints array List of waypoint addresses the model generated
distance float Actual route distance from Google Maps (0 if failed)
accuracy_score float 0–1, max(0, 1 - |actual - target| / target)
tool_calls_used array Raw tool call with arguments and result
errors array Error messages if validation failed

Models Evaluated

  • OpenAI: GPT-5 (High), GPT-4o-mini
  • Anthropic: Claude-Opus-4.1, Claude-Sonnet-4
  • Google: Gemini-2.5-Pro, Gemini-2.5-Flash
  • xAI: Grok-4
  • DeepSeek: DeepSeek-V3.1
  • Zhipu AI: GLM-4.5
  • Moonshot AI: Kimi-K2
  • Qwen: Qwen3-235B, Qwen3-Coder
  • NousResearch: Hermes-4-70B

Citation

If you use RouteEval in your research, please cite:

@article{roy2025routeeval,
  title={RouteEval: A Benchmark for Evaluating LLM Tool Calling in Running Route Generation},
  author={Roy, Veejhay and Jin, Roger},
  year={2025}
}

You can also use the Cite this repository link on GitHub or import CITATION.cff into Zotero, Mendeley, or other reference managers.

License

This work is licensed under CC-BY-4.0. You are free to share and adapt the material with appropriate attribution.

About

RouteEval: A benchmark for evaluating LLM tool calling in running route generation. Evaluation data and results for 13 state-of-the-art models across 800 evaluations each.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages