Authors: Veejhay Roy, Roger Jin
This repository contains the evaluation data and results for RouteEval, a benchmark designed to evaluate Large Language Models' ability to generate accurate running routes through tool calling.
- Live demo: routecraft.io
- Full paper: RouteEval Research Paper (PDF)
Route generation presents unique challenges beyond general tool use: it requires spatial reasoning, precise numerical constraint satisfaction, and real-world validation. Our benchmark assesses 13 state-of-the-art models across 50 diverse prompts with 16 runs each (totaling 800 evaluations per model), measuring their capability to use a route generation tool effectively while adhering to distance constraints.
| Model | Success Rate | Avg Accuracy | Perfect Rate | High Accuracy (≥0.8) |
|---|---|---|---|---|
| GPT-5 (High) | 91.1% | 0.650 | 12.2% | 44.6% |
| Grok-4 | 98.5% | 0.555 | 6.2% | 26.5% |
| Gemini-2.5-Pro | 98.5% | 0.524 | 5.8% | 20.9% |
| Gemini-2.5-Flash | 99.5% | 0.520 | 6.9% | 23.1% |
| DeepSeek-V3.1 | 92.6% | 0.519 | 6.6% | 24.1% |
| Claude-Opus-4.1 | 99.1% | 0.477 | 2.9% | 25.5% |
| GLM-4.5 | 78.5% | 0.475 | 4.9% | 18.4% |
| Claude-Sonnet-4 | 100.0% | 0.458 | 4.9% | 27.5% |
| Kimi-K2 | 88.8% | 0.444 | 5.1% | 20.0% |
| GPT-4o-mini | 99.0% | 0.430 | 4.2% | 18.9% |
| Hermes-4-70B | 91.6% | 0.410 | 4.1% | 16.0% |
| Qwen3-235B | 77.9% | 0.373 | 2.9% | 15.9% |
| Qwen3-Coder | 75.0% | 0.315 | 2.0% | 8.8% |
Perfect Rate: Routes with accuracy ≥ 0.95 (≤5% distance error)
High Accuracy Rate: Routes with accuracy ≥ 0.80 (≤20% distance error)
evaluation_data/— Raw evaluation results per model (JSON). Each file contains ~800 evaluations (50 prompts × 16 runs) with prompts, waypoints, accuracy scores, and tool call details.leaderboards/— Summary statistics derived from the evaluation data (CSV).paper/— Full research paper (PDF) and abstract (LaTeX).
Each file in evaluation_data/ is a JSON array of evaluation records. Each record has:
| Field | Type | Description |
|---|---|---|
prompt_index |
int | Index of the prompt (0–49) |
prompt |
string | Natural language route request |
run |
int | Run number (1–16) for this prompt |
target_distance |
float | Target distance in miles |
success |
bool | Whether the model produced a valid route |
waypoints |
array | List of waypoint addresses the model generated |
distance |
float | Actual route distance from Google Maps (0 if failed) |
accuracy_score |
float | 0–1, max(0, 1 - |actual - target| / target) |
tool_calls_used |
array | Raw tool call with arguments and result |
errors |
array | Error messages if validation failed |
- OpenAI: GPT-5 (High), GPT-4o-mini
- Anthropic: Claude-Opus-4.1, Claude-Sonnet-4
- Google: Gemini-2.5-Pro, Gemini-2.5-Flash
- xAI: Grok-4
- DeepSeek: DeepSeek-V3.1
- Zhipu AI: GLM-4.5
- Moonshot AI: Kimi-K2
- Qwen: Qwen3-235B, Qwen3-Coder
- NousResearch: Hermes-4-70B
If you use RouteEval in your research, please cite:
@article{roy2025routeeval,
title={RouteEval: A Benchmark for Evaluating LLM Tool Calling in Running Route Generation},
author={Roy, Veejhay and Jin, Roger},
year={2025}
}You can also use the Cite this repository link on GitHub or import CITATION.cff into Zotero, Mendeley, or other reference managers.
This work is licensed under CC-BY-4.0. You are free to share and adapt the material with appropriate attribution.