RouteEval: A Benchmark for Evaluating LLM Tool Calling in Running Route Generation

Authors: Veejhay Roy, Roger Jin

This repository contains the evaluation data and results for RouteEval, a benchmark designed to evaluate Large Language Models' ability to generate accurate running routes through tool calling.

Live demo: routecraft.io
Full paper: RouteEval Research Paper (PDF)

Overview

Route generation presents unique challenges beyond general tool use: it requires spatial reasoning, precise numerical constraint satisfaction, and real-world validation. Our benchmark assesses 13 state-of-the-art models across 50 diverse prompts with 16 runs each (totaling 800 evaluations per model), measuring their capability to use a route generation tool effectively while adhering to distance constraints.

Key Results

Model	Success Rate	Avg Accuracy	Perfect Rate	High Accuracy (≥0.8)
GPT-5 (High)	91.1%	0.650	12.2%	44.6%
Grok-4	98.5%	0.555	6.2%	26.5%
Gemini-2.5-Pro	98.5%	0.524	5.8%	20.9%
Gemini-2.5-Flash	99.5%	0.520	6.9%	23.1%
DeepSeek-V3.1	92.6%	0.519	6.6%	24.1%
Claude-Opus-4.1	99.1%	0.477	2.9%	25.5%
GLM-4.5	78.5%	0.475	4.9%	18.4%
Claude-Sonnet-4	100.0%	0.458	4.9%	27.5%
Kimi-K2	88.8%	0.444	5.1%	20.0%
GPT-4o-mini	99.0%	0.430	4.2%	18.9%
Hermes-4-70B	91.6%	0.410	4.1%	16.0%
Qwen3-235B	77.9%	0.373	2.9%	15.9%
Qwen3-Coder	75.0%	0.315	2.0%	8.8%

Perfect Rate: Routes with accuracy ≥ 0.95 (≤5% distance error)
High Accuracy Rate: Routes with accuracy ≥ 0.80 (≤20% distance error)

Repository Contents

evaluation_data/ — Raw evaluation results per model (JSON). Each file contains ~800 evaluations (50 prompts × 16 runs) with prompts, waypoints, accuracy scores, and tool call details.
leaderboards/ — Summary statistics derived from the evaluation data (CSV).
paper/ — Full research paper (PDF) and abstract (LaTeX).

Evaluation Data Format

Each file in evaluation_data/ is a JSON array of evaluation records. Each record has:

Field	Type	Description
`prompt_index`	int	Index of the prompt (0–49)
`prompt`	string	Natural language route request
`run`	int	Run number (1–16) for this prompt
`target_distance`	float	Target distance in miles
`success`	bool	Whether the model produced a valid route
`waypoints`	array	List of waypoint addresses the model generated
`distance`	float	Actual route distance from Google Maps (0 if failed)
`accuracy_score`	float	0–1, `max(0, 1 - \|actual - target\| / target)`
`tool_calls_used`	array	Raw tool call with `arguments` and `result`
`errors`	array	Error messages if validation failed

Models Evaluated

OpenAI: GPT-5 (High), GPT-4o-mini
Anthropic: Claude-Opus-4.1, Claude-Sonnet-4
Google: Gemini-2.5-Pro, Gemini-2.5-Flash
xAI: Grok-4
DeepSeek: DeepSeek-V3.1
Zhipu AI: GLM-4.5
Moonshot AI: Kimi-K2
Qwen: Qwen3-235B, Qwen3-Coder
NousResearch: Hermes-4-70B

Citation

If you use RouteEval in your research, please cite:

@article{roy2025routeeval,
  title={RouteEval: A Benchmark for Evaluating LLM Tool Calling in Running Route Generation},
  author={Roy, Veejhay and Jin, Roger},
  year={2025}
}

You can also use the Cite this repository link on GitHub or import CITATION.cff into Zotero, Mendeley, or other reference managers.

License

This work is licensed under CC-BY-4.0. You are free to share and adapt the material with appropriate attribution.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
evaluation_data		evaluation_data
leaderboards		leaderboards
paper		paper
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RouteEval: A Benchmark for Evaluating LLM Tool Calling in Running Route Generation

Overview

Key Results

Repository Contents

Evaluation Data Format

Models Evaluated

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

vjroy/routeeval

Folders and files

Latest commit

History

Repository files navigation

RouteEval: A Benchmark for Evaluating LLM Tool Calling in Running Route Generation

Overview

Key Results

Repository Contents

Evaluation Data Format

Models Evaluated

Citation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages