agent-benchmark

Here are 8 public repositories matching this topic...

Cre4T3Tiv3 / ai-agents-reality-check

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Aug 8, 2025
Python

hidai25 / eval-view

Star

Proof your AI agent still works. Regression testing with golden baselines, tool-call diffing, and output drift detection. MCP server + Claude Code skills. LangGraph, CrewAI, Anthropic, OpenAI.

testing agent tools evaluation pytest ai-agents mlops llm langchain llmops anthropic openai-assistants crewai langgraph agentic-ai langgraph-python crewai-tools agent-evaluation agent-benchmark

Updated Mar 2, 2026
Python

collinear-ai / tau-trait

Star

TraitBasis applied to TauBench

rl-envs rl-training agent-benchmark

Updated Nov 11, 2025
Python

dataanswer / awesome-agent-benchmarks

Star

A curated collection of the world’s most advanced benchmark datasets for evaluating Large Language Model (LLM) Agents.

agent benchmarks awesome-list agent-based-modeling awesome-list-awesome-list ai-agent llm-agent llm-evaluation llm-agents agentic-ai guiagents agent-benchmark evaluation-dataset

Updated Dec 21, 2025

justindobbs / Tracecore

Star

The CI reliability gate for action-oriented agents.

reliability-engineering agents ai-agents benchmarking-framework autogen fastapi langchain observability-platform ai-evaluation-framework agent-benchmark deterministic-testing

Updated Mar 3, 2026
Python

axxafo / awesome-agent-benchmarks

Star

🧠 Discover and evaluate advanced benchmark datasets for Large Language Model agents to enhance performance assessment in real-world tasks.

search awesome ai benchmarks rl agent-based-modeling reasoning awesome-list-awesome-list ai-models ai-agent for-devs llm-agent agentic llm-evaluation llm-agents agentic-ai guiagents agent-benchmark evaluation-dataset

Updated Mar 3, 2026

edholofy / dojo.md

Star

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

Updated Mar 1, 2026
TypeScript

MohamedEmad219 / ai-agents-reality-check

Star

🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Mar 3, 2026
Python

Improve this page

Add a description, image, and links to the agent-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-benchmark

Here are 8 public repositories matching this topic...

Cre4T3Tiv3 / ai-agents-reality-check

hidai25 / eval-view

collinear-ai / tau-trait

dataanswer / awesome-agent-benchmarks

justindobbs / Tracecore

axxafo / awesome-agent-benchmarks

edholofy / dojo.md

MohamedEmad219 / ai-agents-reality-check

Improve this page

Add this topic to your repo