TraceCore is a deterministic execution specification for autonomous agent systems. The /spec/ folder is the canonical standard; this repository contains the Python reference implementation (CLI runtime, harness, artifact serializer, and dashboard).
TraceCore aims to become a shared reliability standard for autonomous agent systems.
Brand note: TraceCore ships two CLI entry points:
tracecore(preferred) andagent-bench(legacy alias, kept for compatibility). Both resolve to the same runtime.
- Bounded Episodes — Frozen inputs (agent, task, seed, budgets, runtime identity) guarantee reproducibility across runs.
- Hard Budgets — Steps, tool calls, and optional wall-clock timers are enforced with no "best effort" exemptions.
- Deterministic Validation — Validators emit binary verdicts plus structured payloads tied to the failure taxonomy.
- Immutable Artifacts — Run artifacts conform to
/spec/artifact-schema-v1.0.jsonso any tool can validate them offline.
Full normative text lives in /spec/tracecore-spec-v1.0.md. Determinism requirements are detailed in /spec/determinism.md; auditors can use /spec/compliance-checklist-v0.1.md.
- A CLI runtime (
tracecore, withagent-benchkept as a legacy alias) that enforces the spec and ships as the reference implementation. - A compliance-focused artifact serializer that emits schema-valid JSON and baseline bundles.
- A FastAPI dashboard + APIs for replay, baseline diffs, and ledger inspection.
- Example tasks, agents, and CI workflows that prove spec conformance.
Other runtimes (Rust, Go, JS, etc.) can implement the spec by following /spec/ plus the artifact schema.
pip install tracecore
tracecore run pairing log_stream_monitor --seed 7 --strict-specOutputs include:
TraceCore Verified
agent: agents/toy_agent.py
task: log_stream_monitor@1
spec: tracecore-spec-v0.1
artifact_hash: sha256:...
TraceCore now tracks the latest run/bundle in .agent_bench/session.json, so day-to-day loops no
longer require copying run IDs:
tracecore run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 0
tracecore verify # defaults to latest run / bundle pair
tracecore bundle seal # seals from the latest successful run
tracecore bundle status # shows recent bundles + integrity stateUse this loop while iterating locally, then flip CI into --replay-bundle/--strict to gate changes.
Drop into --record only when you intentionally need to capture a new canonical baseline.
"TraceCore Verified" means:
- Agent version
Xran taskYunder specZ. - Budgets remained within the published limits.
- The artifact hash recorded in the ledger/baseline bundle matches the schema-defined serialization.
- Determinism metadata (seed, model pins, mocks) is embedded for replay.
Every run artifact now includes:
| Field | Purpose |
|---|---|
spec_version |
Declares the spec this runtime implements (tracecore-spec-v1.0). |
runtime_identity |
{name, version, git_sha} for the reference harness or alt runtimes. |
task_hash |
SHA-256 over the task harness (setup/actions/validate). |
agent_ref |
Alias for the agent module path invoked. |
artifact_hash |
Stable SHA-256 of the artifact (volatile timestamps stripped before hashing). |
budgets |
Frozen maximum steps/tool calls for the episode. |
wall_clock_elapsed_s |
Total episode wall time in seconds; required by spec v1.0. |
These fields are enforced at runtime and inspected by --strict-spec.
See /spec/compliance-checklist-v0.1.md for the auditable criteria.
Spec versions advance independently from package releases. Each runtime must declare which spec it implements:
| Runtime release | Implements spec |
|---|---|
tracecore 1.0.0 (current) |
tracecore-spec v1.0 |
tracecore 0.9.x |
tracecore-spec v0.1 |
Future runtimes MUST keep reporting spec_version inside every run artifact.
tracecore run --strict-spec is available today:
- Validates the freshly emitted artifact against
/spec/artifact-schema-v1.0.jsonbefore reporting success. - Ensures required metadata (
spec_version,runtime_identity,task_hash,artifact_hash,wall_clock_elapsed_s, frozenbudgets, determinism seed) is present and well-formed. - Confirms budgets never go negative and that
failure_typevalues stay inside the canonical taxonomy. - Prints the compliance verdict plus the artifact hash so you can share/record it in ledgers.
Use this flag in CI to fail fast on spec regressions. Details live in docs/architecture.md and the /spec/ bundle.
- What's new in v1.0
- Canonical spec bundle (
/spec/) - Google Colab Example — hosted copy ready to run without cloning the repo
- TraceCore technical specification explainer
- Deterministic Episode Runtime spec (
docs/core.md) - AutoGen adapter tutorial
- Task registry & spec freeze
- Release process & historical notes
- Troubleshooting
- Manual verification checklist
TraceCore v1.0 is the first stable release of the Deterministic Episode Runtime — frozen spec, hardened runner, and full operational metrics.
Highlights:
tracecoreCLI —tracecoreis now a first-class installed command.agent-benchstays as a legacy alias.- Spec v1.0 — all provisional language promoted to normative MUST;
wall_clock_elapsed_srequired in every artifact. - Parallel batch execution —
tracecore run batch --workers Nruns episodes concurrently in isolated subprocesses with per-job timeouts. - Metrics dashboard —
tracecore runs metrics,GET /api/metrics, and the/metricsUI page show reproducibility rates, budget P50/P95, failure taxonomy, and MTTR. - Dashboard fixes — Run button event-loop freeze and
__init__.pyagent dropdown noise, both resolved.
→ Full announcement and upgrade guide
Following the FastAPI guidance on creating virtual environments, isolate your TraceCore install before running any commands:
python -m venv .venv # Windows: use "py -3.12 -m venv .venv" if multiple Python versions
# Windows activation
.venv\Scripts\activate
# macOS / Linux activation
source .venv/bin/activateOnce activated, run the install commands below from the same shell session so tracecore (and the agent-bench alias) land in the expected interpreter. Deactivate with deactivate when you're done.
| Use case | Command | Notes |
|---|---|---|
| Stable CLI (recommended) | pip install tracecore |
Adds tracecore (plus the agent-bench alias) to your PATH. |
| uv users | uv pip install tracecore |
Same artifact, faster resolver. |
| pipx / uv tool | pipx install tracecore or uv tool install tracecore |
Creates isolated shim in %USERPROFILE%\.local\bin. |
| Development | git clone https://github.com/justindobbs/Tracecore && cd Tracecore && python -m venv .venv && .venv\Scripts\activate && pip install -e .[dev] |
Keeps CLI + tasks live-edited. |
| OpenAI Agents extra | pip install tracecore[openai_agents] |
Adds openai-agents (per https://openai.github.io/openai-agents-python/). |
Windows-specific install guidance (PATH, ExecutionPolicy, uv tool shims) lives in docs/troubleshooting.md#windows.
Linux/macOS
# Add Python user scripts to PATH (run once or add to ~/.bashrc or ~/.zshrc)
export PATH="$HOME/.local/bin:$PATH"Windows
# Add Python Scripts to PATH (run once or set via System Properties > Environment Variables)
$env:PATH += ";$env:APPDATA\Python\Python312\Scripts"Isolated install with pipx (recommended)
pip install pipx
pipx install tracecore
pipx ensurepath # Adds pipx shims to PATHFallback: run as module
python -m agent_bench.cli --helptracecore is a first-class installed entry point since v1.0.0 — no alias needed. agent-bench remains as a legacy alias for compatibility.
| Capability | Why it matters |
|---|---|
| Deterministic Episode Runtime | Every task freezes its environment, action schema, budgets, and validator, so a run_id is reproducible proof of behavior. See docs/core.md. |
| Sandboxed tasks | Task manifests declare filesystem roots + network hosts, enforced by GuardedEnv and surfaced in IO audits. |
| Binary scoring + telemetry | Success/failure is the headline; secondary metrics (steps, tool calls, IO audits, validator payloads) keep regressions obvious. |
| Minimal stack | Python-only harness + FastAPI dashboard. No Node build tooling, no external services. Runs in seconds on a laptop. |
| CLI & Web UI parity | tracecore commands (and the agent-bench alias), dashboard, and APIs all call the same runner, so automation matches what maintainers see. |
| Extensible registry | Built-in tasks live beside plugin tasks discovered via the agent_bench.tasks entry point group. |
TraceCore evaluates planner loops, not single prompts: tool sequencing, retry logic, state tracking, and boring-but-correct behavior under budgets.
# Run a known-good pairing
tracecore run pairing log_stream_monitor
tracecore run pairing log_stream_monitor --seed 7
See all available pairings:
```bash
tracecore run pairing --list
tracecore run pairing --all --timeout 120
# Run explicit agent + task
tracecore run --agent agents/toy_agent.py --task filesystem_hidden_config@1 --seed 42
# Launch the interactive wizard
tracecore interactive --dry-run --save-session
# Launch the dashboard
tracecore dashboard
or
tracecore dashboard --reload
# Summaries & baselines
tracecore runs summary --task log_stream_monitor@1 --limit 10
tracecore baseline --agent agents/toy_agent.py --task filesystem_hidden_config@1 --export latest
# Scaffold a new agent
tracecore new-agent my_agent
# Maintainer helper (pytest + task validation)
tracecore maintainNeed a turnkey example? See examples/simple_agent_demo for a self-contained CLI, examples/autogen_adapter_demo for the AutoGen adapter flow, or docs/pydantic_poc.md for the deterministic dice-game walkthrough.
Frozen tasks live in SPEC_FREEZE.md. Current operations-focused suites:
| Task | Suite | Goal | Signals |
|---|---|---|---|
filesystem_hidden_config@1 |
Filesystem | Discover the one true config key without wrecking the tree. | Selective exploration, state recall. |
rate_limited_api@1 |
API | Navigate a deterministic rate limit + transient errors to fetch ACCESS_TOKEN. |
Retry pacing, error classification. |
rate_limited_chain@1 |
API pain task | Multi-stage handshake + rate limit. | Sequencing, dependency tracking. |
deterministic_rate_service@1 |
API | Deterministic payload parsing + rate-limits. | Budget management, payload validation. |
log_alert_triage@1 |
Operations | Triage noisy logs to recover ALERT_CODE. |
Signal detection, tool economy. |
config_drift_remediation@1 |
Operations | Compare desired vs. live config and emit the remediation patch. | Diffing discipline, precise edits. |
incident_recovery_chain@1 |
Operations | Follow a hand-off chain to recover RECOVERY_TOKEN. |
Long-horizon reasoning, state carry-over. |
log_stream_monitor@1 |
Operations | Poll paginated logs, ignore noise, emit STREAM_CODE. |
Patience, trigger detection. |
runbook_verifier@1 |
Operations | Verify runbook phase execution order and emit RUNBOOK_CHECKSUM. |
Ordering discipline, multi-artifact stitching. |
sandboxed_code_auditor@1 |
Operations | Audit sandbox source + logs to emit ISSUE_ID|AUDIT_CODE. |
Scoped reads, multi-source extraction. |
Every task ships with a harness (setup.py, actions.py, validate.py, task.toml), published hashes, and budgets. Success is binary; steps/tool calls/IO audits provide color.
Agent script ──▶ Runner (GuardedEnv, budgets, validator)
│
├─► IO audit + action trace (JSON)
├─► Baseline exports (.agent_bench/baselines)
└─► FastAPI dashboard + REST APIs
- CLI (
tracecore, withagent-benchalias) — runs agents, validates tasks, exports baselines, maintains the repo. - Runner — enforces budgets, sandbox allowlists, structured failure taxonomy.
- Artifacts —
.agent_bench/runs/<run_id>.json(ground truth) + optionalbaseline-<ts>.jsonfor UI compare views. - APIs —
/api/pairings,/api/traces/{run_id}?include_io=true,/api/ledgerare typed via Pydantic models. - Dashboard — Jinja templates plus FastAPI endpoints; no Node build. Upload a run_id to replay, compare baselines, or visualize IO audits.
Baseline diffs (tracecore baseline --compare run_a run_b) highlight where traces diverge. For CI workflows, see docs/ci_workflow.md.
- Launch runs via forms or quick-pick pairings.
- Drill into traces, budget usage, validator payloads, IO audit summaries.
- Filter baselines and recent runs; download artifacts directly.
- Enable
--reloadonly during local dev (uvicorn auto-reload). For long-lived servers, omit the flag.
All dashboard actions have CLI equivalents so you can automate the same flows.
- Scaffold via
tracecore new-agent my_agent(columnar docstrings, budget guards baked in). - Interface contract lives in
docs/agents.mdanddocs/task_harness.md. - Reference agents:
toy_agent.py,rate_limit_agent.py,chain_agent.py,ops_triage_agent.py,cheater_agent.py(sandbox violation test).
- Built-in tasks register through
tasks/registry.json; update it plusdocs/tasks.mdandSPEC_FREEZE.mdwhen bumping versions. - Plugin pathway: publish a package exposing
agent_bench.tasksentry points. Template lives indocs/task_plugin_template.md. - Every task must include setup/actions/validator files, budgets in
task.toml, and passtracecore tasks validate --registry.
- Install/CLI issues —
docs/troubleshooting.mdcovers PATH fixes, validator errors, dashboard hiccups. - Task validation —
tracecore tasks validate --registryensures manifests + registry stay in lockstep. - Maintainer helper —
tracecore maintainruns pytest + task validation and applies mechanical fixes. - Manual verification — Run through
docs/manual_verification.mdbefore freezing specs or publishing changelogs.
Task budgets are defined per task.toml and cannot be overridden at runtime—agents must respect the published constraints.
- Version metadata lives in
pyproject.tomlandagent_bench/webui/app.py(FastAPI banner). - Changelog is maintained in
CHANGELOG.md; tags followvX.Y.Z. - Release checklist:
docs/release_process.md— changelog promotion, behavior verification, SPEC_FREEZE update, trust evidence bundle, tagging, publish. - Plan/shipping updates are captured in
docs/project_positioning.mdand issue tracker.
TraceCore is intentionally opinionated and evolving. Expect additive task suites, sandbox refinements, and runner upgrades—documented via CHANGELOG + SPEC_FREEZE.
TraceCore (Agent Bench CLI) is MIT Licensed. If you ship improvements (new tasks, agents, dashboard tweaks) open a PR or publish them as plugins. If you disagree with the assumptions, that’s fine: the benchmark is small enough to fork, but contributions that improve determinism, coverage, or ergonomics are always welcome.
One-line summary: Terminal Bench energy, but for agents that actually have to do things.

