Skip to content

Comments

ci: add inference smoke test on every PR#159

Merged
benvinegar merged 9 commits intomainfrom
ci/inference-smoke-test
Feb 24, 2026
Merged

ci: add inference smoke test on every PR#159
benvinegar merged 9 commits intomainfrom
ci/inference-smoke-test

Conversation

@benvinegar
Copy link
Member

What

Add an end-to-end inference smoke test that runs on every PR. Verifies the control-agent can complete a real LLM turn via session-control RPC.

Why

The existing runtime smoke test only checks process/socket liveness — it doesn't catch provider auth, model, or inference regressions. This closes that gap.

Changes

  • bin/ci/smoke-agent-inference.sh — new script: sends a prompt (Reply with exactly: CI_INFERENCE_OK) through the control-agent Unix socket, subscribes to turn_end, validates the response.
  • Uses anthropic/claude-haiku via BAUDBOT_MODEL override — cheap model for a trivial check.
  • start.sh — respects BAUDBOT_MODEL env var before auto-detecting from API keys.
  • .env.schema — added BAUDBOT_MODEL.
  • bin/ci/droplet.sh run — accepts optional KEY=VALUE args forwarded as env vars to the remote script.
  • integration.yml — passes CI_ANTHROPIC_API_KEY secret to every droplet run.
  • setup-ubuntu.sh / setup-arch.sh — inference smoke wired after runtime smoke.

Setup required

Add CI_ANTHROPIC_API_KEY as a GitHub Actions secret with a valid Anthropic API key.

Testing

  • npm run lint
  • npm test ✅ (138 tests)
  • ShellCheck ✅ (59 files)

Add an end-to-end inference smoke test that verifies the control-agent
can complete a real LLM turn via session-control RPC.

- bin/ci/smoke-agent-inference.sh: sends a prompt via Unix socket RPC,
  subscribes to turn_end, validates the response contains expected token.
- Uses anthropic/claude-haiku (cheap) via new BAUDBOT_MODEL env override.
- CI_ANTHROPIC_API_KEY injected into agent .env for provider auth.
- start.sh: respect BAUDBOT_MODEL override before auto-detect.
- .env.schema: add BAUDBOT_MODEL.
- integration.yml: pass CI_ANTHROPIC_API_KEY to every droplet run.
- bin/ci/droplet.sh run: accept optional KEY=VALUE env var forwarding.
- Wired into setup-ubuntu.sh and setup-arch.sh after runtime smoke.
@greptile-apps
Copy link

greptile-apps bot commented Feb 24, 2026

Greptile Summary

This PR adds end-to-end inference smoke testing to CI. The existing runtime smoke test only validated process liveness and socket connectivity — this new test actually sends a prompt through the control-agent and validates the LLM response, catching provider authentication, model configuration, or inference pipeline regressions.

Key changes:

  • New bin/ci/smoke-agent-inference.sh sends a test prompt via session-control RPC and validates the response contains the expected token
  • Uses anthropic/claude-haiku model via BAUDBOT_MODEL override to minimize CI token costs
  • start.sh now respects BAUDBOT_MODEL env var before falling back to API key auto-detection
  • bin/ci/droplet.sh run enhanced to forward arbitrary configuration variables to remote scripts
  • GitHub Actions workflow passes required credentials to every droplet run
  • Both Ubuntu and Arch setup scripts now run the inference smoke test after the runtime smoke test

The implementation follows established patterns from smoke-agent-runtime.sh with proper timeout handling, cleanup traps, and diagnostic output on failure. The Python RPC client properly handles the async event stream and validates the full turn completion.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • All changes follow established patterns, add valuable test coverage without modifying core logic, and have passed linting and existing tests. The inference smoke test is defensive with proper timeouts, cleanup, and error handling.
  • No files require special attention

Important Files Changed

Filename Overview
bin/ci/smoke-agent-inference.sh New inference smoke test that validates end-to-end LLM turn via RPC, well-structured with proper error handling and cleanup
start.sh Added BAUDBOT_MODEL override before auto-detection, allowing CI to use cheaper model for testing
.env.schema Added BAUDBOT_MODEL schema entry with proper documentation and @sensitive=false annotation
.github/workflows/integration.yml Added CI_ANTHROPIC_API_KEY secret forwarding to droplet run for inference smoke test
bin/ci/setup-ubuntu.sh Added inference smoke test call after runtime smoke test
bin/ci/setup-arch.sh Added inference smoke test call after runtime smoke test
bin/ci/droplet.sh Enhanced cmd_run to accept and forward configuration variables to remote script via export statements

Sequence Diagram

sequenceDiagram
    participant GHA as GitHub Actions
    participant Droplet as CI Droplet
    participant Script as smoke-agent-inference.sh
    participant Baudbot as baudbot service
    participant Agent as control-agent
    participant LLM as Anthropic API
    
    GHA->>Droplet: Forward CI_ANTHROPIC_API_KEY
    Droplet->>Script: Execute smoke test
    Script->>Script: inject_ci_config()<br/>(set ANTHROPIC_API_KEY + BAUDBOT_MODEL)
    Script->>Baudbot: sudo baudbot start
    Baudbot->>Agent: Launch with anthropic/claude-haiku
    Agent->>Agent: Create Unix socket
    Script->>Script: wait_for_control_socket()
    Script->>Agent: RPC send("Reply with exactly: CI_INFERENCE_OK")
    Script->>Agent: RPC subscribe("turn_end")
    Agent->>LLM: Inference request
    LLM->>Agent: Response with "CI_INFERENCE_OK"
    Agent->>Script: turn_end event with content
    Script->>Script: Validate token present
    Script->>Baudbot: sudo baudbot stop
    Script->>GHA: Exit 0 (success)
Loading

Last reviewed commit: 75374b4

The runtime smoke test leaves stale socket files after stopping
baudbot. When the inference smoke starts baudbot fresh, the alias
symlink still points to the old dead socket. Fix by verifying the
socket is connectable (not just that the file exists) before
proceeding with the RPC.
sed -i as root changes file ownership to root:root, making the
.env unreadable by baudbot_agent. Run all .env modifications as
the agent user to preserve 600 permissions.
The control-agent has a full skill loaded, so it won't parrot back
an exact token. Just verify we got a non-trivial response (>= 10
chars), which proves the full inference pipeline works.
Instead of testing for a parroted token, ask the control-agent to
do its actual job: run a health check (session liveness, heartbeat)
and report status. Assert the response contains a positive health
signal (healthy, running, ok, etc). This validates the full stack:
socket RPC → model inference → tool use → coherent response.
Ask the agent to respond with a typed JSON health report:
{status, session_alive, heartbeat_active, message}. Extract,
parse, validate the schema, then assert status=healthy and
session_alive=true. No fuzzy string matching.
- Subscribe to turn_end and drain the startup turn before sending
  the health check prompt (fixes race where we got the startup
  response instead of ours).
- Tail journalctl -u baudbot in the background so the CI log shows
  what the agent is doing in real time.
The control-agent startup is multi-turn (checklist, tool calls, etc).
Instead of trying to drain it, send the health check as a follow_up
with a unique marker string. Re-subscribe to turn_end after each
event and keep consuming until we find the response containing our
marker. Also bump timeouts (300s inference, 15m job) to accommodate
the full startup + health check flow.
The agent correctly reports 'degraded' because the bridge can't
auth with dummy Slack tokens. Both healthy and degraded are valid —
only 'unhealthy' indicates a core runtime failure.
@benvinegar benvinegar merged commit e004988 into main Feb 24, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant