ci: add inference smoke test on every PR#159
Merged
benvinegar merged 9 commits intomainfrom Feb 24, 2026
Merged
Conversation
Add an end-to-end inference smoke test that verifies the control-agent can complete a real LLM turn via session-control RPC. - bin/ci/smoke-agent-inference.sh: sends a prompt via Unix socket RPC, subscribes to turn_end, validates the response contains expected token. - Uses anthropic/claude-haiku (cheap) via new BAUDBOT_MODEL env override. - CI_ANTHROPIC_API_KEY injected into agent .env for provider auth. - start.sh: respect BAUDBOT_MODEL override before auto-detect. - .env.schema: add BAUDBOT_MODEL. - integration.yml: pass CI_ANTHROPIC_API_KEY to every droplet run. - bin/ci/droplet.sh run: accept optional KEY=VALUE env var forwarding. - Wired into setup-ubuntu.sh and setup-arch.sh after runtime smoke.
Greptile SummaryThis PR adds end-to-end inference smoke testing to CI. The existing runtime smoke test only validated process liveness and socket connectivity — this new test actually sends a prompt through the control-agent and validates the LLM response, catching provider authentication, model configuration, or inference pipeline regressions. Key changes:
The implementation follows established patterns from Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant GHA as GitHub Actions
participant Droplet as CI Droplet
participant Script as smoke-agent-inference.sh
participant Baudbot as baudbot service
participant Agent as control-agent
participant LLM as Anthropic API
GHA->>Droplet: Forward CI_ANTHROPIC_API_KEY
Droplet->>Script: Execute smoke test
Script->>Script: inject_ci_config()<br/>(set ANTHROPIC_API_KEY + BAUDBOT_MODEL)
Script->>Baudbot: sudo baudbot start
Baudbot->>Agent: Launch with anthropic/claude-haiku
Agent->>Agent: Create Unix socket
Script->>Script: wait_for_control_socket()
Script->>Agent: RPC send("Reply with exactly: CI_INFERENCE_OK")
Script->>Agent: RPC subscribe("turn_end")
Agent->>LLM: Inference request
LLM->>Agent: Response with "CI_INFERENCE_OK"
Agent->>Script: turn_end event with content
Script->>Script: Validate token present
Script->>Baudbot: sudo baudbot stop
Script->>GHA: Exit 0 (success)
Last reviewed commit: 75374b4 |
The runtime smoke test leaves stale socket files after stopping baudbot. When the inference smoke starts baudbot fresh, the alias symlink still points to the old dead socket. Fix by verifying the socket is connectable (not just that the file exists) before proceeding with the RPC.
sed -i as root changes file ownership to root:root, making the .env unreadable by baudbot_agent. Run all .env modifications as the agent user to preserve 600 permissions.
The control-agent has a full skill loaded, so it won't parrot back an exact token. Just verify we got a non-trivial response (>= 10 chars), which proves the full inference pipeline works.
Instead of testing for a parroted token, ask the control-agent to do its actual job: run a health check (session liveness, heartbeat) and report status. Assert the response contains a positive health signal (healthy, running, ok, etc). This validates the full stack: socket RPC → model inference → tool use → coherent response.
Ask the agent to respond with a typed JSON health report:
{status, session_alive, heartbeat_active, message}. Extract,
parse, validate the schema, then assert status=healthy and
session_alive=true. No fuzzy string matching.
- Subscribe to turn_end and drain the startup turn before sending the health check prompt (fixes race where we got the startup response instead of ours). - Tail journalctl -u baudbot in the background so the CI log shows what the agent is doing in real time.
The control-agent startup is multi-turn (checklist, tool calls, etc). Instead of trying to drain it, send the health check as a follow_up with a unique marker string. Re-subscribe to turn_end after each event and keep consuming until we find the response containing our marker. Also bump timeouts (300s inference, 15m job) to accommodate the full startup + health check flow.
The agent correctly reports 'degraded' because the bridge can't auth with dummy Slack tokens. Both healthy and degraded are valid — only 'unhealthy' indicates a core runtime failure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add an end-to-end inference smoke test that runs on every PR. Verifies the control-agent can complete a real LLM turn via session-control RPC.
Why
The existing runtime smoke test only checks process/socket liveness — it doesn't catch provider auth, model, or inference regressions. This closes that gap.
Changes
bin/ci/smoke-agent-inference.sh— new script: sends a prompt (Reply with exactly: CI_INFERENCE_OK) through the control-agent Unix socket, subscribes toturn_end, validates the response.anthropic/claude-haikuviaBAUDBOT_MODELoverride — cheap model for a trivial check.start.sh— respectsBAUDBOT_MODELenv var before auto-detecting from API keys..env.schema— addedBAUDBOT_MODEL.bin/ci/droplet.sh run— accepts optionalKEY=VALUEargs forwarded as env vars to the remote script.integration.yml— passesCI_ANTHROPIC_API_KEYsecret to every droplet run.setup-ubuntu.sh/setup-arch.sh— inference smoke wired after runtime smoke.Setup required
Add
CI_ANTHROPIC_API_KEYas a GitHub Actions secret with a valid Anthropic API key.Testing
npm run lint✅npm test✅ (138 tests)