ci: add inference smoke test on every PR by benvinegar · Pull Request #159 · modem-dev/baudbot

benvinegar · 2026-02-24T03:27:52Z

What

Add an end-to-end inference smoke test that runs on every PR. Verifies the control-agent can complete a real LLM turn via session-control RPC.

Why

The existing runtime smoke test only checks process/socket liveness — it doesn't catch provider auth, model, or inference regressions. This closes that gap.

Changes

bin/ci/smoke-agent-inference.sh — new script: sends a prompt (Reply with exactly: CI_INFERENCE_OK) through the control-agent Unix socket, subscribes to turn_end, validates the response.
Uses anthropic/claude-haiku via BAUDBOT_MODEL override — cheap model for a trivial check.
start.sh — respects BAUDBOT_MODEL env var before auto-detecting from API keys.
.env.schema — added BAUDBOT_MODEL.
bin/ci/droplet.sh run — accepts optional KEY=VALUE args forwarded as env vars to the remote script.
integration.yml — passes CI_ANTHROPIC_API_KEY secret to every droplet run.
setup-ubuntu.sh / setup-arch.sh — inference smoke wired after runtime smoke.

Setup required

Add CI_ANTHROPIC_API_KEY as a GitHub Actions secret with a valid Anthropic API key.

Testing

npm run lint ✅
npm test ✅ (138 tests)
ShellCheck ✅ (59 files)

Add an end-to-end inference smoke test that verifies the control-agent can complete a real LLM turn via session-control RPC. - bin/ci/smoke-agent-inference.sh: sends a prompt via Unix socket RPC, subscribes to turn_end, validates the response contains expected token. - Uses anthropic/claude-haiku (cheap) via new BAUDBOT_MODEL env override. - CI_ANTHROPIC_API_KEY injected into agent .env for provider auth. - start.sh: respect BAUDBOT_MODEL override before auto-detect. - .env.schema: add BAUDBOT_MODEL. - integration.yml: pass CI_ANTHROPIC_API_KEY to every droplet run. - bin/ci/droplet.sh run: accept optional KEY=VALUE env var forwarding. - Wired into setup-ubuntu.sh and setup-arch.sh after runtime smoke.

greptile-apps · 2026-02-24T03:29:50Z

Greptile Summary

This PR adds end-to-end inference smoke testing to CI. The existing runtime smoke test only validated process liveness and socket connectivity — this new test actually sends a prompt through the control-agent and validates the LLM response, catching provider authentication, model configuration, or inference pipeline regressions.

Key changes:

New bin/ci/smoke-agent-inference.sh sends a test prompt via session-control RPC and validates the response contains the expected token
Uses anthropic/claude-haiku model via BAUDBOT_MODEL override to minimize CI token costs
start.sh now respects BAUDBOT_MODEL env var before falling back to API key auto-detection
bin/ci/droplet.sh run enhanced to forward arbitrary configuration variables to remote scripts
GitHub Actions workflow passes required credentials to every droplet run
Both Ubuntu and Arch setup scripts now run the inference smoke test after the runtime smoke test

The implementation follows established patterns from smoke-agent-runtime.sh with proper timeout handling, cleanup traps, and diagnostic output on failure. The Python RPC client properly handles the async event stream and validates the full turn completion.

Confidence Score: 5/5

This PR is safe to merge with minimal risk
All changes follow established patterns, add valuable test coverage without modifying core logic, and have passed linting and existing tests. The inference smoke test is defensive with proper timeouts, cleanup, and error handling.
No files require special attention

Important Files Changed

Filename	Overview
bin/ci/smoke-agent-inference.sh	New inference smoke test that validates end-to-end LLM turn via RPC, well-structured with proper error handling and cleanup
start.sh	Added `BAUDBOT_MODEL` override before auto-detection, allowing CI to use cheaper model for testing
.env.schema	Added `BAUDBOT_MODEL` schema entry with proper documentation and `@sensitive=false` annotation
.github/workflows/integration.yml	Added `CI_ANTHROPIC_API_KEY` secret forwarding to droplet run for inference smoke test
bin/ci/setup-ubuntu.sh	Added inference smoke test call after runtime smoke test
bin/ci/setup-arch.sh	Added inference smoke test call after runtime smoke test
bin/ci/droplet.sh	Enhanced `cmd_run` to accept and forward configuration variables to remote script via export statements

Sequence Diagram

sequenceDiagram
    participant GHA as GitHub Actions
    participant Droplet as CI Droplet
    participant Script as smoke-agent-inference.sh
    participant Baudbot as baudbot service
    participant Agent as control-agent
    participant LLM as Anthropic API
    
    GHA->>Droplet: Forward CI_ANTHROPIC_API_KEY
    Droplet->>Script: Execute smoke test
    Script->>Script: inject_ci_config()<br/>(set ANTHROPIC_API_KEY + BAUDBOT_MODEL)
    Script->>Baudbot: sudo baudbot start
    Baudbot->>Agent: Launch with anthropic/claude-haiku
    Agent->>Agent: Create Unix socket
    Script->>Script: wait_for_control_socket()
    Script->>Agent: RPC send("Reply with exactly: CI_INFERENCE_OK")
    Script->>Agent: RPC subscribe("turn_end")
    Agent->>LLM: Inference request
    LLM->>Agent: Response with "CI_INFERENCE_OK"
    Agent->>Script: turn_end event with content
    Script->>Script: Validate token present
    Script->>Baudbot: sudo baudbot stop
    Script->>GHA: Exit 0 (success)

_{Last reviewed commit: 75374b4}

The runtime smoke test leaves stale socket files after stopping baudbot. When the inference smoke starts baudbot fresh, the alias symlink still points to the old dead socket. Fix by verifying the socket is connectable (not just that the file exists) before proceeding with the RPC.

sed -i as root changes file ownership to root:root, making the .env unreadable by baudbot_agent. Run all .env modifications as the agent user to preserve 600 permissions.

The control-agent has a full skill loaded, so it won't parrot back an exact token. Just verify we got a non-trivial response (>= 10 chars), which proves the full inference pipeline works.

Instead of testing for a parroted token, ask the control-agent to do its actual job: run a health check (session liveness, heartbeat) and report status. Assert the response contains a positive health signal (healthy, running, ok, etc). This validates the full stack: socket RPC → model inference → tool use → coherent response.

Ask the agent to respond with a typed JSON health report: {status, session_alive, heartbeat_active, message}. Extract, parse, validate the schema, then assert status=healthy and session_alive=true. No fuzzy string matching.

- Subscribe to turn_end and drain the startup turn before sending the health check prompt (fixes race where we got the startup response instead of ours). - Tail journalctl -u baudbot in the background so the CI log shows what the agent is doing in real time.

The control-agent startup is multi-turn (checklist, tool calls, etc). Instead of trying to drain it, send the health check as a follow_up with a unique marker string. Re-subscribe to turn_end after each event and keep consuming until we find the response containing our marker. Also bump timeouts (300s inference, 15m job) to accommodate the full startup + health check flow.

bin/ci/smoke-agent-inference.sh

The agent correctly reports 'degraded' because the bridge can't auth with dummy Slack tokens. Both healthy and degraded are valid — only 'unhealthy' indicates a core runtime failure.

benvinegar added 7 commits February 23, 2026 22:46

ci: preserve .env ownership by running sed as agent user

df5273c

sed -i as root changes file ownership to root:root, making the .env unreadable by baudbot_agent. Run all .env modifications as the agent user to preserve 600 permissions.

ci: assert non-empty inference response instead of exact token

28c1737

The control-agent has a full skill loaded, so it won't parrot back an exact token. Just verify we got a non-trivial response (>= 10 chars), which proves the full inference pipeline works.

ci: validate inference health check via structured JSON schema

0675850

Ask the agent to respond with a typed JSON health report: {status, session_alive, heartbeat_active, message}. Extract, parse, validate the schema, then assert status=healthy and session_alive=true. No fuzzy string matching.

sentry bot reviewed Feb 24, 2026

View reviewed changes

bin/ci/smoke-agent-inference.sh Show resolved Hide resolved

ci: accept degraded status (Slack bridge has dummy tokens in CI)

050179e

The agent correctly reports 'degraded' because the bridge can't auth with dummy Slack tokens. Both healthy and degraded are valid — only 'unhealthy' indicates a core runtime failure.

benvinegar merged commit e004988 into main Feb 24, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

ci: add inference smoke test on every PR#159

ci: add inference smoke test on every PR#159
benvinegar merged 9 commits intomainfrom
ci/inference-smoke-test

benvinegar commented Feb 24, 2026

Uh oh!

greptile-apps bot commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

benvinegar commented Feb 24, 2026

What

Why

Changes

Setup required

Testing

Uh oh!

greptile-apps bot commented Feb 24, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant