Skip to content

Comments

fix: replace subshell bridge supervisor with tmux session#158

Merged
benvinegar merged 2 commits intomainfrom
fix/bridge-tmux-supervisor
Feb 24, 2026
Merged

fix: replace subshell bridge supervisor with tmux session#158
benvinegar merged 2 commits intomainfrom
fix/bridge-tmux-supervisor

Conversation

@baudbot-agent
Copy link
Collaborator

Problem

The Slack bridge ran as a subshell (( ... ) &) child of startup-cleanup.sh. Three issues:

  1. Bridge doesn't survive restarts — when pi is restarted independently (not through start.sh), the bridge subshell dies with no recovery
  2. Script hangs — the subshell keeps the script's process group alive, so startup-cleanup.sh never returns when called from pi's bash tool
  3. EADDRINUSE crash loops — PID-file cleanup misses orphaned bridge processes, so the new bridge can't bind port 7890 and crash-loops

Fix

Replace the subshell supervisor with a slack-bridge tmux session — same pattern as sentry-agent, which has been rock solid.

  • tmux session with a 5-second restart loop replaces the subshell ( ... ) &
  • lsof -ti :7890 | xargs kill -9 replaces PID-file tracking — kills whatever is on the port, guaranteed
  • tmux kill-session cleans up any existing bridge session before starting fresh
  • Script completes immediatelytmux new-session -d detaches, no hang
  • Backward compat: still cleans up old-style PID files from previous deploys

Also removes the bridge-restart-policy.sh dependency — the tmux restart loop is simpler and sufficient.

Testing

  • Tested live on the running agent — bridge starts, no hang, survives script completion
  • Pre-existing test failures (heartbeat, memory) unrelated to this change (13/15 pass, same on main)

The bridge ran as a subshell (`( ... ) &`) child of start.sh or
startup-cleanup.sh. When pi was restarted independently, the bridge
died with no recovery. The subshell also kept the script's process
group alive, causing startup-cleanup.sh to hang.

Replace with a `slack-bridge` tmux session with a built-in restart
loop — same pattern as sentry-agent, which has been stable.

Changes:
- Subshell supervisor → tmux session with 5s restart loop
- PID-file cleanup → `lsof -ti :7890 | xargs kill -9` (catches
  orphaned processes the PID file missed)
- Kill existing slack-bridge tmux session before creating a new one
- Remove bridge-restart-policy.sh dependency (no longer needed)
- Script completes immediately (tmux detaches, no hang)
@greptile-apps
Copy link

greptile-apps bot commented Feb 24, 2026

Greptile Summary

Replaces the subshell-based Slack bridge supervisor with a tmux session pattern (matching sentry-agent), fixing three issues: bridge survival across pi restarts, script hanging, and EADDRINUSE crash loops from orphaned processes.

Major changes:

  • Removes bridge-restart-policy.sh dependency and bb_bridge_supervise function
  • Uses lsof -ti :7890 | xargs kill -9 for guaranteed port cleanup instead of PID-file tracking
  • Bridge now runs in detached tmux session with 5-second restart loop
  • Backward compatible cleanup for old PID files from previous deploys

Issues found:

  • Hardcoded node path ~/opt/node-v22.14.0-linux-x64/bin will break if node version changes or runtime overrides are configured (differs from start.sh which uses bb_resolve_runtime_node_bin_dir)

Confidence Score: 3/5

  • Safe to merge with one fix needed: node path resolution
  • The tmux-based approach is sound and matches the proven sentry-agent pattern. Port cleanup via lsof is more reliable than PID files. However, the hardcoded node version path creates a maintenance risk and breaks the abstraction used in start.sh. This should be fixed to use ~/opt/node/bin or source runtime-node.sh for dynamic resolution.
  • Fix the hardcoded node path in pi/skills/control-agent/startup-cleanup.sh:128

Important Files Changed

Filename Overview
pi/skills/control-agent/startup-cleanup.sh Replaces subshell bridge supervisor with tmux session; hardcoded node path differs from start.sh pattern

Sequence Diagram

sequenceDiagram
    participant S as startup-cleanup.sh
    participant P as Port 7890
    participant T as tmux session
    participant B as slack-bridge

    Note over S,P: Old approach: subshell supervisor
    S->>P: Check PID file, kill old supervisor
    S->>S: Start subshell background
    S->>S: Track supervisor PID in file
    Note over S: Script hangs if called from pi

    Note over S,B: New approach: tmux session
    S->>P: lsof -ti :7890 and kill -9
    S->>T: tmux kill-session if exists
    S->>T: tmux new-session -d detached
    loop Restart loop in tmux
        T->>B: varlock run -- node bridge.mjs
        B-->>T: exit with code
        T->>T: sleep 5 seconds
    end
    Note over S: Script returns immediately
    Note over T,B: Bridge restarts automatically
Loading

Last reviewed commit: c1110d5

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

echo "Starting slack-bridge ($BRIDGE_SCRIPT) via tmux..."
tmux new-session -d -s "$BRIDGE_TMUX_SESSION" "\
unset PKG_EXECPATH; \
export PATH=\$HOME/.varlock/bin:\$HOME/opt/node-v22.14.0-linux-x64/bin:\$PATH; \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded node path $HOME/opt/node-v22.14.0-linux-x64/bin differs from start.sh which uses bb_resolve_runtime_node_bin_dir to dynamically locate node.

This will break if:

  • Node version is upgraded
  • BAUDBOT_RUNTIME_NODE_BIN_DIR override is set
  • Installation uses ~/opt/node/bin symlink pattern
Suggested change
export PATH=\$HOME/.varlock/bin:\$HOME/opt/node-v22.14.0-linux-x64/bin:\$PATH; \
export PATH=\$HOME/.varlock/bin:\$HOME/opt/node/bin:\$PATH; \

Or source runtime-node.sh and use bb_resolve_runtime_node_bin_dir to match start.sh behavior.

Prompt To Fix With AI
This is a comment left during a code review.
Path: pi/skills/control-agent/startup-cleanup.sh
Line: 128

Comment:
Hardcoded node path `$HOME/opt/node-v22.14.0-linux-x64/bin` differs from `start.sh` which uses `bb_resolve_runtime_node_bin_dir` to dynamically locate node.

This will break if:
- Node version is upgraded
- `BAUDBOT_RUNTIME_NODE_BIN_DIR` override is set
- Installation uses `~/opt/node/bin` symlink pattern

```suggestion
  export PATH=\$HOME/.varlock/bin:\$HOME/opt/node/bin:\$PATH; \
```

Or source `runtime-node.sh` and use `bb_resolve_runtime_node_bin_dir` to match `start.sh` behavior.

How can I resolve this? If you propose a fix, please make it concise.

Use ~/opt/node/bin with fallback to versioned glob. Fixes the
runtime-node-paths drift check in CI.
@baudbot-agent
Copy link
Collaborator Author

Fixed in 4ec7393 — node path is now resolved dynamically (~/opt/node/bin with fallback to versioned glob). Good catch from the drift check.

@benvinegar benvinegar merged commit fc4fb6e into main Feb 24, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants