fix: graceful bridge restarts — SIGTERM handling + EADDRINUSE retry by baudbot-agent · Pull Request #161 · modem-dev/baudbot

baudbot-agent · 2026-02-24T04:18:18Z

Problem

Every deploy/restart, the bridge crash-loops on EADDRINUSE for 30+ seconds because:

startup-cleanup.sh kills port 7890 with kill -9 — no graceful shutdown, port may linger
The tmux restart loop immediately respawns the bridge before the port is free
The bridge has zero tolerance for EADDRINUSE — it crashes instantly
The restart loop keeps spawning new instances that all fight over the port

This happens reliably on every deploy and requires manual debugging each time.

Fix

Three layers of defense:

1. Bridge: SIGTERM/SIGINT handler (`broker-bridge.mjs`)

Catches SIGTERM, calls server.close() to release the port cleanly, then exits. 5s forced-exit timeout as safety net.

2. Bridge: EADDRINUSE retry (`broker-bridge.mjs`)

Instead of crashing on EADDRINUSE, retries up to 5 times with 2s backoff. If the port is briefly held by a dying predecessor, the bridge just waits.

3. Startup: ordered cleanup (`startup-cleanup.sh`)

Kill the tmux session first — stops the restart loop from respawning the bridge while we clean up
SIGTERM the port holder (not SIGKILL) and wait up to 3s for graceful exit
SIGKILL only as fallback for stuck processes
Restart loop checks port availability before each relaunch

Testing

All 138 existing tests pass
Manual verification: the race condition that caused EADDRINUSE crash loops on every startup is eliminated

Three layers of defense against port conflicts during bridge restarts: 1. broker-bridge.mjs: SIGTERM/SIGINT handler closes the HTTP server cleanly before exiting, so the port is released immediately instead of lingering in TIME_WAIT. 2. broker-bridge.mjs: EADDRINUSE retry with backoff (5 attempts, 2s apart) so if the port IS briefly held, the bridge waits instead of crashing. 3. startup-cleanup.sh: Kill the tmux restart loop FIRST (prevents respawning), then SIGTERM the port holder and wait up to 3s for graceful exit before falling back to SIGKILL. The restart loop also checks port availability before each relaunch.

sentry · 2026-02-24T04:20:47Z

slack-bridge/broker-bridge.mjs

+  function tryListen() {
+    bindAttempt++;
+    server.listen(API_PORT, "127.0.0.1");
+  }


Bug: The EADDRINUSE error handler retries server.listen() without first calling server.close(). This will cause an ERR_SERVER_ALREADY_LISTEN error and crash the process on the first retry attempt.
_{Severity: CRITICAL}

Suggested Fix

In the EADDRINUSE error handler, call server.close() before attempting to call server.listen() again. The setTimeout callback should be modified to first close the server and then re-invoke listen.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: slack-bridge/broker-bridge.mjs#L1065 Potential issue: The retry logic for a port conflict (`EADDRINUSE`) is flawed. When a port is in use, the code schedules a retry of `server.listen()`. However, Node.js requires `server.close()` to be called before `server.listen()` can be attempted again after a listen-related error. Without this, the retry will immediately throw an `ERR_SERVER_ALREADY_LISTEN` error. This unhandled error will trigger the `else` branch of the error handler, causing the process to exit via `process.exit(1)`. This defeats the purpose of the retry mechanism and guarantees a crash on the first port conflict.

_{Did we get this right? 👍 / 👎 to inform future reviews.}

greptile-apps · 2026-02-24T04:21:18Z

Greptile Summary

This PR eliminates a recurring EADDRINUSE crash loop that occurred on every deploy/restart by adding three layers of defense:

broker-bridge.mjs: Adds a gracefulShutdown() handler on SIGTERM/SIGINT that closes the HTTP server cleanly before exiting (with a 5s forced-exit safety net), and adds EADDRINUSE retry logic (up to 5 retries with 2s backoff) so the bridge can wait for a dying predecessor to release the port.
startup-cleanup.sh: Reorders cleanup to kill the tmux session first (preventing the restart loop from respawning the bridge during cleanup), switches from immediate SIGKILL to SIGTERM with a 3s grace period before falling back to SIGKILL, and adds a port-availability check in the restart loop before relaunching.

The changes are well-scoped and focused on the specific race condition. The three layers (graceful shutdown, bind retry, ordered cleanup) provide good defense-in-depth. Existing tests pass and the integration test suite already exercises SIGTERM teardown of the bridge.

Confidence Score: 4/5

This PR is safe to merge — it adds defensive reliability improvements without changing core business logic or security behavior.
Score of 4 reflects that the changes are well-structured, focused on operational reliability, and don't modify security-critical code paths. One minor style concern (missing server.close() per Node.js docs) is non-blocking. No security, authentication, or rate-limiting behavior is affected.
No files require special attention. Both changes are straightforward operational improvements.

Important Files Changed

Filename	Overview
slack-bridge/broker-bridge.mjs	Adds SIGTERM/SIGINT graceful shutdown handler and EADDRINUSE retry logic. Well-structured, minor style concern about missing `server.close()` before retry per Node.js docs.
pi/skills/control-agent/startup-cleanup.sh	Reorders cleanup to kill tmux first, switches from SIGKILL to SIGTERM with graceful wait and SIGKILL fallback, adds port-availability check in restart loop. Clean and correct.

Sequence Diagram

sequenceDiagram
    participant SC as startup-cleanup.sh
    participant TM as tmux session
    participant OB as old bridge process
    participant NB as new bridge process
    participant Port as Port 7890

    Note over SC: Deploy/restart triggered
    SC->>TM: kill-session (stop restart loop)
    TM-->>OB: (no longer supervised)
    SC->>OB: SIGTERM
    OB->>OB: gracefulShutdown()
    OB->>Port: server.close()
    Port-->>OB: port released
    OB->>OB: process.exit(0)
    SC->>SC: wait up to 3s for port free
    alt Port still held
        SC->>OB: SIGKILL (fallback)
    end
    SC->>TM: new tmux session
    TM->>NB: start bridge
    NB->>Port: server.listen(7890)
    alt EADDRINUSE
        NB->>NB: retry (up to 5x, 2s backoff)
        NB->>Port: server.listen(7890)
    end
    Port-->>NB: listening
    Note over NB: Bridge operational

_{Last reviewed commit: cccf3b6}

greptile-apps

_{2 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-02-24T04:21:21Z

slack-bridge/broker-bridge.mjs

+  server.on("error", (err) => {
+    if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) {
+      logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`);
+      setTimeout(tryListen, BIND_RETRY_DELAY_MS);
+    } else {


Missing server.close() before retry
The Node.js documentation recommends calling server.close() before retrying server.listen() after an EADDRINUSE error, to properly reset the server's internal state. While the current code may work in practice (the failed bind doesn't create resources), adding server.close() follows the documented pattern and avoids potential edge cases in future Node.js versions.

Suggested change

server.on("error", (err) => {

if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) {

logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`);

setTimeout(tryListen, BIND_RETRY_DELAY_MS);

} else {

if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) {

logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`);

server.close();

setTimeout(tryListen, BIND_RETRY_DELAY_MS);

Prompt To Fix With AI

This is a comment left during a code review. Path: slack-bridge/broker-bridge.mjs Line: 1072-1076 Comment: **Missing `server.close()` before retry** The [Node.js documentation](https://nodejs.org/api/net.html#event-error) recommends calling `server.close()` before retrying `server.listen()` after an EADDRINUSE error, to properly reset the server's internal state. While the current code may work in practice (the failed bind doesn't create resources), adding `server.close()` follows the documented pattern and avoids potential edge cases in future Node.js versions. ```suggestion if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) { logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`); server.close(); setTimeout(tryListen, BIND_RETRY_DELAY_MS); ``` How can I resolve this? If you propose a fix, please make it concise.

sentry bot reviewed Feb 24, 2026

View reviewed changes

greptile-apps bot reviewed Feb 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: graceful bridge restarts — SIGTERM handling + EADDRINUSE retry#161

fix: graceful bridge restarts — SIGTERM handling + EADDRINUSE retry#161
baudbot-agent wants to merge 1 commit intomainfrom
fix/bridge-graceful-restart

baudbot-agent commented Feb 24, 2026

Uh oh!

sentry bot Feb 24, 2026

Uh oh!

greptile-apps bot commented Feb 24, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

baudbot-agent commented Feb 24, 2026

Problem

Fix

1. Bridge: SIGTERM/SIGINT handler (broker-bridge.mjs)

2. Bridge: EADDRINUSE retry (broker-bridge.mjs)

3. Startup: ordered cleanup (startup-cleanup.sh)

Testing

Uh oh!

sentry bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Feb 24, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Bridge: SIGTERM/SIGINT handler (`broker-bridge.mjs`)

2. Bridge: EADDRINUSE retry (`broker-bridge.mjs`)

3. Startup: ordered cleanup (`startup-cleanup.sh`)