Skip to content

Comments

fix: graceful bridge restarts — SIGTERM handling + EADDRINUSE retry#161

Open
baudbot-agent wants to merge 1 commit intomainfrom
fix/bridge-graceful-restart
Open

fix: graceful bridge restarts — SIGTERM handling + EADDRINUSE retry#161
baudbot-agent wants to merge 1 commit intomainfrom
fix/bridge-graceful-restart

Conversation

@baudbot-agent
Copy link
Collaborator

Problem

Every deploy/restart, the bridge crash-loops on EADDRINUSE for 30+ seconds because:

  1. startup-cleanup.sh kills port 7890 with kill -9 — no graceful shutdown, port may linger
  2. The tmux restart loop immediately respawns the bridge before the port is free
  3. The bridge has zero tolerance for EADDRINUSE — it crashes instantly
  4. The restart loop keeps spawning new instances that all fight over the port

This happens reliably on every deploy and requires manual debugging each time.

Fix

Three layers of defense:

1. Bridge: SIGTERM/SIGINT handler (broker-bridge.mjs)

Catches SIGTERM, calls server.close() to release the port cleanly, then exits. 5s forced-exit timeout as safety net.

2. Bridge: EADDRINUSE retry (broker-bridge.mjs)

Instead of crashing on EADDRINUSE, retries up to 5 times with 2s backoff. If the port is briefly held by a dying predecessor, the bridge just waits.

3. Startup: ordered cleanup (startup-cleanup.sh)

  • Kill the tmux session first — stops the restart loop from respawning the bridge while we clean up
  • SIGTERM the port holder (not SIGKILL) and wait up to 3s for graceful exit
  • SIGKILL only as fallback for stuck processes
  • Restart loop checks port availability before each relaunch

Testing

  • All 138 existing tests pass
  • Manual verification: the race condition that caused EADDRINUSE crash loops on every startup is eliminated

Three layers of defense against port conflicts during bridge restarts:

1. broker-bridge.mjs: SIGTERM/SIGINT handler closes the HTTP server
   cleanly before exiting, so the port is released immediately
   instead of lingering in TIME_WAIT.

2. broker-bridge.mjs: EADDRINUSE retry with backoff (5 attempts,
   2s apart) so if the port IS briefly held, the bridge waits
   instead of crashing.

3. startup-cleanup.sh: Kill the tmux restart loop FIRST (prevents
   respawning), then SIGTERM the port holder and wait up to 3s
   for graceful exit before falling back to SIGKILL. The restart
   loop also checks port availability before each relaunch.
function tryListen() {
bindAttempt++;
server.listen(API_PORT, "127.0.0.1");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The EADDRINUSE error handler retries server.listen() without first calling server.close(). This will cause an ERR_SERVER_ALREADY_LISTEN error and crash the process on the first retry attempt.
Severity: CRITICAL

Suggested Fix

In the EADDRINUSE error handler, call server.close() before attempting to call server.listen() again. The setTimeout callback should be modified to first close the server and then re-invoke listen.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: slack-bridge/broker-bridge.mjs#L1065

Potential issue: The retry logic for a port conflict (`EADDRINUSE`) is flawed. When a
port is in use, the code schedules a retry of `server.listen()`. However, Node.js
requires `server.close()` to be called before `server.listen()` can be attempted again
after a listen-related error. Without this, the retry will immediately throw an
`ERR_SERVER_ALREADY_LISTEN` error. This unhandled error will trigger the `else` branch
of the error handler, causing the process to exit via `process.exit(1)`. This defeats
the purpose of the retry mechanism and guarantees a crash on the first port conflict.

Did we get this right? 👍 / 👎 to inform future reviews.

@greptile-apps
Copy link

greptile-apps bot commented Feb 24, 2026

Greptile Summary

This PR eliminates a recurring EADDRINUSE crash loop that occurred on every deploy/restart by adding three layers of defense:

  • broker-bridge.mjs: Adds a gracefulShutdown() handler on SIGTERM/SIGINT that closes the HTTP server cleanly before exiting (with a 5s forced-exit safety net), and adds EADDRINUSE retry logic (up to 5 retries with 2s backoff) so the bridge can wait for a dying predecessor to release the port.
  • startup-cleanup.sh: Reorders cleanup to kill the tmux session first (preventing the restart loop from respawning the bridge during cleanup), switches from immediate SIGKILL to SIGTERM with a 3s grace period before falling back to SIGKILL, and adds a port-availability check in the restart loop before relaunching.

The changes are well-scoped and focused on the specific race condition. The three layers (graceful shutdown, bind retry, ordered cleanup) provide good defense-in-depth. Existing tests pass and the integration test suite already exercises SIGTERM teardown of the bridge.

Confidence Score: 4/5

  • This PR is safe to merge — it adds defensive reliability improvements without changing core business logic or security behavior.
  • Score of 4 reflects that the changes are well-structured, focused on operational reliability, and don't modify security-critical code paths. One minor style concern (missing server.close() per Node.js docs) is non-blocking. No security, authentication, or rate-limiting behavior is affected.
  • No files require special attention. Both changes are straightforward operational improvements.

Important Files Changed

Filename Overview
slack-bridge/broker-bridge.mjs Adds SIGTERM/SIGINT graceful shutdown handler and EADDRINUSE retry logic. Well-structured, minor style concern about missing server.close() before retry per Node.js docs.
pi/skills/control-agent/startup-cleanup.sh Reorders cleanup to kill tmux first, switches from SIGKILL to SIGTERM with graceful wait and SIGKILL fallback, adds port-availability check in restart loop. Clean and correct.

Sequence Diagram

sequenceDiagram
    participant SC as startup-cleanup.sh
    participant TM as tmux session
    participant OB as old bridge process
    participant NB as new bridge process
    participant Port as Port 7890

    Note over SC: Deploy/restart triggered
    SC->>TM: kill-session (stop restart loop)
    TM-->>OB: (no longer supervised)
    SC->>OB: SIGTERM
    OB->>OB: gracefulShutdown()
    OB->>Port: server.close()
    Port-->>OB: port released
    OB->>OB: process.exit(0)
    SC->>SC: wait up to 3s for port free
    alt Port still held
        SC->>OB: SIGKILL (fallback)
    end
    SC->>TM: new tmux session
    TM->>NB: start bridge
    NB->>Port: server.listen(7890)
    alt EADDRINUSE
        NB->>NB: retry (up to 5x, 2s backoff)
        NB->>Port: server.listen(7890)
    end
    Port-->>NB: listening
    Note over NB: Bridge operational
Loading

Last reviewed commit: cccf3b6

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +1072 to +1076
server.on("error", (err) => {
if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) {
logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`);
setTimeout(tryListen, BIND_RETRY_DELAY_MS);
} else {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing server.close() before retry
The Node.js documentation recommends calling server.close() before retrying server.listen() after an EADDRINUSE error, to properly reset the server's internal state. While the current code may work in practice (the failed bind doesn't create resources), adding server.close() follows the documented pattern and avoids potential edge cases in future Node.js versions.

Suggested change
server.on("error", (err) => {
if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) {
logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`);
setTimeout(tryListen, BIND_RETRY_DELAY_MS);
} else {
if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) {
logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`);
server.close();
setTimeout(tryListen, BIND_RETRY_DELAY_MS);
Prompt To Fix With AI
This is a comment left during a code review.
Path: slack-bridge/broker-bridge.mjs
Line: 1072-1076

Comment:
**Missing `server.close()` before retry**
The [Node.js documentation](https://nodejs.org/api/net.html#event-error) recommends calling `server.close()` before retrying `server.listen()` after an EADDRINUSE error, to properly reset the server's internal state. While the current code may work in practice (the failed bind doesn't create resources), adding `server.close()` follows the documented pattern and avoids potential edge cases in future Node.js versions.

```suggestion
    if (err.code === "EADDRINUSE" && bindAttempt < MAX_BIND_RETRIES) {
      logWarn(`⚠️ port ${API_PORT} in use, retrying in ${BIND_RETRY_DELAY_MS}ms (attempt ${bindAttempt}/${MAX_BIND_RETRIES})`);
      server.close();
      setTimeout(tryListen, BIND_RETRY_DELAY_MS);
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant