Skip to content

Conversation

@flfeurmou-indeed
Copy link
Contributor

Summary

This PR fixes the issue where MCP servers fail to restart after laptop sleep/wake cycles due to zombie supervisor processes.

Problem

After laptop sleep, old supervisor processes may:

  1. Still have PID files but not actually be running
  2. Be running but unresponsive to SIGTERM signals
  3. Hold ports and prevent new servers from starting

The current code only checks if a PID file exists (GetWorkloadPID), not if the process is actually alive. Additionally, KillProcess only sends SIGTERM which zombie processes may ignore.

Solution

1. Proper process liveness detection (manager.go)

isSupervisorProcessAlive now uses process.FindProcess to actually check if the process is running (via signal 0), not just if a PID file exists. If a process is dead but its PID file remains, the stale file is cleaned up automatically.

2. Forceful kill for zombie processes (kill_unix.go)

KillProcess now:

  1. Sends SIGTERM for graceful shutdown
  2. Waits 500ms
  3. If process is still alive, sends SIGKILL to force termination

This handles zombie processes that survive laptop sleep and don't respond to SIGTERM.

Testing

  • Updated existing tests to mock the new process checking behavior
  • All existing tests pass
  • Tested manually with Datadog and Glean MCP servers through sleep/wake cycles

Fixes #3429

Checklist

  • Code follows project style guidelines
  • Tests added/updated
  • Documentation updated (N/A - internal change)
  • Signed-off-by trailer included

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes zombie supervisor processes that can persist after laptop sleep/wake cycles, preventing MCP servers from restarting due to stale PID files and unresponsive processes holding ports.

Changes:

  • Enhanced process liveness detection to verify processes are actually running, not just that PID files exist
  • Added forceful termination (SIGKILL) for zombie processes that don't respond to SIGTERM
  • Updated tests to mock the new process detection behavior

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
pkg/workloads/manager.go Added ProcessFinder injection point and improved isSupervisorProcessAlive to actually check if processes are running, with stale PID cleanup
pkg/workloads/manager_test.go Added mockFindProcess field to test structs and injected mocks into DefaultManager for testing
pkg/process/kill_unix.go Implemented two-stage kill (SIGTERM then SIGKILL after 500ms) to handle zombie processes
pkg/process/kill_windows.go Added SPDX license header only, no functional changes
FLFEURMOU_LOCAL_FIXES.md Critical issue: Personal development tracking file that should not be committed

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@amirejaz
Copy link
Contributor

@flfeurmou-indeed Copilot added some comments

This commit fixes two issues that cause MCP servers to fail after laptop
sleep/wake cycles:

1. isSupervisorProcessAlive now actually checks if the process is running
   using process.FindProcess (signal 0) instead of just checking if a PID
   file exists. Stale PID files are cleaned up automatically.

2. KillProcess now sends SIGKILL after SIGTERM if the process doesn't
   terminate gracefully. This handles zombie processes that may survive
   laptop sleep and don't respond to SIGTERM.

Implementation notes:
- Uses Toolhive's existing FindProcess function (signal 0 check)
- Uses dependency injection (ProcessFinder field) for testability
- No race conditions - verified with go test -race

These changes ensure that:
- Dead supervisor processes are detected even if PID files remain
- Stubborn/zombie processes are forcefully terminated
- MCP servers can restart cleanly after sleep/wake events

Fixes stacklok#3429

Signed-off-by: Frederic Le Feurmou <flfeurmou@indeed.com>
@flfeurmou-indeed flfeurmou-indeed force-pushed the fix/zombie-process-detection branch from bdae7de to c9e1f32 Compare January 27, 2026 01:47
- Use errors.Is(err, os.ErrProcessDone) instead of string comparison
- Remove extra blank line after imports in manager.go
- Remove duplicate DefaultManager comment
- Add test case for zombie scenario (PID exists but process dead)
@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Jan 28, 2026
@codecov
Copy link

codecov bot commented Jan 28, 2026

Codecov Report

❌ Patch coverage is 40.00000% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 65.29%. Comparing base (f9ecdf7) to head (17d4c7e).

Files with missing lines Patch % Lines
pkg/process/kill_unix.go 33.33% 8 Missing ⚠️
pkg/workloads/manager.go 46.15% 4 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3430      +/-   ##
==========================================
- Coverage   65.33%   65.29%   -0.04%     
==========================================
  Files         403      403              
  Lines       39231    39253      +22     
==========================================
+ Hits        25631    25632       +1     
- Misses      11615    11634      +19     
- Partials     1985     1987       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@flfeurmou-indeed
Copy link
Contributor Author

Addressed the Copilot review feedback:

Error handling: Changed string comparison to errors.Is(err, os.ErrProcessDone) for robustness
Extra blank line: Removed the extra blank line after imports
Duplicate comment: Removed duplicate "DefaultManager" comment, ProcessFinder now has its own proper comment
Zombie test case: Added test case for the scenario where PID file exists but process is dead
Personal file: Removed FLFEURMOU_LOCAL_FIXES.md from the PR

⏸️ 500ms wait time: Kept as-is for now. The value is documented in the comment. Happy to make it configurable if that's a blocker.

@theJC
Copy link
Contributor

theJC commented Jan 30, 2026

❤️ ❤️
I am so happy to see this MR, this is a constant problem for me and has got me staying away from leveraging toolhive lately because its so disruptive! Thank you @flfeurmou-indeed for digging into this issue!

@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Jan 30, 2026
@amirejaz
Copy link
Contributor

@flfeurmou-indeed Let’s keep this PR open for now, as it may no longer be needed since we’ve merged the DCR persistence PR. If the issue still persists, we can discuss a better solution for handling this in the file state manager.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remote MCP servers leave zombie processes after laptop sleep, blocking restart

3 participants