-
Notifications
You must be signed in to change notification settings - Fork 173
fix: properly detect and kill zombie supervisor processes #3430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: properly detect and kill zombie supervisor processes #3430
Conversation
1e25563 to
bdae7de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes zombie supervisor processes that can persist after laptop sleep/wake cycles, preventing MCP servers from restarting due to stale PID files and unresponsive processes holding ports.
Changes:
- Enhanced process liveness detection to verify processes are actually running, not just that PID files exist
- Added forceful termination (SIGKILL) for zombie processes that don't respond to SIGTERM
- Updated tests to mock the new process detection behavior
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/workloads/manager.go | Added ProcessFinder injection point and improved isSupervisorProcessAlive to actually check if processes are running, with stale PID cleanup |
| pkg/workloads/manager_test.go | Added mockFindProcess field to test structs and injected mocks into DefaultManager for testing |
| pkg/process/kill_unix.go | Implemented two-stage kill (SIGTERM then SIGKILL after 500ms) to handle zombie processes |
| pkg/process/kill_windows.go | Added SPDX license header only, no functional changes |
| FLFEURMOU_LOCAL_FIXES.md | Critical issue: Personal development tracking file that should not be committed |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@flfeurmou-indeed Copilot added some comments |
This commit fixes two issues that cause MCP servers to fail after laptop sleep/wake cycles: 1. isSupervisorProcessAlive now actually checks if the process is running using process.FindProcess (signal 0) instead of just checking if a PID file exists. Stale PID files are cleaned up automatically. 2. KillProcess now sends SIGKILL after SIGTERM if the process doesn't terminate gracefully. This handles zombie processes that may survive laptop sleep and don't respond to SIGTERM. Implementation notes: - Uses Toolhive's existing FindProcess function (signal 0 check) - Uses dependency injection (ProcessFinder field) for testability - No race conditions - verified with go test -race These changes ensure that: - Dead supervisor processes are detected even if PID files remain - Stubborn/zombie processes are forcefully terminated - MCP servers can restart cleanly after sleep/wake events Fixes stacklok#3429 Signed-off-by: Frederic Le Feurmou <flfeurmou@indeed.com>
bdae7de to
c9e1f32
Compare
- Use errors.Is(err, os.ErrProcessDone) instead of string comparison - Remove extra blank line after imports in manager.go - Remove duplicate DefaultManager comment - Add test case for zombie scenario (PID exists but process dead)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3430 +/- ##
==========================================
- Coverage 65.33% 65.29% -0.04%
==========================================
Files 403 403
Lines 39231 39253 +22
==========================================
+ Hits 25631 25632 +1
- Misses 11615 11634 +19
- Partials 1985 1987 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Addressed the Copilot review feedback: ✅ Error handling: Changed string comparison to ⏸️ 500ms wait time: Kept as-is for now. The value is documented in the comment. Happy to make it configurable if that's a blocker. |
|
❤️ ❤️ |
|
@flfeurmou-indeed Let’s keep this PR open for now, as it may no longer be needed since we’ve merged the DCR persistence PR. If the issue still persists, we can discuss a better solution for handling this in the file state manager. |
Summary
This PR fixes the issue where MCP servers fail to restart after laptop sleep/wake cycles due to zombie supervisor processes.
Problem
After laptop sleep, old supervisor processes may:
The current code only checks if a PID file exists (
GetWorkloadPID), not if the process is actually alive. Additionally,KillProcessonly sends SIGTERM which zombie processes may ignore.Solution
1. Proper process liveness detection (
manager.go)isSupervisorProcessAlivenow usesprocess.FindProcessto actually check if the process is running (via signal 0), not just if a PID file exists. If a process is dead but its PID file remains, the stale file is cleaned up automatically.2. Forceful kill for zombie processes (
kill_unix.go)KillProcessnow:This handles zombie processes that survive laptop sleep and don't respond to SIGTERM.
Testing
Fixes #3429
Checklist