Skip to content

fix(core): use native graceful process tree shutdown#33655

Draft
leosvelperez wants to merge 12 commits intomasterfrom
nxc-3480
Draft

fix(core): use native graceful process tree shutdown#33655
leosvelperez wants to merge 12 commits intomasterfrom
nxc-3480

Conversation

@leosvelperez
Copy link
Member

@leosvelperez leosvelperez commented Nov 28, 2025

Current Behavior

When running continuous tasks (e.g. nx e2e with a dev server dependency), shutting down via Ctrl+C or SIGTERM can leave orphaned processes and leaked resources (e.g. Docker containers still running). The JS tree-kill package kills all processes flat — parents die before children can run cleanup handlers. Children also receive redundant signals (OS SIGINT + orchestrator SIGTERM + per-process handlers), causing race conditions and unpredictable shutdown ordering.

Expected Behavior

Graceful, ordered shutdown: leaf processes are signaled first, allowed to run cleanup handlers (e.g. docker compose stop), then parents are signaled. At most 1 orchestrator-dispatched signal per child. No orphaned processes, no leaked resources, cleanup subprocesses are protected during shutdown.

Changes

Rust process killer (process_killer/mod.rs)

Two native functions via sysinfo process table, replacing the tree-kill npm package:

  • killProcessTree (sync, fire-and-forget) — snapshots tree, signals all descendants at once. For process.on('exit') handlers. Falls back to SIGKILL on Windows.
  • killProcessTreeGraceful (async, bottom-up) — signals leaves first, waits for exit, promotes parents, repeats. Tracks and protects cleanup subprocesses spawned during shutdown. SIGKILL escalation after grace period (default 5s).

exec → spawn with detached groups (running-tasks.ts)

  • exec()spawn({ shell: true, detached: true }) — children get own process group via setsid(), don't receive OS SIGINT on Ctrl+C. Only orchestrator dispatches signals.
  • killPromise caching — concurrent callers (e.g. cleanup() after fire-and-forget kill from cleanUpUnneededContinuousTasks) await the same in-progress kill instead of no-oping. Prevents orphan detached processes on normal task completion.
  • Signal handlers guarded by NX_FORKED_TASK_EXECUTOR — only active in forked path (no orchestrator).

Orchestrator (task-orchestrator.ts)

  • process.on() (persistent) instead of process.once() — prevents signal-exit re-raise before async cleanup completes.
  • Single handleSignal with stopRequested guard for SIGINT/SIGTERM/SIGHUP.
  • Kill deduplication — continuous tasks already being killed are skipped in the run-commands kill pass.
  • Awaits forked runner cleanup.

Forked process task runner (forked-process-task-runner.ts)

  • Async cleanup() awaited by orchestrator.
  • Removed SIGTERM/SIGHUP handlers — orchestrator owns dispatch.
  • SIGINT handler calls cleanup without process.exit().

Other migrations

  • pseudo-terminal.ts, node-child-process.ts, run-script.impl.ts: replace tree-kill with native killers.
  • Remove tree-kill dependency.

Tests

  • Integration tests for Rust napi bindings: tree traversal, SIGTERM handler execution, grace period, SIGKILL fallback, dead PID.
  • Updated run-commands tests for spawn assertions.

Related Issue(s)

Fixes #32438

@leosvelperez leosvelperez self-assigned this Nov 28, 2025
@leosvelperez leosvelperez requested review from a team as code owners November 28, 2025 16:10
@vercel
Copy link

vercel bot commented Nov 28, 2025

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Preview Updated (UTC)
nx-dev Ready Ready Preview Nov 28, 2025 4:13pm

@netlify
Copy link

netlify bot commented Nov 28, 2025

Deploy Preview for nx-docs ready!

Name Link
🔨 Latest commit 47f266b
🔍 Latest deploy log https://app.netlify.com/projects/nx-docs/deploys/69b4320494fcab0008470ad5
😎 Deploy Preview https://deploy-preview-33655--nx-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@nx-cloud
Copy link
Contributor

nx-cloud bot commented Nov 28, 2025

View your CI Pipeline Execution ↗ for commit 47f266b

Command Status Duration Result
nx affected --targets=lint,test,build,e2e,e2e-c... ❌ Failed 35m 21s View ↗
nx run-many -t check-imports check-lock-files c... ✅ Succeeded 4m 5s View ↗
nx-cloud record -- nx-cloud conformance:check ✅ Succeeded 9s View ↗
nx-cloud record -- nx format:check ✅ Succeeded 2s View ↗
nx-cloud record -- nx sync:check ✅ Succeeded <1s View ↗

☁️ Nx Cloud last updated this comment at 2026-03-13 16:49:09 UTC

@github-actions
Copy link
Contributor

🐳 We have a release for that!

This PR has a release associated with it. You can try it out using this command:

npx create-nx-workspace@0.0.0-pr-33655-040845c my-workspace

Or just copy this version and use it in your own command:

0.0.0-pr-33655-040845c
Release details 📑
Published version 0.0.0-pr-33655-040845c
Triggered by @leosvelperez
Branch nxc-3480
Commit 040845c
Workflow run 19768936373

To request a new release for this pull request, mention someone from the Nx team or the @nrwl/nx-pipelines-reviewers.

@agcty
Copy link

agcty commented Feb 26, 2026

Hey @leosvelperez — I've been working on fixing the graceful shutdown regression that's been blocking this PR. I built on top of your nxc-3480 branch and have a working solution.

Branch: agcty/nx:fix/32438-graceful-process-tree-kill (6 commits on top of your branch)

What it does

Added killProcessTreeGraceful() — an async Rust function that replaces the fire-and-forget killProcessTree in shutdown paths:

  1. Snapshots the process tree via BFS before sending signals
  2. Sends SIGTERM to all descendants (leaves first)
  3. Polls every 100ms for up to 5s (configurable) for graceful exit
  4. SIGKILL survivors after the grace period
  5. Detects reparented descendants (when root exits quickly, children get reparented to init)

Then wired it into all the TypeScript kill() methods and fixed several race conditions in the signal handling:

  • forked-process-task-runner: cleanup() is now async, SIGINT handler awaits it before process.exit()
  • running-tasks.ts: Per-child SIGINT handlers no longer call process.exit() (was causing a race where first child to finish killed the parent before others could clean up)
  • task-orchestrator.ts: Now awaits forked runner cleanup alongside other running tasks
  • pseudo-terminal.ts: kill() goes through graceful tree kill directly; shutdown() stays sync for exit handlers

Test results

Tested with the reproduction repo (3 concurrent services via continuous: true):

  • ✅ No orphaned processes after Ctrl+C
  • ✅ Ports freed (cleanup handlers ran server.close())
  • ✅ Cleanup handler execution confirmed via log file output
  • ✅ 6 new integration tests for killProcessTree / killProcessTreeGraceful (all passing)

What's left

  • Haven't run the full NX test suite (nx affected -t build,test,lint, e2e) — would appreciate if you could run that against CI
  • Only tested Unix (macOS). Windows behavior is unchanged from your PR (all signals map to TerminateProcess)
  • The sysinfo crate's kill_with(Signal::Kill) has a quirk on macOS with detached processes — pre-existing from your branch, not introduced here

Happy to adjust anything based on your feedback. This has been a fun one to debug!

@agcty
Copy link

agcty commented Feb 26, 2026

@leosvelperez had a go at it with opus 4.6 and codex 5.3 xhigh, seems working in my local repro, wonder what you think about the solution

@netlify
Copy link

netlify bot commented Feb 27, 2026

Deploy Preview for nx-dev ready!

Name Link
🔨 Latest commit 47f266b
🔍 Latest deploy log https://app.netlify.com/projects/nx-dev/deploys/69b43204c1c50b00087838fc
😎 Deploy Preview https://deploy-preview-33655--nx-dev.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@leosvelperez
Copy link
Member Author

@agcty thanks so much for your continued investigation and work here!

I'll take a look at your changes. At a quick glance, it's a subset of what I'm trying to do, but it might provide a quick/intermediate point we could use before I wrap up my changes. So, I'll review/test it locally and analyze whether it's worth using as a quick win. Will let you know later today.

This branch was outdated, so I updated it. I apologize for any issues it might cause with your work on top of it.

@leosvelperez
Copy link
Member Author

leosvelperez commented Feb 27, 2026

@agcty did you test it against the repro/too-aggressive branch or your repo? I cherry-picked your commits into a separate local branch (having rebased master) to test things locally against that branch of your repro, but running bun nx dev frontend and killing it leaves the Docker instance behind.

I can follow it up and fix whatever is left, but I want to understand what exactly you tested so I have the full picture of what's expected vs what I see.

@agcty
Copy link

agcty commented Mar 2, 2026

@leosvelperez I think it's not quite there yet, it seemed like it worked but I think I was on the other branch, sorry about that. This seems very very hard to debug, e.g nx changes env vars when it finds a claude md file amongst some other things which makes it not that easy to reason about. Will look into it some more today! Did you have another fix in mind that be architecturally more sound?

@leosvelperez
Copy link
Member Author

@agcty, yes, I'm working on a bigger change to centralize/coordinate process shutdown requests/flow, but I'll explore your solution a bit more to see if, with some tweaks, we could still have a quicker win.

@agcty
Copy link

agcty commented Mar 2, 2026

@leosvelperez I see, thanks for the update!

@leosvelperez leosvelperez changed the title fix(core): add rust-based process tree killer and kill pseudo terminal process tree fix(core): native graceful process tree shutdown Mar 4, 2026
@leosvelperez leosvelperez changed the title fix(core): native graceful process tree shutdown fix(core): use native graceful process tree shutdown Mar 4, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

Failed to publish a PR release of this pull request, triggered by @leosvelperez.
See the failed workflow run at: https://github.com/nrwl/nx/actions/runs/22667277116

@github-actions
Copy link
Contributor

github-actions bot commented Mar 4, 2026

🐳 We have a release for that!

This PR has a release associated with it. You can try it out using this command:

npx create-nx-workspace@22.6.0-pr.33655.b3efa09 my-workspace

Or just copy this version and use it in your own command:

22.6.0-pr.33655.b3efa09
Release details 📑
Published version 22.6.0-pr.33655.b3efa09
Triggered by @leosvelperez
Branch nxc-3480
Commit b3efa09
Workflow run 22670992340

To request a new release for this pull request, mention someone from the Nx team or the @nrwl/nx-pipelines-reviewers.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

@leosvelperez leosvelperez force-pushed the nxc-3480 branch 2 times, most recently from df03045 to 4de9081 Compare March 12, 2026 16:52
nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

nx-cloud[bot]

This comment was marked as outdated.

leosvelperez and others added 12 commits March 13, 2026 16:48
Building on the Rust-based process tree killer, this adds a
`killProcessTreeGraceful` async function that implements
SIGTERM → wait → SIGKILL escalation, preventing two issues:

1. Orphaned processes: the tree walker sends signals to all
   descendants regardless of process group, so deep child processes
   (e.g. doppler → bunx → bun → app) all receive SIGTERM.

2. Premature cleanup interruption: the main process now stays alive
   during a configurable grace period (default 5s), keeping the PTY
   master open and preventing the kernel from sending SIGHUP to
   children before they finish their cleanup handlers.

The fire-and-forget `killProcessTree` is kept for synchronous
`process.on('exit')` handlers as a last resort.

Fixes #32438

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- forked-process-task-runner: await kill promises before process.exit()
- running-tasks: per-child SIGINT handlers no longer call process.exit()
  to prevent first-child-exits race
- pseudo-terminal: shutdown() uses sync killProcessTree for exit handler
- pseudo-terminal: kill() uses killProcessTreeGraceful directly
- process_killer: graceful kill detects reparented descendants

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TaskOrchestrator.cleanup() now awaits the forked process runner's
cleanup promise alongside other running tasks, ensuring all process
trees are fully shut down before the orchestrator continues.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…reeGraceful

Tests cover:
- Killing a simple process
- Killing a multi-level process tree (parent + children)
- Graceful SIGTERM with cleanup handler verification
- Slow cleanup within grace period (verifies process isn't killed prematurely)
- Multi-level tree graceful shutdown
- Already-dead process (no-op fast path)

Skipped on Windows where Unix signals don't apply.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rewrite killProcessTreeGraceful to signal leaf processes first, wait
for them to exit, then signal their parents (now leaves). Keeps the
PTY shell alive while descendants run cleanup handlers.

- Signal only new leaves each iteration (signaled HashSet prevents
  duplicate SIGTERMs that interrupted execSync cleanup)
- Reuse single System instance and Vec buffer across iterations
- Defensive SIGKILL on !has_leaves to prevent process leaks
- Accept string | number signal type (consistent with killProcessTree)
- Add ?. null check in SeriallyRunningTasks.kill()
- Add missing signalToCode import in forked-process-task-runner
Rust:
- Log JoinError instead of silently swallowing with .ok()
- Inline collect_process_tree_inner, drop unused require_root param
- Remove duplicate buf copy before ProcessesToUpdate::Some

Tests:
- Fix SIGTERM test to use single node process (bottom-up kills shell
  children first, preventing trap from firing)
- Fix tree test to poll for marker file instead of silently skipping
  grandchild assertion
- Remove misleading graceful tree test (cleanup marker never written
  due to bottom-up kill order)
- Add SIGKILL fallback test for SIGTERM-ignoring processes
- Add afterEach cleanup, remove unused imports, widen timing bounds
- Early-return in collect_process_tree when root PID not found, skipping unnecessary process table iteration
- Return children map from collect_process_tree to avoid redundant reconstruction in graceful kill path
- Replace duck-typed .then check with Promise.resolve() wrapper in signal handler
…hutdown

Orchestrator is now the single authority for signal dispatch in the
direct-execution path. Children spawned with detached process groups
(setsid) don't receive OS SIGINT; the orchestrator sends SIGTERM via
killProcessTreeGraceful during cleanup.

- exec() -> spawn(detached) in RunningNodeProcess for process group isolation
- Guard per-child signal handlers with NX_FORKED_TASK_EXECUTOR (direct path
  delegates to orchestrator, forked path keeps handlers as primary mechanism)
- Remove FPTR SIGTERM/SIGHUP handlers that consumed events before orchestrator
- Remove FPTR SIGINT process.exit() (orchestrator owns exit decision)
- Unify orchestrator signal handlers via handleSignal with process.on() to
  absorb signal-exit v3 re-raises, stopRequested as re-entrancy guard
- Deduplicate kills between continuousSnapshot and runningRunCommandsTasks
- Add killing flag and closedPromise to RunningNodeProcess for idempotent
  graceful shutdown with stdio drain
- Rust: unified map_signal across platforms, force_kill_fallback for
  fire-and-forget killProcessTree, cleanup subprocess discovery,
  ProcessesToUpdate::All for tracking children spawned during shutdown
- Windows: detached=false (children share console, get CTRL_C_EVENT),
  unsupported signals fall back to SIGKILL via force_kill_fallback
cleanUpUnneededContinuousTasks() fires kill() without await. With the
boolean `killing` flag, cleanup() at the end of run() would call kill()
again but no-op (flag already set), leaving the async kill unwaited.
Detached children survive the parent's exit.

Replace the boolean with a cached killPromise. Concurrent callers
(cleanup after fire-and-forget kill) now await the same in-progress
operation.
Copy link
Contributor

@nx-cloud nx-cloud bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nx Cloud has identified a possible root cause for your failed CI:

We classified these CI failures as environment state rather than a code change. The detox:build-base failure is a TypeScript build dependency ordering issue where the react package output was not available — unrelated to our signal handling and process killer changes. The @nx/nx-source:populate-local-registry-storage failure (exit code 130) is a pre-existing flaky interruption with a 3% historical failure rate; a rerun should resolve both.

No code changes were suggested for this issue.

Trigger a rerun:

Rerun CI

Nx Cloud View detailed reasoning on Nx Cloud ↗

🔔 Heads up, your workspace has pending recommendations ↗ to auto-apply fixes for similar failures.


🎓 Learn more about Self-Healing CI on nx.dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Node processes not closing after ending NX session

2 participants