Aurashk/improve process manager terminate behaviour by Aurashk · Pull Request #733 · DUNE-DAQ/drunc

Aurashk · 2025-12-05T15:24:26Z

Description

Fixes #658

Modifies both flavours of SSH process manager (shell and paramiko) to query/kill remote processes directly rather than through the local ssh client. This is achieved by running headless remote processes and storing the pid of the remote process in metadata. Then the pid can be used to send signals through ssh directly to the remote process.

This has some desired effects:

We get informative exit codes from the remote processes rather than the ssh client processes
More control of cleanup through sending signals directly, otherwise all we can do is kill the client which sends a SIGHUP to the remote process

Note: There is an adjacent comment in the issue about fixing the terminate order to match K8s. I think that would be straightforward to add to this PR but I will wait for feedback on the approach first

Type of change

Documentation (non-breaking change that adds or improves the documentation)
New feature (non-breaking change which adds functionality)
Optimization (non-breaking, back-end change that speeds up the code)
Bug fix (non-breaking change which fixes an issue)
Breaking change (whatever its nature)

Key checklist

All tests pass (eg. python -m pytest)
Pre-commit hooks run successfully (eg. pre-commit run --all-files)

Further checks

Code is commented, particularly in hard-to-understand areas
Tests added or an issue has been opened to tackle that in the future.
(Indicate issue here: # (issue))

improve kill handling to kill remote processes directly via ssh via sigterm

…-behaviour

…signal behaviour

jamesturner246

Not sure if you can still see our meeting chat any more, but I think I have signal forwarding working without any monitor threads or other hacks.

The secret is to run the ssh client with the -t option, which forces remote to create a PTY, and then run the command inside of that, instead of without, where it runs the command directly. Advantage of this way is signals like SIGTERM et al are actually forwarded to remote command properly, thanks to the PTY's built-in signal handling.

So all it means in practice is using ssh -t ... instead of ssh ..., and it should work without the hacks.

jamesturner246 · 2025-12-08T11:05:40Z

One caveat though, SIGTERMing the local ssh means that in fact a SIGHUP is actually what appears on the remote command side. Butt from this use case I don't think it matters which of SIGTERM or SIGHUP reaches remote command, just the fact that a TERM-ish signal is reaching remote reliably.

jamesturner246 · 2025-12-08T11:13:05Z

One can in fact be EXTRA safe (probably would recommend) by sending the ^C byte through explicitly, and letting the remote program deal with shutting itself down. The ssh return code would be the return code of the remote command. It would fall back to getting SIGHUP if that fails though.

Aurashk · 2026-01-16T11:06:00Z

One caveat though, SIGTERMing the local ssh means that in fact a SIGHUP is actually what appears on the remote command side. Butt from this use case I don't think it matters which of SIGTERM or SIGHUP reaches remote command, just the fact that a TERM-ish signal is reaching remote reliably.

From my understanding of #658 and #649 this is exactly the problem we want to solve. We want to explicitly send signals to the remote process, but the current implementation send signals to the local process. If a remote process gets a SIGHUP it leads to ambiguity, as it just means the connection/terminal was closed from the client side. The current intended behaviour attempts to shut everything down cleanly with a SIQUIT (but this doesn't work because it SIGQUITS the client instead)

This is all my interpretation of what the code should be doing from issues and looking at the current implementation though, I'd say ideally we should have these nuances documented somewhere it's clear what boot and kill should do. Maybe add a few lines in process_manager.md

Aurashk · 2026-01-16T11:15:13Z

Not sure if you can still see our meeting chat any more, but I think I have signal forwarding working without any monitor threads or other hacks.

The secret is to run the ssh client with the -t option, which forces remote to create a PTY, and then run the command inside of that, instead of without, where it runs the command directly. Advantage of this way is signals like SIGTERM et al are actually forwarded to remote command properly, thanks to the PTY's built-in signal handling.

So all it means in practice is using ssh -t ... instead of ssh ..., and it should work without the hacks.

Monitor threads are used for a few things in the shell ssh process manager which we should probably document better (reading logs from remote processes asynchronously, checking the remote process is alive, running a callback function when the remote process exits), I don't think it's that straightforward to remove them.

We are already using -t in all ssh commands. I actually think we might be overusing it and should use -T for some of the ssh calls for better efficiency. I'm not too sure why -t is already used everywhere - it may be that some of the processes require a full interactive shell so the process manager wouldn't work robustly without it.

Aurashk · 2026-01-22T14:42:48Z

Just want to flag something about the terminate behaviour now that our signal propagation is working.
Using SIGQUIT rarely works for the remote processes within the timeout provided by the configuration here (Usually this is 100 ms/ 0.1 seconds).

If you try, for example

drunc-unified-shell ssh-standalone daqsystemtest/config/daqsystemtest/example-configs.data.xml local-1x1-config MyTest
boot
terminate

You will get most of your processes forcibly closed with SIGKILL because the timeout was too slow. If you add 5.0 seconds to the timeout argument linked above you'll be far more successful in your SIGQUIT attempts, although each one takes a few seconds to shut down cleanly so it makes terminate considerably slower. It would be good to be clear about what our requirements are here so we can use the right tradeoff.

…ocesses

Aurashk · 2026-01-23T16:34:02Z

As discussed with @jamesturner246 we can use the existing monitor threads to have one persistent ssh connection per process which will block until the remote process dies. This means we don't need to reconnect to every remote process for each ps command. With the latest commit the only outstanding concern is the slowness of killing the processes mentioned in the comment above.

The fundamental limitation I'm seeing is:

Each remote process takes ~ 2-6 seconds to shut down cleanly (on my fairly slow laptop mind) with SIGQUIT and give us an exit code, but we weren't seeing this delay before because we were only checking that the ssh client process ended locally (The process can still be running if you disconnect).
When we do terminate we want to shut down processes synchronously, because the order matters, so we need to wait for the previous process to shut down before shutting down the next one. This makes the terminate command much slower than before (adds seconds per process).

I think this means we need to do one of the following but open to other solutions:

Accept the slower performance with ssh process manager terminate in testing/prototyping
Kill all the processes at once asynchronously, and accept that we can't order the termination of processes. (Or something more sophisticated which would kill the processes in batches)
Don't display exit codes/status interactively in the shell, write it to a log in the background instead. Then the Process cleanup can happen correctly in the background independently of the shell instance and problematic exit codes can still get flagged at the end.
Don't bother with SIGQUIT and default straight to SIGKILL as it's faster if a clean shutdown isn't important for all processes in this situation.

PawelPlesniak · 2026-01-28T16:45:48Z

Thank you both for the discussion on this PR thread. @Aurashk you have made the correct assumptions about the code.

When we terminate a session, we want to be able to kill all the applications directly, not their connections. The timeout was indeed reduced to make this faster, but if we can send the signals in parallel or asynchronously it would be a lot faster, but lets do this only if it is does not reuqire much time - SSH is used now but it will eventually not be used primarily as we shift to k8s/systemd. If we are to increase the timeout associated with killing the processes, then this would be preferred as when we run the SSH PM in production, we have O(30) processes, and if we have to wait 30 processes x 5 seconds x 2 attempts (SIGQUIT, SIGKILL), this is 5 mins which is too much.

The order for process shutdown does matter, in the sense that we want to kill daq_applications, then segment controllers, then the root_controller, then the infrastructure applications, if defined in the session (e.g. local-connectivity-service but not ehn1-connectivity-service). We could have these sets of processes be defined to be terminated in set order, which can be identified by the tree ID, which is formatted as a 3 digit number using session_index.segment_index.application_index.

Your suggestions on process ordering are good, but it would be better to contain the process termination within the reporting period - leaving the processes as cleaning up after the result of terminate has been reported could lead to zombie processes, which we have had several iterations of in the past.

Doing SIGQUIT before SIGKILL should stay too, I understand the suggestion but from a wider group perspective it is better to attempt to gracefully shut down before force killing.

TLDR

The termination order should be daq_applications, then segment controllers, then the root_controller, then the infrastructure applications. This can be done synchronously, but only if it is not complex to do so.
Lets keep SIGQUIT and SIGKILL

…-behaviour

…ttps://github.com/DUNE-DAQ/drunc into aurashk/improve-process-manager-terminate-behaviour

Aurashk · 2026-01-30T17:29:39Z

Thank you both for the discussion on this PR thread. @Aurashk you have made the correct assumptions about the code.

When we terminate a session, we want to be able to kill all the applications directly, not their connections. The timeout was indeed reduced to make this faster, but if we can send the signals in parallel or asynchronously it would be a lot faster, but lets do this only if it is does not reuqire much time - SSH is used now but it will eventually not be used primarily as we shift to k8s/systemd. If we are to increase the timeout associated with killing the processes, then this would be preferred as when we run the SSH PM in production, we have O(30) processes, and if we have to wait 30 processes x 5 seconds x 2 attempts (SIGQUIT, SIGKILL), this is 5 mins which is too much.

The order for process shutdown does matter, in the sense that we want to kill daq_applications, then segment controllers, then the root_controller, then the infrastructure applications, if defined in the session (e.g. local-connectivity-service but not ehn1-connectivity-service). We could have these sets of processes be defined to be terminated in set order, which can be identified by the tree ID, which is formatted as a 3 digit number using session_index.segment_index.application_index.

Your suggestions on process ordering are good, but it would be better to contain the process termination within the reporting period - leaving the processes as cleaning up after the result of terminate has been reported could lead to zombie processes, which we have had several iterations of in the past.

Doing SIGQUIT before SIGKILL should stay too, I understand the suggestion but from a wider group perspective it is better to attempt to gracefully shut down before force killing.

TLDR
* The termination order should be `daq_application`s, then segment `controller`s, then the `root_controller`, then the infrastructure applications. This can be done synchronously, but only if it is not complex to do so.

* Lets keep `SIGQUIT` and `SIGKILL`

Thanks for the helpful comments @PawelPlesniak, it wasn't too complicated to do the terminate in asynchronous batches (in the order specified, same as k8s) so I went with that. It seems to work nicely and much more performant than killing the processes one-after-the-other.

I turned up the timeout in drunc/src/drunc/data/process_manager/ssh-standalone.json slightly from 0.1 -> 0.5. Otherwise both the SIGQUIT and SIGKILL can fail to kill in the timeout specified and this line will log it as an error. Is that okay? You could probably crank it up even further to 2 seconds to give a good chance all the SIGQUIT's will succeed. The performance scales with the number of items in the shutdown order, so even with a lot more processes it should still be pretty fast.

Aurashk · 2026-01-30T17:33:47Z

@MRiganSUSX I made a small edit to the k8s process manager so that it shares the same shutdown ordering list as the ssh process manager stored as a config variable. It's a bit nicer but not strictly necessary for this PR so I can revert it if it's going to mess with anything you are doing.

PawelPlesniak · 2026-02-02T12:49:47Z

Hi Aurash, thank you for this cleanup, the code is much easier to read now. I a request - could we suppress the output from the sh library on terminate? It is throwing the integration tests as this is reported as an error, as

[2026/02/02 12:47:46 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

Note - this was tested with increasing kill_timeout in the config to 10s.
Full output of boot then terminate:

drunc-unified-shell ssh-standalone config/daqsystemtest/example-configs.data.xml local-1x1-config pawel boot
[2026/02/02 12:47:21 UTC] INFO       shell.py:180                             drunc.unified_shell                                Setting up to use the process manager with configuration ssh-standalone and configuration id "local-1x1-config" from 
oksconflibs:config/daqsystemtest/example-configs.data.xml
[2026/02/02 12:47:21 UTC] INFO       shell.py:202                             drunc.unified_shell                                Starting process manager
[2026/02/02 12:47:21 UTC] INFO       process_manager.py:108                   drunc.process_manager                              process_manager communicating through address 10.73.136.85:33043
[2026/02/02 12:47:21 UTC] INFO       shell.py:538                             drunc.unified_shell                                unified_shell ready with process_manager and controller commands
[2026/02/02 12:47:21 UTC] INFO       process_manager_driver.py:96             drunc.process_manager_driver                       Booting session pawel
[2026/02/02 12:47:22 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'local-connection-server' from session 'pawel' with UUID df940ba6-73aa-44c9-b6af-3d6f4e08a3f1
[2026/02/02 12:47:22 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'root-controller' from session 'pawel' with UUID fa989303-00ef-42ad-8a08-7e85de3fb918
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ru-controller' from session 'pawel' with UUID 98cc4980-5fdb-4342-a47a-acaad56a3340
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'ru-01' from session 'pawel' with UUID 05b0fdab-ca04-4bc7-b1c8-3f3b1a11a2f3
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'df-controller' from session 'pawel' with UUID ea92f3e4-263d-4ab9-b20b-63101e8d63bf
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'tp-stream-writer' from session 'pawel' with UUID 4c894d81-481d-4e9a-a5a1-0c85a5a9844f
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'dfo-01' from session 'pawel' with UUID 53633d24-1be5-4ce1-bd0f-38e46d86cb3e
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'df-01' from session 'pawel' with UUID 12c626ac-19a7-422d-a2a5-c1a48358fe38
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'trg-controller' from session 'pawel' with UUID f98a1a47-942a-495e-84ea-e7445a6ef423
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'tc-maker-1' from session 'pawel' with UUID 1c926c21-af31-4998-a587-75574c9eb9f5
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'mlt' from session 'pawel' with UUID 3dfc7468-cd6f-4ccc-963a-e7ff3e421f42
[2026/02/02 12:47:23 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'hsi-fake-controller' from session 'pawel' with UUID 0dcbce50-711e-4424-b752-525adf3c1bc6
[2026/02/02 12:47:24 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'hsi-fake-01' from session 'pawel' with UUID 238ca5d3-8506-400c-b78a-6e566abafd99
[2026/02/02 12:47:24 UTC] INFO       ssh_process_manager.py:366               drunc.process_manager.SSH_SHELL_process_manager    Booted 'hsi-fake-to-tc-app' from session 'pawel' with UUID ee1b20f5-e519-40ff-91a1-aafdc3c2308c
  Looking for root-controller on the connectivity service... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0:00:00 0:00:00
⠋ Trying to talk to the root controller... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ -:--:-- 0:00:00
                                              pawel status                                              
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Name                   ┃ Info ┃ State   ┃ Substate ┃ In error ┃ Included ┃ Endpoint                  ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ root-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.85:30006 │
│   df-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.85:38085 │
│     df-01              │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.85:60629 │
│     dfo-01             │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.85:51387 │
│     tp-stream-writer   │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.85:52437 │
│   hsi-fake-controller  │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.85:37363 │
│     hsi-fake-01        │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.85:40325 │
│     hsi-fake-to-tc-app │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.85:33801 │
│   ru-controller        │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.85:45131 │
│     ru-01              │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.85:43837 │
│   trg-controller       │      │ initial │ initial  │ No       │ Yes      │ grpc://10.73.136.85:35577 │
│     mlt                │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.85:57253 │
│     tc-maker-1         │      │ initial │ idle     │ No       │ Yes      │ rest://10.73.136.85:37067 │
└────────────────────────┴──────┴─────────┴──────────┴──────────┴──────────┴───────────────────────────┘
Waiting on tree initialisation... ━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   6% 0:01:06
[2026/02/02 12:47:29 UTC] INFO       commands.py:80                           drunc.unified_shell.boot                           Booted successfully
[2026/02/02 12:47:29 UTC] INFO       shell.py:411                             drunc.unified_shell                                Shutting down the unified_shell
[2026/02/02 12:47:29 UTC] INFO       shell_utils.py:135                       drunc.utils.ShellContext                           You will not be able to issue commands to the controller anymore.
[2026/02/02 12:47:29 UTC] INFO       shell_utils.py:137                       drunc.utils.ShellContext                           Controller driver has been deleted.
[2026/02/02 12:47:29 UTC] INFO       ssh_process_manager.py:202               drunc.process_manager.SSH_SHELL_process_manager    Terminating
[2026/02/02 12:47:29 UTC] INFO       ssh_process_manager.py:205               drunc.process_manager.SSH_SHELL_process_manager    Killing all the known processes before exiting
[2026/02/02 12:47:29 UTC] INFO       ssh_process_lifetime_manager_shell.py:56 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Terminating role 'unknown' from provided UUIDs ---
[2026/02/02 12:47:29 UTC] INFO       ssh_process_lifetime_manager_shell.py:56 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Terminating role 'application' from provided UUIDs ---
[2026/02/02 12:47:29 UTC] INFO       ssh_process_lifetime_manager_shell.py:50 drunc.process_manager.SSH_SHELL_process_manager    Killing 8 process(es) with role 'application' from 14 candidates
[2026/02/02 12:47:32 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:32 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'hsi-fake-01' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:32 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:32 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'hsi-fake-to-tc-app' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:33 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 238ca5d3-8506-400c-b78a-6e566abafd99 (PID 3531572) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:33 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process ee1b20f5-e519-40ff-91a1-aafdc3c2308c (PID 3531785) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:34 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:34 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:34 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 4c894d81-481d-4e9a-a5a1-0c85a5a9844f (PID 3529781) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:34 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 05b0fdab-ca04-4bc7-b1c8-3f3b1a11a2f3 (PID 3529421) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:34 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:34 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'tp-stream-writer' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:34 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'ru-01' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:34 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'dfo-01' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:34 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:34 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'tc-maker-1' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:34 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:34 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 53633d24-1be5-4ce1-bd0f-38e46d86cb3e (PID 3529880) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:34 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 3dfc7468-cd6f-4ccc-963a-e7ff3e421f42 (PID 3530934) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:34 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'mlt' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:34 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 1c926c21-af31-4998-a587-75574c9eb9f5 (PID 3530675) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:42 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:42 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 12c626ac-19a7-422d-a2a5-c1a48358fe38 (PID 3530180) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:42 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'df-01' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:42 UTC] INFO       ssh_process_lifetime_manager_shell.py:57 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Role 'application' complete ---
[2026/02/02 12:47:42 UTC] INFO       ssh_process_lifetime_manager_shell.py:56 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Terminating role 'segment-controller' from provided UUIDs ---
[2026/02/02 12:47:42 UTC] INFO       ssh_process_lifetime_manager_shell.py:50 drunc.process_manager.SSH_SHELL_process_manager    Killing 4 process(es) with role 'segment-controller' from 14 candidates
[2026/02/02 12:47:43 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:43 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process f98a1a47-942a-495e-84ea-e7445a6ef423 (PID 3530458) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:43 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'trg-controller' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:43 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:43 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 0dcbce50-711e-4424-b752-525adf3c1bc6 (PID 3531258) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:43 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'hsi-fake-controller' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:44 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:44 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'ru-controller' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:44 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:44 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'df-controller' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:45 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process 98cc4980-5fdb-4342-a47a-acaad56a3340 (PID 3529283) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:45 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process ea92f3e4-263d-4ab9-b20b-63101e8d63bf (PID 3529531) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:45 UTC] INFO       ssh_process_lifetime_manager_shell.py:57 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Role 'segment-controller' complete ---
[2026/02/02 12:47:45 UTC] INFO       ssh_process_lifetime_manager_shell.py:56 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Terminating role 'root-controller' from provided UUIDs ---
[2026/02/02 12:47:45 UTC] INFO       ssh_process_lifetime_manager_shell.py:50 drunc.process_manager.SSH_SHELL_process_manager    Killing 1 process(es) with role 'root-controller' from 14 candidates
[2026/02/02 12:47:46 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:46 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process fa989303-00ef-42ad-8a08-7e85de3fb918 (PID 3529123) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:46 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'root-controller' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:46 UTC] INFO       ssh_process_lifetime_manager_shell.py:57 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Role 'root-controller' complete ---
[2026/02/02 12:47:46 UTC] INFO       ssh_process_lifetime_manager_shell.py:56 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Terminating role 'local-connection-server' from provided UUIDs ---
[2026/02/02 12:47:46 UTC] INFO       ssh_process_lifetime_manager_shell.py:50 drunc.process_manager.SSH_SHELL_process_manager    Killing 1 process(es) with role 'local-connection-server' from 14 candidates
[2026/02/02 12:47:47 UTC] ERROR      sh.py:1717                               drunc.process_manager.SSH_SHELL_process_manager    Connection to localhost closed.

[2026/02/02 12:47:47 UTC] INFO       ssh_process_manager.py:303               drunc.process_manager.SSH_SHELL_process_manager    Process 'local-connection-server' (session: 'pawel', user: 'pplesnia') process exited with exit code 0
[2026/02/02 12:47:47 UTC] INFO       ssh_process_lifetime_manager_shell.py:10 drunc.process_manager.SSH_SHELL_process_manager    Remote process df940ba6-73aa-44c9-b6af-3d6f4e08a3f1 (PID 3528869) terminated gracefully following SIGQUIT signal.
[2026/02/02 12:47:48 UTC] INFO       ssh_process_lifetime_manager_shell.py:57 drunc.process_manager.SSH_SHELL_process_manager    --- Shutdown stage: Role 'local-connection-server' complete ---
[2026/02/02 12:47:48 UTC] INFO       shell_utils.py:135                       drunc.utils.ShellContext                           You will not be able to issue commands to the process_manager anymore.
[2026/02/02 12:47:48 UTC] INFO       shell_utils.py:137                       drunc.utils.ShellContext                           Process_manager driver has been deleted.
[2026/02/02 12:47:48 UTC] INFO       process_manager.py:128                   drunc.process_manager                              Shutting down the process manager server
[2026/02/02 12:47:48 UTC] INFO       shell.py:513                             drunc.unified_shell                                unified_shell exited successfully

MRiganSUSX · 2026-02-02T12:54:17Z

@Aurashk I have not tested anything and only looked at the k8s PM related bits, but seems good to me. Thanks for checking.

Aurashk added 5 commits December 4, 2025 17:26

add support for saving metadata ssh process manager

95367c9

improve kill handling to kill remote processes directly via ssh via sigterm

improve boot efficiency

1423994

Merge branch 'develop' into aurashk/improve-process-manager-terminate…

ce2e65c

…-behaviour

make shell process manager a background remote process to fix remote …

a97a724

…signal behaviour

add tmp/ fall back for ssh process metadata

d465e76

Aurashk requested review from PawelPlesniak and jamesturner246 December 5, 2025 16:44

Aurashk marked this pull request as ready for review December 8, 2025 10:10

jamesturner246 requested changes Dec 8, 2025

View reviewed changes

improve commenting and naming

ed3ea5e

PawelPlesniak added the enhancement New feature or request label Jan 22, 2026

Aurashk added 2 commits January 22, 2026 14:45

improve logging clarity

0ad329b

reduce number of ssh connections required for remote monitoring of pr…

0d625bb

…ocesses

jamesturner246 self-requested a review January 29, 2026 11:52

Aurashk and others added 4 commits January 30, 2026 15:55

add terminate boot order and asynchronous remote process killing

9073c08

Merge branch 'develop' into aurashk/improve-process-manager-terminate…

f112ede

…-behaviour

simplify and fix shutdown ordering

43a5cd3

Merge branch 'aurashk/improve-process-manager-terminate-behaviour' of h…

e1828fd

…ttps://github.com/DUNE-DAQ/drunc into aurashk/improve-process-manager-terminate-behaviour

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aurashk/improve process manager terminate behaviour#733

Aurashk/improve process manager terminate behaviour#733
Aurashk wants to merge 12 commits intodevelopfrom
aurashk/improve-process-manager-terminate-behaviour

Aurashk commented Dec 5, 2025 •

edited

Loading

Uh oh!

jamesturner246 left a comment

Uh oh!

jamesturner246 commented Dec 8, 2025

Uh oh!

jamesturner246 commented Dec 8, 2025 •

edited

Loading

Uh oh!

Aurashk commented Jan 16, 2026

Uh oh!

Aurashk commented Jan 16, 2026 •

edited

Loading

Uh oh!

Aurashk commented Jan 22, 2026 •

edited

Loading

Uh oh!

Aurashk commented Jan 23, 2026

Uh oh!

PawelPlesniak commented Jan 28, 2026

Uh oh!

Aurashk commented Jan 30, 2026

Uh oh!

Aurashk commented Jan 30, 2026

Uh oh!

PawelPlesniak commented Feb 2, 2026 •

edited

Loading

Uh oh!

MRiganSUSX commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Aurashk commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Key checklist

Further checks

Uh oh!

jamesturner246 left a comment

Choose a reason for hiding this comment

Uh oh!

jamesturner246 commented Dec 8, 2025

Uh oh!

jamesturner246 commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aurashk commented Jan 16, 2026

Uh oh!

Aurashk commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aurashk commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aurashk commented Jan 23, 2026

Uh oh!

PawelPlesniak commented Jan 28, 2026

Uh oh!

Aurashk commented Jan 30, 2026

Uh oh!

Aurashk commented Jan 30, 2026

Uh oh!

PawelPlesniak commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MRiganSUSX commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Aurashk commented Dec 5, 2025 •

edited

Loading

jamesturner246 commented Dec 8, 2025 •

edited

Loading

Aurashk commented Jan 16, 2026 •

edited

Loading

Aurashk commented Jan 22, 2026 •

edited

Loading

PawelPlesniak commented Feb 2, 2026 •

edited

Loading