Add test for --unload-on-kill spawner lock behavior#3035
Add test for --unload-on-kill spawner lock behavior#3035naitikpahwa18 wants to merge 9 commits intoros-controls:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a C++ regression test in controller_manager to ensure a spawner started with --unload-on-kill does not hold the global spawner lock while waiting for interrupt, allowing other spawners to start promptly (covers the fix from #2559; closes #2566).
Changes:
- Add
unload_on_kill_does_not_block_other_spawnerstest that runs one spawner in the background with--unload-on-kill. - Measure that a second spawner invocation completes within a bounded time, indicating it wasn’t blocked on the lock.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3035 +/- ##
=======================================
Coverage 89.16% 89.17%
=======================================
Files 158 158
Lines 19538 19570 +32
Branches 1594 1599 +5
=======================================
+ Hits 17422 17452 +30
+ Misses 1462 1458 -4
- Partials 654 660 +6
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "timeout --signal=INT 5 " | ||
| "$(ros2 pkg prefix controller_manager)/lib/controller_manager/spawner " |
There was a problem hiding this comment.
This file’s call_spawner(...) helper prefixes commands with coveragepy_script (see the helper near the top of the file), and the existing timeout --signal=INT ... tests here also include that prefix. spawner_a_cmd omits it, which can make coverage runs inconsistent. Consider adding the same prefix (or reusing call_spawner via a timeout wrapper) so this spawner path is covered consistently.
| "timeout --signal=INT 5 " | |
| "$(ros2 pkg prefix controller_manager)/lib/controller_manager/spawner " | |
| std::string("timeout --signal=INT 5 ") + | |
| coveragepy_script + | |
| " $(ros2 pkg prefix controller_manager)/lib/controller_manager/spawner " |
| EXPECT_EQ(call_spawner("ctrl_2 -c test_controller_manager"), 0) | ||
| << "Spawner B should not be blocked by Spawner A's --unload-on-kill lock"; | ||
| auto spawner_b_elapsed = std::chrono::steady_clock::now() - spawner_b_start; | ||
|
|
||
| EXPECT_LT(std::chrono::duration_cast<std::chrono::seconds>(spawner_b_elapsed).count(), 30) | ||
| << "Spawner B took too long — likely blocked by Spawner A holding the lock"; | ||
|
|
There was a problem hiding this comment.
call_spawner(...) is a blocking system() call. If the lock regression comes back, this call can still block for ~100s (20s timeout × 5 retries) before returning, and only then will the < 30s assertion fail. Consider adding a hard upper bound (e.g., wrap Spawner B in timeout, or run it in a future and fail if it doesn’t complete within the expected time).
| EXPECT_EQ(call_spawner("ctrl_2 -c test_controller_manager"), 0) | |
| << "Spawner B should not be blocked by Spawner A's --unload-on-kill lock"; | |
| auto spawner_b_elapsed = std::chrono::steady_clock::now() - spawner_b_start; | |
| EXPECT_LT(std::chrono::duration_cast<std::chrono::seconds>(spawner_b_elapsed).count(), 30) | |
| << "Spawner B took too long — likely blocked by Spawner A holding the lock"; | |
| auto spawner_b_future = std::async( | |
| std::launch::async, | |
| []() { return call_spawner("ctrl_2 -c test_controller_manager"); }); | |
| // Enforce a hard upper bound on Spawner B execution time. | |
| auto spawner_b_status = spawner_b_future.wait_for(std::chrono::seconds(30)); | |
| EXPECT_NE(spawner_b_status, std::future_status::timeout) | |
| << "Spawner B call_spawner did not complete within 30 seconds — likely blocked by " | |
| << "Spawner A holding the lock"; | |
| if (spawner_b_status != std::future_status::timeout) | |
| { | |
| int spawner_b_result = spawner_b_future.get(); | |
| EXPECT_EQ(spawner_b_result, 0) | |
| << "Spawner B should not be blocked by Spawner A's --unload-on-kill lock"; | |
| auto spawner_b_elapsed = std::chrono::steady_clock::now() - spawner_b_start; | |
| EXPECT_LT(std::chrono::duration_cast<std::chrono::seconds>(spawner_b_elapsed).count(), 30) | |
| << "Spawner B took too long — likely blocked by Spawner A holding the lock"; | |
| } |
| // Run Spawner A with --unload-on-kill in background; timeout sends SIGINT after 5s for cleanup. | ||
| std::string spawner_a_cmd = | ||
| "timeout --signal=INT 5 " | ||
| "$(ros2 pkg prefix controller_manager)/lib/controller_manager/spawner " | ||
| "ctrl_1 -c test_controller_manager --unload-on-kill"; |
There was a problem hiding this comment.
To exercise the lock regression deterministically, Spawner B should start while Spawner A is still alive and waiting for the interrupt. With a 5s timeout, there’s a race where Spawner A can exit before Spawner B starts, making the test pass without actually validating the lock-release behavior. Consider increasing the timeout or asserting Spawner A is still running (e.g., spawner_a_future.wait_for(0) not ready) before starting Spawner B.
christophfroehlich
left a comment
There was a problem hiding this comment.
Please discuss the feedback by the copilot review, and don't close the review threads by yourself (this should only be done by the reviewer).
|
@christophfroehlich Sorry for closing the threads, I thought they were addressed and didn't realize only the reviewer should do that. I've addressed all Copilot feedback in the latest commit. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Spawner B must not be blocked by Spawner A's lock; bug would cause ~100s delay (20s x 5 | ||
| // retries). | ||
| auto spawner_b_start = std::chrono::steady_clock::now(); | ||
| std::string spawner_b_cmd = | ||
| "timeout --signal=INT 30 " + | ||
| std::string(coveragepy_script) + | ||
| " $(ros2 pkg prefix controller_manager)/lib/controller_manager/spawner " | ||
| "ctrl_2 -c test_controller_manager"; | ||
| int spawner_b_exit_code = std::system(spawner_b_cmd.c_str()); | ||
| EXPECT_EQ(spawner_b_exit_code, 0) | ||
| << "Spawner B should not be blocked by Spawner A's --unload-on-kill lock"; | ||
| auto spawner_b_elapsed = std::chrono::steady_clock::now() - spawner_b_start; | ||
|
|
||
| EXPECT_LT(std::chrono::duration_cast<std::chrono::seconds>(spawner_b_elapsed).count(), 30) | ||
| << "Spawner B took too long and may have been blocked by Spawner A holding the lock"; |
There was a problem hiding this comment.
The current timing/timeout setup can still pass even if Spawner B is blocked on the spawner lock: if Spawner A holds the lock until it is killed at 15s, Spawner B can unblock after ~15s and still exit 0 well under the 30s timeout/threshold. To make this a real regression test for “does not block”, the Spawner B timeout/allowed elapsed time should be well below Spawner A’s lifetime (e.g., B timeout 5–10s and assert elapsed < that), or keep A alive longer but explicitly interrupt it after B finishes. Also, comparing duration_cast<seconds>(...).count() truncates sub-second precision; prefer comparing durations directly (e.g., elapsed < std::chrono::seconds(...)).
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| " $(ros2 pkg prefix controller_manager)/lib/controller_manager/spawner " | ||
| "ctrl_1 -c test_controller_manager --unload-on-kill"; | ||
| auto spawner_a_future = std::async( | ||
| std::launch::async, [&spawner_a_cmd]() { return std::system(spawner_a_cmd.c_str()); }); |
There was a problem hiding this comment.
spawner_a_cmd is captured by reference in the async lambda. While it currently stays in scope until the end of the test, capturing by value would make the thread safety/lifetime guarantees explicit and avoid accidental future refactors turning this into a use-after-scope issue.
| std::launch::async, [&spawner_a_cmd]() { return std::system(spawner_a_cmd.c_str()); }); | |
| std::launch::async, [spawner_a_cmd]() { return std::system(spawner_a_cmd.c_str()); }); |
| std::string spawner_a_cmd = | ||
| "timeout --signal=INT 30 " + std::string(coveragepy_script) + | ||
| " $(ros2 pkg prefix controller_manager)/lib/controller_manager/spawner " | ||
| "ctrl_1 -c test_controller_manager --unload-on-kill"; |
There was a problem hiding this comment.
Spawner A and Spawner B are both invoked via coveragepy_script (python3 -m coverage run --append ...) and run concurrently. Coverage’s default .coverage data file is not safe to append from multiple processes at the same time, so this can corrupt coverage output and/or make the test flaky. Consider running only one spawner under coverage here, or switch to per-process coverage files (e.g., --parallel-mode / unique COVERAGE_FILE) and combine them after the test run.
| // Wait for Spawner A to be killed by timeout and verify it did not exit successfully | ||
| int spawner_a_exit_code = spawner_a_future.get(); | ||
| EXPECT_NE(spawner_a_exit_code, 0) |
There was a problem hiding this comment.
This test will always wait for Spawner A’s timeout --signal=INT 30 to elapse before finishing (spawner_a_future.get()), adding ~30s to the test suite even on success. Please reduce the timeout substantially (it only needs to be > Spawner B’s timeout) or explicitly terminate Spawner A once Spawner B has completed (e.g., by tracking the PID and sending SIGINT) to keep CI runtime reasonable.
| auto spawner_b_start = std::chrono::steady_clock::now(); | ||
| std::string spawner_b_cmd = | ||
| "timeout --signal=INT 10 " + std::string(coveragepy_script) + | ||
| " $(ros2 pkg prefix controller_manager)/lib/controller_manager/spawner " | ||
| "ctrl_2 -c test_controller_manager"; | ||
| int spawner_b_exit_code = std::system(spawner_b_cmd.c_str()); | ||
| EXPECT_EQ(spawner_b_exit_code, 0) | ||
| << "Spawner B should not be blocked by Spawner A's --unload-on-kill lock"; | ||
| auto spawner_b_elapsed = std::chrono::steady_clock::now() - spawner_b_start; | ||
|
|
||
| EXPECT_LT(spawner_b_elapsed, std::chrono::seconds(10)) | ||
| << "Spawner B took too long and may have been blocked by Spawner A holding the lock"; |
There was a problem hiding this comment.
The elapsed-time assertion is redundant and potentially flaky because the command is already wrapped in timeout ... 10, and the timer starts before building the command string (so small overhead can push the measured duration to ~10s even when timeout doesn’t trigger). Either remove this check, or assert against a smaller threshold with margin and measure only around the std::system() call.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
saikishor
left a comment
There was a problem hiding this comment.
Overall, looks good
Just some minor nitpicks
| cm_->set_parameter(rclcpp::Parameter("ctrl_2.type", test_controller::TEST_CONTROLLER_CLASS_NAME)); | ||
|
|
||
| // Run Spawner A with --unload-on-kill in background, keep it alive long enough to ensure | ||
| // Spawner B cannot succeed by waiting for Spawner A's timeout. |
There was a problem hiding this comment.
| // Spawner B cannot succeed by waiting for Spawner A's timeout. | |
| // Spawner B should succeed, inspite of the waiting for Spawner A's timeout. |
| EXPECT_EQ(spawner_b_exit_code, 0) | ||
| << "Spawner B should not be blocked by Spawner A's --unload-on-kill lock"; | ||
|
|
||
| // Wait for Spawner A to be killed by timeout and verify it did not exit successfully |
There was a problem hiding this comment.
| // Wait for Spawner A to be killed by timeout and verify it did not exit successfully | |
| EXPECT_EQ(spawner_a_future.wait_for(std::chrono::seconds(0)), std::future_status::timeout) | |
| << "Spawner A exited already; lock-contention scenario might not be exercised"; | |
| // Wait for Spawner A to be killed by timeout and verify it did not exit successfully |
There was a problem hiding this comment.
Please also verify that the ctrl_1 and ctrl_2 are active right after the spawner B success and then once the unload is done by the timeout, also verify that only ctrl_2 is active
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
Signed-off-by: Naitik Pahwa <naitikpahwa18@users.noreply.github.com>
be135e6 to
aecf629
Compare
christophfroehlich
left a comment
There was a problem hiding this comment.
please don't do force pushes, this only requires more effort for another review.
This PR adds a regression test to verify that a spawner started with
--unload-on-kill does not block other spawners while waiting for an
interrupt.
The test launches one spawner in the background and ensures a second
spawner can start without being delayed, covering the locking fix
introduced in #2559
Closes #2566