RBMC: Check again for dead sibling service by spinler · Pull Request #69 · ibm-openbmc/phosphor-state-manager

spinler · 2025-01-31T17:07:26Z

During some bad path testing the sibling daemon on each BMC would make it past the existing check done to make sure it was running and then die. This would cause the wait for the sibling interface to be on D-Bus to time out. At that point each BMC became active since it thought the sibling daemon was fine and just the sibling BMC had the problem.

Fix this by checking again if the sibling daemon is running when the sibling interface still isn't on D-Bus after waiting for it. If it isn't, become passive.

Tested:

This is seen on each BMC:

Waiting for sibling interface and/or heartbeat: Present = False, Heartbeat = False
Done waiting for sibling. Interface present = False, heartbeat = False
Sibling service state is failed
Role = xyz.openbmc_project.State.BMC.Redundancy.Role.Passive due to: Sibling BMC service is not running

During some bad path testing the sibling daemon on each BMC would make it past the existing check done to make sure it was running and then die. This would cause the wait for the sibling interface to be on D-Bus to time out. At that point each BMC became active since it thought the sibling daemon was fine and just the sibling BMC had the problem. Fix this by checking again if the sibling daemon is running when the sibling interface still isn't on D-Bus after waiting for it. If it isn't, become passive. Tested: This is seen on each BMC: ``` Waiting for sibling interface and/or heartbeat: Present = False, Heartbeat = False Done waiting for sibling. Interface present = False, heartbeat = False Sibling service state is failed Role = xyz.openbmc_project.State.BMC.Redundancy.Role.Passive due to: Sibling BMC service is not running ``` Signed-off-by: Matt Spinler <spinler@us.ibm.com>

geissonator · 2025-02-04T02:32:20Z

            co_await sibling->waitForSiblingUp(siblingTimeout);

-            if (previousRole == Role::Passive)
+            // Sibling service may have died.  Check again.


I've looked at this for a while now, and I'm sure it's right, but it just feels like we're starting to work ourselves into the if/else wormhole. With tests now needed for every path and just a lot of complexity. Is that sibling service dying a real use case? Or just something that was only possible with some special error injections?

It happened when the cfam daemon failed accessing the CFAM regs, which I think is a valid fail, and right now I just have the daemon crash when that happens. I did it that way because I couldn't think of how else to alert rbmc manager that the sibling can't get any info from it. If there would be another way for the rbmc manager to know that FSI is broken, then I wouldn't worry about this case here. I'll think a bit more on it.

spinler · 2025-03-13T19:02:26Z

Moved this over to an 1120 PR: #77. Will close this one.

spinler requested review from RameshIyyar and geissonator January 31, 2025 17:07

RameshIyyar approved these changes Feb 1, 2025

View reviewed changes

geissonator reviewed Feb 4, 2025

View reviewed changes

spinler closed this Mar 13, 2025

spinler deleted the rbmc_dead_sibling_check branch March 13, 2025 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RBMC: Check again for dead sibling service#69

RBMC: Check again for dead sibling service#69
spinler wants to merge 1 commit intoibm-openbmc:1110from
spinler:rbmc_dead_sibling_check

spinler commented Jan 31, 2025

Uh oh!

geissonator Feb 4, 2025

Uh oh!

spinler Feb 4, 2025

Uh oh!

spinler commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

spinler commented Jan 31, 2025

Uh oh!

geissonator Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

spinler Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

spinler commented Mar 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants