RBMC: Check again for dead sibling service#69
RBMC: Check again for dead sibling service#69spinler wants to merge 1 commit intoibm-openbmc:1110from
Conversation
During some bad path testing the sibling daemon on each BMC would make it past the existing check done to make sure it was running and then die. This would cause the wait for the sibling interface to be on D-Bus to time out. At that point each BMC became active since it thought the sibling daemon was fine and just the sibling BMC had the problem. Fix this by checking again if the sibling daemon is running when the sibling interface still isn't on D-Bus after waiting for it. If it isn't, become passive. Tested: This is seen on each BMC: ``` Waiting for sibling interface and/or heartbeat: Present = False, Heartbeat = False Done waiting for sibling. Interface present = False, heartbeat = False Sibling service state is failed Role = xyz.openbmc_project.State.BMC.Redundancy.Role.Passive due to: Sibling BMC service is not running ``` Signed-off-by: Matt Spinler <spinler@us.ibm.com>
| co_await sibling->waitForSiblingUp(siblingTimeout); | ||
|
|
||
| if (previousRole == Role::Passive) | ||
| // Sibling service may have died. Check again. |
There was a problem hiding this comment.
I've looked at this for a while now, and I'm sure it's right, but it just feels like we're starting to work ourselves into the if/else wormhole. With tests now needed for every path and just a lot of complexity. Is that sibling service dying a real use case? Or just something that was only possible with some special error injections?
There was a problem hiding this comment.
It happened when the cfam daemon failed accessing the CFAM regs, which I think is a valid fail, and right now I just have the daemon crash when that happens. I did it that way because I couldn't think of how else to alert rbmc manager that the sibling can't get any info from it. If there would be another way for the rbmc manager to know that FSI is broken, then I wouldn't worry about this case here. I'll think a bit more on it.
|
Moved this over to an 1120 PR: #77. Will close this one. |
During some bad path testing the sibling daemon on each BMC would make it past the existing check done to make sure it was running and then die. This would cause the wait for the sibling interface to be on D-Bus to time out. At that point each BMC became active since it thought the sibling daemon was fine and just the sibling BMC had the problem.
Fix this by checking again if the sibling daemon is running when the sibling interface still isn't on D-Bus after waiting for it. If it isn't, become passive.
Tested:
This is seen on each BMC: