Skip to content

Fix silent hang in flush when nodeShutdown returns non-recoverable error#1267

Open
alexgim961101 wants to merge 1 commit intopercona:devfrom
alexgim961101:fix/nodeshutdown-error-handling
Open

Fix silent hang in flush when nodeShutdown returns non-recoverable error#1267
alexgim961101 wants to merge 1 commit intopercona:devfrom
alexgim961101:fix/nodeshutdown-error-handling

Conversation

@alexgim961101
Copy link

Problem

During physical restore, flush() calls nodeShutdown() to shut down mongod.
If nodeShutdown() returns an error other than ConflictingOperationInProgress
(e.g., (Unauthorized) shutdown must run from localhost when running db without auth),
the error is silently ignored due to the if err != nil && contains(...) pattern,
and the code proceeds to waitMgoShutdown() which hangs indefinitely.

Root Cause

err = nodeShutdown(ctx, r.node)
if err != nil &&
    strings.Contains(err.Error(), "(ConflictingOperationInProgress)...") {
    return errors.Wrap(err, "shutdown server")
}
break  // ← non-ConflictingOp errors silently ignored

Changes

flush() — Extract error classification

Replaced the inline strings.Contains check with a call to the new
isNonRecoverableShutdownErr() helper:

err = nodeShutdown(ctx, r.node)
if err != nil && isNonRecoverableShutdownErr(err) {
    return errors.Wrap(err, "shutdown server")
}
break

isNonRecoverableShutdownErr() — New helper function

Returns true if the error indicates mongod will not shut down and waiting
would be futile. Currently checks for:

  • (Unauthorized) — e.g., shutdown from non-localhost without auth
  • (ConflictingOperationInProgress) — node already stepping down

New error patterns can be added to the nonRecoverablePatterns slice.

Testing

Tested with a 3-node replica set (MongoDB 7.0, no auth, Docker):

  • Before: restore hangs indefinitely at waitMgoShutdown()
  • After: restore fails immediately with clear error message:
    shutdown server: (Unauthorized) shutdown must run from localhost when running db without auth

@it-percona-cla
Copy link

it-percona-cla commented Feb 16, 2026

CLA assistant check
All committers have signed the CLA.

During physical restore, flush() calls nodeShutdown() to shut down mongod.
Previously, if nodeShutdown() returned an error other than
ConflictingOperationInProgress (e.g., Unauthorized when not connecting
from localhost), the error was silently ignored and the code proceeded
to waitMgoShutdown(), which would hang indefinitely.

Extract error classification into isNonRecoverableShutdownErr() helper
to make it easy to add new non-recoverable error patterns in the future.

Signed-off-by: alexgim961101 <[email protected]>
@alexgim961101 alexgim961101 force-pushed the fix/nodeshutdown-error-handling branch from 25d4f7d to ee031ae Compare February 16, 2026 06:02
@alexgim961101 alexgim961101 changed the title fix: handle non-recoverable nodeShutdown errors in flush to prevent silent restore hang Fix silent hang in flush when nodeShutdown returns non-recoverable error Feb 16, 2026
@boris-ilijic
Copy link
Member

boris-ilijic commented Feb 18, 2026

Hey @alexgim961101, thank you for your contribution. It looks good, we’ll perform a deeper review shortly and get back to you.

@alexgim961101
Copy link
Author

@boris-ilijic Thanks for the response! Take your time with the review. I'm happy to address any feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants