Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b57849f0d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if err := flushSnapshotHostFilesystem(ctx, driverCfg.Snapshots.Driver); err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
Resume sandbox after host sync failure
CreateSnapshot pauses the VM process and then immediately returns on flushSnapshotHostFilesystem errors, but the resume defer is only installed afterward. If host sync fails (including context cancellation/timeouts), the function exits with the Firecracker process still SIGSTOPed, leaving the sandbox unusable until manual intervention. Move defer setup (or an explicit resume-on-error path) to immediately after pauseSandboxProcess succeeds so all early returns unpause the sandbox.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in e14e8eb: the resume defer now installs immediately after pauseSandboxProcess succeeds, so a failed host sync still sends SIGCONT before CreateSnapshot returns. I also added a regression test for the ZFS path that forces host sync to fail and asserts the sandbox is resumed and snapshotting never starts.
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
Summary
Root cause
ZFS snapshots were being taken after the guest ran
sync, but before the host had flushed buffered writes for the zvol. That meant the stored snapshot could capture an older rootfs state than the live sandbox, which showed up as restored sandboxes missing/workspaceafter repo bootstrap.Impact
Validation
mise exec -- go test ./internal/backend/firecracker ./internal/volumestore ./scripts -run 'Test(CreateSnapshot|SnapshotDriverNeedsHostSync|ZFSDriver|RootHelper|Helper)'bash -n scripts/cleanroom-root-helper.shcleanroom-prod-apse2-linux-i-0b35e6770b45a7db1(100.127.229.48): repo-aware create, ZFS snapshot create, create-from-snapshot, exec from/workspace, interactiveconsole, cleanup, andcleanroom doctor(summary: 34 pass, 3 warn, 0 fail)Notes
During validation I cleaned up one leftover test snapshot dataset directly after the listener had already dropped sandbox state. The restore regression itself is fixed by this change, but that cleanup edge may still be worth a separate follow-up if it reproduces consistently.