fix: harden Firecracker ZFS snapshot lifecycle by lox · Pull Request #142 · buildkite/cleanroom

lox · 2026-04-03T21:37:24Z

Summary

flush host-side buffered writes before taking Firecracker ZFS snapshots so restored sandboxes see the latest rootfs state
keep the root helper aligned with the ZFS runtime by allowing the dataset-access doctor probe and waiting for cloned zvol device nodes to appear
add Firecracker snapshot regression coverage for the ZFS host-sync path

Root cause

ZFS snapshots were being taken after the guest ran sync, but before the host had flushed buffered writes for the zvol. That meant the stored snapshot could capture an older rootfs state than the live sandbox, which showed up as restored sandboxes missing /workspace after repo bootstrap.

Impact

snapshot-backed Firecracker sandboxes on ZFS restore the expected workspace contents again
ZFS doctor checks and clone/create flows work reliably with the privileged helper

Validation

mise exec -- go test ./internal/backend/firecracker ./internal/volumestore ./scripts -run 'Test(CreateSnapshot|SnapshotDriverNeedsHostSync|ZFSDriver|RootHelper|Helper)'
bash -n scripts/cleanroom-root-helper.sh
remote validation on cleanroom-prod-apse2-linux-i-0b35e6770b45a7db1 (100.127.229.48): repo-aware create, ZFS snapshot create, create-from-snapshot, exec from /workspace, interactive console, cleanup, and cleanroom doctor (summary: 34 pass, 3 warn, 0 fail)

Notes

During validation I cleaned up one leftover test snapshot dataset directly after the listener had already dropped sandbox state. The restore regression itself is fixed by this change, but that cleanup edge may still be worth a separate follow-up if it reproduces consistently.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b57849f0d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T22:23:50Z

internal/backend/firecracker/backend.go

+	if err := flushSnapshotHostFilesystem(ctx, driverCfg.Snapshots.Driver); err != nil {
+		return nil, err


Resume sandbox after host sync failure

CreateSnapshot pauses the VM process and then immediately returns on flushSnapshotHostFilesystem errors, but the resume defer is only installed afterward. If host sync fails (including context cancellation/timeouts), the function exits with the Firecracker process still SIGSTOPed, leaving the sandbox unusable until manual intervention. Move defer setup (or an explicit resume-on-error path) to immediately after pauseSandboxProcess succeeds so all early returns unpause the sandbox.

Useful? React with 👍 / 👎.

Fixed in e14e8eb: the resume defer now installs immediately after pauseSandboxProcess succeeds, so a failed host sync still sends SIGCONT before CreateSnapshot returns. I also added a regression test for the ZFS path that forces host sync to fail and asserts the sandbox is resumed and snapshotting never starts.

lox · 2026-04-03T23:11:58Z

@codex review

chatgpt-codex-connector · 2026-04-03T23:17:49Z

Codex Review: Didn't find any major issues. 👍

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

fix(firecracker): harden zfs snapshot lifecycle

d618016

lox changed the title ~~[codex] Harden Firecracker ZFS snapshot lifecycle~~ fix: harden Firecracker ZFS snapshot lifecycle Apr 3, 2026

lox added 2 commits April 4, 2026 08:50

fix(ci): satisfy shellcheck in zfs helper

543038f

fix(cli): avoid racing stdin error propagation

7b57849

lox marked this pull request as ready for review April 3, 2026 22:19

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

fix(firecracker): resume paused sandbox on sync failure

e14e8eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden Firecracker ZFS snapshot lifecycle#142

fix: harden Firecracker ZFS snapshot lifecycle#142
lox wants to merge 4 commits intomainfrom
codex/fix-zfs-snapshot-durability

lox commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

lox Apr 3, 2026

Uh oh!

lox commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if err := flushSnapshotHostFilesystem(ctx, driverCfg.Snapshots.Driver); err != nil {
		return nil, err

Conversation

lox commented Apr 3, 2026

Summary

Root cause

Impact

Validation

Notes

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

lox Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

lox commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant