Skip to content

Warm-pool v2: state-machine-driven reusable GCE lifecycle#48

Merged
lukacf merged 22 commits intomainfrom
feat/warm-pool-v2
Mar 26, 2026
Merged

Warm-pool v2: state-machine-driven reusable GCE lifecycle#48
lukacf merged 22 commits intomainfrom
feat/warm-pool-v2

Conversation

@lukacf
Copy link
Copy Markdown
Owner

@lukacf lukacf commented Mar 26, 2026

Summary

  • replace the ad hoc warm-pool lifecycle with an explicit instance state machine and lease model
  • simplify warm reuse to a VM-driven pickup protocol and harden the GCE startup / status / cleanup paths
  • add broad regression coverage for warm-pool dispatch, daemon reconciliation, cancellation, startup failures, and lifecycle cleanup

What changed

  • add instance lifecycle primitives (instance_types, instance_transitions, instance_core, instance_controller) and persist warm instance state + leases in the database
  • move warm-pool behavior onto state-machine transitions instead of scattered status rewrites
  • rework warm reuse in the GCE backend, startup script, daemon polling, cleanup, and cancel flow
  • harden edge cases before release: per-run exit metadata, signal clearing, launch-time cleanup, timeout alignment, strict audit behavior, and warm-pool/reused-run env handling
  • add unit and integration coverage for warm-pool claim adapter behavior, startup-loop behavior, daemon recovery, cancel semantics, launch failures, cleanup ownership, and end-to-end reuse lifecycle

Verification

  • make lint
  • make test
  • make test-integration

Notes for reviewers

  • the important architectural shift is that warm-pool lifecycle is now state-machine-driven instead of being coordinated by ad hoc DB status updates
  • the warm reuse protocol no longer relies on the old ACK handshake; the VM picks up work from GCS/metadata and the daemon reconciles based on explicit instance state + leases

Test User and others added 22 commits March 23, 2026 10:09
…gle controller

Replace scattered warm-pool lifecycle logic (~10 code locations making
ad-hoc status updates) with a first-class state machine:

- 7 states (launching→busy→draining→idle_ready→claimed→deleting→gone)
- 13 events with explicit transition table (20 transitions)
- CAS semantics via instance_transition() mirroring stage run state machine
- Explicit leases (instance_leases table) replace ownership guesswork
- Atomic capacity gate via conditional INSERT (no separate has_capacity check)
- Reuse ordering gate: busy→draining→idle_ready enforced by state machine

InstanceController is the single entry point for all instance lifecycle
decisions. No code writes warm_instances.state directly.

Key fixes found via multi-model bug hunting (GPT-5.4 + Opus 4.6 + Gemini):
- Claim signaling uses --metadata-from-file (JSON-safe, no comma mangling)
- Flat signal JSON (spec_gcs_path at top level, not nested in payload)
- Pre-registration uses launcher's _sanitize_name (name matches actual VM)
- Launch failures call on_launch_failed (no stranded launching rows)
- Launching timeout in daemon (15min safety net)
- Lease creation failure in on_fresh_launch fails the operation (no strandable busy)
- ACK after spec validation (not before)
- set -e safe wait pattern (EXIT_CODE=0; wait $PID || EXIT_CODE=$?)
- Env vars passed to Docker via -e KEY=VALUE (not sourced into host shell)
- goldfish_instance_state=idle_ready metadata on both first-boot and reuse
- Cleanup preserves job_spec.json, resets GCS_LOG_PATH/GCS_STAGE_LOG_PATH
- log_syncer snapshots GCS paths per-job (no stale first-boot paths)
- Stale lease detection for draining instances with terminal runs
- Claim retry loop (3 attempts) for CAS races
- export GCS_BUCKET for Python child processes
- find_claimable_instance ORDER BY state_entered_at (oldest idle first)
- warm_pool_cleanup cancels owning runs before deleting (no orphaned runs)
- Accurate deleted vs deletion_pending counts in cleanup response

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ocker failures

Two ship-blocking fixes:

1. Migration v8 now always creates warm_instances via CREATE TABLE IF NOT
   EXISTS, not just when upgrading from the old schema. Fresh upgrades
   from main (no prior warm pool table) were left without the table,
   causing "no such table: warm_instances" on first warm-pool operation.

2. Input staging and image pull failures in the warm reuse path now set
   goldfish_instance_state=idle_ready before continuing to idle. Without
   this, the run finalizes busy→draining but poll_warm_instances never
   sees the metadata it needs for DRAIN_COMPLETE, stranding the instance
   in draining permanently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…and controller routing

Fixes from bug-hunter rounds:

- Transition-first lease ordering: on_run_terminal, on_launch_failed, and
  on_claim_timeout now transition state BEFORE releasing the lease, preventing
  orphaned busy instances when transition fails after lease release

- Clear stale goldfish_instance_state metadata on new job claim: sets
  metadata to "busy" before ACK to prevent daemon from seeing old idle_ready
  value during the next draining phase

- Consecutive liveness failures (3) required before declaring preemption:
  transient gcloud/API errors no longer immediately mark live VMs as gone

- on_claim_acked result checked: failed ACK transition now rolls back the
  claim instead of returning a handle to an unowned instance

- on_fresh_launch failure cleanup: calls on_launch_failed to remove the
  pre-registered row instead of leaving it for the launching timeout

- find_claimable_instance excludes instances with active leases: prevents
  claim retry loop from repeatedly picking the same lost-race instance

- delete_instance routes through controller: on_delete_requested +
  on_delete_confirmed for audit trail, instead of bypassing state machine

- Remove FK CASCADE on instance_state_transitions: audit history is
  preserved after warm instance row deletion

- cancel_run skips direct backend cleanup when controller handled it:
  avoids redundant synchronous gcloud delete for warm-pool instances

- Warm-pool variable initialization before try block: prevents
  UnboundLocalError in exception handler on early failures

- gcloud delete timeout reduced to 15s: daemon retries on next poll

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ightening

- on_preempted/on_delete_requested: transition-first, then release lease
  (consistent with on_run_terminal/on_claim_timeout/on_launch_failed)

- on_claim_timeout: always release lease regardless of transition result
  (caller abandoned the claim; stale lease blocks instance forever)

- pre_register_warm_instance: INSERT OR IGNORE prevents IntegrityError
  on duplicate stage_run_id retries

- find_claimable_instance: NOT EXISTS subquery excludes instances with
  active leases (prevents claim retry picking same lost-race instance)

- instance_transition idempotency: require from_state != current_state
  to prevent spurious "already_in_target_state" for unrelated events

- cancel_run: require ctrl_result.success before skipping backend cleanup

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Controller registration failure: remove pre-registered row directly
  instead of calling on_launch_failed (which would transition to deleting
  and eventually kill the live, running VM)

- LAST_JOB_REQ_ID assignment moved after ACK: transient spec-download
  failures no longer permanently suppress retries for the same request

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If the same (instance_name, stage_run_id) lease already exists (e.g.
retried call), treat it as idempotent success instead of failing and
triggering the "delete pre-registered row" cleanup path that would
orphan a successfully launched VM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ase recovery

- Async delete verification: after issuing --async gcloud delete, verify
  the VM is actually gone via liveness check before removing the DB row.
  Prevents pool from exceeding max_instances while old VM is still alive.

- Dotfile cleanup: default cleanup uses find instead of rm -rf * so
  hidden directories like .goldfish/ (metrics, SVS artifacts) are removed
  between warm-pool reuse cycles. Prevents stale artifacts from being
  uploaded under the next run's path.

- Busy-instance lease recovery: poll_warm_instances now checks busy
  instances for stale leases pointing to terminal/non-owning runs (same
  pattern as draining). Handles executor crash after run goes terminal
  but before on_run_terminal releases the lease.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminate the entire class of partial-failure bugs between lease and
state operations by making them a single CAS UPDATE.

Before: create_instance_lease() + instance_transition() were separate
DB transactions. A crash between them left inconsistent state (orphaned
leases, busy instances without leases, stale leases blocking claims).

After: instance_transition(set_lease_run_id=...) atomically sets both
state and current_lease_run_id in one UPDATE ... WHERE state = ? CAS.
No partial-failure window exists.

Changes:
- warm_instances gains current_lease_run_id column (nullable TEXT)
- instance_transition() gains set_lease_run_id parameter
- Controller methods pass lease ops through to instance_transition()
- find_claimable_instance uses current_lease_run_id IS NULL (no subquery)
- instance_leases table kept as append-only audit log
- Daemon reads lease from instance row directly (no separate lookup)
- All tests updated to use atomic lease model

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s check

- cancel_run always runs _cleanup_backend for immediate GCE termination,
  not deferred to daemon polling. The controller transition still happens
  for warm-pool state tracking, but the VM is killed right away.

- get_logs uses handle.stage_run_id for GCS path resolution, not
  handle.backend_handle (instance name). Warm-pool reuse uploads logs
  under runs/<stage_run_id>/, so using the instance name returned the
  wrong run's logs.

- idle_ready instances now get liveness checks (3 consecutive failures
  before preemption). Spot VMs preempted while idle were previously
  invisible until idle timeout, consuming pool capacity and causing
  failed claims.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- check_instance_status returns "dead" (not "not_found") for VMs in
  TERMINATED/STOPPED/SUSPENDED state. DELETING branch accepts both
  "dead" and "not_found" as deletion confirmation, but transient
  gcloud errors ("error") no longer falsely confirm deletion.

- Orphaned controller failure: set goldfish_warm_pool_disabled metadata
  on the VM so it self-deletes after first job instead of entering an
  untracked idle loop.

- Missing/purged run handling: BUSY and DRAINING instances with a lease
  pointing to a non-existent run are now treated as terminal and
  finalized, preventing indefinite stranding.

- _force_release_lease also updates instance_leases audit table to
  prevent stale active lease rows from persisting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…errors

- Missing/purged runs are now treated as "terminated" when the daemon
  detects stale leases, so the controller emits DELETE_REQUESTED (kill
  the VM) instead of JOB_FINISHED (recycle it for reuse).

- Liveness checks now use check_instance_status() tri-state instead of
  the boolean check_instance_alive(). Only "not_found" and "dead" count
  toward the 3-strike preemption threshold. Transient gcloud errors
  ("error") neither increment the counter nor reset it, preventing
  healthy VMs from being killed during API outages.

- Fixed stale comment about --async in DELETING branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…paths

- cancel_run skips backend cleanup when the backend_handle is a known
  warm-pool instance that this run no longer owns. Prevents canceling
  an already-finalized run from killing a VM that was recycled for a
  different job.

- Idle-loop error branches (spec download failure, missing spec_gcs_path,
  missing required fields) now sleep 1s before continue, preventing
  tight CPU spin when the same metadata signal is retried.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a warm-pool claim times out, the timed-out VM still has the
goldfish job metadata and can ACK late, starting the same stage_run_id
that the caller is about to launch on a fresh VM. Both VMs then write
to the same GCS paths, causing duplicate execution and clobbered outputs.

Fix: clear the goldfish metadata key on the timed-out VM before
returning None, so even if the VM reads the signal late, it finds
an empty value and stays idle. Best-effort — if the VM is unreachable
(which is likely why it timed out), the metadata clear fails silently
and the VM will be deleted by the daemon anyway.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e release

- Launch failure zone discovery: when launch_instance() raises after
  capacity search may have created the VM in a different zone, discover
  the actual zone via _find_instance_zone() before marking deleting.
  Prevents daemon from looking in the wrong zone and dropping the row
  while the real VM runs untracked.

- Stale draining lease: the recovery path incorrectly called
  on_run_terminal (which emits JOB_FINISHED, only valid from busy) on
  draining instances, then unconditionally cleared the local lease flag
  even when the transition failed. Now uses _force_release_lease()
  directly for draining instances (already past busy) and only clears
  the flag on success, preventing idle_ready instances with stale
  current_lease_run_id from becoming permanently unclaimable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion

- cancel_run: skip cleanup only if the warm-pool instance is owned by a
  different run (checks current_lease_run_id != run_id), not merely if
  a warm-pool row exists. Prevents skipping cleanup for runs that still
  own their VM.

- Draining timeout: instances stuck in draining for >30 minutes are now
  transitioned to deleting. Previously there was no timeout, so a failed
  metadata update could strand a VM in draining permanently.

- Delete confirmation: only confirm deletion on "not_found" (VM truly
  gone from GCE), not "dead" (TERMINATED/STOPPED — still exists as a
  resource). Also clean up liveness counter on delete confirmation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gging

- on_claim_timeout: only force-release the lease if this run still owns
  the instance (current_lease_run_id == stage_run_id). Prevents wiping
  the lease from a VM that already advanced to busy via CLAIM_ACKED race.

- try_claim: early return if bucket is None, preventing gs://None/...
  GCS paths when warm pool is enabled without GCS config.

- Lease audit: INSERT ... ON CONFLICT DO UPDATE reactivates released
  lease rows for the same (instance, run) pair instead of silently
  ignoring the insert.

- _get_warm_pool_manager: log warning on failure instead of silent
  swallow, so warm-pool polling failures are visible to operators.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split the shared exclude_args into outputs_excludes and tmp_excludes
so /mnt/outputs paths only apply to the /mnt/outputs find, and /tmp
paths only apply to the /tmp find. Previously all exclusions were
injected into both commands, making cross-root exclusions meaningless.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…force-release

Two claim-race fixes:

1. When on_claim_acked fails (instance left claimed state), clear the
   goldfish metadata before returning None. The VM already ACKed and
   has the job signal — without clearing it, both the warm VM and
   the fresh fallback launch execute the same stage_run_id.

2. on_claim_timeout fallback: require state == "claimed" in addition
   to current_lease_run_id == stage_run_id before force-releasing.
   If CLAIM_ACKED won the race, the instance is now busy with a valid
   lease for the running job — releasing it would strand the job.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@lukacf lukacf merged commit e1e3d98 into main Mar 26, 2026
12 of 13 checks passed
@lukacf lukacf deleted the feat/warm-pool-v2 branch March 26, 2026 10:43
@claude
Copy link
Copy Markdown

claude Bot commented Mar 26, 2026

test comment

@claude
Copy link
Copy Markdown

claude Bot commented Mar 26, 2026

PR #48 Review: Warm-Pool v2 State Machine

@claude
Copy link
Copy Markdown

claude Bot commented Mar 26, 2026

PR 48 Review - Warm-Pool v2 State Machine

This is a well-structured architectural upgrade. The state-machine-driven lifecycle is a clear improvement over scattered status rewrites, and the test coverage is extensive. A few issues worth fixing before merge, roughly ordered by severity.


HIGH SEVERITY

  1. executescript() breaks migration atomicity (db/database.py)

Python sqlite3.executescript() issues an implicit COMMIT before executing. If the process crashes between executescript() returning and the final RENAME TO warm_instances, the database is left with an empty warm_instances_v2 and no warm_instances table - an unrecoverable state. Fix: replace executescript() with individual conn.execute() calls, keeping everything inside the parent context-manager transaction.

  1. awaiting_user_finalization treated as non-owning in daemon (stage_daemon.py)

The non_owning set includes awaiting_user_finalization, causing the daemon to release the warm VM (emit DELETE_REQUESTED) while a run is still awaiting user review. The VM holds outputs the user is reviewing - deleting it mid-review is incorrect. awaiting_user_finalization is not terminal; the run can still be approved/rejected and write outputs. It should not be in non_owning.

  1. create_warm_pool_manager ignores storage: section (cloud/factory.py)

raw_bucket = config.gcs.bucket if config.gcs else None only reads from the legacy gcs: config section. Per the config hierarchy in CLAUDE.md, storage: takes precedence over gcs:. An operator using storage.backend: gcs with storage.gcs.bucket: my-bucket and no legacy gcs: section gets bucket=None, silently disabling warm pool dispatch with no error. Use the same bucket-resolution logic as create_run_backend().


MEDIUM SEVERITY

  1. DELETING has no DELETE_REQUESTED transition (instance_transitions.py)

If on_delete_requested() is called on an instance already in DELETING (e.g. warm_pool_cleanup retrying), the CAS finds no matching transition and returns no_transition instead of being idempotent. A self-transition should be added. Also, DELETABLE_STATES is defined but never consulted by instance_transition() - it is dead code.

  1. Self-transition idempotency check fails in instance_core.py

The guard condition t.from_state != current_state means that for DELETE_FAILED (which stays in DELETING), t.from_state == t.to_state == current_state, so the condition is False. A second on_delete_failed() call returns no_transition instead of idempotency. Fix: remove the guard or handle self-transitions explicitly.

  1. Unbounded retry loop in try_claim() (warm_pool.py)

The while True loop in try_claim has no iteration cap. If find_claimable_instance returns the same row repeatedly due to a DB bug, this hangs the executor thread permanently. A cap of ~20 iterations with a fall-through to fresh launch is much safer than no bound at all.

  1. Dead code: create_instance_lease / release_instance_lease (db/database.py)

These methods are never called by the state machine. Calling them directly would bypass the CAS and create lease records without updating current_lease_run_id - a split-brain inconsistency. Either remove them or clearly document they must never mutate lease state.

  1. WarmPoolManager recreated on every daemon poll tick (stage_daemon.py)

_get_warm_pool_manager() calls create_warm_pool_manager() on every poll cycle, constructing a new WarmPoolManager + InstanceController every few seconds. Pass the manager at construction time instead.

  1. JSON parsed with grep/sed in idle loop shell script (startup_builder.py)

The grep/sed pattern breaks on JSON with escaped characters or if the key appears earlier in a string value. python3 is already used in the same loop - use python3 -c with json.load() instead.

  1. abort_warm_job uploads logs to wrong GCS path on early failures (startup_builder.py)

abort_warm_job uses the global GCS_STDOUT_PATH/GCS_STDERR_PATH. When called before the path-reset section, these still point to previous job paths. Pass GCS paths as explicit arguments to abort_warm_job.


LOW SEVERITY / CLAUDE.md Conventions

  1. Dead enum members in instance_types.py: InstanceEvent.JOB_STARTED and InstanceEvent.IDLE_READY appear in no transition in the table.

  2. _find_instance_for_run() missing cast() and TypedDict return type (instance_controller.py). Per CLAUDE.md: when returning TypedDict, ALWAYS use cast().

  3. Imports inside method bodies: import os inside finally block in warm_pool.py; datetime import inside _force_release_lease() when already at module top; InstanceController imported twice in cancel.py. Per CLAUDE.md: move ALL imports to top.

  4. Timeout constants not YAML-configurable (stage_daemon.py): _LAUNCHING_TIMEOUT_SECONDS and _DRAINING_TIMEOUT_SECONDS should be fields in WarmPoolConfig.


Minor Nitpicks

  • on_launch_failed has no transition from DRAINING or IDLE_READY - a racing claim that calls on_launch_failed will silently stall the instance with no lease.
  • cancel.py: the error-sanitization comment was removed - restore it so the generic message looks intentional.
  • cancel.py: when lease is None and state is deleting, _cleanup_backend fires a redundant delete call - harmless but unnecessary.

Overall this is a solid, well-tested feature. The three high-severity issues (migration atomicity, awaiting_user_finalization VM release, storage: config bypass) should be resolved before merging. The medium items are real correctness bugs likely to surface under operational conditions. The rest is polish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant