Warm-pool v2: state-machine-driven reusable GCE lifecycle#48
Conversation
…gle controller Replace scattered warm-pool lifecycle logic (~10 code locations making ad-hoc status updates) with a first-class state machine: - 7 states (launching→busy→draining→idle_ready→claimed→deleting→gone) - 13 events with explicit transition table (20 transitions) - CAS semantics via instance_transition() mirroring stage run state machine - Explicit leases (instance_leases table) replace ownership guesswork - Atomic capacity gate via conditional INSERT (no separate has_capacity check) - Reuse ordering gate: busy→draining→idle_ready enforced by state machine InstanceController is the single entry point for all instance lifecycle decisions. No code writes warm_instances.state directly. Key fixes found via multi-model bug hunting (GPT-5.4 + Opus 4.6 + Gemini): - Claim signaling uses --metadata-from-file (JSON-safe, no comma mangling) - Flat signal JSON (spec_gcs_path at top level, not nested in payload) - Pre-registration uses launcher's _sanitize_name (name matches actual VM) - Launch failures call on_launch_failed (no stranded launching rows) - Launching timeout in daemon (15min safety net) - Lease creation failure in on_fresh_launch fails the operation (no strandable busy) - ACK after spec validation (not before) - set -e safe wait pattern (EXIT_CODE=0; wait $PID || EXIT_CODE=$?) - Env vars passed to Docker via -e KEY=VALUE (not sourced into host shell) - goldfish_instance_state=idle_ready metadata on both first-boot and reuse - Cleanup preserves job_spec.json, resets GCS_LOG_PATH/GCS_STAGE_LOG_PATH - log_syncer snapshots GCS paths per-job (no stale first-boot paths) - Stale lease detection for draining instances with terminal runs - Claim retry loop (3 attempts) for CAS races - export GCS_BUCKET for Python child processes - find_claimable_instance ORDER BY state_entered_at (oldest idle first) - warm_pool_cleanup cancels owning runs before deleting (no orphaned runs) - Accurate deleted vs deletion_pending counts in cleanup response Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ocker failures Two ship-blocking fixes: 1. Migration v8 now always creates warm_instances via CREATE TABLE IF NOT EXISTS, not just when upgrading from the old schema. Fresh upgrades from main (no prior warm pool table) were left without the table, causing "no such table: warm_instances" on first warm-pool operation. 2. Input staging and image pull failures in the warm reuse path now set goldfish_instance_state=idle_ready before continuing to idle. Without this, the run finalizes busy→draining but poll_warm_instances never sees the metadata it needs for DRAIN_COMPLETE, stranding the instance in draining permanently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…and controller routing Fixes from bug-hunter rounds: - Transition-first lease ordering: on_run_terminal, on_launch_failed, and on_claim_timeout now transition state BEFORE releasing the lease, preventing orphaned busy instances when transition fails after lease release - Clear stale goldfish_instance_state metadata on new job claim: sets metadata to "busy" before ACK to prevent daemon from seeing old idle_ready value during the next draining phase - Consecutive liveness failures (3) required before declaring preemption: transient gcloud/API errors no longer immediately mark live VMs as gone - on_claim_acked result checked: failed ACK transition now rolls back the claim instead of returning a handle to an unowned instance - on_fresh_launch failure cleanup: calls on_launch_failed to remove the pre-registered row instead of leaving it for the launching timeout - find_claimable_instance excludes instances with active leases: prevents claim retry loop from repeatedly picking the same lost-race instance - delete_instance routes through controller: on_delete_requested + on_delete_confirmed for audit trail, instead of bypassing state machine - Remove FK CASCADE on instance_state_transitions: audit history is preserved after warm instance row deletion - cancel_run skips direct backend cleanup when controller handled it: avoids redundant synchronous gcloud delete for warm-pool instances - Warm-pool variable initialization before try block: prevents UnboundLocalError in exception handler on early failures - gcloud delete timeout reduced to 15s: daemon retries on next poll Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ightening - on_preempted/on_delete_requested: transition-first, then release lease (consistent with on_run_terminal/on_claim_timeout/on_launch_failed) - on_claim_timeout: always release lease regardless of transition result (caller abandoned the claim; stale lease blocks instance forever) - pre_register_warm_instance: INSERT OR IGNORE prevents IntegrityError on duplicate stage_run_id retries - find_claimable_instance: NOT EXISTS subquery excludes instances with active leases (prevents claim retry picking same lost-race instance) - instance_transition idempotency: require from_state != current_state to prevent spurious "already_in_target_state" for unrelated events - cancel_run: require ctrl_result.success before skipping backend cleanup Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Controller registration failure: remove pre-registered row directly instead of calling on_launch_failed (which would transition to deleting and eventually kill the live, running VM) - LAST_JOB_REQ_ID assignment moved after ACK: transient spec-download failures no longer permanently suppress retries for the same request Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If the same (instance_name, stage_run_id) lease already exists (e.g. retried call), treat it as idempotent success instead of failing and triggering the "delete pre-registered row" cleanup path that would orphan a successfully launched VM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ase recovery - Async delete verification: after issuing --async gcloud delete, verify the VM is actually gone via liveness check before removing the DB row. Prevents pool from exceeding max_instances while old VM is still alive. - Dotfile cleanup: default cleanup uses find instead of rm -rf * so hidden directories like .goldfish/ (metrics, SVS artifacts) are removed between warm-pool reuse cycles. Prevents stale artifacts from being uploaded under the next run's path. - Busy-instance lease recovery: poll_warm_instances now checks busy instances for stale leases pointing to terminal/non-owning runs (same pattern as draining). Handles executor crash after run goes terminal but before on_run_terminal releases the lease. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Eliminate the entire class of partial-failure bugs between lease and state operations by making them a single CAS UPDATE. Before: create_instance_lease() + instance_transition() were separate DB transactions. A crash between them left inconsistent state (orphaned leases, busy instances without leases, stale leases blocking claims). After: instance_transition(set_lease_run_id=...) atomically sets both state and current_lease_run_id in one UPDATE ... WHERE state = ? CAS. No partial-failure window exists. Changes: - warm_instances gains current_lease_run_id column (nullable TEXT) - instance_transition() gains set_lease_run_id parameter - Controller methods pass lease ops through to instance_transition() - find_claimable_instance uses current_lease_run_id IS NULL (no subquery) - instance_leases table kept as append-only audit log - Daemon reads lease from instance row directly (no separate lookup) - All tests updated to use atomic lease model Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…s check - cancel_run always runs _cleanup_backend for immediate GCE termination, not deferred to daemon polling. The controller transition still happens for warm-pool state tracking, but the VM is killed right away. - get_logs uses handle.stage_run_id for GCS path resolution, not handle.backend_handle (instance name). Warm-pool reuse uploads logs under runs/<stage_run_id>/, so using the instance name returned the wrong run's logs. - idle_ready instances now get liveness checks (3 consecutive failures before preemption). Spot VMs preempted while idle were previously invisible until idle timeout, consuming pool capacity and causing failed claims. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- check_instance_status returns "dead" (not "not_found") for VMs in
TERMINATED/STOPPED/SUSPENDED state. DELETING branch accepts both
"dead" and "not_found" as deletion confirmation, but transient
gcloud errors ("error") no longer falsely confirm deletion.
- Orphaned controller failure: set goldfish_warm_pool_disabled metadata
on the VM so it self-deletes after first job instead of entering an
untracked idle loop.
- Missing/purged run handling: BUSY and DRAINING instances with a lease
pointing to a non-existent run are now treated as terminal and
finalized, preventing indefinite stranding.
- _force_release_lease also updates instance_leases audit table to
prevent stale active lease rows from persisting.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…errors
- Missing/purged runs are now treated as "terminated" when the daemon
detects stale leases, so the controller emits DELETE_REQUESTED (kill
the VM) instead of JOB_FINISHED (recycle it for reuse).
- Liveness checks now use check_instance_status() tri-state instead of
the boolean check_instance_alive(). Only "not_found" and "dead" count
toward the 3-strike preemption threshold. Transient gcloud errors
("error") neither increment the counter nor reset it, preventing
healthy VMs from being killed during API outages.
- Fixed stale comment about --async in DELETING branch.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…paths - cancel_run skips backend cleanup when the backend_handle is a known warm-pool instance that this run no longer owns. Prevents canceling an already-finalized run from killing a VM that was recycled for a different job. - Idle-loop error branches (spec download failure, missing spec_gcs_path, missing required fields) now sleep 1s before continue, preventing tight CPU spin when the same metadata signal is retried. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a warm-pool claim times out, the timed-out VM still has the goldfish job metadata and can ACK late, starting the same stage_run_id that the caller is about to launch on a fresh VM. Both VMs then write to the same GCS paths, causing duplicate execution and clobbered outputs. Fix: clear the goldfish metadata key on the timed-out VM before returning None, so even if the VM reads the signal late, it finds an empty value and stays idle. Best-effort — if the VM is unreachable (which is likely why it timed out), the metadata clear fails silently and the VM will be deleted by the daemon anyway. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e release - Launch failure zone discovery: when launch_instance() raises after capacity search may have created the VM in a different zone, discover the actual zone via _find_instance_zone() before marking deleting. Prevents daemon from looking in the wrong zone and dropping the row while the real VM runs untracked. - Stale draining lease: the recovery path incorrectly called on_run_terminal (which emits JOB_FINISHED, only valid from busy) on draining instances, then unconditionally cleared the local lease flag even when the transition failed. Now uses _force_release_lease() directly for draining instances (already past busy) and only clears the flag on success, preventing idle_ready instances with stale current_lease_run_id from becoming permanently unclaimable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion - cancel_run: skip cleanup only if the warm-pool instance is owned by a different run (checks current_lease_run_id != run_id), not merely if a warm-pool row exists. Prevents skipping cleanup for runs that still own their VM. - Draining timeout: instances stuck in draining for >30 minutes are now transitioned to deleting. Previously there was no timeout, so a failed metadata update could strand a VM in draining permanently. - Delete confirmation: only confirm deletion on "not_found" (VM truly gone from GCE), not "dead" (TERMINATED/STOPPED — still exists as a resource). Also clean up liveness counter on delete confirmation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gging - on_claim_timeout: only force-release the lease if this run still owns the instance (current_lease_run_id == stage_run_id). Prevents wiping the lease from a VM that already advanced to busy via CLAIM_ACKED race. - try_claim: early return if bucket is None, preventing gs://None/... GCS paths when warm pool is enabled without GCS config. - Lease audit: INSERT ... ON CONFLICT DO UPDATE reactivates released lease rows for the same (instance, run) pair instead of silently ignoring the insert. - _get_warm_pool_manager: log warning on failure instead of silent swallow, so warm-pool polling failures are visible to operators. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Split the shared exclude_args into outputs_excludes and tmp_excludes so /mnt/outputs paths only apply to the /mnt/outputs find, and /tmp paths only apply to the /tmp find. Previously all exclusions were injected into both commands, making cross-root exclusions meaningless. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…force-release Two claim-race fixes: 1. When on_claim_acked fails (instance left claimed state), clear the goldfish metadata before returning None. The VM already ACKed and has the job signal — without clearing it, both the warm VM and the fresh fallback launch execute the same stage_run_id. 2. on_claim_timeout fallback: require state == "claimed" in addition to current_lease_run_id == stage_run_id before force-releasing. If CLAIM_ACKED won the race, the instance is now busy with a valid lease for the running job — releasing it would strand the job. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
test comment |
PR #48 Review: Warm-Pool v2 State Machine |
|
PR 48 Review - Warm-Pool v2 State Machine This is a well-structured architectural upgrade. The state-machine-driven lifecycle is a clear improvement over scattered status rewrites, and the test coverage is extensive. A few issues worth fixing before merge, roughly ordered by severity. HIGH SEVERITY
Python sqlite3.executescript() issues an implicit COMMIT before executing. If the process crashes between executescript() returning and the final RENAME TO warm_instances, the database is left with an empty warm_instances_v2 and no warm_instances table - an unrecoverable state. Fix: replace executescript() with individual conn.execute() calls, keeping everything inside the parent context-manager transaction.
The non_owning set includes awaiting_user_finalization, causing the daemon to release the warm VM (emit DELETE_REQUESTED) while a run is still awaiting user review. The VM holds outputs the user is reviewing - deleting it mid-review is incorrect. awaiting_user_finalization is not terminal; the run can still be approved/rejected and write outputs. It should not be in non_owning.
raw_bucket = config.gcs.bucket if config.gcs else None only reads from the legacy gcs: config section. Per the config hierarchy in CLAUDE.md, storage: takes precedence over gcs:. An operator using storage.backend: gcs with storage.gcs.bucket: my-bucket and no legacy gcs: section gets bucket=None, silently disabling warm pool dispatch with no error. Use the same bucket-resolution logic as create_run_backend(). MEDIUM SEVERITY
If on_delete_requested() is called on an instance already in DELETING (e.g. warm_pool_cleanup retrying), the CAS finds no matching transition and returns no_transition instead of being idempotent. A self-transition should be added. Also, DELETABLE_STATES is defined but never consulted by instance_transition() - it is dead code.
The guard condition t.from_state != current_state means that for DELETE_FAILED (which stays in DELETING), t.from_state == t.to_state == current_state, so the condition is False. A second on_delete_failed() call returns no_transition instead of idempotency. Fix: remove the guard or handle self-transitions explicitly.
The while True loop in try_claim has no iteration cap. If find_claimable_instance returns the same row repeatedly due to a DB bug, this hangs the executor thread permanently. A cap of ~20 iterations with a fall-through to fresh launch is much safer than no bound at all.
These methods are never called by the state machine. Calling them directly would bypass the CAS and create lease records without updating current_lease_run_id - a split-brain inconsistency. Either remove them or clearly document they must never mutate lease state.
_get_warm_pool_manager() calls create_warm_pool_manager() on every poll cycle, constructing a new WarmPoolManager + InstanceController every few seconds. Pass the manager at construction time instead.
The grep/sed pattern breaks on JSON with escaped characters or if the key appears earlier in a string value. python3 is already used in the same loop - use python3 -c with json.load() instead.
abort_warm_job uses the global GCS_STDOUT_PATH/GCS_STDERR_PATH. When called before the path-reset section, these still point to previous job paths. Pass GCS paths as explicit arguments to abort_warm_job. LOW SEVERITY / CLAUDE.md Conventions
Minor Nitpicks
Overall this is a solid, well-tested feature. The three high-severity issues (migration atomicity, awaiting_user_finalization VM release, storage: config bypass) should be resolved before merging. The medium items are real correctness bugs likely to surface under operational conditions. The rest is polish. |
Summary
What changed
instance_types,instance_transitions,instance_core,instance_controller) and persist warm instance state + leases in the databaseVerification
make lintmake testmake test-integrationNotes for reviewers