This document defines the authoritative state machine for lifecycle Workers. It establishes the semantic meaning of each state, the allowed transitions, and the philosophy behind "Intent vs Outcome".
To resolve ambiguity between "finished" (success) and "stopped" (interruption), we distinguish based on termination intent:
- Natural Termination: The worker finished its task because it was done. -> Finished
- Requested Termination: The worker finished because it was asked to stop. -> Stopped
- Error Termination: The worker finished because it crashed or errored. -> Failed
- Definition: The worker instance exists and is initialized, but
Start()has not been called. - Context: Resources (structs) are allocated, but operational resources (goroutines, file handles) are not.
- Next Valid:
Starting,Pending.
- Definition: The worker is blocked from execution by an active governor (e.g., Supervisor backoff).
- Context: Waiting for a restart delay timer, rate limit slot, or circuit breaker reset.
- Next Valid:
Starting,Stopped(if cancelled while pending).
- Definition:
Start()has been called, and initialization logic is executing. - Context: Opening connections, spawning background goroutines. This state is transient.
- Next Valid:
Running,Stopped,Failed(if init fails).
- Definition: The worker is actively executing its primary workload.
- Context: The main loop is active.
- Next Valid:
Suspended,Stopping,Stopped,Finished,Failed.
- Definition: The worker has paused its execution (quiescence) but retains its resources.
- Context: Traffic is drained, no new work is accepted, but memory state is preserved.
- Next Valid:
Running(via Resume),Stopping(via Stop).
- Definition:
Stop()has been called (or Context cancelled), and the worker is performing graceful shutdown. - Context: Draining final requests, closing connections, flushing buffers.
- Next Valid:
Stopped,Failed,Killed.
- Semantic: Compliance.
- Definition: The worker terminated successfully (exit code 0 / no error) IN RESPONSE to a stop request or context cancellation.
- Logic:
StopRequested == trueORtermio.IsInterrupted(err).
- Semantic: Completion.
- Definition: The worker terminated successfully (exit code 0 / no error) WITHOUT being requested to stop.
- Logic:
StopRequested == falseANDerr == nil.
- Semantic: Error.
- Definition: The worker terminated with a non-nil error or non-zero exit code.
- Logic:
err != nil(and not an Interruption).
- Semantic: Non-Compliance / Obstinacy.
- Definition: The worker failed to stop within the grace period and was forcefully terminated.
- Logic:
Killed == true.
While Status describes the worker's position in the lifecycle, Health describes its internal operational viability. A worker in StatusRunning might be technically executing but logically "Broken" (e.g., disconnected from a database).
Starting in v1.8, workers can implement Probe(ctx) ProbeResult.
-
Healthy ❤️: Internal invariants are met. System is proceeding normally.
-
Unhealthy 💔: Failures detected (timeouts, dependency loss).
The Supervisor actively triggers these probes during state inspection, enabling high-fidelity diagnostic dashboards and diagrams.
To support auditing, state transitions capture precise timestamps:
- StartedAt: When the current instance reached
StatusRunning. - UpdatedAt: When the last
SetStatusorHealthchange occurred.
| From | To | Trigger |
|---|---|---|
| Created | Starting |
Start() called directly. |
| Created | Pending |
Supervisor delays start (Backoff). |
| Pending | Starting |
Backoff timer expires. |
| Pending | Stopped |
Context cancelled while waiting. |
| Starting | Running |
Initialization complete. |
| Starting | Failed |
Initialization error. |
| Running | Suspended |
Suspend() called. |
| Running | Stopping |
Stop() or Context cancelled. |
| Running | Finished |
Main loop returns nil. |
| Running | Failed |
Main loop returns error. |
| Suspended | Running |
Resume() called. |
| Suspended | Stopping |
Stop() called while suspended. |
| Stopping | Stopped |
Graceful shutdown complete. |
| Stopping | Failed |
Shutdown error (non-timeout). |
| Stopping | Killed |
Shutdown timeout (forced kill). |
The BaseWorker struct centralizes the state logic to ensure consistency across all worker types (Process, Func, Container).
Workers track their Termination Intent using BaseWorker fields:
type BaseWorker struct {
// ...
status Status
StopRequested bool // Was Stop() called?
Killed bool // Was it force-killed?
Err error // Did it exit with an error?
}During execution, workers use SetStatus(new) to update their state. This method handles locking and emits StateChange events for introspection.
// Example in ProcessWorker.Start
p.SetStatus(StatusRunning)When a worker exits (run loop returns), it calls Finish(err). This method calculates the final status using DeriveFinalStatus():
func (b *BaseWorker) DeriveFinalStatus() Status {
if b.Killed {
return StatusKilled
}
// Interrupted errors (ContextCanceled, Terminated) are considered "Stopped"
if b.Err != nil && termio.IsInterrupted(b.Err) {
return StatusStopped
}
if b.Err != nil {
return StatusFailed
}
if b.StopRequested {
return StatusStopped
}
return StatusFinished
}This ensures that a worker which was asked to stop (StopRequested) and exits cleanly (err == nil) is correctly classified as Stopped, distinguishing it from one that finished naturally (Finished).
StatusCancelled typically mirrors context.Canceled. In lifecycle, context cancellation is the mechanism for stopping, not a distinct state. We map cancellation to StatusStopped (Compliance) to simplify Supervisor logic.
Distinguishing Killed from Failed is crucial for reliability engineering. A Failed worker might be buggy, but a Killed worker is obstinate—it hangs during shutdown. This distinction allows Supervisors to apply different policies (e.g., alert on Killed, retry on Failed).