Note: This document describes the architecture of
lifecycle, spanning its v1.0-v1.4 Foundation (Death Management) and the v1.5+ Control Plane (Life Management). For a history of architectural choices, see DECISIONS.md.
This section defines the architectural pillars that govern the library.
Architecture Note (Facade Pattern): The root
lifecyclepackage acts as a Facade, exposing a curated subset of functionality frompkg/coreandpkg/eventsfor 90% of use cases. Deep consumers should import from the core packages directly, while application authors should prefer the root package for ergonomics. Tests inlifecycle_test.goverify this wiring but do not duplicate the exhaustive behavioral tests found in the core packages.
Technically, lifecycle is a Signal-Aware Control Plane and Interruptible I/O Supervisor for modern applications (Services, Agents, CLIs).
- Signal-Aware: It allows the application to distinguish between "User Requests" (
SIGINT) and "System Demands" (SIGTERM), enabling intelligent shutdown policies (e.g., "Press Ctrl+C again to force quit"). - Interruptible: It creates a layer over blocking System Calls (like
read), allowing them to be abandoned instantly via Context cancellation, preventing goroutine leaks. - Supervisor: It manages the lifecycle of child components (Processes, Containers, Goroutines), ensuring they are bound to the parent's lifetime.
To prevent "Memory Leaks" and "Zombie Processes", the system imposes explicit constraints:
We acknowledge that OS signals are inherently global. Instead of pretending they aren't, lifecycle manages this global state for you.
- Default Router: Like
net/http, we provide a default multiplexer for ease of use. - Clean Logic: Your business logic remains free of global side-effects, relying on
Contextpropagation andHandlerinterfaces.
We adopt a Fail-Closed default for child processes.
If the parent process crashes or is killed (SIGKILL), all child processes must die immediately. This is enforced via OS primitives on supported platforms:
- Linux:
SysProcAttr.Pdeathsig - Windows: Job Objects (
JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) - macOS/BSD: Not supported (Best-effort cleanup via Signals only).
Windows is a first-class citizen.
- We explicitly handle
CONIN$to ensureCtrl+Cworks reliably in interactive prompts. - We normalize file system paths and signals to ensure behavior matches Unix expectations where possible.
Internal state changes are not black boxes. They are exposed via:
- Metrics: Counts and Histograms for every signal, hook, and I/O event.
- Introspection: Immutable
State()methods that allow the application to visualize its own topology.
The lifecycle is bound to the Main Job (lifecycle.Run(fn)). When the main function returns, the application is considered Complete. lifecycle automatically cancels the Global Context, signaling all background tasks (lifecycle.Go, Supervisor) to shut down immediately. This prevents "Orphaned Processes" where a finished CLI tool hangs indefinitely waiting for a metrics reporter.
We believe in Simple Primitives, Rich Behaviors. Instead of a monolithic "Exit" function with 20 flags, we provide atomic events (Suspend, Resume, Shutdown, Reload) that can be chained.
- Composition is King: A "Power Command" (like
xorterminate) is simply a sequence:SuspendEvent(to ensure state is saved) followed by aShutdownEvent. - Context Cancellation is Normal: During shutdown, receiving
context.Canceledis not an "Error" to be warned about. It is the sign of a healthy, responding system fulfilling its contract.
This section details the internal state machines and I/O handling strategies.
Our SignalContext manages the transition from Graceful to Forced shutdown based on a configurable Force-Exit Threshold.
stateDiagram-v2
[*] --> Running
Running --> Graceful: SIGTERM (1st) or SIGINT (Count == Threshold)
note right of Graceful
Context cancelled.
App starts cleanup.
end note
Graceful --> ForceExit: Any Signal (Count > Threshold)
note right of ForceExit
os.Exit(1) called.
Immediate termination.
end note
Running --> Running: SIGINT (Escalation Mode Threshold >= 2)
note left of Running
Count < Threshold:
ClearLineEvent emitted.
end note
ForceExit --> [*]
Graceful --> [*]: Natural Cleanup Completes
Key Behaviors:
- Mode: Industry Standard (Threshold=1): The first
SIGINT(Ctrl+C) orSIGTERMcancels the context. The second signal triggersos.Exit(1). This is the default. - Mode: Escalation (Threshold=N):
SIGINTis captured and emitted as an event (InterruptEvent) without cancelling the context. Only the N-th signal triggersos.Exit(1).SIGTERMalways cancels on the first signal. - Interactive Offset: If
WithCancelOnInterrupt(false)is set, the runtime implicitly increments the threshold by 2. This preserves the "Distance Invariant" (Kill distance relative to the last software action) and prevents races during interactive shutdowns. - Mode: Unsafe (Threshold=0): Automatic forced exit is disabled. The user is responsible for process status.
- Async Hooks:
OnShutdownhooks run concurrently or sequentially (LIFO) depending on configuration, but always after context cancellation. - Reasoning:
ctx.Reason()differentiates if closure was manual (Stop()), signal-based (Interrupt), or time-based (Timeout). - Shutdown Diagnostics: If cleanup exceeded
WithShutdownTimeout(default 2s), the runtime automatically dumps all goroutine stacks tostderrto help diagnose hangs.
sequenceDiagram
participant OS
participant SignalContext
participant Hook_B
participant Hook_A
participant App
OS->>SignalContext: SIGTERM
SignalContext->>App: Cancel Context (ctx.Done closed)
rect rgb(30, 30, 30)
note right of SignalContext: Async Cleanup (LIFO)
SignalContext->>Hook_B: Execute()
Hook_B-->>SignalContext: Return
SignalContext->>Hook_A: Execute()
Hook_A-->>SignalContext: Return (or Panic recovered)
end
Traditional I/O is binary: it reads or blocks. lifecycle (via procio/termio) introduces Context-Aware I/O to balance Data vs. Safety.
| Strategy | Use Case | Behavior |
|---|---|---|
| Shielded Return | Automation / Logs | Data First. If data arrives with Cancel, return Data. |
| Strict Discard | Interactive Prompts | Safety First. If Cancel occurs, discard partial input. |
| Regret Window | Critical Opps | Pause. Sleep(ctx) breaks availability on Cancel. |
sequenceDiagram
participant App
participant Reader
participant OS_Stdin
participant Context
note over App: Strategy Selection
alt Strategy A (Data First)
App->>Reader: Read()
OS_Stdin-->>Reader: Returns "Data"
Context-->>Reader: Returns "Cancelled"
Reader-->>App: Return "Data", nil
note right of App: Process Data
else Strategy B (Error First)
App->>Reader: ReadInteractive()
OS_Stdin-->>Reader: Returns "Data"
Context-->>Reader: Returns "Cancelled"
Reader-->>App: Return 0, ErrInterrupted
note right of App: Abort Operation (Strict)
else Strategy C (Regret Window)
App->>App: Input Accepted
App->>lifecycle: Sleep(ctx, 3s)
Context-->>lifecycle: Cancelled (User Regret)
lifecycle-->>App: Return ctx.Err()
note right of App: Abort Execution
end
lifecycle provides primitives to manage goroutines safely, ensuring they respect shutdown signals and provide visibility.
The most common pattern. Fire-and-forget but tracked.
- Context Propagation: Inherits cancellation from the parent.
- Wait Tracking:
lifecycle.Runautomatically waits for these tasks. - Safety: Panics are recovered and logged.
lifecycle.Run(func(ctx context.Context) error {
lifecycle.Go(ctx, func(ctx context.Context) error {
// Runs in background, but tracked.
// If it panics, app stays alive.
return nil
})
return nil
})Executes a function synchronously with safety guarantees.
- Observability: Metrics for duration and success/failure.
- Recovery: Captures panics.
- Usage: Used internally by
GoandGroup.
For complex parallelism requiring limits or gang-scheduling.
- API: Wrapper around
errgroup.Group. - Features:
SetLimit(n), panic recovery, and metric tracking.
g, ctx := lifecycle.NewGroup(ctx)
g.SetLimit(10)
g.Go(func(ctx context.Context) error { ... })
g.Wait()To ensure safe access to shared worker state, we use the withLock and withLockResult helpers:
value := withLockResult(p, func() int { return p.myField })
withLock(p, func() { p.myField = 42 })Attention: Do not use these helpers in methods that already perform locking internally (e.g., ExportState), to avoid deadlocks.
This pattern reduces boilerplate, prevents improper unlocks, and simplifies maintenance.
See the formal decision in ADR05 in DECISIONS.md.
Ensures child processes do not outlive the parent. This logic is delegated to the procio library.
- Linux: Uses
SysProcAttr.Pdeathsigto signal the child when the parent thread dies. - Windows: Uses Job Objects (
JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE) to ensure the OS terminates the child tree when the parent handle is closed. - macOS: Fallback to standard
exec.Cmd(OS limitations prevent strict guarantees).
Starting with v1.7.2, the preferred way to create processes is lifecycle.NewProcessCmd(ctx, name, args...).
This function performs Lazy Construction: it stores the command parameters and only creates the underlying OS process when .Start() is called.
This ensures that the Context passed at creation is the one actually used to monitor the process, preventing "Context Detachment" where a process is started with an expired or irrelevant context.
It also automatically configures:
- Cancellation:
cmd.Wait()returns when the context is cancelled, and the process is signalled/killed based on platform hygiene. - Hygiene: PDeathSig (Linux) or Job Objects (Windows) are applied automatically.
While procio provides OS-level guarantees, the Control Plane must enforce strict contextual control to prevent "Pathological Detachments" (where a worker spawns a process using context.Background(), losing all connection to the parent lifecycle).
The Rule: Every child process or goroutine MUST accept a context derived from its immediate parent.
sequenceDiagram
participant Main as Main Context
participant Work as Worker Context
participant Timed as Chained Context (Timeout)
participant Child as Child Process (procio)
Main->>Work: Derived Cancel
Work->>Timed: context.WithTimeout(Work)
Timed->>Child: lifecycle.NewProcessCmd(Timed)
Note over Child: Running...
alt Parent Cancelled
Main--xTimed: Cancel Cascades
Timed-->>Child: SIGINT/SIGKILL (via procio)
else Timeout Expired
Timed--xTimed: Self-Cancel
Timed-->>Child: SIGINT/SIGKILL (via procio)
end
By chaining contexts, we ensure that software-level failsafes (deadlines) and lifecycle events (Ctrl+C) propagate instantly down the chain, with OS-level Job Objects/PDeathSig acting as the final safety net for hard crashes.
To support Durable Execution engines (like Trellis), we provide primitives that shield critical operations.
lifecycle.DoDetached(ctx, fn) (formerly Do) allows executing a function that cannot be cancelled by the parent context until it completes. It returns any error produced by the shielded function.
Note:
lifecycle.Do(ctx, fn)now represents a "Safe Executor" that respects cancellation but provides panic recovery and observability.DoDetachedwrapsDowithcontext.WithoutCancel.
sequenceDiagram
participant P as Parent Context
participant D as lifecycle.DoDetached
participant F as Function
P->>D: Call DoDetached(ctx, fn)
D->>F: Run fn(shieldedCtx) -> error
note right of P: User hits Ctrl+C
P--xP: Cancelled!
note over D: DoDetached detects cancellation<br/>but WAITS for fn
F->>F: Complete Critical Work
F-->>D: Return error
D-->>P: Return error (or Canceled if shielded ctx ignored)
(Introduced in v1.3) The Supervisor manages a set of Workers, forming a Supervision Tree.
Uniform interface for Process, Container, and Goroutine management.
sequenceDiagram
participant Manager
participant Worker
Manager->>Worker: Start(ctx)
activate Worker
rect rgb(30, 30, 30)
note right of Worker: Work happens...
end
alt Graceful Stop
Manager->>Worker: Stop(ctx)
Worker-->>Manager: Returns nil
else Crash
Worker->>Worker: Closes Wait() channel (w/ error)
end
deactivate Worker
Asynchronous callbacks that send on channels can race with shutdown. To avoid panics and leaked goroutines, adopt a three-stage cleanup protocol:
- STOP: Reject new work (e.g., set
closed = true). - WAIT: Track in-flight callbacks via
sync.WaitGroupand wait for completion. - CLOSE: Close channels or release resources only after callbacks are drained.
Use BlockWithTimeout to avoid indefinite waits during shutdown.
type debouncer struct {
closed bool
wg sync.WaitGroup
out chan Event
}
func (d *debouncer) stopAndWait(timeout time.Duration) {
d.closed = true
done := make(chan struct{})
go func() {
d.wg.Wait()
close(done)
}()
_ = lifecycle.BlockWithTimeout(done, timeout)
close(d.out)
}By default, an OS-level worker's Stop method sends a signal and waits using the BaseWorker timeout logic. A FuncWorker stops and returns immediately upon context cancellation. However, Stop() calls may return before the child process's stdout and stderr buffers have completely flushed, or a background goroutine has fully exited.
To address race conditions in heavily coordinated systems (like executing tools strictly sequentially), the library provides a universal utility: lifecycle.StopAndWait(ctx, worker).
It internally calls worker.Stop(ctx) but firmly blocks return until <-worker.Wait() completely resolves, ensuring all background I/O or detached routines are cleanly closed before yielding control back to the caller.
Manual use of locks across multiple worker implementations creates high risk for deadlocks (especially during recursive state inspection) and repetitive boilerplate.
Standard workers now satisfy the sync.Locker interface and leverage internal generic helpers to ensure atomic operations:
withLock(w, fn): Standard mutex acquisition and deferred release.withLockResult(w, fn): Mutex acquisition for operations returning a value (State snapshots).
Determination of the final status code in BaseWorker.DeriveFinalStatus follows a strict precedence:
- Killed: If
ctx.Done()was reached duringStop()or a force kill was explicitly triggered. - Stopped: If the error returned by the operation matches
termio.IsInterruptedor specific OS signal patterns. - Failed: Any other non-nil error.
- Finished: Natural exit with
nilerror.
[!CAVEAT] Linux Signal Detection: Since Go's
os.Process.Waitreturns string-based error patterns for signals on Linux,lifecycleperforms explicit string matching for"signal: interrupt","signal: terminated", and"signal: killed". This ensures platform isolation without deep dependencies onsys/unix, but requires awareness if custom shells or non-standard exit patterns are used.
- OneForOne: Restart only the failed child.
- OneForAll: Restart all children if one fails (tight coupling).
- Restart Policies: Per-child control via
Always,OnFailure, orNever. - Circuit Breaker: Sliding-window restarts limit (
MaxRestartswithinMaxDuration). - Backoff: Exponential backoff (with jitter) limits restart loops.
- Introspection Metadata: Supervisors attach reliability metrics to their state:
restarts: Total restart count for the child.circuit_breaker: Set totriggeredif the max restarts threshold is exceeded.
Supervisors can manage other supervisors, forming deep trees. The State() method supports recursive inspection by propagating Children fields:
rootSup := supervisor.New("root", supervisor.StrategyOneForOne,
supervisor.Spec{
Name: "child-sup",
Factory: func() (worker.Worker, error) {
return childSup, nil // Nested supervisor
},
},
)
state := rootSup.State()
// state.Children[0].Children = childSup's children ✅This enables full topology visualization in introspection diagrams, showing the complete supervision tree regardless of nesting depth.
To prevent race conditions during rapid failures and restarts (e.g., OneForAll strategy), the supervisor implements a Worker Identity Shield.
- The Problem: In asynchronous environments, an exit event from a previous worker instance might arrive after a new instance has already been started. Without protection, the supervisor might process this stale event and accidentally shut down or remove the new, healthy worker.
- The Solution: Every
childExitevent carries a reference to the specificworker.Workerinstance that triggered it. The supervisor's monitor loop verifies this identity against the currently active child before taking action. Stale events are logged and ignored.
Allows "Durable Execution" across restarts. The Supervisor injects environment variables into the restarted worker:
LIFECYCLE_RESUME_ID: Stable UUID for the worker session.LIFECYCLE_PREV_EXIT: Exit code of the previous run.
sequenceDiagram
participant Sup as Supervisor
participant W as Worker (Instance 1)
participant W2 as Worker (Instance 2)
Sup->>W: Start (Injected: RESUME_ID=ABC, PREV_EXIT=0)
W-->>Sup: Crash!
note over Sup: Strategy OneForOne
Sup->>W2: Start (Injected: RESUME_ID=ABC, PREV_EXIT=-1)
note right of W2: Worker resumes work for session 'ABC'
(Introduced in v1.5) The Control Plane generalized the "Signal" concept into generic "Events".
The Router is the central nervous system of the Control Plane, inspired by net/http.ServeMux. It routes generalized Events to specialized Handlers.
Note (Facade): The router and handlers are exposed via the top-level
lifecyclepackage for ease of use (e.g.,lifecycle.NewRouter()).
Routes are defined using string patterns:
- Exact Match:
"webhook/reload"(O(1) map lookup) - Glob Match:
"signal.*"(O(n) linear search usingpath.Match)
router.HandleFunc("signal.*", func(ctx context.Context, e Event) error {
log.Println("Received signal:", e)
return nil
})Pattern Syntax & Performance: For detailed pattern syntax, performance benchmarks (scaling with route count), and examples, see LIMITATIONS.md - Router Pattern Matching and
pkg/events/router_benchmark_test.go.
The library provides predefined events for common lifecycle transitions:
| Event | Topic | Trigger | Typical Action |
|---|---|---|---|
SuspendEvent |
lifecycle/suspend |
Escalation logic / API | Pause workers, persist state. |
ResumeEvent |
lifecycle/resume |
Escalation logic / API | Resume workers from state. |
ShutdownEvent |
lifecycle/shutdown |
Input: exit, quit |
Cancel SignalContext. |
TerminateEvent |
lifecycle/terminate |
Input: x, terminate |
Suspend (Save) + Shutdown. |
ClearLineEvent |
lifecycle/clear-line |
Ctrl+C Escalation Mode | Clear CLI prompt, re-print >. |
UnknownCommandEvent |
input/unknown |
Input: ?, unknown |
Print generic help message. |
Middleware wraps handlers to provide cross-cutting concerns (logging, recovery, tracing).
router.Use(RecoveryMiddleware)
router.Use(LoggingMiddleware)Events might be triggered multiple times (e.g., a "Quit" signal followed by a manual "Exit" command). To prevent side-effects like double-closing channels, lifecycle provides the control.Once(handler) middleware.
This utility ensures the wrapped handler's logic is executed exactly once, providing a standard safety mechanism for shutdown and cleanup operations.
// Protected shutdown logic
quitHandler := control.Once(control.HandlerFunc(func(ctx context.Context, _ control.Event) error {
close(quitCh) // Safe against multiple calls
return nil
}))The Router exposes registered routes and its own status via the Introspectable interface.
type Introspectable interface {
State() any
}Calls to State() return a snapshot of the component's internal state (topology, metrics, flags) for visualization tools.
state := router.State().(RouterState)
// {Routes: [...], Middlewares: 2, Running: true}To support Durable Execution systems, lifecycle introduces SuspendEvent and ResumeEvent managed by handlers.SuspendHandler.
stateDiagram-v2
[*] --> Running
Running --> Suspended: SuspendEvent
Suspended --> Running: ResumeEvent
Running --> Graceful: SIGTERM
Suspended --> Graceful: SIGTERM
- Suspend: Application is asked to pause processing, persist state, and stop accepting new work.
- Resume: Application restarts processing from the persisted state.
- Transitioning State: The
SuspendHandleruses an internaltransitioningflag to ensure that while hooks are running, duplicate events (e.g., rapid-fire suspend signals) are ignored. This prevents race conditions and ensures hook execution is atomic. - Idempotency: Requests to
Suspendwhile already suspended (orResumewhile running) are ignored safely. - Quiescence Safety: The
SuspendGateprimitive is context-aware, ensuring buffered workers can abort instantly if the application shuts down while they are paused.
The SuspendHandler (and most control.Router logic) executes hooks sequentially in the order they were registered. This has critical implications for UI feedback:
- The Problem: If a UI hook ("System Suspended") is registered before a heavy worker (e.g., a Supervisor with a slow
Blocker), the UI will announce success immediately, while the system is still technically in transition. - The Solution (Registration Hierarchy): To ensure "Final State" messages are accurate, register heavy components (Supervisors/Gateways) before UI reporting hooks.
// Correct Order:
suspendHandler.Manage(supervisor) // 1. Heavy lifting (blocking)
suspendHandler.OnSuspend(func(...) { // 2. UI feedback (runs after #1)
fmt.Println("🛑 SYSTEM SUSPENDED")
})sequenceDiagram
participant S as Source (OS/HTTP)
participant R as Router
participant M as Middleware
participant H as Handler
S->>R: Emit(Event)
R->>R: Match(Event.Topic)
R->>M: Dispatch(Event)
M->>H: Handle(Event)
H-->>M: Return error?
M-->>R: Complete
To reduce boilerplate for CLI applications, lifecycle provides a pre-configured router helper.
// wires up:
// - OS Signals (Interrupt/Term) -> Escalator (Interrupt first, then Quit)
// - Input (Stdin) -> Router (reads lines as commands)
// - Commands: "suspend", "resume" -> SuspendHandler
// - Command: "quit", "q" -> shutdownFunc
router := lifecycle.NewInteractiveRouter(
lifecycle.WithSuspendOnInterrupt(suspendHandler),
lifecycle.WithShutdown(func() { ... }),
)This helper ensures standard behavior ("q" to quit, "Ctrl+C" to suspend first) without manual wiring.
To reduce boilerplate across source implementations, lifecycle provides BaseSource — an embeddable helper following the same pattern as BaseWorker.
Problem: 7 source types repeated identical Events() method implementation.
Solution: Embedding pattern with auto-exposed methods.
Before (per source):
type MySource struct {
events chan control.Event // Repeated
}
func NewMySource() *MySource {
return &MySource{
events: make(chan control.Event, 10), // Repeated
}
}
func (s *MySource) Events() <-chan control.Event { // Repeated
return s.events
}
func (s *MySource) Start(ctx context.Context) error {
s.events <- event // Direct access
}After (with BaseSource):
type MySource struct {
control.BaseSource // Embedding!
}
func NewMySource() *MySource {
return &MySource{
BaseSource: control.NewBaseSource(10), // Explicit buffer
}
}
// Events() FREE via embedding!
func (s *MySource) Start(ctx context.Context) error {
s.Emit(event) // Clean helper
}Benefits:
- DRY: Single implementation, not repeated 7x
- Consistency: All sources use same pattern
- Explicit: Buffer size visible at construction
- Future-Proof: Add features (metrics, filtering) in one place
API:
func NewBaseSource(bufferSize int) BaseSource
func (b *BaseSource) Events() <-chan Event // Auto via embedding
func (b *BaseSource) Emit(e Event) // Helper
func (b *BaseSource) Close() // CleanupUsage: FileWatchSource, WebhookSource, TickerSource, InputSource, HealthCheckSource, ChannelSource, OSSignalSource.
High-frequency event sources (like recursive filesystem watchers) can overwhelm the system. The Control Plane provides events.DebounceHandler to buffer bursts and emit a single, stable event after a quiet window (trailing edge).
- Anti-Starvation: Continuous event bursts would normally prevent a trailing-edge from ever firing. The
WithMaxWaitoption guarantees a synchronous payload flush after a specified maximum duration. - Custom Aggregation: Users can provide a
mergeFuncto combine arriving events (e.g., accumulating changed file paths) rather than blindly dropping them.
While the default Router uses a callback-based Handler interface, some Go applications prefer idiomatic select loops or range iterations over channels.
The events.Notify(ch) bridge converts a standard Go channel into a Handler. It performs non-blocking sends, dropping events cleanly (ErrNotHandled) if the consumer's channel buffer is full, preventing the Control Plane from deadlocking due to a slow reader.
// Allows idiomatic integration with other select loops
ch := make(chan events.Event, 100)
router.Handle("file/*", events.Notify(ch))To adhere to Zero Config but safe concurrency, we use Context Propagation.
// 1. Run injects a TaskTracker into the context
runtime.Run(func(ctx context.Context) error {
// 2. Go() uses the tracker from the context
runtime.Go(ctx, func(ctx context.Context) error {
// ... safe background work ...
return nil
})
return nil
})Features:
- Context-Aware:
Golooks for a tracker inctx. If found, it tracks the goroutine vialifecycle.Run. - Safe Fallback: If
Runis not used,Gofalls back to a global tracker. You can wait for these tasks withlifecycle.WaitForGlobal(). - Leak Prevention:
Run()waits for all tracked goroutines to finish before exiting. - Panic Recovery: Panics are caught, logged, and do not crash the main process.
lifecycle adopts the Introspection Pattern: components expose State() methods returning immutable DTOs, which are rendered into Mermaid diagrams via the github.com/aretw0/introspection library.
Visualization is delegated to the external introspection library, following the same Primitive Promotion strategy used for procio (see ADR-0010 and ADR-0011).
lifecycleprovides domain-specific styling logic (signal.PrimaryStyler,worker.NodeLabeler) and configuration viaLifecycleDiagramConfig().introspectionhandles structural rendering (Mermaid syntax, graph traversal, CSS class application).
This separation ensures that:
- Diagram logic is DRY: Rendering logic is not duplicated across
signal,worker, andsupervisorpackages. - Visualization is reusable: Other projects (e.g.,
trellis,arbour) can useintrospectionfor their own topologies. - Maintenance is centralized: Visual improvements or Mermaid syntax changes happen in one place.
- Logic/FSM: Rendered via
introspection.StateMachineDiagramasstateDiagram-v2. - Topology: Rendered via
introspection.TreeDiagramorintrospection.ComponentDiagramasgraph TD.
To support future reactive frontends and type-safe observation, high-value metadata and internal metrics have been promoted to first-class fields in the worker.State struct:
Type: Canonical worker category (Goroutine, Process, Supervisor, Container, Func).Restarts: Total count of restart attempts.StartedAt / UpdatedAt: Precise timestamps for uptime and transition auditing.Health: Active diagnostic status (see Probing below).
Hybrid Bridge: For backward compatibility with reflection-based tools, these fields are mirrored in the
Metadatamap until the ecosystem fully transitions to typed state consumption.
Beyond simple lifecycle status (Running/Stopped), workers can implement the Prober interface to expose deep internal health.
type Prober interface {
Probe(ctx context.Context) ProbeResult
}The Supervisor automatically discovers Prober implementations in its children and triggers a probe during its own State() collection. This enables visual health indicators (❤️/💔) in Mermaid diagrams and real-time health monitoring in control planes.
The lifecycle.SystemDiagram(sig, work) function synthesizes the Control Plane (Signal Context) and Data Plane (Worker Tree) into a single Mermaid diagram:
diagram := lifecycle.SystemDiagram(ctx.State(), supervisor.State())This delegates to introspection.ComponentDiagram, which applies the configuration from LifecycleDiagramConfig().
The following CSS classes are applied by stylers to represent component states:
- 🟡 pending: Defined, not active.
- 🔵 active: Running & healthy.
- 🟢 stopped: Clean exit.
- 🔴 failed: Crashed/Error.
These classes are defined in the domain packages (pkg/core/signal, pkg/core/worker) and consumed by introspection via the NodeStyler and PrimaryStyler hooks.
For implementation details, see docs/ecosystem/introspection.md.
The library is instrumented via pkg/metrics, pkg/log, and the optional Observer hook.
- Signals:
IncSignalReceived - Processes:
IncProcessStarted,IncProcessFailed - Hooks:
ObserveHookDuration - Data Safety:
IncTerminalUpgrade(WindowsCONIN$usage)
When a background task panics (lifecycle.Go), the runtime invokes:
Observer.OnGoroutinePanicked(recovered, stack)
To avoid breaking the base Observer contract while supporting extended observability from dependencies like procio, lifecycle uses an Optional Interface Discovery pattern via the ProcioDiscoveryBridge.
When an observer is registered via SetObserver(o), it is wrapped in an internal bridge that uses Go's structural typing (duck typing) to detect if o implements additional methods:
OnIOError(op string, err error): Invoked when a low-level I/O operation (read/write/scan) fails.OnScanError(err error): Invoked specifically for terminal/buffer scanning failures.
This allows users who need deep I/O observability to simply add these methods to their Observer implementation without lifecycle needing to know about procio's specific interfaces at the API level.
Stack capture is controlled by WithStackCapture(bool):
- true: always capture stack bytes (useful in production for critical tasks)
- false: never capture stack bytes (performance testing)
- unset: auto-detect based on debug logging
Configuration and ObserverBridge examples live in docs/CONFIGURATION.md.
For a comprehensive list of platform-specific constraints, API stability status, performance unknowns, and compatibility matrices, see LIMITATIONS.md.
Key Highlights:
- Windows: Requires Go 1.20+ for Job Objects (zombie prevention); CONIN$ needs explicit opt-in
- macOS: No PDeathSig support; hard crashes may leave orphans
- Router: Pattern matching is glob-only (no regex); O(n) route lookup
- Performance: ~5-10µs overhead per
lifecycle.Go()call; stack capture adds +1-2µs if enabled - Coverage: Intentional exclusions for
metrics,termio(external), and flaky FS code; see TESTING.md
lifecycle adopts an "Honest Coverage" baseline. Instead of pursuing an arbitrary 100% or even 80% statement coverage across every line of code, we prioritize the verification of Behavioral Logic and Critical Path Resilience.
We distinguish between two types of code:
- Primary Behavioral Logic: The state machines, concurrent primitives, and event routing logic. These must have near-total coverage (effectively 100% of meaningful states).
- Secondary Plumbing & Syscall Wrappers: Code that interfaces with OS primitives (e.g., job objects, pdeathsig) or provides boilerplate (NoOp/Mock providers).
In certain packages, "100% coverage" often indicates Test Theater—tests that exercise no-op paths or force unreachable error states (like mock syscall failures) just to satisfy a metric.
We consider a package "Satisfactory" even with lower metrics if the missing coverage falls into these categories:
- Unreachable Syscall Errors: Windows and Linux syscall error paths that only trigger under extreme or non-existent conditions.
- Boilerplate/NoOp Paths: Methods in
NoOpProviderorLogProviderthat exist for interface compatibility and possess no complex logic. - Platform-Specific Mocks: Code used primarily for testing other components that doesn't hold its own business logic.
By setting an Honest Baseline, we ensure that our engineering efforts are spent on validating the reliability of the system, not on maintaining the theater of perfect metrics.