To get all the dependencies required to develop the node services:
$ npm installTo build:
$ npm run buildA node services Docker image is used by the verification tests in Kubernetes.
$ docker build --platform linux/arm64,linux/amd64 -t ghcr.io/restatedev/e2e-verification-runner --push .Linting is run together with gradle check, you can format using:
$ npm run formatSERVICES=InterpreterDriver node dist/app.jsSERVICES=InterpreterDriverJob node dist/app.jsDuring the verification phase the driver watches whether the state keeps converging towards the expected counters. If the total difference stops shrinking for a while (the run is wedged, e.g. an invocation got stuck), the driver collects diagnostics and fails fast instead of letting the CI job run until its timeout. The diagnostics include:
- the not-yet-completed invocations from the restate admin API
(
sys_invocation: status, what each issuspended_waiting_for_*, the caller chain, last failure), - for each differing interpreter key, that object's current
state, invocation history (sys_invocation, all statuses),sys_journal, andsys_idempotencymapping — so a lost (or extra) increment can be localized to a specific object/invocation/journal entry, and - a goroutine dump of every SDK service container (via
SIGQUIT, which is whyGOTRACEBACK=allis set for those containers) plus a tail of every container's logs.
The journal/invocation history is only retained for already-completed
invocations if retention was enabled for the interpreter services (see
INTERPRETER_JOURNAL_RETENTION below).
Configure via environment variables:
STUCK_DETECTOR_TIMEOUT_SECONDS— no-progress window before declaring the run stuck (default2700). Must be comfortably larger than one verification poll, which can take many minutes for large key spaces.STUCK_DETECTOR_DUMP_GOROUTINES— set tofalseto skipSIGQUITing the SDK containers (defaulttrue).STUCK_DETECTOR_DISABLED— set to any value to disable the watchdog entirely.INTERPRETER_JOURNAL_RETENTION— journal/idempotency retention applied to the interpreter services after registration, in the restate "friendly" duration format (default3 hours; set tooffto leave service retention untouched). Retaining ~1M journals is storage-heavy, and the offending invocation may complete early in a run, so to reliably capture it the retention must span the whole run — tune this (and/ortestsin the params file) for focused hunts.