Observability overhaul + traces support by mnafees · Pull Request #3213 · hatchet-dev/hatchet

mnafees · 2026-03-09T12:31:24Z

Description

Introduces Hatchet o11y and changes for the same to Python, TypeScript, and Go SDKs.

Type of change

New feature (non-breaking change which adds functionality)

Note

Medium Risk
Adds new persisted trace data (new DB table/partitioning) and new API endpoints surfaced in the UI; correctness and performance depend on batching/truncation and pagination behavior. Risk is mitigated by gating collection behind HatchetO11y.Enabled, but changes touch engine ingest, RBAC, and frontend navigation.

Overview
Adds end-to-end OpenTelemetry trace support: the engine can now ingest/store spans (new v1_otel_trace table + enums/partitions) via the OTLP collector with configurable max batch size and retry-count correlation, gated by HatchetO11y.Enabled.

Exposes new stable APIs to fetch traces for a task or workflow-run (GET .../trace with pagination), wires them through RBAC and OpenAPI clients/models, and updates server handlers/transformers (including a CEL enum naming fix).

Replaces the workflow/task-run Waterfall UI with an Observability tab that fetches all spans, builds a span tree, and renders an interactive timeline/tree view; adds new example apps for Go/Python/TypeScript OTel instrumentation and small CI/lint tweaks (enable o11y in python SDK workflow, ignore flaky OTel tests, expand linter excludes).

^{Written by Cursor Bugbot for commit 27aefc0. This will update automatically on new commits. Configure here.}

* feat: first pass at auto otel impl * refactor: clean up a bit, naming, etc. * refactor: rm instance vars * fix: rm one more instance var * chore: notes to self * traces view * minor changes * trace view by task external id * go sdk instrumentation * e2e tests for Py SDK trace --------- Co-authored-by: Mohammed Nafees <[email protected]>

vercel · 2026-03-09T12:31:29Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hatchet-docs	Ready	Preview, Comment	Mar 19, 2026 10:00pm

* add: otel as optional dep on ts packages * feat: opentelemetry instrumentor for TS sdk, with example * fix: lint * revert: debug print * remove: trailing space * fix: ts otel patch file path, throw handlesteprun error upstream, ts otel examples * fix: lint * feat: add schedule_workflow instrumentor, add otel conig loader tests * add: more robust wrap unwrap for patched modules * fix: lint, update version * refactor: ts otel config type assertion * revert: rebase issues * fix: lint * fix: update worker patch for ts otel with InternalWorker * fix: lint * refactor: parsejson on otel * fix: pnpm-lock * fix: lint * docs: add otel instrumented method warnings Co-authored-by: Jishnu <[email protected]>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix prepared fixes for both issues found in the latest run.

✅ Fixed: Wrong status enum causes infinite polling for succeeded runs
- I added WorkflowRunStatus.SUCCEEDED to the shared terminal-status list so succeeded workflow runs are treated as terminal and observability polling stops.
✅ Fixed: Division by zero when span duration is zero
- I guarded getTimelineData against totalRange <= 0 and return stable fallback percentages to avoid NaN timeline CSS values.

Or push these changes by commenting:

@cursor push 33d8ac818f

Preview (33d8ac818f)

diff --git a/frontend/app/src/components/v1/agent-prism/agent-prism-data.ts b/frontend/app/src/components/v1/agent-prism/agent-prism-data.ts
--- a/frontend/app/src/components/v1/agent-prism/agent-prism-data.ts
+++ b/frontend/app/src/components/v1/agent-prism/agent-prism-data.ts
@@ -49,9 +49,15 @@
   maxEnd: number;
 }): { durationMs: number; startPercent: number; widthPercent: number } => {
   const startMs = new Date(spanCard.createdAt).getTime();
+  const durationMs = spanCard.durationNs / 1_000_000;
   const totalRange = maxEnd - minStart;
-  const durationMs = spanCard.durationNs / 1_000_000;
+
+  if (totalRange <= 0) {
+    return { durationMs, startPercent: 0, widthPercent: 100 };
+  }
+
   const startPercent = ((startMs - minStart) / totalRange) * 100;
   const widthPercent = (durationMs / totalRange) * 100;
+
   return { durationMs, startPercent, widthPercent };
 };

diff --git a/frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx b/frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx
--- a/frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx
+++ b/frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx
@@ -19,7 +19,12 @@
   TabsList,
   TabsTrigger,
 } from '@/components/v1/ui/tabs';
-import { V1TaskStatus, V1TaskSummary, queries } from '@/lib/api';
+import {
+  V1TaskStatus,
+  V1TaskSummary,
+  WorkflowRunStatus,
+  queries,
+} from '@/lib/api';
 import { emptyGolangUUID, formatDuration } from '@/lib/utils';
 import { TaskRunActionButton } from '@/pages/main/v1/task-runs-v1/actions';
 import { WorkflowDefinitionLink } from '@/pages/main/workflow-runs/$run/v2components/workflow-definition';
@@ -50,6 +55,7 @@
   V1TaskStatus.CANCELLED,
   V1TaskStatus.FAILED,
   V1TaskStatus.COMPLETED,
+  WorkflowRunStatus.SUCCEEDED,
 ];
 
 const TaskRunPermalinkOrBacklink = ({

_{This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.}

cursor · 2026-03-19T08:21:59Z

frontend/app/src/pages/main/v1/workflow-runs-v1/$run/index.tsx

+              workflowRunExternalId={id}
+              isRunning={
+                !TASK_RUN_TERMINAL_STATUSES.includes(workflowRun.status)
+              }


Wrong status enum causes infinite polling for succeeded runs

High Severity

TASK_RUN_TERMINAL_STATUSES contains V1TaskStatus.COMPLETED (string "COMPLETED"), but workflowRun.status is a WorkflowRunStatus whose success value is SUCCEEDED (string "SUCCEEDED"). Since "SUCCEEDED" is never in TASK_RUN_TERMINAL_STATUSES, isRunning remains true after a workflow succeeds, causing the Observability component to poll every 5 seconds indefinitely.

Additional Locations (1)

frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx#L48-L53

cursor · 2026-03-19T08:21:59Z

frontend/app/src/components/v1/agent-prism/agent-prism-data.ts

+  const durationMs = spanCard.durationNs / 1_000_000;
+  const startPercent = ((startMs - minStart) / totalRange) * 100;
+  const widthPercent = (durationMs / totalRange) * 100;
+  return { durationMs, startPercent, widthPercent };


Division by zero when span duration is zero

Low Severity

In getTimelineData, if maxEnd equals minStart (possible when all spans have zero duration), totalRange is 0 and both startPercent and widthPercent become NaN. These NaN values flow into CSS left and width style properties in SpanCardTimeline, producing undefined rendering behavior.

cursor · 2026-03-19T08:58:21Z

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2026-03-19T09:23:10Z

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

mnafees · 2026-03-19T09:29:52Z

cmd/hatchet-engine/engine/run.go

 			return fmt.Errorf("could not create admin service (v1): %w", err)
 		}

-		oc, err := otelcol.NewOTelCollector(


This is fine since this was previously a no-op and now will only be enabled when the env var is set as in below.

cursor · 2026-03-19T09:35:24Z

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2026-03-19T09:38:08Z

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

mrkaye97 · 2026-03-20T14:51:08Z

going to leave some review comments for myself then will go through and fix them

.github/workflows/sdk-python.yml

sdks/guides/python/ai_agents/__pycache__/worker.cpython-310.pyc

sdks/python/examples/run_details/test_run_detail_getter.py

sdks/python/examples/opentelemetry_instrumentation/hatchet/trigger.py

sdks/python/hatchet_sdk/opentelemetry/instrumentor.py

sdks/python/tests/otel_traces/test_otel_traces.py

mrkaye97

and last one: need a minor version for python

mnafees self-assigned this Mar 9, 2026

vercel bot deployed to Preview March 9, 2026 12:34 View deployment

mnafees added 2 commits March 9, 2026 13:57

fix CI

05f3b0f

fix black lint

139a8d1

vercel bot deployed to Preview March 9, 2026 13:03 View deployment

Merge branch 'main' into feat-o11y-overhaul

3bb8236

vercel bot deployed to Preview March 9, 2026 16:27 View deployment

fix example

8c145bf

vercel bot deployed to Preview March 9, 2026 16:49 View deployment

Merge branch 'main' into feat-o11y-overhaul

1008b05

vercel bot deployed to Preview March 9, 2026 17:22 View deployment

mnafees added 2 commits March 9, 2026 18:37

fix lint

c147b91

inject traceparent in Go SDK

c49e1d8

vercel bot deployed to Preview March 9, 2026 17:54 View deployment

ctx propagation

374342e

vercel bot deployed to Preview March 9, 2026 18:35 View deployment

vercel bot deployed to Preview March 9, 2026 20:28 View deployment

many many random spans

e2cdadf

vercel bot deployed to Preview March 9, 2026 21:12 View deployment

refetch polling

0c8839b

vercel bot deployed to Preview March 9, 2026 21:51 View deployment

mnafees added 3 commits March 10, 2026 13:12

Merge branch 'main' into feat-o11y-overhaul

bd54b97

otel postgres traces

06915fb

add to rbac.yaml

9396872

vercel bot deployed to Preview March 10, 2026 15:02 View deployment

some refactor

35c59b9

vercel bot deployed to Preview March 10, 2026 23:54 View deployment

Merge branch 'main' into feat-o11y-overhaul

27aefc0

vercel bot deployed to Preview March 19, 2026 08:21 View deployment

cursor bot reviewed Mar 19, 2026

View reviewed changes

bug fixes

6ecadf7

vercel bot deployed to Preview March 19, 2026 09:00 View deployment

docs push

a3a2ad9

vercel bot deployed to Preview March 19, 2026 09:26 View deployment

mnafees commented Mar 19, 2026

View reviewed changes

restore comments

87207b1

delete older observability docs

cc6b54a

vercel bot deployed to Preview March 19, 2026 09:41 View deployment

Merge branch 'main' into feat-o11y-overhaul

ad1312d

vercel bot deployed to Preview March 19, 2026 22:00 View deployment