Skip to content

Observability overhaul + traces support#3213

Open
mnafees wants to merge 63 commits intomainfrom
feat-o11y-overhaul
Open

Observability overhaul + traces support#3213
mnafees wants to merge 63 commits intomainfrom
feat-o11y-overhaul

Conversation

@mnafees
Copy link
Member

@mnafees mnafees commented Mar 9, 2026

Description

Introduces Hatchet o11y and changes for the same to Python, TypeScript, and Go SDKs.

Type of change

  • New feature (non-breaking change which adds functionality)

Note

Medium Risk
Adds new persisted trace data (new DB table/partitioning) and new API endpoints surfaced in the UI; correctness and performance depend on batching/truncation and pagination behavior. Risk is mitigated by gating collection behind HatchetO11y.Enabled, but changes touch engine ingest, RBAC, and frontend navigation.

Overview
Adds end-to-end OpenTelemetry trace support: the engine can now ingest/store spans (new v1_otel_trace table + enums/partitions) via the OTLP collector with configurable max batch size and retry-count correlation, gated by HatchetO11y.Enabled.

Exposes new stable APIs to fetch traces for a task or workflow-run (GET .../trace with pagination), wires them through RBAC and OpenAPI clients/models, and updates server handlers/transformers (including a CEL enum naming fix).

Replaces the workflow/task-run Waterfall UI with an Observability tab that fetches all spans, builds a span tree, and renders an interactive timeline/tree view; adds new example apps for Go/Python/TypeScript OTel instrumentation and small CI/lint tweaks (enable o11y in python SDK workflow, ignore flaky OTel tests, expand linter excludes).

Written by Cursor Bugbot for commit 27aefc0. This will update automatically on new commits. Configure here.

* feat: first pass at auto otel impl

* refactor: clean up a bit, naming, etc.

* refactor: rm instance vars

* fix: rm one more instance var

* chore: notes to self

* traces view

* minor changes

* trace view by task external id

* go sdk instrumentation

* e2e tests for Py SDK trace

---------

Co-authored-by: Mohammed Nafees <[email protected]>
@mnafees mnafees self-assigned this Mar 9, 2026
@vercel
Copy link

vercel bot commented Mar 9, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hatchet-docs Ready Ready Preview, Comment Mar 19, 2026 10:00pm

Request Review

* add: otel as optional dep on ts packages

* feat: opentelemetry instrumentor for TS sdk, with example

* fix: lint

* revert: debug print

* remove: trailing space

* fix: ts otel patch file path, throw handlesteprun error upstream, ts otel examples

* fix: lint

* feat: add schedule_workflow instrumentor, add otel conig loader tests

* add: more robust wrap unwrap for patched modules

* fix: lint, update version

* refactor: ts otel config type assertion

* revert: rebase issues

* fix: lint

* fix: update worker patch for ts otel with InternalWorker

* fix: lint

* refactor: parsejson on otel

* fix: pnpm-lock

* fix: lint

* docs: add otel instrumented method warnings

Co-authored-by: Jishnu <[email protected]>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Wrong status enum causes infinite polling for succeeded runs
    • I added WorkflowRunStatus.SUCCEEDED to the shared terminal-status list so succeeded workflow runs are treated as terminal and observability polling stops.
  • ✅ Fixed: Division by zero when span duration is zero
    • I guarded getTimelineData against totalRange <= 0 and return stable fallback percentages to avoid NaN timeline CSS values.

Create PR

Or push these changes by commenting:

@cursor push 33d8ac818f
Preview (33d8ac818f)
diff --git a/frontend/app/src/components/v1/agent-prism/agent-prism-data.ts b/frontend/app/src/components/v1/agent-prism/agent-prism-data.ts
--- a/frontend/app/src/components/v1/agent-prism/agent-prism-data.ts
+++ b/frontend/app/src/components/v1/agent-prism/agent-prism-data.ts
@@ -49,9 +49,15 @@
   maxEnd: number;
 }): { durationMs: number; startPercent: number; widthPercent: number } => {
   const startMs = new Date(spanCard.createdAt).getTime();
+  const durationMs = spanCard.durationNs / 1_000_000;
   const totalRange = maxEnd - minStart;
-  const durationMs = spanCard.durationNs / 1_000_000;
+
+  if (totalRange <= 0) {
+    return { durationMs, startPercent: 0, widthPercent: 100 };
+  }
+
   const startPercent = ((startMs - minStart) / totalRange) * 100;
   const widthPercent = (durationMs / totalRange) * 100;
+
   return { durationMs, startPercent, widthPercent };
 };

diff --git a/frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx b/frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx
--- a/frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx
+++ b/frontend/app/src/pages/main/v1/workflow-runs-v1/$run/v2components/step-run-detail/step-run-detail.tsx
@@ -19,7 +19,12 @@
   TabsList,
   TabsTrigger,
 } from '@/components/v1/ui/tabs';
-import { V1TaskStatus, V1TaskSummary, queries } from '@/lib/api';
+import {
+  V1TaskStatus,
+  V1TaskSummary,
+  WorkflowRunStatus,
+  queries,
+} from '@/lib/api';
 import { emptyGolangUUID, formatDuration } from '@/lib/utils';
 import { TaskRunActionButton } from '@/pages/main/v1/task-runs-v1/actions';
 import { WorkflowDefinitionLink } from '@/pages/main/workflow-runs/$run/v2components/workflow-definition';
@@ -50,6 +55,7 @@
   V1TaskStatus.CANCELLED,
   V1TaskStatus.FAILED,
   V1TaskStatus.COMPLETED,
+  WorkflowRunStatus.SUCCEEDED,
 ];
 
 const TaskRunPermalinkOrBacklink = ({

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

workflowRunExternalId={id}
isRunning={
!TASK_RUN_TERMINAL_STATUSES.includes(workflowRun.status)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong status enum causes infinite polling for succeeded runs

High Severity

TASK_RUN_TERMINAL_STATUSES contains V1TaskStatus.COMPLETED (string "COMPLETED"), but workflowRun.status is a WorkflowRunStatus whose success value is SUCCEEDED (string "SUCCEEDED"). Since "SUCCEEDED" is never in TASK_RUN_TERMINAL_STATUSES, isRunning remains true after a workflow succeeds, causing the Observability component to poll every 5 seconds indefinitely.

Additional Locations (1)
Fix in Cursor Fix in Web

const durationMs = spanCard.durationNs / 1_000_000;
const startPercent = ((startMs - minStart) / totalRange) * 100;
const widthPercent = (durationMs / totalRange) * 100;
return { durationMs, startPercent, widthPercent };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Division by zero when span duration is zero

Low Severity

In getTimelineData, if maxEnd equals minStart (possible when all spans have zero duration), totalRange is 0 and both startPercent and widthPercent become NaN. These NaN values flow into CSS left and width style properties in SpanCardTimeline, producing undefined rendering behavior.

Fix in Cursor Fix in Web

@cursor
Copy link

cursor bot commented Mar 19, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@cursor
Copy link

cursor bot commented Mar 19, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

return fmt.Errorf("could not create admin service (v1): %w", err)
}

oc, err := otelcol.NewOTelCollector(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine since this was previously a no-op and now will only be enabled when the env var is set as in below.

@cursor
Copy link

cursor bot commented Mar 19, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@cursor
Copy link

cursor bot commented Mar 19, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@mrkaye97
Copy link
Contributor

going to leave some review comments for myself then will go through and fix them

Copy link
Contributor

@mrkaye97 mrkaye97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and last one: need a minor version for python

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants