Skip to content

Preserve span table across otel_span_ets crashes#954

Open
QuinnWilton wants to merge 1 commit intoopen-telemetry:mainfrom
QuinnWilton:span-table
Open

Preserve span table across otel_span_ets crashes#954
QuinnWilton wants to merge 1 commit intoopen-telemetry:mainfrom
QuinnWilton:span-table

Conversation

@QuinnWilton
Copy link

@QuinnWilton QuinnWilton commented Feb 10, 2026

The span ETS table is created without {heir, Pid, Data}. When the
otel_span_ets process crashes, the table is destroyed and all in-flight
spans are silently lost. The process restarts and creates a fresh empty
table, but spans that were active between crash and restart are gone.
The try/catch in storage_insert/1 prevents cascading badarg errors but
masks the data loss.

Use the supervisor (otel_span_sup) as heir. On crash, ownership
transfers atomically to the supervisor. The restarted otel_span_ets
sees the table already exists via ets:info/2, skips ets:new, and
resumes operating on the preserved data. The table is public, so all
read/write operations continue to work regardless of which process
owns it.

@QuinnWilton QuinnWilton requested a review from a team as a code owner February 10, 2026 04:25
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Feb 10, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: QuinnWilton / name: Quinn Wilton (5a7956a)

@QuinnWilton QuinnWilton changed the title Preserve span table across otel_span_ets crashes + enable read_concurrency Preserve span table across otel_span_ets crashes Feb 10, 2026
The span ETS table is created without {heir, Pid, Data}. When the
otel_span_ets process crashes, the table is destroyed and all in-flight
spans are silently lost. The process restarts and creates a fresh empty
table, but spans that were active between crash and restart are gone.
The try/catch in storage_insert/1 prevents cascading badarg errors but
masks the data loss.

Use the supervisor (otel_span_sup) as heir. On crash, ownership
transfers atomically to the supervisor. The restarted otel_span_ets
sees the table already exists via ets:info/2, skips ets:new, and
resumes operating on the preserved data. The table is public, so all
read/write operations continue to work regardless of which process
owns it.
@QuinnWilton
Copy link
Author

I originally had another commit in here for enabling read_concurrency on the table, but after running some benchmarks, it wasn't a clear win and I opted to remove that change and keep things simple.

@tsloughter
Copy link
Member

Hm, what I don't remember is why the table isn't created in start_link instead of init...

@codecov
Copy link

codecov bot commented Feb 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 17.66%. Comparing base (98be90e) to head (5a7956a).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #954   +/-   ##
=======================================
  Coverage   17.66%   17.66%           
=======================================
  Files          24       24           
  Lines         719      719           
=======================================
  Hits          127      127           
  Misses        592      592           
Flag Coverage Δ
api 17.66% <ø> (ø)
elixir 17.66% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@QuinnWilton
Copy link
Author

Hm, what I don't remember is why the table isn't created in start_link instead of init...

I'm thinking through things, and can't really come up with any technical reason you wouldn't have: most of what comes to mind is just the convention of a server initializing its resources in init/1.

As far as I can tell, both approaches are about equivalent, as long as you setup the heir like this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants