Preserve span table across otel_span_ets crashes#954
Preserve span table across otel_span_ets crashes#954QuinnWilton wants to merge 1 commit intoopen-telemetry:mainfrom
Conversation
|
|
7be5187 to
7d0fb59
Compare
The span ETS table is created without {heir, Pid, Data}. When the
otel_span_ets process crashes, the table is destroyed and all in-flight
spans are silently lost. The process restarts and creates a fresh empty
table, but spans that were active between crash and restart are gone.
The try/catch in storage_insert/1 prevents cascading badarg errors but
masks the data loss.
Use the supervisor (otel_span_sup) as heir. On crash, ownership
transfers atomically to the supervisor. The restarted otel_span_ets
sees the table already exists via ets:info/2, skips ets:new, and
resumes operating on the preserved data. The table is public, so all
read/write operations continue to work regardless of which process
owns it.
7d0fb59 to
5a7956a
Compare
|
I originally had another commit in here for enabling read_concurrency on the table, but after running some benchmarks, it wasn't a clear win and I opted to remove that change and keep things simple. |
|
Hm, what I don't remember is why the table isn't created in |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #954 +/- ##
=======================================
Coverage 17.66% 17.66%
=======================================
Files 24 24
Lines 719 719
=======================================
Hits 127 127
Misses 592 592
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
I'm thinking through things, and can't really come up with any technical reason you wouldn't have: most of what comes to mind is just the convention of a server initializing its resources in As far as I can tell, both approaches are about equivalent, as long as you setup the heir like this. |
The span ETS table is created without {heir, Pid, Data}. When the
otel_span_ets process crashes, the table is destroyed and all in-flight
spans are silently lost. The process restarts and creates a fresh empty
table, but spans that were active between crash and restart are gone.
The try/catch in storage_insert/1 prevents cascading badarg errors but
masks the data loss.
Use the supervisor (otel_span_sup) as heir. On crash, ownership
transfers atomically to the supervisor. The restarted otel_span_ets
sees the table already exists via ets:info/2, skips ets:new, and
resumes operating on the preserved data. The table is public, so all
read/write operations continue to work regardless of which process
owns it.