Skip to content

in_opentelemetry: SIGSEGV in cmetrics compute_metric_hash on histogram payload from Quarkus / Micrometer #11764

@enoquefcd

Description

@enoquefcd

Bug Report

Describe the bug

Fluent Bit 5.0.3 (cmetrics 2.1.2) segfaults inside the opentelemetry input plugin when receiving OTLP/HTTP histogram payloads from Quarkus services that use Micrometer's OTLP exporter. The crash is a NULL dereference in compute_metric_hash (lib/cmetrics/src/cmt_decode_opentelemetry.c). The function dereferences map->opts->fqname and label_value->name without NULL checks, while its peer function get_or_create_metric_metadata_context — called immediately after, from the same caller — does have the proper guards. The inconsistency makes the trigger crystal clear once you read both functions.

Each histogram publish kills the worker, the pod restarts, the buffered chunk on disk replays the same payload, and fluent-bit re-crashes on startup (CrashLoopBackOff) until the chunk hits its retry limit and is dropped.

To Reproduce

Identical stack trace on every crash:

[engine] caught signal (SIGSEGV)
#0  get_or_create_data_point_metadata_context() at lib/cmetrics/src/cmt_decode_opentelemetry.c:331
#1  decode_histogram_data_point()                at lib/cmetrics/src/cmt_decode_opentelemetry.c:1021
#2  decode_histogram_data_point_list()           at lib/cmetrics/src/cmt_decode_opentelemetry.c:1061
#3  decode_histogram_entry()                     at lib/cmetrics/src/cmt_decode_opentelemetry.c:1158
#4  decode_metrics_entry()                       at lib/cmetrics/src/cmt_decode_opentelemetry.c:1545
#5  decode_scope_metrics_entry()                 at lib/cmetrics/src/cmt_decode_opentelemetry.c:1763
#6  decode_resource_metrics_entry()              at lib/cmetrics/src/cmt_decode_opentelemetry.c:1871
#7  decode_service_request()                     at lib/cmetrics/src/cmt_decode_opentelemetry.c:1931
#8  cmt_decode_opentelemetry_create()            at lib/cmetrics/src/cmt_decode_opentelemetry.c:1954
#9  process_payload_metrics_ng()                 at plugins/in_opentelemetry/opentelemetry_prot.c:526
#10 opentelemetry_prot_handle_ng()               at plugins/in_opentelemetry/opentelemetry_prot.c:1019
#11 flb_http_server_client_activity_event_handler() at src/http_server/flb_http_server.c:365
#12 flb_engine_start()                           at src/flb_engine.c:1267
#13 flb_lib_worker()                             at src/flb_lib.c:909

Steps to reproduce:

  1. Run a Quarkus 3.32.x service. Either of these emitter setups reproduces:

    • io.quarkus:quarkus-micrometer-opentelemetry (the core bridge), or
    • io.quarkiverse.micrometer.registry:quarkus-micrometer-registry-otlp:3.5.0 (the native Micrometer OTLP registry — bypasses the Quarkus OTel SDK entirely).
  2. Configure metric publication to fluent-bit OTLP HTTP every 10s, e.g.:

    quarkus.micrometer.export.otlp.url=http://<fluent-bit>:4318/v1/metrics
    quarkus.micrometer.export.otlp.step=10s
    
  3. Fluent-bit opentelemetry input on :4318:

    [INPUT]
        Name           opentelemetry
        Listen         0.0.0.0
        Port           4318
        Tag            otlp.app
  4. Within ~10 s of app boot, the first histogram-bearing publish (default Micrometer binders emit jvm.gc.pause, http.server.connections.duration, etc., as Timer → OTLP Histogram) reaches the OTLP input → SIGSEGV.

The crash reproduces with both emitters, so the root cause is downstream of Quarkus, in cmetrics.

Root cause analysis

decode_data_point_labels falls into its else branch when an OTLP attribute's AnyValue.value_case is unrecognised (e.g. NOT_SET = 0):

else {
    result = append_new_metric_label_value(metric, NULL, 0);
}

This stores a label with name == NULL in sample->labels. compute_metric_hash then calls cfl_sds_len(label_value->name) without a NULL guard — cfl_sds_len(NULL) calls CFL_SDS_HEADER(NULL)->len, dereferencing (struct cfl_sds *)(NULL - 16) — segfault. The peer function get_or_create_metric_metadata_context, called two lines later from the same caller, already guards against map->opts->fqname == NULL; compute_metric_hash does not.

Expected behavior

OTLP histogram payloads from Java/Quarkus clients should be ingested without crashing the worker. Malformed or unknown attribute value cases should not produce a NULL-named label that subsequently segfaults the hash function.

Your Environment

  • Version used: Fluent Bit 5.0.3 (Helm chart fluent/fluent-bit-0.57.3, app version 5.0.3)
  • cmetrics: 2.1.2 (current master also affected)
  • Configuration: opentelemetry input on :4318, prometheus_remote_write output to VictoriaMetrics
  • Environment: Kubernetes (k3s)
  • Operating System: Linux
  • Client: Quarkus 3.32.4 with quarkus-micrometer-registry-otlp 3.5.0 (also reproduced with quarkus-micrometer-opentelemetry core bridge)

Additional context

  • The crash address 0x...028 points to compute_metric_hash; the symbol the unwinder reports (get_or_create_data_point_metadata_context) is the calling function because debug info maps to its entry line.
  • Fix: decode_opentelemetry: guard NULL fqname and label name in compute_metric_hash cmetrics#265decode_data_point_labels stores "" instead of NULL for unrecognised AnyValue.value_case; compute_metric_hash guards fqname and label->name for defence in depth. Regression test included.
  • Possibly related but distinct: Quarkus issue #51741 (histogram buckets miscounted in the quarkus-micrometer-opentelemetry bridge). Our crash reproduces with both Quarkus paths, so the cmetrics NULL deref is independent of #51741.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions