fix(metrics): Prevent race condition crash during metrics collection on shutdown by adduali1310 · Pull Request #3741 · falcosecurity/falco

adduali1310 · 2025-11-17T09:49:07Z

This fixes a segmentation fault that occurs when /metrics endpoint is accessed during Falco shutdown. The crash happens as there is a very minute window, wherein the webserver continues serving /metrics requests after outputs have been destroyed.

The race condition is timing-dependent and occurs in this window:

process_events()
Webserver still running and can receive /metrics requests
Metrics code accesses NULL outputs elements→ crash

Changes:

Create cleanup_outputs action to handle outputs destruction
Reorder teardown steps to stop webserver before destorying outputs
Move outputs.reset() from process_events to cleanup_outputs()

This eliminates the race condition by ensuring the webserver stops accepting requests before any subsystems are destroyed. The synchronisation behaviour of output.reset() block till queue flushed is preserved.

What type of PR is this?

/kind bug

Any specific area of the project related to this PR?

/area engine

What this PR does / why we need it:
Fixes a race condition that causes a segmentation fault during Falco shutdown when prometheus metrics endpoint is accessed. The crash occurs in a very short time window before the webserver is stopped but after outputs are destroyed.
Due to the destruction, accessing them leads to a SEGV and invalid access.
The fix reorders the shutdown sequence to stop the webserver before destroying the subsystems.

Which issue(s) this PR fixes:

Fixes #3739

Special notes for your reviewer:
Created a simple bash script to reproduce the issue.

#!/bin/bash

FALCO_BIN="Binary Path"
METRICS_URL="http://localhost:3018/metrics"

echo "Starting race condition test..."

$FALCO_BIN -c /etc/falco/falco.yaml &
FALCO_PID=$!

sleep 3

echo "Falco started (PID: $FALCO_PID), launching 20 parallel scrapers..."

for i in {1..20}; do
    while true; do
        curl -s $METRICS_URL > /dev/null 2>&1
    done &
done

sleep 2

echo "Sending SIGTERM to Falco..."
kill -SIGTERM $FALCO_PID

sleep 3

echo "Cleaning up scrapers..."

pkill -P $$ curl 2>/dev/null
killall curl 2>/dev/null


wait $FALCO_PID 2>/dev/null
EXIT_CODE=$?

echo "Falco exit code: $EXIT_CODE"

if [ $EXIT_CODE -eq 139 ]; then
    echo "CRASH DETECTED: Segmentation fault (exit code 139)"
    exit 1
elif [ $EXIT_CODE -eq 0 ]; then
    echo "SUCCESS: Clean shutdown"
    exit 0
else
    echo "UNEXPECTED: Exit code $EXIT_CODE"
    exit $EXIT_CODE
fi

Before Fix:

Test run 1
Starting race condition test...
Mon Nov 17 08:08:24 2025: Falco version: 0.42.1-2025.11.2 (x86_64)
Mon Nov 17 08:08:24 2025: Falco initialized with configuration files:
Mon Nov 17 08:08:24 2025:    /etc/falco/falco.yaml | schema validation: ok
Mon Nov 17 08:08:24 2025: Loading rules from:
Mon Nov 17 08:08:24 2025:    /etc/falco/falco_rules.local.yaml | schema validation: none
Mon Nov 17 08:08:24 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Nov 17 08:08:24 2025: Starting health webserver with threadiness 96, listening on 0.0.0.0:3018
Mon Nov 17 08:08:24 2025: Setting metrics interval to 1h, equivalent to 3600000 (ms)
Mon Nov 17 08:08:24 2025: Loaded event sources: syscall
Mon Nov 17 08:08:24 2025: Enabled event sources: syscall
Mon Nov 17 08:08:24 2025: Opening 'syscall' source with modern BPF probe.
Mon Nov 17 08:08:24 2025: One ring buffer every '2' CPUs.
Mon Nov 17 08:08:24 2025: [libs]: Trying to open the right engine!
Falco started (PID: 317455), launching 20 parallel scrapers...
Sending SIGTERM to Falco...
Cleaning up scrapers...
Terminated
Terminated
Terminated
Terminated
Terminated
Mon Nov 17 08:08:44 2025: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 0 occurrences
   - num times actions taken: 0
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Falco exit code: 139
CRASH DETECTED: Segmentation fault (exit code 139)
FAILED on run 1

After Fix:

Starting race condition test...
...
Mon Nov 17 08:06:07 2025: Falco initialized with configuration files:
Mon Nov 17 08:06:07 2025:    /etc/falco/falco.yaml | schema validation: ok
Mon Nov 17 08:06:07 2025: Loading rules from:
Mon Nov 17 08:06:07 2025:    /etc/falco/falco_rules.local.yaml | schema validation: none
Mon Nov 17 08:06:07 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Nov 17 08:06:07 2025: Starting health webserver with threadiness 96, listening on 0.0.0.0:3018
Mon Nov 17 08:06:07 2025: Setting metrics interval to 1h, equivalent to 3600000 (ms)
Mon Nov 17 08:06:07 2025: Loaded event sources: syscall
Mon Nov 17 08:06:07 2025: Enabled event sources: syscall
Mon Nov 17 08:06:07 2025: Opening 'syscall' source with modern BPF probe.
Mon Nov 17 08:06:07 2025: One ring buffer every '2' CPUs.
Mon Nov 17 08:06:07 2025: [libs]: Trying to open the right engine!
Falco started (PID: 153669), launching 20 parallel scrapers...
Sending SIGTERM to Falco...
Cleaning up scrapers...
Terminated
Terminated
Terminated
Terminated
Terminated
Terminated
Mon Nov 17 08:06:26 2025: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 0 occurrences
   - num times actions taken: 0
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Falco exit code: 0
SUCCESS: Clean shutdown

Post the fix in this PR, can confirm haven't seen the crashes.
Can be confirmed by running the script aggressively as well.

Does this PR introduce a user-facing change?:

None

poiana · 2025-11-17T09:49:12Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adduali1310
Once this PR has been reviewed and has the lgtm label, please assign ekoops for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

leogr · 2025-11-17T14:29:14Z

good catch! thank you

/milestone 0.43.0

ekoops · 2026-01-07T13:07:03Z

Hey @adduali1310 , sorry, but we had experienced some delay due to rate-limiting issues. The change looks good, even if that print_stats() embedded in that action could be difficult to spot. Do you mind to move that print in a separate action?

adduali1310 · 2026-01-19T11:03:09Z

Apologies for the delay @ekoops . Separated print_stats into a separate action as requested and is also called separately in the teardown sequence.

adduali1310 · 2026-02-03T14:35:05Z

Hey @ekoops @leogr

Checking in as to when do we plan to merge this change?

ekoops · 2026-02-04T13:02:21Z

Hey @ekoops @leogr

Checking in as to when do we plan to merge this change?

Hey! Sorry, we were busy with the new release. Now we can work for merging this. I see tests failure in CI, but I haven't had time to go through it. Can you rebase on top of master? I hope this fill fix the issue. If the CI become green, I'll definitely merge this (notice that the failing jobs are marked as "required").

adduali1310 · 2026-02-19T19:13:04Z

Hey @ekoops

So took me a while to get to the bottom of things but quick update on the CI status after rebasing. I ran into a couple of issues.

After moving print_stats() to a separate teardown action as requested, the check-engine-checksum and check-engine-version CI checks started failing as the teardown steps run on every exit path, including info commands like --list-events and --version. On master, print_stats() only ran at the end of process_events(), which is never reached for those commands. As a teardown action, it was unconditionally appending extra output to --list-events:

Events detected: 0
Rule counts by severity:
Triggered rules by rule name:

This changed the computed engine checksum. I built locally and confirmed by diffing falco --list-events output between master and my branch.
To fix this, I added an if(s.outputs) guard to both print_stats.cpp and cleanup_outputs.cpp, since s.outputs is only initialized in init_outputs which runs right before process_events and for info commands it remains null.

While adding the guard did fix the checksum, but TestFalcoKmod and TestFalcoModernBpf are now failing with:

Error: Expect "" to match "Events detected:"

These tests run Falco with -M 3 and check stderr for the "Events detected:" string from print_stats(). The problem is a teardown ordering conflict:

cleanup_outputs runs first, calling s.outputs.reset() which sets s.outputs to null
print_stats runs next, but the if(s.outputs) guard sees null and skips it

Swapping the order would fix the test but breaks the original intent on master, s.outputs.reset() was intentionally called before print_stats() to flush the output queue and avoid intermixed output.

So s.outputs isn't a viable guard since it's always null by the time print_stats needs to run, and there's no clean flag in state to indicate "we actually processed events."

I think the simplest correct fix is to move print_stats() back into process_events() (at the end, before the return) and only keep cleanup_outputs as the separate teardown action. print_stats() itself doesn't have the race condition with the webserver — only s.outputs.reset() does. This preserves the original ordering (reset outputs, then print stats) and the teardown safety.

Unless you have a better or alternative guard in mind to keep print_stats as a separate teardown action let me know, how you would prefer to proceed.

ekoops · 2026-02-20T08:56:37Z

Mmm I see. Yes please, let's bring back print_stats() inside the action. However, I would add a comment, where the action is used, that says that internally it calls print_stats()... WDYT?

…on shutdown This fixes a segmentation fault that occurs when /metrics endpoint is accessed during Falco shutdown. The crash happens as the webserver continues serving /metrics requests after outputs and inspectors have been destroyed. Changes: - Create cleanup_outputs action to handle outputs destruction - Create print_stats action for stats printing - Reorder teardown steps to stop webserver before destorying outputs - Move outputs.reset() from process_events to cleanup_outputs() This eliminates the race condition by ensuring the webserver stops accepting requests before any subsystems are destroyed. The synchronisation behaviour of output.reset() block till queue flushed is preserved. Signed-off-by: Adnan Ali <adduali1310@hotmail.com>

adduali1310 · 2026-03-10T16:00:27Z

Hey @ekoops

Made the requested change. Could you take a look when possible?
Also, all checks pass except test-dev-packages-arm64 / test-packages with the failure for TestFalcoctl_Artifact_InstallPlugin and is unrelated to this change.

I would presume rerunning this test should fix it.

leogr · 2026-03-12T11:33:14Z

I would presume rerunning this test should fix it.

Job re-launched. Let's see.

adduali1310 · 2026-03-13T12:27:48Z

Seems to have worked? @leogr

github-project-automation bot added this to Falco Roadmap Nov 17, 2025

github-project-automation bot moved this to Todo in Falco Roadmap Nov 17, 2025

poiana added release-note-none kind/bug dco-signoff: yes area/engine labels Nov 17, 2025

poiana requested review from Kaizhe and leogr November 17, 2025 09:49

poiana added the size/M label Nov 17, 2025

poiana added this to the 0.43.0 milestone Nov 17, 2025

adduali1310 mentioned this pull request Nov 17, 2025

fix(metrics): Add null check for state.outputs in metrics collection #3740

Merged

ekoops modified the milestones: 0.43.0, 0.44.0 Jan 14, 2026

adduali1310 force-pushed the fix/metrics-race-condition-shutdown branch from 213f8d3 to 5eb3468 Compare January 19, 2026 11:00

adduali1310 force-pushed the fix/metrics-race-condition-shutdown branch 2 times, most recently from 3bda69d to 0730de3 Compare February 19, 2026 18:41

leogr mentioned this pull request Mar 5, 2026

SIGSEGV when snapshotting metrics libs::metrics::libs_metrics_collector::snapshot #3737

Open

adduali1310 force-pushed the fix/metrics-race-condition-shutdown branch from 0730de3 to 8a2f071 Compare March 10, 2026 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(metrics): Prevent race condition crash during metrics collection on shutdown#3741

fix(metrics): Prevent race condition crash during metrics collection on shutdown#3741
adduali1310 wants to merge 1 commit intofalcosecurity:masterfrom
adduali1310:fix/metrics-race-condition-shutdown

adduali1310 commented Nov 17, 2025 •

edited

Loading

Uh oh!

poiana commented Nov 17, 2025

Uh oh!

leogr commented Nov 17, 2025

Uh oh!

ekoops commented Jan 7, 2026 •

edited

Loading

Uh oh!

adduali1310 commented Jan 19, 2026

Uh oh!

adduali1310 commented Feb 3, 2026

Uh oh!

ekoops commented Feb 4, 2026

Uh oh!

adduali1310 commented Feb 19, 2026

Uh oh!

ekoops commented Feb 20, 2026

Uh oh!

adduali1310 commented Mar 10, 2026

Uh oh!

leogr commented Mar 12, 2026

Uh oh!

adduali1310 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

adduali1310 commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poiana commented Nov 17, 2025

Uh oh!

leogr commented Nov 17, 2025

Uh oh!

ekoops commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adduali1310 commented Jan 19, 2026

Uh oh!

adduali1310 commented Feb 3, 2026

Uh oh!

ekoops commented Feb 4, 2026

Uh oh!

adduali1310 commented Feb 19, 2026

Uh oh!

ekoops commented Feb 20, 2026

Uh oh!

adduali1310 commented Mar 10, 2026

Uh oh!

leogr commented Mar 12, 2026

Uh oh!

adduali1310 commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

adduali1310 commented Nov 17, 2025 •

edited

Loading

ekoops commented Jan 7, 2026 •

edited

Loading