Skip to content

fix(metrics): Prevent race condition crash during metrics collection on shutdown#3741

Open
adduali1310 wants to merge 1 commit intofalcosecurity:masterfrom
adduali1310:fix/metrics-race-condition-shutdown
Open

fix(metrics): Prevent race condition crash during metrics collection on shutdown#3741
adduali1310 wants to merge 1 commit intofalcosecurity:masterfrom
adduali1310:fix/metrics-race-condition-shutdown

Conversation

@adduali1310
Copy link
Contributor

@adduali1310 adduali1310 commented Nov 17, 2025

This fixes a segmentation fault that occurs when /metrics endpoint is accessed during Falco shutdown. The crash happens as there is a very minute window, wherein the webserver continues serving /metrics requests after outputs have been destroyed.

The race condition is timing-dependent and occurs in this window:

  1. process_events()
  2. Webserver still running and can receive /metrics requests
  3. Metrics code accesses NULL outputs elements→ crash

Changes:

  • Create cleanup_outputs action to handle outputs destruction
  • Reorder teardown steps to stop webserver before destorying outputs
  • Move outputs.reset() from process_events to cleanup_outputs()

This eliminates the race condition by ensuring the webserver stops accepting requests before any subsystems are destroyed. The synchronisation behaviour of output.reset() block till queue flushed is preserved.

What type of PR is this?

/kind bug

Any specific area of the project related to this PR?

/area engine

What this PR does / why we need it:
Fixes a race condition that causes a segmentation fault during Falco shutdown when prometheus metrics endpoint is accessed. The crash occurs in a very short time window before the webserver is stopped but after outputs are destroyed.
Due to the destruction, accessing them leads to a SEGV and invalid access.
The fix reorders the shutdown sequence to stop the webserver before destroying the subsystems.

Which issue(s) this PR fixes:

Fixes #3739

Special notes for your reviewer:
Created a simple bash script to reproduce the issue.

#!/bin/bash

FALCO_BIN="Binary Path"
METRICS_URL="http://localhost:3018/metrics"

echo "Starting race condition test..."

$FALCO_BIN -c /etc/falco/falco.yaml &
FALCO_PID=$!

sleep 3

echo "Falco started (PID: $FALCO_PID), launching 20 parallel scrapers..."

for i in {1..20}; do
    while true; do
        curl -s $METRICS_URL > /dev/null 2>&1
    done &
done

sleep 2

echo "Sending SIGTERM to Falco..."
kill -SIGTERM $FALCO_PID

sleep 3

echo "Cleaning up scrapers..."

pkill -P $$ curl 2>/dev/null
killall curl 2>/dev/null


wait $FALCO_PID 2>/dev/null
EXIT_CODE=$?

echo "Falco exit code: $EXIT_CODE"

if [ $EXIT_CODE -eq 139 ]; then
    echo "CRASH DETECTED: Segmentation fault (exit code 139)"
    exit 1
elif [ $EXIT_CODE -eq 0 ]; then
    echo "SUCCESS: Clean shutdown"
    exit 0
else
    echo "UNEXPECTED: Exit code $EXIT_CODE"
    exit $EXIT_CODE
fi

Before Fix:

Test run 1
Starting race condition test...
Mon Nov 17 08:08:24 2025: Falco version: 0.42.1-2025.11.2 (x86_64)
Mon Nov 17 08:08:24 2025: Falco initialized with configuration files:
Mon Nov 17 08:08:24 2025:    /etc/falco/falco.yaml | schema validation: ok
Mon Nov 17 08:08:24 2025: Loading rules from:
Mon Nov 17 08:08:24 2025:    /etc/falco/falco_rules.local.yaml | schema validation: none
Mon Nov 17 08:08:24 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Nov 17 08:08:24 2025: Starting health webserver with threadiness 96, listening on 0.0.0.0:3018
Mon Nov 17 08:08:24 2025: Setting metrics interval to 1h, equivalent to 3600000 (ms)
Mon Nov 17 08:08:24 2025: Loaded event sources: syscall
Mon Nov 17 08:08:24 2025: Enabled event sources: syscall
Mon Nov 17 08:08:24 2025: Opening 'syscall' source with modern BPF probe.
Mon Nov 17 08:08:24 2025: One ring buffer every '2' CPUs.
Mon Nov 17 08:08:24 2025: [libs]: Trying to open the right engine!
Falco started (PID: 317455), launching 20 parallel scrapers...
Sending SIGTERM to Falco...
Cleaning up scrapers...
Terminated
Terminated
Terminated
Terminated
Terminated
Mon Nov 17 08:08:44 2025: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 0 occurrences
   - num times actions taken: 0
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Falco exit code: 139
CRASH DETECTED: Segmentation fault (exit code 139)
FAILED on run 1

After Fix:

Starting race condition test...
...
Mon Nov 17 08:06:07 2025: Falco initialized with configuration files:
Mon Nov 17 08:06:07 2025:    /etc/falco/falco.yaml | schema validation: ok
Mon Nov 17 08:06:07 2025: Loading rules from:
Mon Nov 17 08:06:07 2025:    /etc/falco/falco_rules.local.yaml | schema validation: none
Mon Nov 17 08:06:07 2025: The chosen syscall buffer dimension is: 8388608 bytes (8 MBs)
Mon Nov 17 08:06:07 2025: Starting health webserver with threadiness 96, listening on 0.0.0.0:3018
Mon Nov 17 08:06:07 2025: Setting metrics interval to 1h, equivalent to 3600000 (ms)
Mon Nov 17 08:06:07 2025: Loaded event sources: syscall
Mon Nov 17 08:06:07 2025: Enabled event sources: syscall
Mon Nov 17 08:06:07 2025: Opening 'syscall' source with modern BPF probe.
Mon Nov 17 08:06:07 2025: One ring buffer every '2' CPUs.
Mon Nov 17 08:06:07 2025: [libs]: Trying to open the right engine!
Falco started (PID: 153669), launching 20 parallel scrapers...
Sending SIGTERM to Falco...
Cleaning up scrapers...
Terminated
Terminated
Terminated
Terminated
Terminated
Terminated
Mon Nov 17 08:06:26 2025: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 0 occurrences
   - num times actions taken: 0
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:
Falco exit code: 0
SUCCESS: Clean shutdown

Post the fix in this PR, can confirm haven't seen the crashes.
Can be confirmed by running the script aggressively as well.

Does this PR introduce a user-facing change?:

None

@poiana
Copy link
Contributor

poiana commented Nov 17, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: adduali1310
Once this PR has been reviewed and has the lgtm label, please assign ekoops for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@poiana poiana requested review from Kaizhe and leogr November 17, 2025 09:49
@poiana poiana added the size/M label Nov 17, 2025
@leogr
Copy link
Member

leogr commented Nov 17, 2025

good catch! thank you

/milestone 0.43.0

@ekoops
Copy link
Contributor

ekoops commented Jan 7, 2026

Hey @adduali1310 , sorry, but we had experienced some delay due to rate-limiting issues. The change looks good, even if that print_stats() embedded in that action could be difficult to spot. Do you mind to move that print in a separate action?

@ekoops ekoops modified the milestones: 0.43.0, 0.44.0 Jan 14, 2026
@adduali1310 adduali1310 force-pushed the fix/metrics-race-condition-shutdown branch from 213f8d3 to 5eb3468 Compare January 19, 2026 11:00
@adduali1310
Copy link
Contributor Author

Apologies for the delay @ekoops . Separated print_stats into a separate action as requested and is also called separately in the teardown sequence.

@adduali1310
Copy link
Contributor Author

Hey @ekoops @leogr

Checking in as to when do we plan to merge this change?

@ekoops
Copy link
Contributor

ekoops commented Feb 4, 2026

Hey @ekoops @leogr

Checking in as to when do we plan to merge this change?

Hey! Sorry, we were busy with the new release. Now we can work for merging this. I see tests failure in CI, but I haven't had time to go through it. Can you rebase on top of master? I hope this fill fix the issue. If the CI become green, I'll definitely merge this (notice that the failing jobs are marked as "required").

@adduali1310 adduali1310 force-pushed the fix/metrics-race-condition-shutdown branch 2 times, most recently from 3bda69d to 0730de3 Compare February 19, 2026 18:41
@adduali1310
Copy link
Contributor Author

Hey @ekoops

So took me a while to get to the bottom of things but quick update on the CI status after rebasing. I ran into a couple of issues.

  • After moving print_stats() to a separate teardown action as requested, the check-engine-checksum and check-engine-version CI checks started failing as the teardown steps run on every exit path, including info commands like --list-events and --version. On master, print_stats() only ran at the end of process_events(), which is never reached for those commands. As a teardown action, it was unconditionally appending extra output to --list-events:
Events detected: 0
Rule counts by severity:
Triggered rules by rule name:

This changed the computed engine checksum. I built locally and confirmed by diffing falco --list-events output between master and my branch.
To fix this, I added an if(s.outputs) guard to both print_stats.cpp and cleanup_outputs.cpp, since s.outputs is only initialized in init_outputs which runs right before process_events and for info commands it remains null.

  • While adding the guard did fix the checksum, but TestFalcoKmod and TestFalcoModernBpf are now failing with:

Error: Expect "" to match "Events detected:"

These tests run Falco with -M 3 and check stderr for the "Events detected:" string from print_stats(). The problem is a teardown ordering conflict:

  1. cleanup_outputs runs first, calling s.outputs.reset() which sets s.outputs to null
  2. print_stats runs next, but the if(s.outputs) guard sees null and skips it

Swapping the order would fix the test but breaks the original intent on master, s.outputs.reset() was intentionally called before print_stats() to flush the output queue and avoid intermixed output.

So s.outputs isn't a viable guard since it's always null by the time print_stats needs to run, and there's no clean flag in state to indicate "we actually processed events."

I think the simplest correct fix is to move print_stats() back into process_events() (at the end, before the return) and only keep cleanup_outputs as the separate teardown action. print_stats() itself doesn't have the race condition with the webserver — only s.outputs.reset() does. This preserves the original ordering (reset outputs, then print stats) and the teardown safety.

Unless you have a better or alternative guard in mind to keep print_stats as a separate teardown action let me know, how you would prefer to proceed.

@ekoops
Copy link
Contributor

ekoops commented Feb 20, 2026

Mmm I see. Yes please, let's bring back print_stats() inside the action. However, I would add a comment, where the action is used, that says that internally it calls print_stats()... WDYT?

…on shutdown

This fixes a segmentation fault that occurs when /metrics endpoint is accessed during Falco shutdown. The crash happens as the webserver continues serving /metrics requests after outputs and inspectors have been destroyed.

Changes:

- Create cleanup_outputs action to handle outputs destruction
- Create print_stats action for stats printing
- Reorder teardown steps to stop webserver before destorying outputs
- Move outputs.reset() from process_events to cleanup_outputs()

This eliminates the race condition by ensuring the webserver stops accepting requests before any subsystems are destroyed. The synchronisation behaviour of output.reset() block till queue flushed is preserved.

Signed-off-by: Adnan Ali <adduali1310@hotmail.com>
@adduali1310 adduali1310 force-pushed the fix/metrics-race-condition-shutdown branch from 0730de3 to 8a2f071 Compare March 10, 2026 15:03
@adduali1310
Copy link
Contributor Author

Hey @ekoops

Made the requested change. Could you take a look when possible?
Also, all checks pass except test-dev-packages-arm64 / test-packages with the failure for TestFalcoctl_Artifact_InstallPlugin and is unrelated to this change.

I would presume rerunning this test should fix it.

@leogr
Copy link
Member

leogr commented Mar 12, 2026

I would presume rerunning this test should fix it.

Job re-launched. Let's see.

@adduali1310
Copy link
Contributor Author

Seems to have worked? @leogr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Segmentation Fault when accessing outputs_queue_num_drops prometheus metrics

4 participants