Skip to content

GC heap corruption with GCLargePages #126903

@neiljohari

Description

@neiljohari

Description

We (@BenV and @andyblox) are experiencing what appears to be heap corruption when using GCLargePages in production.

This can be forced to happen by using GC.Collect with GCCollectionMode.Aggressive when GCLargePages
mode is enabled on Linux.

When DOTNET_GCLargePages=1 is enabled (with real kernel huge pages via hugetlbfs),
calling GC.Collect(2, GCCollectionMode.Aggressive, true, true) can cause GC
managed heap corruption. The corruption manifests as NullReferenceException and/or
AccessViolationException inside ConcurrentDictionary internals, with multiple
threads faulting simultaneously, which is characteristic of a GC heap corruption event.

Switching to GCCollectionMode.Forced (same generation, same blocking/compacting
parameters) works as expected with no corruption.

While GCCollectionMode.Aggressive does appear to make the heap corruption significantly more likely to happen
we are not using that in our application, so we would like assistance in tracking down the root cause.

Analysis

The GCCollectionMode.Aggressive documentation states it "requests that the garbage
collector decommit as much memory as possible." We believe the aggressive decommit
code path interacts incorrectly with the GC's large-page memory management:

When GCLargePages is enabled the garbage collector skips de-commits at the OS level
and also has special logic regarding clearing regions.
When an aggressive GC is induced, all regions are flagged for de-commit
and since de-commit success is always true for large pages the bookkeeping is updated even though
the de-commit never really happened, which could potentially cause corruption.

It is possible that the aggressive de-commit may need to be skipped when GCLargePages is active,
however it is also possible that this reproduction via forcing Aggressive GC is simply exposing
existing bugs within the de-commit logic that is leading to re-using pages which have not been
zeroed out by the OS.

We discovered this issue while trying to track down a problem we're hitting recently in production
when GCLargePages mode is active - we are occasionally getting NullReferenceExceptions related
to ConcurrentDictionary where it does not seem possible to throw a NullReferenceException:

System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)

We created this example program as a way to try and reproduce that original issue, and while it is possible they are
related we do not believe we are either directly or indirectly calling GC.Collect with GCCollectionMode.Aggressive.
We would be very interested in hearing your thoughts on whether there could be any bugs lurking around
that could be causing the similar issues that we are encountering in production.

Reproduction

Minimal repro program

gc-largepages-repro.zip

Option 1: Docker (requires --privileged for huge page setup)

# Triggers corruption, usually within seconds:
./run.sh aggressive

# Control, identical workload, runs clean:
./run.sh forced

# To test with a different .NET version:
DOTNET_VERSION_ARG=10.0 ./run.sh aggressive

Option 2: Run locally on Linux

Requires real kernel huge pages allocated ahead of time (at least 4000 pages = 8GB):

# Reserve huge pages (requires root):
echo 4000 | sudo tee /proc/sys/vm/nr_hugepages

# Build (env vars must NOT be set during build):
dotnet build -c Release

# Triggers corruption, usually within seconds:
DOTNET_GCLargePages=1 DOTNET_GCHeapHardLimit=0xC0000000 \
    dotnet bin/Release/net8.0/GCLargePagesRepro.dll aggressive

# Control, identical workload, runs clean:
DOTNET_GCLargePages=1 DOTNET_GCHeapHardLimit=0xC0000000 \
    dotnet bin/Release/net8.0/GCLargePagesRepro.dll forced

The repro:

  1. Creates 4 writer threads writing to a shared ConcurrentDictionary<string, byte[]>
  2. A separate thread allocates temporary SOH/LOH arrays (creating GC region churn) and periodically calls GC.Collect(2, mode, true, true)
  3. With GCCollectionMode.Aggressive: Corruption typically occurs within seconds
  4. With GCCollectionMode.Forced: runs clean for the full duration

Configuration that triggers the bug

DOTNET_GCLargePages=1
GC.Collect(2, GCCollectionMode.Aggressive, true, true)

Configurations that do NOT trigger the bug

DOTNET_GCLargePages=1
GC.Collect(2, GCCollectionMode.Forced, true, true)
DOTNET_GCLargePages=0
GC.Collect(2, GCCollectionMode.Aggressive, true, true)

Expected behavior

GC.Collect(2, GCCollectionMode.Aggressive, true, true) should not corrupt the
managed heap regardless of whether large pages are enabled.

Observed behavior

Heap corruption and access violations or null reference exceptions. Corruption typically
occurs within seconds of starting, before the first status line is printed:

=== GCLargePages + Aggressive GC: Heap Corruption Repro ===
GC collect mode: Aggressive
Server GC:       True
GCLargePages:    1
Runtime:         .NET 8.0.25

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].TryAddInternal(...)
   at Program.WriterLoop(Int32)
   at Program+<>c__DisplayClass3_1.<Main>b__0()
[11:24:19.849] *** NRE on thread 3 — HEAP CORRUPTION ***
[11:24:19.847] *** NRE on thread 1 — HEAP CORRUPTION ***
Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryAddInternal(...)
   at Program.WriterLoop(Int32 tid) in /app/Program.cs

The corruption cascades rapidly and multiple threads fault at the same millisecond,
consistent with a single GC event corrupting a heap region that multiple threads
then read from.

Regression?

No response

Known Workarounds

No response

Configuration

Environment

  • Versions: .NET 8.0/10.0 (for .NET 10, disabling DATAS helps reproduce)
  • OS: Linux (tested on official Microsoft .NET Docker images)
  • Arch: x86_64
  • Server/Concurrent GC enabled
  • DOTNET_GCLargePages=1 with real kernel huge pages

Other information

When GCLargePages is enabled the garbage collector skips de-commits at the OS level
and also has special logic regarding clearing regions.
When an aggressive GC is induced, all regions are flagged for de-commit
and since de-commit success is always true for large pages the bookkeeping is updated even though
the de-commit never really happened, which could potentially cause corruption.

It is possible that the aggressive de-commit may need to be skipped when GCLargePages is active,
however it is also possible that this reproduction via forcing Aggressive GC is simply exposing
existing bugs within the de-commit logic that is leading to re-using pages which have not been
zeroed out by the OS.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-GC-coreclruntriagedNew issue has not been triaged by the area owner

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions