Description
We (@BenV and @andyblox) are experiencing what appears to be heap corruption when using GCLargePages in production.
This can be forced to happen by using GC.Collect with GCCollectionMode.Aggressive when GCLargePages
mode is enabled on Linux.
When DOTNET_GCLargePages=1 is enabled (with real kernel huge pages via hugetlbfs),
calling GC.Collect(2, GCCollectionMode.Aggressive, true, true) can cause GC
managed heap corruption. The corruption manifests as NullReferenceException and/or
AccessViolationException inside ConcurrentDictionary internals, with multiple
threads faulting simultaneously, which is characteristic of a GC heap corruption event.
Switching to GCCollectionMode.Forced (same generation, same blocking/compacting
parameters) works as expected with no corruption.
While GCCollectionMode.Aggressive does appear to make the heap corruption significantly more likely to happen
we are not using that in our application, so we would like assistance in tracking down the root cause.
Analysis
The GCCollectionMode.Aggressive documentation states it "requests that the garbage
collector decommit as much memory as possible." We believe the aggressive decommit
code path interacts incorrectly with the GC's large-page memory management:
When GCLargePages is enabled the garbage collector skips de-commits at the OS level
and also has special logic regarding clearing regions.
When an aggressive GC is induced, all regions are flagged for de-commit
and since de-commit success is always true for large pages the bookkeeping is updated even though
the de-commit never really happened, which could potentially cause corruption.
It is possible that the aggressive de-commit may need to be skipped when GCLargePages is active,
however it is also possible that this reproduction via forcing Aggressive GC is simply exposing
existing bugs within the de-commit logic that is leading to re-using pages which have not been
zeroed out by the OS.
We discovered this issue while trying to track down a problem we're hitting recently in production
when GCLargePages mode is active - we are occasionally getting NullReferenceExceptions related
to ConcurrentDictionary where it does not seem possible to throw a NullReferenceException:
System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)
We created this example program as a way to try and reproduce that original issue, and while it is possible they are
related we do not believe we are either directly or indirectly calling GC.Collect with GCCollectionMode.Aggressive.
We would be very interested in hearing your thoughts on whether there could be any bugs lurking around
that could be causing the similar issues that we are encountering in production.
Reproduction
Minimal repro program
gc-largepages-repro.zip
Option 1: Docker (requires --privileged for huge page setup)
# Triggers corruption, usually within seconds:
./run.sh aggressive
# Control, identical workload, runs clean:
./run.sh forced
# To test with a different .NET version:
DOTNET_VERSION_ARG=10.0 ./run.sh aggressive
Option 2: Run locally on Linux
Requires real kernel huge pages allocated ahead of time (at least 4000 pages = 8GB):
# Reserve huge pages (requires root):
echo 4000 | sudo tee /proc/sys/vm/nr_hugepages
# Build (env vars must NOT be set during build):
dotnet build -c Release
# Triggers corruption, usually within seconds:
DOTNET_GCLargePages=1 DOTNET_GCHeapHardLimit=0xC0000000 \
dotnet bin/Release/net8.0/GCLargePagesRepro.dll aggressive
# Control, identical workload, runs clean:
DOTNET_GCLargePages=1 DOTNET_GCHeapHardLimit=0xC0000000 \
dotnet bin/Release/net8.0/GCLargePagesRepro.dll forced
The repro:
- Creates 4 writer threads writing to a shared
ConcurrentDictionary<string, byte[]>
- A separate thread allocates temporary SOH/LOH arrays (creating GC region churn) and periodically calls
GC.Collect(2, mode, true, true)
- With
GCCollectionMode.Aggressive: Corruption typically occurs within seconds
- With
GCCollectionMode.Forced: runs clean for the full duration
Configuration that triggers the bug
DOTNET_GCLargePages=1
GC.Collect(2, GCCollectionMode.Aggressive, true, true)
Configurations that do NOT trigger the bug
DOTNET_GCLargePages=1
GC.Collect(2, GCCollectionMode.Forced, true, true)
DOTNET_GCLargePages=0
GC.Collect(2, GCCollectionMode.Aggressive, true, true)
Expected behavior
GC.Collect(2, GCCollectionMode.Aggressive, true, true) should not corrupt the
managed heap regardless of whether large pages are enabled.
Observed behavior
Heap corruption and access violations or null reference exceptions. Corruption typically
occurs within seconds of starting, before the first status line is printed:
=== GCLargePages + Aggressive GC: Heap Corruption Repro ===
GC collect mode: Aggressive
Server GC: True
GCLargePages: 1
Runtime: .NET 8.0.25
Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].TryAddInternal(...)
at Program.WriterLoop(Int32)
at Program+<>c__DisplayClass3_1.<Main>b__0()
[11:24:19.849] *** NRE on thread 3 — HEAP CORRUPTION ***
[11:24:19.847] *** NRE on thread 1 — HEAP CORRUPTION ***
Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)
at System.Collections.Concurrent.ConcurrentDictionary`2.TryAddInternal(...)
at Program.WriterLoop(Int32 tid) in /app/Program.cs
The corruption cascades rapidly and multiple threads fault at the same millisecond,
consistent with a single GC event corrupting a heap region that multiple threads
then read from.
Regression?
No response
Known Workarounds
No response
Configuration
Environment
- Versions: .NET 8.0/10.0 (for .NET 10, disabling DATAS helps reproduce)
- OS: Linux (tested on official Microsoft .NET Docker images)
- Arch: x86_64
- Server/Concurrent GC enabled
DOTNET_GCLargePages=1 with real kernel huge pages
Other information
When GCLargePages is enabled the garbage collector skips de-commits at the OS level
and also has special logic regarding clearing regions.
When an aggressive GC is induced, all regions are flagged for de-commit
and since de-commit success is always true for large pages the bookkeeping is updated even though
the de-commit never really happened, which could potentially cause corruption.
It is possible that the aggressive de-commit may need to be skipped when GCLargePages is active,
however it is also possible that this reproduction via forcing Aggressive GC is simply exposing
existing bugs within the de-commit logic that is leading to re-using pages which have not been
zeroed out by the OS.
Description
We (@BenV and @andyblox) are experiencing what appears to be heap corruption when using
GCLargePagesin production.This can be forced to happen by using
GC.CollectwithGCCollectionMode.AggressivewhenGCLargePagesmode is enabled on Linux.
When
DOTNET_GCLargePages=1is enabled (with real kernel huge pages via hugetlbfs),calling
GC.Collect(2, GCCollectionMode.Aggressive, true, true)can cause GCmanaged heap corruption. The corruption manifests as
NullReferenceExceptionand/orAccessViolationExceptioninsideConcurrentDictionaryinternals, with multiplethreads faulting simultaneously, which is characteristic of a GC heap corruption event.
Switching to
GCCollectionMode.Forced(same generation, same blocking/compactingparameters) works as expected with no corruption.
While
GCCollectionMode.Aggressivedoes appear to make the heap corruption significantly more likely to happenwe are not using that in our application, so we would like assistance in tracking down the root cause.
Analysis
The
GCCollectionMode.Aggressivedocumentation states it "requests that the garbagecollector decommit as much memory as possible." We believe the aggressive decommit
code path interacts incorrectly with the GC's large-page memory management:
When
GCLargePagesis enabled the garbage collector skips de-commits at the OS leveland also has special logic regarding clearing regions.
When an aggressive GC is induced, all regions are flagged for de-commit
and since de-commit success is always true for large pages the bookkeeping is updated even though
the de-commit never really happened, which could potentially cause corruption.
It is possible that the aggressive de-commit may need to be skipped when
GCLargePagesis active,however it is also possible that this reproduction via forcing Aggressive GC is simply exposing
existing bugs within the de-commit logic that is leading to re-using pages which have not been
zeroed out by the OS.
We discovered this issue while trying to track down a problem we're hitting recently in production
when
GCLargePagesmode is active - we are occasionally gettingNullReferenceExceptionsrelatedto
ConcurrentDictionarywhere it does not seem possible to throw aNullReferenceException:We created this example program as a way to try and reproduce that original issue, and while it is possible they are
related we do not believe we are either directly or indirectly calling
GC.CollectwithGCCollectionMode.Aggressive.We would be very interested in hearing your thoughts on whether there could be any bugs lurking around
that could be causing the similar issues that we are encountering in production.
Reproduction
Minimal repro program
gc-largepages-repro.zip
Option 1: Docker (requires
--privilegedfor huge page setup)Option 2: Run locally on Linux
Requires real kernel huge pages allocated ahead of time (at least 4000 pages = 8GB):
The repro:
ConcurrentDictionary<string, byte[]>GC.Collect(2, mode, true, true)GCCollectionMode.Aggressive: Corruption typically occurs within secondsGCCollectionMode.Forced: runs clean for the full durationConfiguration that triggers the bug
Configurations that do NOT trigger the bug
Expected behavior
GC.Collect(2, GCCollectionMode.Aggressive, true, true)should not corrupt themanaged heap regardless of whether large pages are enabled.
Observed behavior
Heap corruption and access violations or null reference exceptions. Corruption typically
occurs within seconds of starting, before the first status line is printed:
The corruption cascades rapidly and multiple threads fault at the same millisecond,
consistent with a single GC event corrupting a heap region that multiple threads
then read from.
Regression?
No response
Known Workarounds
No response
Configuration
Environment
DOTNET_GCLargePages=1with real kernel huge pagesOther information
When
GCLargePagesis enabled the garbage collector skips de-commits at the OS leveland also has special logic regarding clearing regions.
When an aggressive GC is induced, all regions are flagged for de-commit
and since de-commit success is always true for large pages the bookkeeping is updated even though
the de-commit never really happened, which could potentially cause corruption.
It is possible that the aggressive de-commit may need to be skipped when
GCLargePagesis active,however it is also possible that this reproduction via forcing Aggressive GC is simply exposing
existing bugs within the de-commit logic that is leading to re-using pages which have not been
zeroed out by the OS.