GC heap corruption with GCLargePages

## Description

We (@BenV and @andyblox) are experiencing what appears to be heap corruption when using `GCLargePages` in production.

This can be forced to happen by using `GC.Collect` with `GCCollectionMode.Aggressive` when `GCLargePages`
mode is enabled on Linux.

When `DOTNET_GCLargePages=1` is enabled (with real kernel huge pages via hugetlbfs),
calling `GC.Collect(2, GCCollectionMode.Aggressive, true, true)` can cause GC
managed heap corruption. The corruption manifests as `NullReferenceException` and/or
`AccessViolationException` inside `ConcurrentDictionary` internals, with multiple
threads faulting simultaneously, which is characteristic of a GC heap corruption event.

Switching to `GCCollectionMode.Forced` (same generation, same blocking/compacting
parameters) works as expected with no corruption.

While `GCCollectionMode.Aggressive` does appear to make the heap corruption significantly more likely to happen
we are not using that in our application, so we would like assistance in tracking down the root cause.

## Analysis

The `GCCollectionMode.Aggressive` documentation states it "requests that the garbage
collector decommit as much memory as possible." We believe the aggressive decommit
code path interacts incorrectly with the GC's large-page memory management:

When `GCLargePages` is enabled the garbage collector [skips de-commits at the OS level](https://github.com/dotnet/runtime/blob/v10.0.5/src/coreclr/gc/gc.cpp#L7588)
and also has [special logic regarding clearing regions](https://github.com/dotnet/runtime/blob/v10.0.5/src/coreclr/gc/gc.cpp#L45174).
When an aggressive GC is induced, [all regions are flagged for de-commit](https://github.com/dotnet/runtime/blob/v10.0.5/src/coreclr/gc/gc.cpp#L13445)
and since de-commit success is always true for large pages the [bookkeeping is updated](https://github.com/dotnet/runtime/blob/v10.0.5/src/coreclr/gc/gc.cpp#L7552) even though
the de-commit never really happened, which could potentially cause corruption.

It is possible that the aggressive de-commit may need to be skipped when `GCLargePages` is active,
however it is also possible that this reproduction via forcing Aggressive GC is simply exposing
existing bugs within the de-commit logic that is leading to re-using pages which have not been
zeroed out by the OS.

We discovered this issue while trying to track down a problem we're hitting recently in production
when `GCLargePages` mode is active - we are occasionally getting `NullReferenceExceptions` related
to `ConcurrentDictionary` where it does not seem possible to throw a `NullReferenceException`:

```
System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)
```

We created this example program as a way to try and reproduce that original issue, and while it is possible they are
related we do not believe we are either directly or indirectly calling `GC.Collect` with `GCCollectionMode.Aggressive`.
We would be very interested in hearing your thoughts on whether there could be any bugs lurking around
that could be causing the similar issues that we are encountering in production.


## Reproduction

### Minimal repro program
[gc-largepages-repro.zip](https://github.com/user-attachments/files/26726770/gc-largepages-repro.zip)

#### Option 1: Docker (requires `--privileged` for huge page setup)

```bash
# Triggers corruption, usually within seconds:
./run.sh aggressive

# Control, identical workload, runs clean:
./run.sh forced

# To test with a different .NET version:
DOTNET_VERSION_ARG=10.0 ./run.sh aggressive
```

#### Option 2: Run locally on Linux

Requires real kernel huge pages allocated ahead of time (at least 4000 pages = 8GB):

```bash
# Reserve huge pages (requires root):
echo 4000 | sudo tee /proc/sys/vm/nr_hugepages

# Build (env vars must NOT be set during build):
dotnet build -c Release

# Triggers corruption, usually within seconds:
DOTNET_GCLargePages=1 DOTNET_GCHeapHardLimit=0xC0000000 \
    dotnet bin/Release/net8.0/GCLargePagesRepro.dll aggressive

# Control, identical workload, runs clean:
DOTNET_GCLargePages=1 DOTNET_GCHeapHardLimit=0xC0000000 \
    dotnet bin/Release/net8.0/GCLargePagesRepro.dll forced
```

The repro:

1. Creates 4 writer threads writing to a shared `ConcurrentDictionary<string, byte[]>`
2. A separate thread allocates temporary SOH/LOH arrays (creating GC region churn) and periodically calls `GC.Collect(2, mode, true, true)`
3. With `GCCollectionMode.Aggressive`: Corruption typically occurs within seconds
4. With `GCCollectionMode.Forced`: runs clean for the full duration

### Configuration that triggers the bug

```
DOTNET_GCLargePages=1
GC.Collect(2, GCCollectionMode.Aggressive, true, true)
```

### Configurations that do NOT trigger the bug

```
DOTNET_GCLargePages=1
GC.Collect(2, GCCollectionMode.Forced, true, true)
```

```
DOTNET_GCLargePages=0
GC.Collect(2, GCCollectionMode.Aggressive, true, true)
```

## Expected behavior

`GC.Collect(2, GCCollectionMode.Aggressive, true, true)` should not corrupt the
managed heap regardless of whether large pages are enabled.

## Observed behavior

Heap corruption and access violations or null reference exceptions. Corruption typically
occurs within seconds of starting, before the first status line is printed:

```
=== GCLargePages + Aggressive GC: Heap Corruption Repro ===
GC collect mode: Aggressive
Server GC:       True
GCLargePages:    1
Runtime:         .NET 8.0.25

Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.Collections.Concurrent.ConcurrentDictionary`2[[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].TryAddInternal(...)
   at Program.WriterLoop(Int32)
   at Program+<>c__DisplayClass3_1.<Main>b__0()
```

```
[11:24:19.849] *** NRE on thread 3 — HEAP CORRUPTION ***
[11:24:19.847] *** NRE on thread 1 — HEAP CORRUPTION ***
Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at System.Collections.Concurrent.ConcurrentDictionary`2.GrowTable(Tables tables, Boolean resizeDesired, Boolean forceRehashIfNonRandomized)
   at System.Collections.Concurrent.ConcurrentDictionary`2.TryAddInternal(...)
   at Program.WriterLoop(Int32 tid) in /app/Program.cs
```

The corruption cascades rapidly and multiple threads fault at the same millisecond,
consistent with a single GC event corrupting a heap region that multiple threads
then read from.

### Regression?

_No response_

### Known Workarounds

_No response_

### Configuration

### Environment

- Versions: .NET 8.0/10.0 (for .NET 10, disabling DATAS helps reproduce)
- OS: Linux (tested on official Microsoft .NET Docker images)
- Arch: x86_64
- Server/Concurrent GC enabled
- `DOTNET_GCLargePages=1` with real kernel huge pages


### Other information

When `GCLargePages` is enabled the garbage collector [skips de-commits at the OS level](https://github.com/dotnet/runtime/blob/v10.0.5/src/coreclr/gc/gc.cpp#L7588)
and also has [special logic regarding clearing regions](https://github.com/dotnet/runtime/blob/v10.0.5/src/coreclr/gc/gc.cpp#L45174).
When an aggressive GC is induced, [all regions are flagged for de-commit](https://github.com/dotnet/runtime/blob/v10.0.5/src/coreclr/gc/gc.cpp#L13445)
and since de-commit success is always true for large pages the [bookkeeping is updated](https://github.com/dotnet/runtime/blob/v10.0.5/src/coreclr/gc/gc.cpp#L7552) even though
the de-commit never really happened, which could potentially cause corruption.

It is possible that the aggressive de-commit may need to be skipped when `GCLargePages` is active,
however it is also possible that this reproduction via forcing Aggressive GC is simply exposing
existing bugs within the de-commit logic that is leading to re-using pages which have not been
zeroed out by the OS.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GC heap corruption with GCLargePages #126903

Description

Analysis

Reproduction

Minimal repro program

Option 1: Docker (requires `--privileged` for huge page setup)

Option 2: Run locally on Linux

Configuration that triggers the bug

Configurations that do NOT trigger the bug

Expected behavior

Observed behavior

Regression?

Known Workarounds

Configuration

Environment

Other information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GC heap corruption with GCLargePages #126903

Description

Description

Analysis

Reproduction

Minimal repro program

Option 1: Docker (requires --privileged for huge page setup)

Option 2: Run locally on Linux

Configuration that triggers the bug

Configurations that do NOT trigger the bug

Expected behavior

Observed behavior

Regression?

Known Workarounds

Configuration

Environment

Other information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Option 1: Docker (requires `--privileged` for huge page setup)