ICICLE: MSM chunking, safety on proof generation and GPU memory reduction by p4u · Pull Request #1665 · Consensys/gnark

p4u · 2025-12-19T00:04:37Z

Description

This PR addresses critical stability and memory issues in the ICICLE Groth16 backend, enabling reliable proof generation for large circuits while drastically reducing GPU memory consumption.

Problems Solved

Illegal Memory Access Crashes: Large MSMs (especially BLS12-377 G2 with >4M constraints) were causing "illegal memory access" errors due to bugs in the C++ backend's internal chunking logic.
Race Conditions & Deadlocks: Concurrent MSM execution was causing data races, deadlocks, and occasional proof corruptions due to instability in the C++/CUDA backend's concurrency handling.
Excessive GPU Memory Usage: Pre-loading all proving key vectors at initialization consumed 80-90% of available GPU memory, preventing multiple circuits from being loaded and limiting maximum circuit size.
Memory Leaks: Device memory was not being freed promptly, causing accumulation over multiple proofs and eventual OOM errors.

Solution Overview

Commit 1: Go-side MSM Chunking
Commit 2: Sequential MSM Execution
Commit 3: On-Demand Vector Loading

Impact

Stability:

Eliminates "illegal memory access" crashes on large circuits
Prevents race conditions and deadlocks
Stable proof generation across all supported curves

Performance:

Proof times remain similar (PCIe transfer overhead is negligible vs MSM computation)
Sequential execution trades minimal performance for significant stability gains
Verified on RTX 5070 Ti (16GB):
- BLS12-377 (3.5M constraints): ~3.5s
- BW6-761 (4M constraints): ~6s
- BN254 (16M constraints): ~10s

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How has this been tested?

Integration Tests: Full test suite with custom circuits
Memory Stability: Sequential proof generation with circuit switching (3 circuits in loop)
Proof Verification: All generated proofs verified successfully
Long-running Tests: Multi-hour stress tests without memory leaks or crashes (TODO)

Test Environment:

NVIDIA RTX 5070 Ti (16GB VRAM)
CUDA 12.x
Ubuntu Linux
Multiple circuit types (BN254, BLS12-377, BLS12-381, BW6-761)

Checklist:

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (docs update can follow if needed)
I have added tests that prove my fix is effective or that my feature works (existing tests now pass reliably)
I did not modify files generated from templates (only modified template, regenerated all curves)
golangci-lint does not output errors locally
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

Additional Notes

This PR prioritizes stability and reliability over maximum parallelism. The C++/CUDA backend's concurrency bugs made parallel MSM execution unreliable, leading to occasional proof corruptions and crashes.

Predictable behavior across all circuit sizes and curves
Efficient memory usage enabling production deployments
Maintainable code with simplified control flow

Note

Improves stability and memory usage across all ICICLE Groth16 curves (BN254, BLS12‑377/381, BW6‑761).

Implement Go-side MSM chunking for G1/G2 with memory-aware tuning (chunk cap, env ICICLE_MSM_MAX_WINDOW); replaces backend internal chunking
Serialize all GPU operations per device (mutex) and track/init/release NTT domain per device to avoid races/witness corruption
Switch to on-demand GPU loading of proving key vectors; add PinToGPU mode to pre-load/persist vectors (incl. commitment keys)
Add FreeGPUResources to release pinned/device buffers; fix leaks with timely Free() and runtime.KeepAlive
Update commitment/POK MSMs and KRS/BS1/AR1/BS2 paths to use chunked MSM; improve scheduling to avoid concurrency issues
Minor: add ICICLE_DEBUG, safer setup locking; docs add "GPU pinning" section

^{Written by Cursor Bugbot for commit 876dedf. This will update automatically on new commits. Configure here.}

…d fix data race - Implemented msmChunkedG1/G2 in Go to handle large MSMs by slicing inputs. This bypasses the internal C++ multi-chunking logic which was crashing on large BLS12-377 G2 MSMs (illegal memory access). - Enforced maximum chunk size of 2^18 (262144) in Go wrapper to ensure stability across all curves. - Aligned memory safety margins (bucketFactor=4.0, reduced=0.7) and indices estimation (4x) with C++ backend requirements. - Fixed a data race in Prove where the function returned before the G2 MSM computation was complete. - Updated generator template to reflect these changes for all curves.

- Removed parallel RunOnDevice calls for AR1, BS1, and BS2, consolidating them into a single sequential execution block. - Removed channel-based synchronization (chArDone, chBs1Done, chBs2Done) in favor of direct sequential ordering to eliminate deadlocks and simplify control flow. - Ensured computeKRS runs after all other MSMs within the same device context, preventing potential backend concurrency issues.

Drastically reduce GPU memory usage by loading proving key vectors only when needed during proof generation, instead of pre-loading everything at initialization. - Modified setupDevicePointers() to skip bulk loading of large vectors - Added on-demand loading helpers: * loadG1(): loads with Montgomery→Standard conversion (for A,B,K,Z) * loadG1Raw(): preserves Montgomery form (for Commitment Keys) * loadG2(): loads with Montgomery→Standard conversion (for G2.B) - Updated Prove() to load vectors on-demand in compute functions: * computeAR1(): loads G1.A, uses for MSM, frees immediately * computeBS1(): loads G1.B, uses for MSM, frees immediately * computeBS2(): loads G2.B, uses for MSM, frees immediately * computeKRS(): loads G1.K and G1.Z, uses for MSMs, frees immediately - Fixed memory leak: privateCommittedValuesDevice now freed after POK computation - Added nil pointer guards on all .Free() calls to prevent panics - Commitment keys loaded on-demand during hint execution Impact: - Memory usage peak reduced (80-90%) - Enables larger circuits and more concurrent proof operations - Maintains correctness with proper Montgomery form handling

… scope Concurrent calls to Prove() could simultaneously initialize pk.deviceInfo, causing double initialization and GPU memory corruption. Add setupMu sync.Mutex to ProvingKey struct and protect the entire setupDevicePointers() function with Lock/Unlock. This ensures atomic check-and-initialize semantics. Fix race condition where the nttDomainMu mutex was released before ReleaseDomain() and InitDomain() operations completed, allowing concurrent threads to corrupt domain state.

…ry leaks GPU Memory Leak in load functions: loadG1() and loadG2() now free GPU memory before returning on Montgomery NTT Domain Race Condition: added per-device RWMutex (nttDomainRWByDevice) for NTT domain protection

p4u · 2025-12-19T12:17:03Z

I think it is now quite solid, thanks to cursor bugbot too

If true, the proving key is not uloaded from GPU memory, allowing faster proving (win of 30-50 ms on a bls12-377 3M constraints circuit)

p4u · 2025-12-19T17:10:50Z

Added PinToGPU option for icicle proving keys. So the caller can choose to keep the Proving key in GPU memory, to avoid spending time on load/unload. On my tests, the win is about 30-50 ms for a bls12-377 3.5M constraint circuit. By default this option is set to false.

Replace RWMutex with exclusive per-device mutex to prevent concurrent CUDA stream access to shared NTT domain, fixing sporadic constraint violations under load.

ivokub

Thanks for the PR! Seems quite solid. I locally tested with some example circuits (1.5M constraints) and seems to work very well, but it doesn't hit the edge cases yet.

Can you confirm the cursor comment about the NTT domain initialization? Perhaps Cursor comment doesn't make sense (i.e. when new domain is smaller than existing, then we don't release it but still initialize with new root of unity)?

And secondly imo we could also make the commitment key loading conditional on PinToGPU or remove field from deviceInfo as we now only have a local variable for it.

There is also CI failure as code generation embeds the current year in the file headers and it has updated. We'll update it on repo master branch soon, otherwise there would be hundreds of changed files.

… per‑device max Signed-off-by: Pau Escrich <p4u@dabax.net>

CommitmentKeysDevice was unused and PinToGPU still reloaded commitment bases every proof. Signed-off-by: Pau Escrich <p4u@dabax.net>

Also added a guarded re‑check after acquiring the lock to avoid double‑free if FreeGPUResources is called concurrently. Signed-off-by: Pau Escrich <p4u@dabax.net>

…tupMu Prove now takes a locked snapshot to decide logging, then always calls setupDevicePointers, which already guards with the same mutex. Signed-off-by: Pau Escrich <p4u@dabax.net>

p4u · 2026-01-09T14:41:55Z

All comments have been addressed @ivokub. If there is nothing else, I'd say this is ready to merge.

I tried to rebase with master, but as you said, it means hundred of file modifications. Thus, if you agree, I'd let the final rebase to you after the year is changed on the headers.

If you want me to squash, I could do it. However, I think the individual commits are quite curated and provide a good vision of the history to reach the final state.

ivokub · 2026-01-14T11:04:40Z

All comments have been addressed @ivokub. If there is nothing else, I'd say this is ready to merge.

I tried to rebase with master, but as you said, it means hundred of file modifications. Thus, if you agree, I'd let the final rebase to you after the year is changed on the headers.

If you want me to squash, I could do it. However, I think the individual commits are quite curated and provide a good vision of the history to reach the final state.

Thanks for the fixes! Looks good implementation-wise. AWS doesn't currently have GPU capacity for the instance I have, I'll run some tests and then merge if they pass.

ivokub

Thanks! I merged master and added option to be able to conveniently set the Pinning option without having to type assert the interface.

p4u added 3 commits December 16, 2025 17:00

cursor bot reviewed Dec 19, 2025

View reviewed changes

Comment thread backend/accelerated/icicle/groth16/bls12-377/icicle.go

cursor bot reviewed Dec 19, 2025

View reviewed changes

Comment thread backend/accelerated/icicle/internal/generator/templates/groth16.icicle.go.tmpl

Comment thread backend/accelerated/icicle/internal/generator/templates/groth16.icicle.go.tmpl

p4u force-pushed the p4u/icicle_fix branch from 8e3dedc to 69fff73 Compare December 19, 2025 09:32

icicle: add pk.PinToGPU

126036d

If true, the proving key is not uloaded from GPU memory, allowing faster proving (win of 30-50 ms on a bls12-377 3M constraints circuit)

cursor bot reviewed Dec 19, 2025

View reviewed changes

Comment thread backend/accelerated/icicle/groth16/bls12-377/icicle.go

Comment thread backend/accelerated/icicle/groth16/bls12-377/icicle.go

icicle: serialize GPU operations to prevent witness corruption

373b7ed

Replace RWMutex with exclusive per-device mutex to prevent concurrent CUDA stream access to shared NTT domain, fixing sporadic constraint violations under load.

p4u force-pushed the p4u/icicle_fix branch from 4c06237 to 373b7ed Compare December 25, 2025 17:23

cursor bot reviewed Dec 25, 2025

View reviewed changes

Comment thread backend/accelerated/icicle/groth16/bls12-377/icicle.go

Comment thread backend/accelerated/icicle/groth16/bls12-377/icicle.go

ivokub reviewed Jan 5, 2026

View reviewed changes

Comment thread backend/accelerated/icicle/internal/generator/templates/groth16.icicle.provingkey.go.tmpl

ivokub reviewed Jan 5, 2026

View reviewed changes

p4u added 2 commits January 9, 2026 10:59

icicle: updated the template to only call InitDomain when we grow the…

4e504c6

… per‑device max Signed-off-by: Pau Escrich <p4u@dabax.net>

icicle: implement PinToGPU support for commitment keys

ec132af

CommitmentKeysDevice was unused and PinToGPU still reloaded commitment bases every proof. Signed-off-by: Pau Escrich <p4u@dabax.net>

cursor bot reviewed Jan 9, 2026

View reviewed changes

Comment thread backend/accelerated/icicle/groth16/bls12-377/icicle.go

icicle: clear pk.deviceInfo after freeing resources,

2111d37

Also added a guarded re‑check after acquiring the lock to avoid double‑free if FreeGPUResources is called concurrently. Signed-off-by: Pau Escrich <p4u@dabax.net>

cursor bot reviewed Jan 9, 2026

View reviewed changes

Comment thread backend/accelerated/icicle/groth16/bls12-377/icicle.go Outdated

icicle: fixed data race by moving all pk.deviceInfo reads under pk.se…

2f8a385

…tupMu Prove now takes a locked snapshot to decide logging, then always calls setupDevicePointers, which already guards with the same mutex. Signed-off-by: Pau Escrich <p4u@dabax.net>

Merge branch 'master' into pr/p4u/1665

767a18c

ivokub added 3 commits January 14, 2026 22:33

Merge branch 'master' into pr/p4u/1665

4bd000c

feat: add GPU pinning as option

9a22036

chore: go generate

876dedf

ivokub approved these changes Jan 14, 2026

View reviewed changes

cursor bot reviewed Jan 14, 2026

View reviewed changes

Comment thread backend/accelerated/icicle/groth16/bls12-377/icicle.go

ivokub merged commit 0d7ec1c into Consensys:master Jan 14, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ICICLE: MSM chunking, safety on proof generation and GPU memory reduction#1665

ICICLE: MSM chunking, safety on proof generation and GPU memory reduction#1665
ivokub merged 15 commits intoConsensys:masterfrom
p4u:p4u/icicle_fix

p4u commented Dec 19, 2025 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p4u commented Dec 19, 2025

Uh oh!

p4u commented Dec 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivokub left a comment

Uh oh!

Uh oh!

Uh oh!

p4u commented Jan 9, 2026

Uh oh!

ivokub commented Jan 14, 2026

Uh oh!

ivokub left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

p4u commented Dec 19, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problems Solved

Solution Overview

Impact

Type of change

How has this been tested?

Checklist:

Additional Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

p4u commented Dec 19, 2025

Uh oh!

p4u commented Dec 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivokub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

p4u commented Jan 9, 2026

Uh oh!

ivokub commented Jan 14, 2026

Uh oh!

ivokub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

p4u commented Dec 19, 2025 •

edited by cursor bot

Loading