Skip to content

perf: optimize emulated multi-miller loops via sparse×sparse line multiplications for 0-bits#1701

Merged
yelhousni merged 6 commits intomasterfrom
perf/pairing
Feb 17, 2026
Merged

perf: optimize emulated multi-miller loops via sparse×sparse line multiplications for 0-bits#1701
yelhousni merged 6 commits intomasterfrom
perf/pairing

Conversation

@yelhousni
Copy link
Copy Markdown
Contributor

@yelhousni yelhousni commented Feb 6, 2026

Description

This PR optimizes the Miller loop in emulated pairing circuits by batching sparse line multiplications across pairs. When processing single lines per pair (0-bit iterations), instead of multiplying each line individually with the accumulator, we batch lines 2-by-2 using sparse×sparse multiplication, then multiply the semi-sparse result with the accumulator.

Optimization Pattern

For BLS12-381 and BN254, the optimization applies to:

  1. Main loop case 0: When the loop counter bit is 0, there's only one line per pair
  2. First iteration (k ≥ 2): Initial accumulation of lines beyond the first two pairs

The key insight is that multiplying two sparse lines together produces a semi-sparse result (with fewer non-zero coefficients than a dense element), which can then be multiplied more efficiently with the dense accumulator.

Changes by Curve

BLS12-381 (sw_bls12381/pairing.go, fields_bls12381/e12_pairing.go):

  • Added Mul02368By02368ThenMul: combines sparse×sparse product with dense multiplication
  • Added MulBySemiSparse1_7: specialized multiplication where positions 1 and 7 are zero
  • Updated millerLoopLines to batch lines 2-by-2 across pairs

BN254 (sw_bn254/pairing.go):

  • Updated case 0 in main loop to use Mul01379By01379 + MulBy012346789 for 2-by-2 batching
  • Updated first iteration (k ≥ 2) with same optimization
  • Note: BN254 already had the sparse×sparse batching within pairs for non-zero bits

BW6-761 (sw_bw6761/pairing.go):

  • Updated first iteration (k ≥ 2) to use Mul023By023 + MulBy02345
  • Note: Main loop case 0 already had this optimization

Type of change

  • Performance improvement (non-breaking change that improves efficiency)

Benchmarks

PairingCheck SCS Constraint Counts in a BN254 circuit

Curve n Before After Δ Improvement
BLS12-381 2 2,063,666 1,915,970 -147,696 7.2%
4 3,507,950 3,212,558 -295,392 8.4%
10 7,840,802 7,102,322 -738,480 9.4%
BN254 2 1,780,197 1,711,117 -69,080 3.9%
4 2,963,267 2,825,107 -138,160 4.7%
10 6,512,477 6,167,077 -345,400 5.3%
  • BW6-761 had already the main optimization present; only the first iteration was updated (minimal impact).
  • BN254 already had the sparse×sparse batching within pairs for non-zero bits.

Applications

ECPairBLS precompile:

n Before After Δ %
2 2,785,603 2,763,742 -21,861 0.8%
3 4,023,178 3,990,316 -32,862 0.8%
4 5,260,753 5,216,890 -43,863 0.8%
10 12,686,203 12,576,334 -109,869 0.9%
20 25,061,953 24,842,074 -219,879 0.9%
50 62,189,203 61,639,294 -549,909 0.9%
  • BLS12-381 gained ~0.9% because the non-zero bit handling changed from 2× MulBy02368 to 1× Mul02368By02368ThenMul
  • BN254 shows no change because it already had Mul01379By01379 + MulBy012346789 for non-zero bits

How has this been tested?

  • All existing pairing tests pass for BLS12-381, BN254, and BW6-761
  • TestPairTestSolve, TestPairFixedTestSolve, TestPairingCheckTestSolve pass
  • TestPairingMuxes with varying pair counts (0-5) pass

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I did not modify files generated from templates
  • golangci-lint does not output errors locally
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

Note

Medium Risk
Touches core pairing arithmetic (Miller loop multiplication paths) across multiple curves, so any formula/indexing mistake could silently break correctness despite being a performance-focused change.

Overview
Performance optimization for emulated pairing circuits by batching sparse line evaluations during multi-Miller loops, replacing repeated accumulator×line multiplications with sparse×sparse line products followed by a cheaper semi-sparse multiply.

For BLS12-381, adds new Ext12 helpers (Mul02368By02368ThenMul and MulBySemiSparse1_7) and updates millerLoopLines to batch 0-bit iterations across pairs and to combine the two within-pair line multiplications into a single fused operation; it also factors final exponentiation’s hard part into finalExpHardPart (logic preserved).

For BN254 and BW6-761, updates the first-iteration accumulation and 0-bit handling to batch independent lines 2-by-2 using existing sparse×sparse helpers (Mul01379By01379/MulBy012346789, Mul023By023/MulBy02345). Benchmark stats in internal/stats/latest_stats.csv are updated accordingly.

Written by Cursor Bugbot for commit 4178589. This will update automatically on new commits. Configure here.

@yelhousni yelhousni added this to the v0.14.N milestone Feb 6, 2026
@yelhousni yelhousni requested review from Copilot and ivokub February 6, 2026 00:08
@yelhousni yelhousni self-assigned this Feb 6, 2026
@yelhousni yelhousni added type: perf dep: linea Issues affecting Linea downstream labels Feb 6, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes the Miller loop in emulated pairing circuits by batching sparse line evaluations 2-by-2 across pairs when processing single lines per pair (0-bit iterations). Instead of multiplying each sparse line individually with the dense accumulator, pairs of sparse lines are first multiplied together using sparse×sparse multiplication to produce a semi-sparse result, which is then more efficiently multiplied with the accumulator. This optimization applies to BLS12-381, BN254, and BW6-761 curves.

Changes:

  • Added Mul02368By02368ThenMul and MulBySemiSparse1_7 methods for BLS12-381 to support batched sparse×sparse line multiplication
  • Updated Miller loop implementations in BLS12-381, BN254, and BW6-761 to batch lines 2-by-2 in 0-bit cases and initial iterations
  • Refactored BLS12-381 FinalExponentiation to extract hard part into separate finalExpHardPart method

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
std/algebra/emulated/fields_bls12381/e12_pairing.go Adds Mul02368By02368ThenMul for sparse×sparse line multiplication and MulBySemiSparse1_7 for multiplying by semi-sparse elements with zeros at positions 1 and 7
std/algebra/emulated/sw_bls12381/pairing.go Applies 2-by-2 batching optimization to first iteration and 0-bit cases in Miller loop; refactors final exponentiation hard part into separate method
std/algebra/emulated/sw_bn254/pairing.go Applies 2-by-2 batching to case 0 in main loop and k≥2 in first iteration
std/algebra/emulated/sw_bw6761/pairing.go Applies 2-by-2 batching to k≥2 in first iteration (main loop already had optimization)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread std/algebra/emulated/sw_bls12381/pairing.go
@yelhousni yelhousni merged commit df2294c into master Feb 17, 2026
13 checks passed
@yelhousni yelhousni deleted the perf/pairing branch February 17, 2026 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dep: linea Issues affecting Linea downstream type: perf

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants