Skip to content

ORE v2 (3/n): efficient unary encoding + hardware AES on aarch64#80

Draft
coderdan wants to merge 3 commits into
feat/ore-v2-core-refactorfrom
feat/ore-v2-efficient-encoding
Draft

ORE v2 (3/n): efficient unary encoding + hardware AES on aarch64#80
coderdan wants to merge 3 commits into
feat/ore-v2-core-refactorfrom
feat/ore-v2-efficient-encoding

Conversation

@coderdan

Copy link
Copy Markdown
Contributor

Stacked on #79. Plan §2 (docs/plans/2026-06-12-ore-v2-architecture.md).

Headline numbers (Apple M1 Max, u64 encrypt)

Configuration encrypt-8 vs old default
main, software AES (old default on ARM) 381.1 µs 1.0×
main, hardware AES 39.4 µs 9.7×
this PR, hardware AES (new default) 25.1 µs 15.2×

Full matrix in docs/benchmarks/2026-06-13-pr3-results.md.

What

All byte-identical — the 12 compatibility vectors pass unchanged.

  1. Hardware AES on aarch64: aes v0.8 requires --cfg aes_armv8 and was silently running software AES (~60×/block) on every default ARM build. Workspace .cargo/config.toml sets it; README replaces the outdated nightly advice (stable ≥1.61 suffices). Downstream crates need the cfg in their own builds — worth an announcement when this releases.
  2. Bulk right-block encoding (−36% encrypt under hw AES): hash LSBs packed straight into the block bitvector (hash_all_into, no Vec), indicator mask XORed over the top in one linear pass of the PRP's inverse table (indicator_mask_xor) — replacing 256 per-bit invert lookups + branchy bit-sets per block. Equivalence vs the per-bit reference pinned by quickcheck.
  3. Hasher hoist: the nonce key schedule was rebuilt once per block; now once per encryption.
  4. PRNG pointer flattening (−18% encrypt-left): one branch per byte instead of three, byte stream exactly preserved (including the historical skip-byte-0-on-regeneration quirk, now documented as load-bearing).
  5. Knuth shuffle skips its degenerate i=0 iteration: it always lands on swap(0,0) after an expected 256 rejection draws, and the RNG is dropped immediately after — unobservable, saves a full PRNG regeneration per PRP.

What's now dominant

PRP construction: ~2.2 µs/block, ~70% of encrypt under hardware AES, and wire-frozen for this scheme. That elevates the plan's open question 1 (constant-time small-domain PRP for new schemes) — see the PR 5 notes.

@coderdan coderdan force-pushed the feat/ore-v2-core-refactor branch from ab17b5a to 7170514 Compare June 12, 2026 15:23
@coderdan coderdan force-pushed the feat/ore-v2-efficient-encoding branch from eb98f36 to 980ae10 Compare June 12, 2026 15:23
coderdan added a commit that referenced this pull request Jun 16, 2026
…dead code)

Code-review follow-ups on the efficient-encoding PR (no wire change; compat
vectors remain byte-identical):

- Extract pack_bits_lsb_first (primitives.rs) as the single home of the
  right-ciphertext LSB-first bit-packing convention, shared by
  Hash::hash_all_into and Prp::indicator_mask_xor (previously duplicated in
  both). It carries a real assert!(out.len()*8 == src.len()), replacing both
  debug_asserts and the hardcoded 256 in indicator_mask_xor, so the
  chunks_exact(8) remainder can no longer be dropped silently in release.
- Remove the dead RightBitVec::set_bit (trait method + impl); the bulk encoder
  writes via as_mut_bytes and the only set_bit callers are tests hitting the
  inherent RightBlock32::set_bit. Fix the dangling [Self::set_bit] doc link.
- Correct the stale Aes128Prng comment (it buffers and regenerates 256 bytes;
  it does not panic).

The aes_armv8 RUSTFLAGS footgun is tracked separately for the aes 0.8->0.9
upgrade (#86), which removes the cfg entirely.
coderdan added 3 commits June 16, 2026 19:35
… on aarch64

All changes are byte-identical to the previous implementation, verified by
the PR 1 compatibility vectors:

- Right-block encoding is now two bulk passes: hash bits packed straight
  into the block bitvector (hash_all_into, no Vec), then the PRP XORs its
  indicator mask over the top in one linear walk of its inverse table
  (indicator_mask_xor) — replacing DOMAIN per-bit invert lookups and
  branchy bit-sets per block. A quickcheck test pins mask equivalence
  against the per-bit reference.
- The nonce-keyed hasher is constructed once per encryption instead of
  once per block (the key schedule was being rebuilt N times).
- PRNG pointer flattened (one branch per byte instead of three), exactly
  preserving the historical byte stream including the skip-byte-0-after-
  regeneration quirk, which is load-bearing for ciphertext bytes.
- The Knuth shuffle skips its degenerate i=0 iteration (always swap(0,0),
  reached only after an expected 256 rejection-sampled draws; the RNG is
  dropped immediately after, so the skip is unobservable).
- .cargo/config.toml enables ARMv8 hardware AES for workspace builds: aes
  v0.8 requires --cfg aes_armv8 on aarch64 and falls back to software AES
  (~60x slower per block on M1 Max) without it. README updated for
  downstream builds (stable Rust suffices; the old nightly advice was
  outdated).

Plan updates riding along: §5(b) benchmark gate result (key expansion ~84%
of per-block work at Bit6 width on M1 Max -> decision rule selects
Candidate B, CMAC with cached prefix state) and the corrected §3 aarch64
AES assumption.

Part of the ORE v2 program (docs/plans/2026-06-12-ore-v2-architecture.md, PR 3).
…dead code)

Code-review follow-ups on the efficient-encoding PR (no wire change; compat
vectors remain byte-identical):

- Extract pack_bits_lsb_first (primitives.rs) as the single home of the
  right-ciphertext LSB-first bit-packing convention, shared by
  Hash::hash_all_into and Prp::indicator_mask_xor (previously duplicated in
  both). It carries a real assert!(out.len()*8 == src.len()), replacing both
  debug_asserts and the hardcoded 256 in indicator_mask_xor, so the
  chunks_exact(8) remainder can no longer be dropped silently in release.
- Remove the dead RightBitVec::set_bit (trait method + impl); the bulk encoder
  writes via as_mut_bytes and the only set_bit callers are tests hitting the
  inherent RightBlock32::set_bit. Fix the dangling [Self::set_bit] doc link.
- Correct the stale Aes128Prng comment (it buffers and regenerates 256 bytes;
  it does not panic).

The aes_armv8 RUSTFLAGS footgun is tracked separately for the aes 0.8->0.9
upgrade (#86), which removes the cfg entirely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant