ORE v2 (3/n): efficient unary encoding + hardware AES on aarch64 by coderdan · Pull Request #80 · cipherstash/ore.rs

coderdan · 2026-06-12T15:12:29Z

Stacked on #79. Plan §2 (docs/plans/2026-06-12-ore-v2-architecture.md).

Headline numbers (Apple M1 Max, u64 encrypt)

Configuration	encrypt-8	vs old default
main, software AES (old default on ARM)	381.1 µs	1.0×
main, hardware AES	39.4 µs	9.7×
this PR, hardware AES (new default)	25.1 µs	15.2×

Full matrix in docs/benchmarks/2026-06-13-pr3-results.md.

What

All byte-identical — the 12 compatibility vectors pass unchanged.

Hardware AES on aarch64: aes v0.8 requires --cfg aes_armv8 and was silently running software AES (~60×/block) on every default ARM build. Workspace .cargo/config.toml sets it; README replaces the outdated nightly advice (stable ≥1.61 suffices). Downstream crates need the cfg in their own builds — worth an announcement when this releases.
Bulk right-block encoding (−36% encrypt under hw AES): hash LSBs packed straight into the block bitvector (hash_all_into, no Vec), indicator mask XORed over the top in one linear pass of the PRP's inverse table (indicator_mask_xor) — replacing 256 per-bit invert lookups + branchy bit-sets per block. Equivalence vs the per-bit reference pinned by quickcheck.
Hasher hoist: the nonce key schedule was rebuilt once per block; now once per encryption.
PRNG pointer flattening (−18% encrypt-left): one branch per byte instead of three, byte stream exactly preserved (including the historical skip-byte-0-on-regeneration quirk, now documented as load-bearing).
Knuth shuffle skips its degenerate i=0 iteration: it always lands on swap(0,0) after an expected 256 rejection draws, and the RNG is dropped immediately after — unobservable, saves a full PRNG regeneration per PRP.

What's now dominant

PRP construction: ~2.2 µs/block, ~70% of encrypt under hardware AES, and wire-frozen for this scheme. That elevates the plan's open question 1 (constant-time small-domain PRP for new schemes) — see the PR 5 notes.

…dead code) Code-review follow-ups on the efficient-encoding PR (no wire change; compat vectors remain byte-identical): - Extract pack_bits_lsb_first (primitives.rs) as the single home of the right-ciphertext LSB-first bit-packing convention, shared by Hash::hash_all_into and Prp::indicator_mask_xor (previously duplicated in both). It carries a real assert!(out.len()*8 == src.len()), replacing both debug_asserts and the hardcoded 256 in indicator_mask_xor, so the chunks_exact(8) remainder can no longer be dropped silently in release. - Remove the dead RightBitVec::set_bit (trait method + impl); the bulk encoder writes via as_mut_bytes and the only set_bit callers are tests hitting the inherent RightBlock32::set_bit. Fix the dangling [Self::set_bit] doc link. - Correct the stale Aes128Prng comment (it buffers and regenerates 256 bytes; it does not panic). The aes_armv8 RUSTFLAGS footgun is tracked separately for the aes 0.8->0.9 upgrade (#86), which removes the cfg entirely.

… on aarch64 All changes are byte-identical to the previous implementation, verified by the PR 1 compatibility vectors: - Right-block encoding is now two bulk passes: hash bits packed straight into the block bitvector (hash_all_into, no Vec), then the PRP XORs its indicator mask over the top in one linear walk of its inverse table (indicator_mask_xor) — replacing DOMAIN per-bit invert lookups and branchy bit-sets per block. A quickcheck test pins mask equivalence against the per-bit reference. - The nonce-keyed hasher is constructed once per encryption instead of once per block (the key schedule was being rebuilt N times). - PRNG pointer flattened (one branch per byte instead of three), exactly preserving the historical byte stream including the skip-byte-0-after- regeneration quirk, which is load-bearing for ciphertext bytes. - The Knuth shuffle skips its degenerate i=0 iteration (always swap(0,0), reached only after an expected 256 rejection-sampled draws; the RNG is dropped immediately after, so the skip is unobservable). - .cargo/config.toml enables ARMv8 hardware AES for workspace builds: aes v0.8 requires --cfg aes_armv8 on aarch64 and falls back to software AES (~60x slower per block on M1 Max) without it. README updated for downstream builds (stable Rust suffices; the old nightly advice was outdated). Plan updates riding along: §5(b) benchmark gate result (key expansion ~84% of per-block work at Bit6 width on M1 Max -> decision rule selects Candidate B, CMAC with cached prefix state) and the corrected §3 aarch64 AES assumption. Part of the ORE v2 program (docs/plans/2026-06-12-ore-v2-architecture.md, PR 3).

…dead code) Code-review follow-ups on the efficient-encoding PR (no wire change; compat vectors remain byte-identical): - Extract pack_bits_lsb_first (primitives.rs) as the single home of the right-ciphertext LSB-first bit-packing convention, shared by Hash::hash_all_into and Prp::indicator_mask_xor (previously duplicated in both). It carries a real assert!(out.len()*8 == src.len()), replacing both debug_asserts and the hardcoded 256 in indicator_mask_xor, so the chunks_exact(8) remainder can no longer be dropped silently in release. - Remove the dead RightBitVec::set_bit (trait method + impl); the bulk encoder writes via as_mut_bytes and the only set_bit callers are tests hitting the inherent RightBlock32::set_bit. Fix the dangling [Self::set_bit] doc link. - Correct the stale Aes128Prng comment (it buffers and regenerates 256 bytes; it does not panic). The aes_armv8 RUSTFLAGS footgun is tracked separately for the aes 0.8->0.9 upgrade (#86), which removes the cfg entirely.

coderdan mentioned this pull request Jun 12, 2026

ORE v2 (4/n): NEON + AVX2 SIMD backends #81

Draft

coderdan force-pushed the feat/ore-v2-core-refactor branch from ab17b5a to 7170514 Compare June 12, 2026 15:23

coderdan force-pushed the feat/ore-v2-efficient-encoding branch from eb98f36 to 980ae10 Compare June 12, 2026 15:23

This was referenced Jun 12, 2026

ORE v2 (5/n): 6-bit block scheme + v2 wire format #82

Draft

Upgrade aes 0.8 -> 0.9 to drop the aes_armv8 cfg workaround #86

Open

coderdan added 3 commits June 16, 2026 19:35

docs: PR 3 benchmark results (hardware + software AES matrix)

4c5c47a

coderdan force-pushed the feat/ore-v2-core-refactor branch from 7170514 to 220c498 Compare June 16, 2026 10:02

coderdan force-pushed the feat/ore-v2-efficient-encoding branch from 6c361b4 to 183ba9c Compare June 16, 2026 10:02

coderdan mentioned this pull request Jun 16, 2026

ORE v2: variable-length & string ORE, v2 wire format, SIMD + constant-time hardening (integration branch) #90

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ORE v2 (3/n): efficient unary encoding + hardware AES on aarch64#80

ORE v2 (3/n): efficient unary encoding + hardware AES on aarch64#80
coderdan wants to merge 3 commits into
feat/ore-v2-core-refactorfrom
feat/ore-v2-efficient-encoding

coderdan commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

coderdan commented Jun 12, 2026

Headline numbers (Apple M1 Max, u64 encrypt)

What

What's now dominant

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant