ORE v2 (3/n): efficient unary encoding + hardware AES on aarch64#80
Draft
coderdan wants to merge 3 commits into
Draft
ORE v2 (3/n): efficient unary encoding + hardware AES on aarch64#80coderdan wants to merge 3 commits into
coderdan wants to merge 3 commits into
Conversation
ab17b5a to
7170514
Compare
eb98f36 to
980ae10
Compare
This was referenced Jun 12, 2026
coderdan
added a commit
that referenced
this pull request
Jun 16, 2026
…dead code) Code-review follow-ups on the efficient-encoding PR (no wire change; compat vectors remain byte-identical): - Extract pack_bits_lsb_first (primitives.rs) as the single home of the right-ciphertext LSB-first bit-packing convention, shared by Hash::hash_all_into and Prp::indicator_mask_xor (previously duplicated in both). It carries a real assert!(out.len()*8 == src.len()), replacing both debug_asserts and the hardcoded 256 in indicator_mask_xor, so the chunks_exact(8) remainder can no longer be dropped silently in release. - Remove the dead RightBitVec::set_bit (trait method + impl); the bulk encoder writes via as_mut_bytes and the only set_bit callers are tests hitting the inherent RightBlock32::set_bit. Fix the dangling [Self::set_bit] doc link. - Correct the stale Aes128Prng comment (it buffers and regenerates 256 bytes; it does not panic). The aes_armv8 RUSTFLAGS footgun is tracked separately for the aes 0.8->0.9 upgrade (#86), which removes the cfg entirely.
… on aarch64 All changes are byte-identical to the previous implementation, verified by the PR 1 compatibility vectors: - Right-block encoding is now two bulk passes: hash bits packed straight into the block bitvector (hash_all_into, no Vec), then the PRP XORs its indicator mask over the top in one linear walk of its inverse table (indicator_mask_xor) — replacing DOMAIN per-bit invert lookups and branchy bit-sets per block. A quickcheck test pins mask equivalence against the per-bit reference. - The nonce-keyed hasher is constructed once per encryption instead of once per block (the key schedule was being rebuilt N times). - PRNG pointer flattened (one branch per byte instead of three), exactly preserving the historical byte stream including the skip-byte-0-after- regeneration quirk, which is load-bearing for ciphertext bytes. - The Knuth shuffle skips its degenerate i=0 iteration (always swap(0,0), reached only after an expected 256 rejection-sampled draws; the RNG is dropped immediately after, so the skip is unobservable). - .cargo/config.toml enables ARMv8 hardware AES for workspace builds: aes v0.8 requires --cfg aes_armv8 on aarch64 and falls back to software AES (~60x slower per block on M1 Max) without it. README updated for downstream builds (stable Rust suffices; the old nightly advice was outdated). Plan updates riding along: §5(b) benchmark gate result (key expansion ~84% of per-block work at Bit6 width on M1 Max -> decision rule selects Candidate B, CMAC with cached prefix state) and the corrected §3 aarch64 AES assumption. Part of the ORE v2 program (docs/plans/2026-06-12-ore-v2-architecture.md, PR 3).
…dead code) Code-review follow-ups on the efficient-encoding PR (no wire change; compat vectors remain byte-identical): - Extract pack_bits_lsb_first (primitives.rs) as the single home of the right-ciphertext LSB-first bit-packing convention, shared by Hash::hash_all_into and Prp::indicator_mask_xor (previously duplicated in both). It carries a real assert!(out.len()*8 == src.len()), replacing both debug_asserts and the hardcoded 256 in indicator_mask_xor, so the chunks_exact(8) remainder can no longer be dropped silently in release. - Remove the dead RightBitVec::set_bit (trait method + impl); the bulk encoder writes via as_mut_bytes and the only set_bit callers are tests hitting the inherent RightBlock32::set_bit. Fix the dangling [Self::set_bit] doc link. - Correct the stale Aes128Prng comment (it buffers and regenerates 256 bytes; it does not panic). The aes_armv8 RUSTFLAGS footgun is tracked separately for the aes 0.8->0.9 upgrade (#86), which removes the cfg entirely.
7170514 to
220c498
Compare
6c361b4 to
183ba9c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #79. Plan §2 (
docs/plans/2026-06-12-ore-v2-architecture.md).Headline numbers (Apple M1 Max, u64 encrypt)
Full matrix in
docs/benchmarks/2026-06-13-pr3-results.md.What
All byte-identical — the 12 compatibility vectors pass unchanged.
aesv0.8 requires--cfg aes_armv8and was silently running software AES (~60×/block) on every default ARM build. Workspace.cargo/config.tomlsets it; README replaces the outdated nightly advice (stable ≥1.61 suffices). Downstream crates need the cfg in their own builds — worth an announcement when this releases.hash_all_into, noVec), indicator mask XORed over the top in one linear pass of the PRP's inverse table (indicator_mask_xor) — replacing 256 per-bitinvertlookups + branchy bit-sets per block. Equivalence vs the per-bit reference pinned by quickcheck.What's now dominant
PRP construction: ~2.2 µs/block, ~70% of encrypt under hardware AES, and wire-frozen for this scheme. That elevates the plan's open question 1 (constant-time small-domain PRP for new schemes) — see the PR 5 notes.