Skip to content

WIP: ENH: Algorithm dispatch for OOC filter optimizations#1545

Draft
joeykleingers wants to merge 7 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-identify-sample-optimizations
Draft

WIP: ENH: Algorithm dispatch for OOC filter optimizations#1545
joeykleingers wants to merge 7 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-identify-sample-optimizations

Conversation

@joeykleingers
Copy link
Contributor

@joeykleingers joeykleingers commented Feb 24, 2026

Summary

  • Add reusable AlgorithmDispatch.hpp utility (IsOutOfCore, AnyOutOfCore, DispatchAlgorithm) for dispatching between in-core and OOC algorithm implementations at runtime
  • Add ForceOocAlgorithm() static flag and ForceOocAlgorithmGuard RAII class so unit tests can exercise the OOC algorithm path even in in-core builds
  • Tests use Catch2 GENERATE(false, true) with the guard to run both algorithm paths automatically
  • Add documentation (docs/AlgorithmDispatch.md) explaining when and how to use algorithm dispatch

IdentifySample

  • Split into BFS flood fill (in-core, 1 bit/voxel) and scanline CCL with union-find (OOC, chunk-sequential)
  • CCL uses 2-slice rolling buffer (O(slice) memory) with deterministic replay pass to avoid O(N) label storage
Config Before (BFS) After Speedup
In-core (200³) 0.23s 0.16s (BFS) 1.4x
OOC (200³) 841s 4.14s (CCL) 203x

BadDataNeighborOrientationCheck

  • Split into Worklist (in-core, std::deque for eligible voxels) and Scanline (OOC, chunk-sequential multi-pass with chunk-skip optimization)
  • Both share Phase 1 (chunk-sequential neighbor counting) and a single neighborCount vector (4 bytes/voxel) with no additional large allocations
Config Before After Speedup
In-core (200³) 1.78s 0.35s 5x
OOC (200³) 97.1s 53.6s 1.8x

FillBadData

  • Split into BFS (in-core, original algorithm with O(N) neighbors vector) and CCL (OOC, chunk-sequential with on-disk deferred fill)
  • CCL algorithm eliminates all O(N) memory allocations:
    • Phase 1–3 (feature boundary classification): Scanline CCL with rolling 2-slice buffer (O(slice) memory) replaces BFS flood fill
    • Phase 4 (iterative morphological fill): On-disk deferred fill using std::tmpfile() replaces O(N) neighbors vector. Voting pass writes (dest, src) pairs to temp file (featureIds read-only), apply pass reads pairs back sequentially.
  • OOC runtime is comparable to BFS baseline despite dramatically lower RAM usage — disk I/O dominates both paths
Config Before (BFS) After (CCL) Notes
In-core (200³) 0.18s 0.28s CCL path adds ~0.1s overhead; BFS still used for in-core
OOC (200³) 6.02s 6.05s Equivalent speed, O(slice) RAM instead of O(N)

SegmentFeatures (Scalar, EBSD, CAxis)

  • Optimize executeCCL() in the shared SegmentFeatures base class with a 2-slice rolling buffer for provisional labels (O(slice) memory instead of O(N))
  • CCL uses chunk-sequential forward scan with backward neighbor checks, union-find for equivalence tracking, and chunk-sequential relabeling
  • DFS (execute()) is unchanged and still used for in-core; CCL dispatched automatically for OOC data
Filter Config Before (DFS) After (CCL) Speedup
ScalarSegmentFeatures In-core (200³) 0.36s 0.23s 1.6x
OOC (200³) >1500s (timeout) 12.9s >115x
EBSDSegmentFeatures In-core (200³) 0.77s 0.62s 1.2x
OOC (200³) >1500s (timeout) 35.9s >42x
CAxisSegmentFeatures In-core (200³) 0.60s 0.55s ~1.1x
OOC (200³) >1500s (timeout) 32.8s >46x

Test plan

  • IdentifySample: All 177 correctness assertions pass (both algorithm paths) on in-core and OOC
  • BadDataNeighborOrientationCheck: All 28 tests pass (both algorithm paths) on in-core and OOC
  • FillBadData: All 11 test cases pass (both algorithm paths) on in-core and OOC
  • SegmentFeatures: All 13 tests pass (Scalar, EBSD, CAxis — both algorithm paths) on in-core and OOC
  • Benchmark tests pass on both configurations
  • OOC tests confirm ZarrStore usage via chunk shape output

@joeykleingers joeykleingers added the enhancement New feature or request label Feb 24, 2026
@joeykleingers joeykleingers force-pushed the worktree-identify-sample-optimizations branch from d801a0a to 1bda2d9 Compare February 24, 2026 18:43
Split IdentifySample into two algorithm implementations selected at
runtime based on whether arrays use out-of-core (chunked) storage:

- IdentifySampleBFS: BFS flood fill optimized for in-core data (1 bit/voxel)
- IdentifySampleCCL: Scanline CCL with union-find optimized for OOC data

Add reusable AlgorithmDispatch.hpp utility (IsOutOfCore, AnyOutOfCore,
DispatchAlgorithm) so other filters can adopt the same pattern.

In-core benchmark: 0.17s (was 0.23s, 1.4x faster)
OOC benchmark: 1.9s (was 841s, 475x faster)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@joeykleingers joeykleingers force-pushed the worktree-identify-sample-optimizations branch from 1bda2d9 to c88c7f6 Compare February 24, 2026 18:43
Copy link
Contributor

@imikejackson imikejackson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed, there needs to be a clear path to exercise both code paths even with out-of-core is not being built. This PR should not be merged until this is figured out.

@joeykleingers joeykleingers force-pushed the worktree-identify-sample-optimizations branch from c88c7f6 to fb7dd1b Compare February 27, 2026 16:08
…re builds

DispatchAlgorithm now checks a static ForceOocAlgorithm() flag in addition
to array storage type. Tests use ForceOocAlgorithmGuard with Catch2 GENERATE
to exercise both BFS and CCL paths regardless of build configuration.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
@joeykleingers joeykleingers force-pushed the worktree-identify-sample-optimizations branch from fb7dd1b to 8e79b9c Compare February 28, 2026 18:42
@joeykleingers joeykleingers force-pushed the worktree-identify-sample-optimizations branch from 8e79b9c to fd57058 Compare February 28, 2026 18:43
@joeykleingers joeykleingers changed the title ENH: Add algorithm dispatch for IdentifySample OOC optimization WIP: ENH: Add algorithm dispatch for IdentifySample OOC optimization Feb 28, 2026
…ithms

Split BadDataNeighborOrientationCheck into two dispatched algorithms
using DispatchAlgorithm for optimal performance in both configurations:

- Worklist (in-core): Uses std::deque worklist for Phase 2 to process
  only eligible voxels with fast random access. ~5x speedup vs original.

- Scanline (OOC): Uses chunk-sequential multi-pass scans for Phase 2
  to avoid random access chunk thrashing. Includes chunk-skip
  optimization that checks in-memory neighborCount before loading
  chunks, skipping those with no eligible voxels. ~1.8x speedup vs
  original.

Both algorithms share Phase 1 (chunk-sequential neighbor counting)
and use only a single neighborCount vector (4 bytes/voxel) with no
additional large allocations.

Updated tests with GENERATE + ForceOocAlgorithmGuard to exercise both
algorithm paths in in-core builds. Added 200x200x200 benchmark test.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
@joeykleingers joeykleingers force-pushed the worktree-identify-sample-optimizations branch from fd57058 to 4e6e1f6 Compare February 28, 2026 19:57
@joeykleingers joeykleingers marked this pull request as draft February 28, 2026 20:02
@joeykleingers joeykleingers changed the title WIP: ENH: Add algorithm dispatch for IdentifySample OOC optimization WIP: ENH: Algorithm dispatch for OOC filter optimizations Feb 28, 2026
@joeykleingers joeykleingers force-pushed the worktree-identify-sample-optimizations branch from 4e6e1f6 to 9004622 Compare February 28, 2026 20:03
…nFind

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Split FillBadData into two algorithm implementations dispatched at
runtime based on whether data is in-memory or disk-backed:

- FillBadDataBFS: In-core algorithm preserving the original BFS
  flood-fill approach for maximum in-memory performance.

- FillBadDataCCL: OOC-optimized algorithm eliminating all O(N) memory
  allocations that made the previous implementation (d4fec00) unable to
  handle datasets larger than available RAM.

Changes from the baseline OOC algorithm (d4fec00):

Phase 1 (CCL): Replaced the O(N) unordered_map<usize, int64>
provisional labels buffer and hash-map-based ChunkAwareUnionFind with a
2-slice rolling buffer (O(slice) memory) that reads backward neighbors
from an in-memory buffer instead of the OOC featureIdsStore. Provisional
labels are written directly into featureIdsStore using positive labels
starting at numFeatures+1 to avoid collision with the -1 fill sentinel.
Includes a lastClearedZ fix for Z-slices spanning multiple chunks.

Phase 3 (Classification): Replaced unordered_map/unordered_set lookups
with a dense vector<int8> indexed by label for O(1) classification.

Phase 4 (Iterative Fill): Replaced the O(N) neighbors vector with an
on-disk deferred fill using std::tmpfile(). Pass 1 scans voxels in chunk
order and writes (dest, src) pairs to a temp file. Pass 2 reads pairs
back and applies fills. This preserves the two-pass semantics (all
voting before any fills) without any per-voxel memory allocation. Also
changed Pass 1 from linear iteration to chunk-sequential iteration for
better OOC access patterns.

UnionFind: Uses the shared vector-based UnionFind with path-halving
compression and union-by-rank, replacing the hash-map-based
ChunkAwareUnionFind.

Peak memory for the OOC path went from ~2*O(N) to O(slice)+O(features),
making it viable for datasets that exceed available RAM (e.g., 1000^3
volumes where the previous O(N) buffers alone would require ~7.5 GB).

Benchmark (200x200x200, OOC 1-slice chunks):
  Baseline BFS OOC: 6.02s avg
  Optimized CCL OOC: 6.05s avg (equivalent speed, dramatically less RAM)

All 11 FillBadData tests pass on both in-core and OOC configurations.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
@joeykleingers joeykleingers force-pushed the worktree-identify-sample-optimizations branch from 9004622 to 4070458 Compare March 2, 2026 20:15
…g buffer

Refactor executeCCL() to use a chunk-sequential 2-slice rolling buffer
for provisional labels instead of an O(N) vector. Add 200x200x200
benchmark tests for ScalarSegmentFeatures, EBSDSegmentFeatures, and
CAxisSegmentFeatures demonstrating >115x OOC speedup over DFS baseline.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
…y pass

Replace O(N) labels vector with a 2-slice rolling buffer (O(slice) memory)
and a deterministic replay pass to re-derive labels without storing them.
Fix PreferencesSentinel threshold from 100 to 625 bytes (1 Z-slice of
25x25x25 uint8 dataset). Remove unnecessary template parameters from
CCLResult struct.

Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Out-of-Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants