WIP: ENH: Algorithm dispatch for OOC filter optimizations by joeykleingers · Pull Request #1545 · BlueQuartzSoftware/simplnx

joeykleingers · 2026-02-24T18:40:39Z

Summary

Add reusable AlgorithmDispatch.hpp utility (IsOutOfCore, AnyOutOfCore, DispatchAlgorithm) for dispatching between in-core and OOC algorithm implementations at runtime
Add ForceOocAlgorithm() static flag and ForceOocAlgorithmGuard RAII class so unit tests can exercise the OOC algorithm path even in in-core builds
Tests use Catch2 GENERATE(false, true) with the guard to run both algorithm paths automatically
Add documentation (docs/AlgorithmDispatch.md) explaining when and how to use algorithm dispatch

IdentifySample

Split into BFS flood fill (in-core, 1 bit/voxel) and scanline CCL with union-find (OOC, chunk-sequential)
CCL uses 2-slice rolling buffer (O(slice) memory) with deterministic replay pass to avoid O(N) label storage

Config	Before (BFS)	After	Speedup
In-core (200³)	0.23s	0.16s (BFS)	1.4x
OOC (200³)	841s	4.14s (CCL)	203x

BadDataNeighborOrientationCheck

Split into Worklist (in-core, std::deque for eligible voxels) and Scanline (OOC, chunk-sequential multi-pass with chunk-skip optimization)
Both share Phase 1 (chunk-sequential neighbor counting) and a single neighborCount vector (4 bytes/voxel) with no additional large allocations

Config	Before	After	Speedup
In-core (200³)	1.78s	0.35s	5x
OOC (200³)	97.1s	53.6s	1.8x

FillBadData

Split into BFS (in-core, original algorithm with O(N) neighbors vector) and CCL (OOC, chunk-sequential with on-disk deferred fill)
CCL algorithm eliminates all O(N) memory allocations:
- Phase 1–3 (feature boundary classification): Scanline CCL with rolling 2-slice buffer (O(slice) memory) replaces BFS flood fill
- Phase 4 (iterative morphological fill): On-disk deferred fill using std::tmpfile() replaces O(N) neighbors vector. Voting pass writes (dest, src) pairs to temp file (featureIds read-only), apply pass reads pairs back sequentially.
OOC runtime is comparable to BFS baseline despite dramatically lower RAM usage — disk I/O dominates both paths

Config	Before (BFS)	After (CCL)	Notes
In-core (200³)	0.18s	0.28s	CCL path adds ~0.1s overhead; BFS still used for in-core
OOC (200³)	6.02s	6.05s	Equivalent speed, O(slice) RAM instead of O(N)

SegmentFeatures (Scalar, EBSD, CAxis)

Optimize executeCCL() in the shared SegmentFeatures base class with a 2-slice rolling buffer for provisional labels (O(slice) memory instead of O(N))
CCL uses chunk-sequential forward scan with backward neighbor checks, union-find for equivalence tracking, and chunk-sequential relabeling
DFS (execute()) is unchanged and still used for in-core; CCL dispatched automatically for OOC data

Filter	Config	Before (DFS)	After (CCL)	Speedup
ScalarSegmentFeatures	In-core (200³)	0.36s	0.23s	1.6x
	OOC (200³)	>1500s (timeout)	12.9s	>115x
EBSDSegmentFeatures	In-core (200³)	0.77s	0.62s	1.2x
	OOC (200³)	>1500s (timeout)	35.9s	>42x
CAxisSegmentFeatures	In-core (200³)	0.60s	0.55s	~1.1x
	OOC (200³)	>1500s (timeout)	32.8s	>46x

Test plan

IdentifySample: All 177 correctness assertions pass (both algorithm paths) on in-core and OOC
BadDataNeighborOrientationCheck: All 28 tests pass (both algorithm paths) on in-core and OOC
FillBadData: All 11 test cases pass (both algorithm paths) on in-core and OOC
SegmentFeatures: All 13 tests pass (Scalar, EBSD, CAxis — both algorithm paths) on in-core and OOC
Benchmark tests pass on both configurations
OOC tests confirm ZarrStore usage via chunk shape output

Split IdentifySample into two algorithm implementations selected at runtime based on whether arrays use out-of-core (chunked) storage: - IdentifySampleBFS: BFS flood fill optimized for in-core data (1 bit/voxel) - IdentifySampleCCL: Scanline CCL with union-find optimized for OOC data Add reusable AlgorithmDispatch.hpp utility (IsOutOfCore, AnyOutOfCore, DispatchAlgorithm) so other filters can adopt the same pattern. In-core benchmark: 0.17s (was 0.23s, 1.4x faster) OOC benchmark: 1.9s (was 841s, 475x faster) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

imikejackson

As we discussed, there needs to be a clear path to exercise both code paths even with out-of-core is not being built. This PR should not be merged until this is figured out.

…re builds DispatchAlgorithm now checks a static ForceOocAlgorithm() flag in addition to array storage type. Tests use ForceOocAlgorithmGuard with Catch2 GENERATE to exercise both BFS and CCL paths regardless of build configuration. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

…ithms Split BadDataNeighborOrientationCheck into two dispatched algorithms using DispatchAlgorithm for optimal performance in both configurations: - Worklist (in-core): Uses std::deque worklist for Phase 2 to process only eligible voxels with fast random access. ~5x speedup vs original. - Scanline (OOC): Uses chunk-sequential multi-pass scans for Phase 2 to avoid random access chunk thrashing. Includes chunk-skip optimization that checks in-memory neighborCount before loading chunks, skipping those with no eligible voxels. ~1.8x speedup vs original. Both algorithms share Phase 1 (chunk-sequential neighbor counting) and use only a single neighborCount vector (4 bytes/voxel) with no additional large allocations. Updated tests with GENERATE + ForceOocAlgorithmGuard to exercise both algorithm paths in in-core builds. Added 200x200x200 benchmark test. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

…nFind Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

Split FillBadData into two algorithm implementations dispatched at runtime based on whether data is in-memory or disk-backed: - FillBadDataBFS: In-core algorithm preserving the original BFS flood-fill approach for maximum in-memory performance. - FillBadDataCCL: OOC-optimized algorithm eliminating all O(N) memory allocations that made the previous implementation (d4fec00) unable to handle datasets larger than available RAM. Changes from the baseline OOC algorithm (d4fec00): Phase 1 (CCL): Replaced the O(N) unordered_map<usize, int64> provisional labels buffer and hash-map-based ChunkAwareUnionFind with a 2-slice rolling buffer (O(slice) memory) that reads backward neighbors from an in-memory buffer instead of the OOC featureIdsStore. Provisional labels are written directly into featureIdsStore using positive labels starting at numFeatures+1 to avoid collision with the -1 fill sentinel. Includes a lastClearedZ fix for Z-slices spanning multiple chunks. Phase 3 (Classification): Replaced unordered_map/unordered_set lookups with a dense vector<int8> indexed by label for O(1) classification. Phase 4 (Iterative Fill): Replaced the O(N) neighbors vector with an on-disk deferred fill using std::tmpfile(). Pass 1 scans voxels in chunk order and writes (dest, src) pairs to a temp file. Pass 2 reads pairs back and applies fills. This preserves the two-pass semantics (all voting before any fills) without any per-voxel memory allocation. Also changed Pass 1 from linear iteration to chunk-sequential iteration for better OOC access patterns. UnionFind: Uses the shared vector-based UnionFind with path-halving compression and union-by-rank, replacing the hash-map-based ChunkAwareUnionFind. Peak memory for the OOC path went from ~2*O(N) to O(slice)+O(features), making it viable for datasets that exceed available RAM (e.g., 1000^3 volumes where the previous O(N) buffers alone would require ~7.5 GB). Benchmark (200x200x200, OOC 1-slice chunks): Baseline BFS OOC: 6.02s avg Optimized CCL OOC: 6.05s avg (equivalent speed, dramatically less RAM) All 11 FillBadData tests pass on both in-core and OOC configurations. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

…g buffer Refactor executeCCL() to use a chunk-sequential 2-slice rolling buffer for provisional labels instead of an O(N) vector. Add 200x200x200 benchmark tests for ScalarSegmentFeatures, EBSDSegmentFeatures, and CAxisSegmentFeatures demonstrating >115x OOC speedup over DFS baseline. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

…y pass Replace O(N) labels vector with a 2-slice rolling buffer (O(slice) memory) and a deterministic replay pass to re-derive labels without storing them. Fix PreferencesSentinel threshold from 100 to 625 bytes (1 Z-slice of 25x25x25 uint8 dataset). Remove unnecessary template parameters from CCLResult struct. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

joeykleingers requested a review from imikejackson February 24, 2026 18:40

joeykleingers added the enhancement New feature or request label Feb 24, 2026

joeykleingers force-pushed the worktree-identify-sample-optimizations branch from d801a0a to 1bda2d9 Compare February 24, 2026 18:43

joeykleingers force-pushed the worktree-identify-sample-optimizations branch from 1bda2d9 to c88c7f6 Compare February 24, 2026 18:43

imikejackson requested changes Feb 24, 2026

View reviewed changes

joeykleingers force-pushed the worktree-identify-sample-optimizations branch from c88c7f6 to fb7dd1b Compare February 27, 2026 16:08

joeykleingers force-pushed the worktree-identify-sample-optimizations branch from fb7dd1b to 8e79b9c Compare February 28, 2026 18:42

joeykleingers requested a review from imikejackson February 28, 2026 18:43

joeykleingers force-pushed the worktree-identify-sample-optimizations branch from 8e79b9c to fd57058 Compare February 28, 2026 18:43

joeykleingers changed the title ~~ENH: Add algorithm dispatch for IdentifySample OOC optimization~~ WIP: ENH: Add algorithm dispatch for IdentifySample OOC optimization Feb 28, 2026

joeykleingers force-pushed the worktree-identify-sample-optimizations branch from fd57058 to 4e6e1f6 Compare February 28, 2026 19:57

joeykleingers marked this pull request as draft February 28, 2026 20:02

joeykleingers changed the title ~~WIP: ENH: Add algorithm dispatch for IdentifySample OOC optimization~~ WIP: ENH: Algorithm dispatch for OOC filter optimizations Feb 28, 2026

joeykleingers force-pushed the worktree-identify-sample-optimizations branch from 4e6e1f6 to 9004622 Compare February 28, 2026 20:03

WIP: Add CCL dispatch for SegmentFeatures + optimize FillBadData Unio…

711b837

…nFind Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>

imikejackson added the Out-of-Core label Mar 2, 2026

joeykleingers force-pushed the worktree-identify-sample-optimizations branch from 9004622 to 4070458 Compare March 2, 2026 20:15

joeykleingers added 2 commits March 3, 2026 09:25

imikejackson mentioned this pull request Mar 3, 2026

ENH: Optimize various filters for out-of-core data access patterns. #1542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: ENH: Algorithm dispatch for OOC filter optimizations#1545

WIP: ENH: Algorithm dispatch for OOC filter optimizations#1545
joeykleingers wants to merge 7 commits intoBlueQuartzSoftware:developfrom
joeykleingers:worktree-identify-sample-optimizations

joeykleingers commented Feb 24, 2026 •

edited

Loading

Uh oh!

imikejackson left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joeykleingers commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

IdentifySample

BadDataNeighborOrientationCheck

FillBadData

SegmentFeatures (Scalar, EBSD, CAxis)

Test plan

Uh oh!

imikejackson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joeykleingers commented Feb 24, 2026 •

edited

Loading