WIP: ENH: Algorithm dispatch for OOC filter optimizations#1545
Draft
joeykleingers wants to merge 7 commits intoBlueQuartzSoftware:developfrom
Draft
WIP: ENH: Algorithm dispatch for OOC filter optimizations#1545joeykleingers wants to merge 7 commits intoBlueQuartzSoftware:developfrom
joeykleingers wants to merge 7 commits intoBlueQuartzSoftware:developfrom
Conversation
d801a0a to
1bda2d9
Compare
Split IdentifySample into two algorithm implementations selected at runtime based on whether arrays use out-of-core (chunked) storage: - IdentifySampleBFS: BFS flood fill optimized for in-core data (1 bit/voxel) - IdentifySampleCCL: Scanline CCL with union-find optimized for OOC data Add reusable AlgorithmDispatch.hpp utility (IsOutOfCore, AnyOutOfCore, DispatchAlgorithm) so other filters can adopt the same pattern. In-core benchmark: 0.17s (was 0.23s, 1.4x faster) OOC benchmark: 1.9s (was 841s, 475x faster) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1bda2d9 to
c88c7f6
Compare
imikejackson
requested changes
Feb 24, 2026
Contributor
imikejackson
left a comment
There was a problem hiding this comment.
As we discussed, there needs to be a clear path to exercise both code paths even with out-of-core is not being built. This PR should not be merged until this is figured out.
c88c7f6 to
fb7dd1b
Compare
…re builds DispatchAlgorithm now checks a static ForceOocAlgorithm() flag in addition to array storage type. Tests use ForceOocAlgorithmGuard with Catch2 GENERATE to exercise both BFS and CCL paths regardless of build configuration. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
fb7dd1b to
8e79b9c
Compare
8e79b9c to
fd57058
Compare
…ithms Split BadDataNeighborOrientationCheck into two dispatched algorithms using DispatchAlgorithm for optimal performance in both configurations: - Worklist (in-core): Uses std::deque worklist for Phase 2 to process only eligible voxels with fast random access. ~5x speedup vs original. - Scanline (OOC): Uses chunk-sequential multi-pass scans for Phase 2 to avoid random access chunk thrashing. Includes chunk-skip optimization that checks in-memory neighborCount before loading chunks, skipping those with no eligible voxels. ~1.8x speedup vs original. Both algorithms share Phase 1 (chunk-sequential neighbor counting) and use only a single neighborCount vector (4 bytes/voxel) with no additional large allocations. Updated tests with GENERATE + ForceOocAlgorithmGuard to exercise both algorithm paths in in-core builds. Added 200x200x200 benchmark test. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
fd57058 to
4e6e1f6
Compare
4e6e1f6 to
9004622
Compare
…nFind Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
Split FillBadData into two algorithm implementations dispatched at runtime based on whether data is in-memory or disk-backed: - FillBadDataBFS: In-core algorithm preserving the original BFS flood-fill approach for maximum in-memory performance. - FillBadDataCCL: OOC-optimized algorithm eliminating all O(N) memory allocations that made the previous implementation (d4fec00) unable to handle datasets larger than available RAM. Changes from the baseline OOC algorithm (d4fec00): Phase 1 (CCL): Replaced the O(N) unordered_map<usize, int64> provisional labels buffer and hash-map-based ChunkAwareUnionFind with a 2-slice rolling buffer (O(slice) memory) that reads backward neighbors from an in-memory buffer instead of the OOC featureIdsStore. Provisional labels are written directly into featureIdsStore using positive labels starting at numFeatures+1 to avoid collision with the -1 fill sentinel. Includes a lastClearedZ fix for Z-slices spanning multiple chunks. Phase 3 (Classification): Replaced unordered_map/unordered_set lookups with a dense vector<int8> indexed by label for O(1) classification. Phase 4 (Iterative Fill): Replaced the O(N) neighbors vector with an on-disk deferred fill using std::tmpfile(). Pass 1 scans voxels in chunk order and writes (dest, src) pairs to a temp file. Pass 2 reads pairs back and applies fills. This preserves the two-pass semantics (all voting before any fills) without any per-voxel memory allocation. Also changed Pass 1 from linear iteration to chunk-sequential iteration for better OOC access patterns. UnionFind: Uses the shared vector-based UnionFind with path-halving compression and union-by-rank, replacing the hash-map-based ChunkAwareUnionFind. Peak memory for the OOC path went from ~2*O(N) to O(slice)+O(features), making it viable for datasets that exceed available RAM (e.g., 1000^3 volumes where the previous O(N) buffers alone would require ~7.5 GB). Benchmark (200x200x200, OOC 1-slice chunks): Baseline BFS OOC: 6.02s avg Optimized CCL OOC: 6.05s avg (equivalent speed, dramatically less RAM) All 11 FillBadData tests pass on both in-core and OOC configurations. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
9004622 to
4070458
Compare
…g buffer Refactor executeCCL() to use a chunk-sequential 2-slice rolling buffer for provisional labels instead of an O(N) vector. Add 200x200x200 benchmark tests for ScalarSegmentFeatures, EBSDSegmentFeatures, and CAxisSegmentFeatures demonstrating >115x OOC speedup over DFS baseline. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
…y pass Replace O(N) labels vector with a 2-slice rolling buffer (O(slice) memory) and a deterministic replay pass to re-derive labels without storing them. Fix PreferencesSentinel threshold from 100 to 625 bytes (1 Z-slice of 25x25x25 uint8 dataset). Remove unnecessary template parameters from CCLResult struct. Signed-off-by: Joey Kleingers <joey.kleingers@bluequartz.net>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
AlgorithmDispatch.hpputility (IsOutOfCore,AnyOutOfCore,DispatchAlgorithm) for dispatching between in-core and OOC algorithm implementations at runtimeForceOocAlgorithm()static flag andForceOocAlgorithmGuardRAII class so unit tests can exercise the OOC algorithm path even in in-core buildsGENERATE(false, true)with the guard to run both algorithm paths automaticallydocs/AlgorithmDispatch.md) explaining when and how to use algorithm dispatchIdentifySample
BadDataNeighborOrientationCheck
neighborCountvector (4 bytes/voxel) with no additional large allocationsFillBadData
std::tmpfile()replaces O(N)neighborsvector. Voting pass writes(dest, src)pairs to temp file (featureIds read-only), apply pass reads pairs back sequentially.SegmentFeatures (Scalar, EBSD, CAxis)
executeCCL()in the sharedSegmentFeaturesbase class with a 2-slice rolling buffer for provisional labels (O(slice) memory instead of O(N))execute()) is unchanged and still used for in-core; CCL dispatched automatically for OOC dataTest plan