Skip to content

Commit c88c7f6

Browse files
joeykleingersclaude
andcommitted
ENH: Add algorithm dispatch for IdentifySample OOC optimization
Split IdentifySample into two algorithm implementations selected at runtime based on whether arrays use out-of-core (chunked) storage: - IdentifySampleBFS: BFS flood fill optimized for in-core data (1 bit/voxel) - IdentifySampleCCL: Scanline CCL with union-find optimized for OOC data Add reusable AlgorithmDispatch.hpp utility (IsOutOfCore, AnyOutOfCore, DispatchAlgorithm) so other filters can adopt the same pattern. In-core benchmark: 0.17s (was 0.23s, 1.4x faster) OOC benchmark: 1.9s (was 841s, 475x faster) Co-Authored-By: Claude Opus 4.6 <[email protected]>
1 parent 616fd40 commit c88c7f6

File tree

11 files changed

+1320
-393
lines changed

11 files changed

+1320
-393
lines changed

CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -541,6 +541,7 @@ set(SIMPLNX_HDRS
541541
${SIMPLNX_SOURCE_DIR}/Utilities/ColorTableUtilities.hpp
542542
${SIMPLNX_SOURCE_DIR}/Utilities/FileUtilities.hpp
543543
${SIMPLNX_SOURCE_DIR}/Utilities/FilterUtilities.hpp
544+
${SIMPLNX_SOURCE_DIR}/Utilities/AlgorithmDispatch.hpp
544545
${SIMPLNX_SOURCE_DIR}/Utilities/GeometryUtilities.hpp
545546
${SIMPLNX_SOURCE_DIR}/Utilities/GeometryHelpers.hpp
546547
${SIMPLNX_SOURCE_DIR}/Utilities/HistogramUtilities.hpp

docs/AlgorithmDispatch.md

Lines changed: 175 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,175 @@
1+
# Algorithm Dispatch for Out-of-Core Optimization
2+
3+
## Overview
4+
5+
Some filter algorithms that perform well on in-memory data become extremely slow when data is stored in disk-backed chunks (out-of-core / OOC mode). This document explains why this happens, when a filter needs separate algorithm implementations, and how to use the `DispatchAlgorithm` utility to select between them at runtime.
6+
7+
## Why Out-of-Core Needs Different Algorithms
8+
9+
### In-Core Storage (DataStore)
10+
11+
In-core arrays store data in a single contiguous memory buffer. Any element can be accessed in O(1) time via pointer arithmetic. Algorithms that use random access patterns (BFS flood fill, graph traversal, etc.) work efficiently because every `operator[]` call is a simple memory read.
12+
13+
### Out-of-Core Storage (ZarrStore)
14+
15+
Out-of-core arrays partition data into compressed chunks stored on disk. A fixed-size FIFO cache (6 chunks) keeps recently accessed chunks in memory. When code accesses an element, the system:
16+
17+
1. Determines which chunk contains that element
18+
2. If the chunk is cached, returns the value directly
19+
3. If not cached, evicts the oldest chunk (writing it to disk if dirty) and loads the needed chunk from disk
20+
21+
This means algorithms with random access patterns cause **chunk thrashing**: each random jump may load a new chunk and evict another, turning O(1) memory reads into O(disk) operations. A BFS flood fill that visits neighbors across chunk boundaries can trigger thousands of chunk load/evict cycles, making a sub-second algorithm take 10+ minutes.
22+
23+
### The Solution: Chunk-Sequential Algorithms
24+
25+
Algorithms designed for OOC data process chunks sequentially using the chunk API:
26+
27+
```cpp
28+
for(uint64 chunkIdx = 0; chunkIdx < store.getNumberOfChunks(); chunkIdx++)
29+
{
30+
store.loadChunk(chunkIdx);
31+
auto lower = store.getChunkLowerBounds(chunkIdx);
32+
auto upper = store.getChunkUpperBounds(chunkIdx);
33+
for(z = lower[0]; z <= upper[0]; z++)
34+
for(y = lower[1]; y <= upper[1]; y++)
35+
for(x = lower[2]; x <= upper[2]; x++) { /* process voxel */ }
36+
}
37+
```
38+
39+
This pattern ensures sequential disk access and keeps the working set within the cache. However, chunk-sequential algorithms often require different data structures (e.g., union-find instead of BFS queues) and may use more auxiliary memory.
40+
41+
## When to Use Algorithm Dispatch
42+
43+
Use the dispatch pattern when:
44+
45+
- The in-core algorithm uses random access (BFS, graph traversal, hash-map lookups by voxel index)
46+
- The OOC-optimized algorithm uses fundamentally different data structures or traversal order
47+
- The two approaches have different memory/performance trade-offs that make one unsuitable for the other's use case
48+
49+
Do **not** use dispatch when:
50+
51+
- The algorithm already uses sequential access patterns (a single algorithm with chunk-aware loops suffices)
52+
- The only difference is minor (e.g., adding `loadChunk()` calls) rather than a fundamentally different approach
53+
- The data is small enough that OOC overhead is negligible
54+
55+
## The DispatchAlgorithm Utility
56+
57+
**Header:** `simplnx/Utilities/AlgorithmDispatch.hpp`
58+
59+
### IsOutOfCore
60+
61+
```cpp
62+
bool IsOutOfCore(const IDataArray& array);
63+
```
64+
65+
Returns `true` if the array's data store has a chunk shape (indicating ZarrStore or similar chunked storage). Returns `false` for in-memory DataStore.
66+
67+
### AnyOutOfCore
68+
69+
```cpp
70+
bool AnyOutOfCore(std::initializer_list<const IDataArray*> arrays);
71+
```
72+
73+
Returns `true` if **any** of the given arrays uses OOC storage. Null pointers in the list are skipped. Use this when a filter operates on multiple input/output arrays and any one of them being OOC should trigger the chunk-sequential algorithm path.
74+
75+
### DispatchAlgorithm
76+
77+
```cpp
78+
template <typename InCoreAlgo, typename OocAlgo, typename... ArgsT>
79+
Result<> DispatchAlgorithm(std::initializer_list<const IDataArray*> arrays, ArgsT&&... args);
80+
```
81+
82+
Checks whether any array in `arrays` uses OOC storage. If so, constructs `OocAlgo(args...)` and calls `operator()()`. Otherwise, constructs `InCoreAlgo(args...)` and calls `operator()()`.
83+
84+
**Requirements for algorithm classes:**
85+
86+
1. Both must be constructible from the same argument types (`ArgsT...`)
87+
2. Both must provide `Result<> operator()()` as their execution entry point
88+
3. Both must produce identical results for the same input data
89+
90+
## Example: IdentifySample
91+
92+
The IdentifySample filter identifies the largest contiguous region of "good" voxels in a 3D volume. It has two algorithm implementations:
93+
94+
### IdentifySampleBFS (In-Core)
95+
96+
Uses BFS flood fill with `std::vector<bool>` tracking arrays:
97+
- Memory: 2 bits per voxel (checked + sample flags)
98+
- Access pattern: Random (BFS visits neighbors in all 6 directions)
99+
- In-core performance: Fast (random access is O(1) in memory)
100+
- OOC performance: Extremely slow (random neighbor access thrashes chunk cache)
101+
102+
### IdentifySampleCCL (Out-of-Core)
103+
104+
Uses scanline Connected Component Labeling with union-find:
105+
- Memory: 8 bytes per voxel (int64 label array) + union-find overhead
106+
- Access pattern: Sequential (processes chunks in order, only checks backward neighbors)
107+
- In-core performance: Good but uses more RAM than BFS
108+
- OOC performance: Fast (sequential chunk access, no thrashing)
109+
110+
### Dispatcher Code
111+
112+
The `IdentifySample` algorithm class acts as a thin dispatcher:
113+
114+
```cpp
115+
// IdentifySample.cpp
116+
#include "IdentifySampleBFS.hpp"
117+
#include "IdentifySampleCCL.hpp"
118+
#include "simplnx/Utilities/AlgorithmDispatch.hpp"
119+
120+
Result<> IdentifySample::operator()()
121+
{
122+
auto* inputData = m_DataStructure.getDataAs<IDataArray>(m_InputValues->MaskArrayPath);
123+
return DispatchAlgorithm<IdentifySampleBFS, IdentifySampleCCL>(
124+
{inputData}, m_DataStructure, m_MessageHandler, m_ShouldCancel, m_InputValues);
125+
}
126+
```
127+
128+
The filter (`IdentifySampleFilter`) is unchanged -- it creates an `IdentifySample` instance and calls `operator()()` as before. The dispatch is transparent.
129+
130+
### File Organization
131+
132+
```
133+
Filters/Algorithms/
134+
IdentifySample.hpp # InputValues struct + dispatcher class declaration
135+
IdentifySample.cpp # Thin dispatcher using DispatchAlgorithm<BFS, CCL>
136+
IdentifySampleBFS.hpp # BFS algorithm class declaration
137+
IdentifySampleBFS.cpp # BFS flood-fill implementation
138+
IdentifySampleCCL.hpp # CCL algorithm class declaration
139+
IdentifySampleCCL.cpp # Chunk-sequential CCL implementation
140+
IdentifySampleCommon.hpp # Shared code (VectorUnionFind, slice-by-slice functor)
141+
```
142+
143+
## How to Add Algorithm Dispatch to a Filter
144+
145+
1. **Identify the bottleneck**: Profile the filter in OOC mode. If it is orders of magnitude slower than in-core, random access patterns are likely the cause.
146+
147+
2. **Design the OOC algorithm**: Replace random access with chunk-sequential processing. Common techniques:
148+
- Scanline CCL with union-find instead of BFS flood fill
149+
- Worklists sorted by chunk instead of global queues
150+
- Multi-pass algorithms where each pass processes chunks sequentially
151+
152+
3. **Create separate algorithm classes**: Both must share the same `InputValues` struct and constructor signature. Follow the existing pattern:
153+
```
154+
FilterNameBFS.hpp/cpp # In-core algorithm
155+
FilterNameCCL.hpp/cpp # OOC algorithm (or other descriptive suffix)
156+
FilterName.hpp/cpp # Dispatcher
157+
FilterNameCommon.hpp # Shared code (if any)
158+
```
159+
160+
4. **Make the dispatcher**: In the original algorithm's `operator()()`, use `DispatchAlgorithm`. Pass all input and output arrays so the dispatch triggers OOC mode if any of them are chunked:
161+
```cpp
162+
return DispatchAlgorithm<FilterNameBFS, FilterNameCCL>(
163+
{inputArrayA, inputArrayB, outputArray},
164+
dataStructure, messageHandler, shouldCancel, inputValues);
165+
```
166+
167+
5. **Register new files**: Add the new algorithm names to the plugin's `AlgorithmList` in `CMakeLists.txt`.
168+
169+
6. **Test both paths**: Run the filter's tests with both in-core and OOC configurations. Both must produce identical results.
170+
171+
## Performance Expectations
172+
173+
- **In-core**: The dispatched in-core algorithm should match or exceed the original single-algorithm performance.
174+
- **OOC**: The OOC algorithm should be dramatically faster than the original (often 100x+), though it may still be slower than in-core due to inherent disk I/O costs.
175+
- **Small datasets**: On small datasets (< 1000 voxels), the dispatch overhead is negligible and both algorithms complete in milliseconds regardless.

src/Plugins/SimplnxCore/CMakeLists.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,6 +248,8 @@ set(AlgorithmList
248248
FlyingEdges3D
249249
IdentifyDuplicateVertices
250250
IdentifySample
251+
IdentifySampleBFS
252+
IdentifySampleCCL
251253
InitializeData
252254
# InitializeImageGeomCellData
253255
# InterpolatePointCloudToRegularGrid

0 commit comments

Comments
 (0)