Version: 0.1.0
License: TBD
Language: C++20
libdop is a modern C++ header-only library designed for high-performance parallel data processing. Built on C++20, it provides a flexible and efficient framework for managing data containers, dispatching parallel operations, and composing reusable computational operators.
The library leverages TaskFlow as its backend for task-based parallelism, enabling efficient multi-threaded execution with minimal overhead. All components are designed to work seamlessly together, providing a cohesive API for parallel computation.
- Header-Only Library: Easy integration - just include the headers
- Modern C++20: Leverages concepts, ranges, and other C++20 features
- Zero-Copy Design: Efficient data management with span-based views
- Task-Based Parallelism: Built on TaskFlow for efficient parallel execution
- Type-Safe Dispatching: Compile-time type checking for operators and containers
- Flexible Container System: Support for vectors, spans, tensors, and custom containers
- Composable Operators: Build complex pipelines from simple operator building blocks
libdop is organized around three core components that work together to enable efficient parallel computation:
┌─────────────────────────────────────────────────────────┐
│ libdop │
├─────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Container │──▶│ Dispatcher │◀──│ Operator │ │
│ └─────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ Data Storage Task Management Computation │
│ │
└─────────────────────────────────────────────────────────┘
│
▼
┌──────────────┐
│ Backend │
│ (TaskFlow) │
└──────────────┘
The Container component provides data storage and memory management abstractions. Containers in libdop are designed for zero-copy operations and efficient slicing.
A dynamic array container with custom allocator support.
Features:
- Dynamic resizing with power-of-two growth strategy
- Custom allocator support for specialized memory management
- Efficient element access and iteration
- Subspan creation for zero-copy slicing
Usage:
#include "Container/Vector.h"
Vector<int> vec;
vec.PushBack(10);
vec.PushBack(20);
auto span = vec.Subspan(0, vec.Size());A non-owning view into contiguous memory, inspired by std::span.
Features:
- Zero-copy views into existing memory
- Slicing support via
Subspan() - Both mutable and const variants
- Lightweight (pointer + size)
Usage:
#include "Container/Span.h"
Vector<int> data = {1, 2, 3, 4, 5};
Span<int> view = data.Subspan(1, 3); // Views elements [2, 3, 4]A smart wrapper for parallel tiling operations that automatically manages ownership.
Features:
- Automatic ownership detection based on value category
- Lvalue references: stores pointer (non-owning)
- Rvalue references: takes ownership via move
- Uses
std::variantfor type-safe storage (32 bytes) - Essential for
ParallelRangeFortiling behavior
Usage:
#include "Container/Tileable.h"
Vector<int> data(1000);
TileableVector<int> tileable(data); // Non-owning reference
// Tileable automatically handles ownership:
TileableVector<int> owned(std::move(data)); // Takes ownershipA simple wrapper for passing plain objects by value to operators.
Use Case: Passing functors, comparators, or small objects to parallel operations without reference complications.
A placeholder for future multi-dimensional tensor support.
Status: Under development
The Dispatcher component manages task scheduling, parallel execution, and dependency tracking. It provides high-level abstractions for common parallel patterns.
The fundamental dispatching primitive for simple operators.
Signature:
template <typename TOperator, typename... TContainer>
TaskHandle Dispatch(TOperator op, TaskHandle deps, TContainer&&... containers);Features:
- Accepts any callable operator
- Automatic container type conversion
- Dependency management via
TaskHandle - Perfect forwarding of containers
- Debug mode support (
IMMEDIATE_COMPLETE_TASK)
Usage:
#include "Dispatcher/Dispatch.h"
auto task = Dispatch(
[](Span<int> data) {
for (int i = 0; i < data.Size(); ++i) {
data[i] *= 2;
}
},
TaskHandle(), // No dependencies
myVector
);
task.Wait();Parallel tiled execution for data-parallel operations.
Signature:
template <typename TOperator, typename... TContainer>
TaskHandle ParallelRangeFor(TOperator op, TaskHandle deps, TContainer&&... containers);Requirements:
- Operator must derive from
IParallelRangeForOperator<TOperator> - First container must be
TileableVectorto determine tiling size - Operator signature:
TaskHandle operator()(TContainer... containers, int n, int threadId)
Features:
- Automatic work tiling based on thread count
- Block-wise parallel execution
- Thread-aware operators (receives
threadId) - Optimal load balancing
Usage:
#include "Dispatcher/ParallelRangeFor.h"
struct MyOperator : IParallelRangeForOperator<MyOperator> {
TaskHandle operator()(Span<int> data, int n, int threadId) {
for (int i = 0; i < n; ++i) {
data[i] = data[i] * threadId;
}
return TaskHandle();
}
};
Vector<int> data(10000);
auto task = ParallelRangeFor(MyOperator{}, TaskHandle(), TileableVector(data));
task.Wait();Sequential range-based iteration (single-threaded).
Use Case: Sequential operations or small datasets where parallelism overhead isn't justified.
Parallel reduction operations (sum, min, max, etc.).
Signature:
template <typename T, typename TOp>
TaskHandle ParallelReduction(const Span<T> data, Span<T> output, TOp op, TaskHandle deps);Features:
- Multi-threaded reduction with per-thread partial results
- Custom reduction operators (addition, multiplication, min, max, etc.)
- Automatic thread-local result aggregation
Usage:
#include "Dispatcher/Reduction.h"
Vector<int> data = {1, 2, 3, 4, 5};
Vector<int> result(1);
auto task = ParallelReduction(
data.Subspan(0, data.Size()),
result.Subspan(0, 1),
[](int a, int b) { return a + b; }, // Sum reduction
TaskHandle()
);
task.Wait();
// result[0] contains the sumPrefix sum and scan operations.
Signature:
template <typename T, typename TOp>
TaskHandle Scan(const Span<T> data, Span<T> output, TOp op, TaskHandle deps);Features:
- Inclusive scan (prefix sum)
- Custom scan operators
- Both sequential and parallel variants
Usage:
#include "Dispatcher/Scan.h"
Vector<int> data = {1, 2, 3, 4, 5};
Vector<int> output(5);
auto task = Scan(
data.Subspan(0, data.Size()),
output.Subspan(0, output.Size()),
[](int a, int b) { return a + b; },
TaskHandle()
);
task.Wait();
// output = {1, 3, 6, 10, 15}Represents an asynchronous task with dependency tracking.
Features:
- Wraps backend task representation
Wait()method for synchronization- Chainable for task dependencies
Internal utility for automatic container type conversion during dispatch.
Purpose: Converts containers to appropriate views (e.g., Vector → Span) for zero-copy operations.
The Operator component defines computational kernels that process data. Operators are type-safe, composable, and can be parallelized automatically.
CRTP base class for simple, single-threaded operators.
Requirements:
template <typename TOperator>
struct ISimpleOperator {
template <typename ...TContainer>
TaskHandle operator()(TContainer&&... container, int n);
};Usage:
#include "Operator/ISimpleOperator.h"
struct DoubleValues : ISimpleOperator<DoubleValues> {
TaskHandle operator()(Span<int> data, int n) {
for (int i = 0; i < n; ++i) {
data[i] *= 2;
}
return TaskHandle();
}
};CRTP base class for parallel, thread-aware operators.
Requirements:
template <typename TOperator>
struct IParallelRangeForOperator {
template <typename ...TContainer>
TaskHandle operator()(TContainer&&... containers, int n, int threadId);
};Key Points:
n: Number of elements in the current tilethreadId: Logical thread ID (0 to threadCount-1)- Must be used with
ParallelRangeFordispatcher
Usage:
#include "Operator/IParallelRangeForOperator.h"
struct ThreadAwareOp : IParallelRangeForOperator<ThreadAwareOp> {
TaskHandle operator()(Span<int> data, int n, int threadId) {
// Process n elements, using threadId for thread-local state
for (int i = 0; i < n; ++i) {
data[i] = threadId; // Example: fill with thread ID
}
return TaskHandle();
}
};Base class for operators that need per-frame or periodic updates.
Use Case: Operators that maintain state across multiple invocations (e.g., simulation steps, animations).
libdop uses a pluggable backend architecture, currently implemented with TaskFlow.
Global backend instance management.
Features:
GetGlobalBackend(): Access to the global backend instance- Compile-time backend selection via
USE_TASKFLOW - Extensible for future backends
The default parallel execution backend.
Features:
- Task-based parallelism
- Automatic thread pool management
- Work stealing for load balancing
- Efficient task dependency resolution
Configuration:
option(USE_TASKFLOW "Enable TaskFlow as backend for libdop" ON)- C++20 compliant compiler (GCC 10+, Clang 11+, MSVC 2019+)
- CMake 3.16+
- TaskFlow (automatically fetched via CMake FetchContent)
add_subdirectory(path/to/libdop)
target_link_libraries(your_target PUBLIC libdop::libdop)Simply add the Source/ directory to your include path and include the headers you need.
#include "Container/Vector.h"
#include "Container/Tileable.h"
#include "Dispatcher/ParallelRangeFor.h"
#include "Operator/IParallelRangeForOperator.h"
// Define a parallel operator
struct MultiplyByTwo : IParallelRangeForOperator<MultiplyByTwo> {
TaskHandle operator()(Span<int> data, int n, int threadId) {
for (int i = 0; i < n; ++i) {
data[i] *= 2;
}
return TaskHandle();
}
};
int main() {
// Create data
Vector<int> numbers(10000);
for (int i = 0; i < 10000; ++i) {
numbers.PushBack(i);
}
// Execute in parallel
auto task = ParallelRangeFor(
MultiplyByTwo{},
TaskHandle(),
TileableVector(numbers)
);
// Wait for completion
task.Wait();
return 0;
}Chain tasks using TaskHandle dependencies:
auto task1 = Dispatch(operator1, TaskHandle(), data1);
auto task2 = Dispatch(operator2, task1, data2); // Waits for task1
auto task3 = Dispatch(operator3, task2, data3); // Waits for task2
task3.Wait(); // Wait for entire pipelineUse custom allocators for specialized memory management:
struct MyAllocator {
static T* Allocate(size_t n) {
// Custom allocation logic
return static_cast<T*>(my_allocate(n * sizeof(T)));
}
static void Deallocate(T* ptr) {
// Custom deallocation logic
my_deallocate(ptr);
}
};
Vector<int, MyAllocator> customVector;Enable immediate task completion for debugging:
#define IMMEDIATE_COMPLETE_TASKThis forces all tasks to execute synchronously, simplifying debugging.
- Use TileableVector for ParallelRangeFor: Ensures proper tiling and load balancing
- Prefer Span over Vector: Avoid unnecessary copies with zero-copy views
- Batch Operations: Combine multiple operations into single operators when possible
- Minimize TaskHandle::Wait(): Allow tasks to run asynchronously
- Profile Before Parallelizing: Small datasets may not benefit from parallelism
- Span: 16 bytes (pointer + size)
- TileableVector: 32 bytes (std::variant-based)
- Vector: Variable (depends on capacity)
Build and run tests:
mkdir build && cd build
cmake .. -DBUILD_TESTS=ON
cmake --build .
ctestTest directories:
Tests/TileableVectorTest/: Tests for TileableVector ownership and slicingTests/RadixSortTest/: Parallel radix sort implementationTests/hiding-game/: Game-related test scenarios
Version: 0.1.0 (Early Development)
✅ Core container system (Vector, Span, TileableVector)
✅ TaskFlow backend integration
✅ Basic dispatchers (Dispatch, ParallelRangeFor, RangeFor)
✅ Reduction and Scan operations
✅ Operator interfaces (ISimpleOperator, IParallelRangeForOperator)
🚧 DenseTensor implementation
🚧 TypedContainer runtime type system
🚧 Advanced debugging tools (DispatcherDebugger)
📋 Additional backends (OpenMP, TBB)
📋 GPU support (CUDA, SYCL)
📋 Automatic task graph optimization
📋 Profiling and performance analysis tools
Contributions are welcome! Please follow these guidelines:
- Code Style: Follow existing code conventions (C++20, header-only)
- Testing: Add tests for new features in
Tests/directory - Documentation: Update this README for significant changes
- Performance: Benchmark performance-critical changes
TBD - Please contact the project maintainers for licensing information.
For questions, issues, or contributions, please open an issue in the project repository.
- TaskFlow: High-performance task-based parallelism library
- C++20 Standard: Modern language features enabling elegant abstractions
Last Updated: January 3, 2026