PIF, Higher Order and Optimized Scatter/Gather, by PaulFisch · Pull Request #501 · IPPL-framework/ippl

PaulFisch · 2026-05-04T10:06:33Z

This PR adds the Particle-in-Fourier method to IPPL, together with restructured FFT, scatter/gather, particle layout, communication, and timing infrastructure. A list of changes:

Build system

New IPPL_ENABLE_FINUFFT (CPU finufft / GPU cufinufft via FetchContent)
and IPPL_ENABLE_CUFFTMP (cuFFTMp + NVSHMEM) build flags, both default OFF.
Kokkos default bumped to 5.0.0
cmake/AddIpplTest.cmake passes TEST_LINK_LIBS through to non-integration
tests.
cmake/AutoTunePresets.cmake, IpplAutoTunePresets.h.in, and pre-tuned
sweeps under cmake/auto_tune/sm_90/ ship the autotuner data alongside
the build (for new scatter/gather kernels).

FFT refactor

The old FFT.h/.hpp is split into a Backend layer
(Backend/{Backend,CuFFT,CuFFTMp,Heffte}.h) handling plan
management, and a Transform layer
(Transform/{CC,RC,PrunedCC,PrunedRC,Trig,NUFFT}.h) exposing per-transform
front-ends.
A new native NUFFT (Type-1 and Type-2) lives under src/FFT/NUFFT/:
- NativeNUFFT.h drives the upsampling-FFT pipeline.
- ESKernel.h is an exponential-of-semicircle spreading kernel with
  width-specialised optimized evaluation.
- Correction.h provides deconvolution factors, NUFFTUtilities.h
  provides helpers.
The new NUFFT type plugs into the Transform layer via NUFFT.{h,hpp}
and supports both the native engine and the (cu)FINUFFT backend behind
IPPL_ENABLE_FINUFFT.

Interpolation

New scatter dispatch in Interpolation/Scatter/: Scatter.h selects
between AtomicScatter, GridParallelScatter (output-focused), and
TiledScatter back-ends through ScatterConfig and a runtime
TileSizeCache loaded from the autotune presets.
Matching gather front-end (AtomicGather, AtomicSort variants) under
Interpolation/Gather/.
Binning.h bin-sorts particles by tile so hot tiles avoid atomic
contention. Kernels.h, WidthDispatcher.h, and CoordinateTransform.h
are shared with the native NUFFT.
AutoTune.{h,cpp} and TileSizeCache.cpp provide the runtime
cache-lookup that lets the same binary pick a different tile size on
different hardware.

Particle layout

ParticleSpatialLayout::update is rewritten from the per-iteration
vector-of-vectors model into a packed locate + single-pass send/recv
pipeline with reusable scratch (rankSendCount, sendOffsets, sendIds,
cursor). Buffers are reused across updates and host-side hashes are
avoided.
ParticleSort.h / SortBuffer.h provide a bin-sort with a shared
device-side buffer pool (BinSortBuffers) so repeated sorts in
scatter/gather hot paths reuse storage.
FieldLayout now stores nghost; HaloCells.{h,hpp} thread it through
accumulateHalo / fillHalo, enabling kernel-width-aware halos (>1
ghost cell). (Before this PR, nghosts>1 was broken in IPPL)
ParticleAttrib exposes scatterPIFNUFFT / gatherPIFNUFFT guarded by
IPPL_ENABLE_FFT.
ParticleAttrib::scatter(...) / ParticleAttrib::gather(...) (CIC overloads used by
every alpine PIC example) now
forward to the kernel-aware path with LinearKernel<MeshType> and a
default ScatterConfig / GatherConfig. PIC paths therefore inherit the
same atomic / atomic-sort / tiled / output-focused dispatch and
TileSizeCache lookup as NUFFT.

Communication and Particle Update

This PR removes a per-attribute buf_m send/recv staging view for every
ParticleAttrib. The old update path went
attribute view -> per-attribute buf_m -> MPI buffer → per-attribute buf_m->→ attribute view, with one buf_m per attribute per exchange and copies on
both ends. The new path serializes attribute data directly into a shared,
reused MPI buffer managed by a singleton BufferHandler, and
deserializes straight back into the destination attribute's storage at a
caller-supplied offset. The per-attribute buf_m is not needed for the common path.

ParticleAttrib::serialize / deserialize now operate on dview_m
directly and accept a hash view (gather-on-the-fly of the live
particle indices going to one rank) and a destination offset. There is
no intermediate per-attribute buffer to size, allocate, copy into, or
copy out of. getView() returns a subview over the live range only, so
attribute kernels never see the overallocated tail.
A granularity-aware GPU allocator is added to Archive'. Device
archives are allocated with raw cudaMalloc / hipMalloc so the
pointer is page-aligned. Host
archives keep using a regular Kokkos::View.

Utilities

IpplTimings.{h,cpp}: track per-call measurements alongside cumulative
wall time, support resetAllTimers() (for warmup), and add
dumpToCSV() for offline analysis. print() reports measurement counts
and per-occurrence statistics (mean, stddev, min, max).
BufferView.h: lightweight, type-erased view over a contiguous device
byte buffer, used by the new send/recv path.
Tuning.h: TileSizeTuner interface to Kokkos Tools autotuner that
enumerates tile-size candidates for the new scatter/gather back-ends.
ParallelDispatch.h: helpers for selecting parallel-for vs.
parallel-reduce policies based on view rank/extents and execution space.

Alpine

alpine/ElectrostaticPIF/{LandauDampingPIF, LandauDampingPIFPruned, BumponTailInstabilityPIF, PenningTrapPIF}.cpp: added as Alpine PIF examples.
alpine/ElectrostaticPIF/{ChargedParticlesPIF, ChargedParticlesPIFPruned}.hpp:
particle-bunch container specialised for the PIF schemes

Unit tests

FFT/NUFFT.cpp + NUFFTTestUtils.h: test matrix for NUFFT Type-1
and Type-2 (small / medium grid, with / without upsampling, three spread
methods, tolerance sweep at 1e-4 / 1e-7 / 1e-10, FINUFFT cross-check if compiled,
three gather methods).
FFT/NUFFTAccuracy.cpp: tolerance-vs-error sweep that compares the
native NUFFT against a direct DFT reference.
Interpolation/KernelGatherScatterTest.cpp: charge conservation,
periodicity, gather-of-constant-field, scatter/sort comparison,
roundtrip; covers Atomic, Tiled, OutputFocused
Interpolation/Binning.cpp: bin-sort correctness and stability tests.
Particle/ParticleUpdate.cpp: spatial-layout update tests under uniform
decomposition.
Particle/ParticleUpdateNonuniform.cpp: same for non-uniform domain
decompositions

Integration tests

test/FFT/TestFFTCC.cpp extended for the new pruned variants and the
heFFTe backend trait.
test/particle/benchmarkParticleUpdate.cpp switched to
Kokkos::Random_XorShift64_Pool device init (avoids host->device copies
between iterations) and adapted to the new ParticleSpatialLayout
signature.

- Top-level CMakeLists.txt: add option(IPPL_ENABLE_FINUFFT) and option(IPPL_ENABLE_CUFFTMP), both defaulting OFF. - cmake/Dependencies.cmake: add FetchContent block for (CU)FINUFFT and the cuFFTMp / NVSHMEM discovery block. Bump Kokkos default to 5.0.0 and enable Kokkos_ENABLE_COMPILE_AS_CMAKE_LANGUAGE. - cmake/InstallIppl.cmake: install finufft / cufinufft targets when built in-tree. - cmake/AddIpplTest.cmake: pass TEST_LINK_LIBS through to non-integration tests as well. - src/CMakeLists.txt: link Heffte/finufft/cufinufft only when FFT and the corresponding option are enabled; embed build metadata as an opt-in to reduce rebuild churn. - .gitignore: exclude .idea*, cmake-build-*.

- Split the original FFT.h/FFT.hpp into: - Backend/{Backend,CuFFT,CuFFTMp,Heffte,Interface}.h: per-library backend selection and plan management. - Transform/{CC,RC,PrunedCC,PrunedRC,Trig,Common,Transform, NUFFT.{h,hpp}}.h: per-transform front-ends (complex-complex, real-complex, pruned variants, trigonometric, NUFFT). - Traits.h: backend availability traits and selection helpers. - Add a native Kokkos NUFFT (Type-1 / Type-2) engine under FFT/NUFFT/: NativeNUFFT.h driving the upsampling-FFT pipeline, ESKernel.h exponential-of-semicircle spreading kernel with width-specialised Estrin polynomial evaluators, Correction.h deconvolution factors, Quadrature.h Gauss-Legendre quadrature, NUFFTUtilities.h helpers. - The new NUFFT type plugs into the Transform layer via NUFFT.{h,hpp} and supports both the native engine and the (cu)FINUFFT backend behind IPPL_ENABLE_FINUFFT.

- Add Scatter/{Scatter,GridParallelScatter,AtomicScatter,TiledScatter, ScatterConfig,ScatterArgumentsBase,TileSizeCache}.h: a configurable scatter front-end that selects between atomic, output-focused, and tiled grid-parallel back-ends, with an optional tile-size cache for hardware-specific tuning data loaded at runtime. - Add Gather/{Gather,AtomicGather,GatherConfig,GatherArgumentsBase}.h: matching gather front-end (Atomic / AtomicSort variants). - Add Binning.h: bin-sort utilities used to group particles by tile so scatter/gather kernels can avoid atomics on hot tiles. - Add Kernels.h: shared kernel-evaluation helpers consumed by the new scatter/gather kernels. - Add WidthDispatcher.h: compile-time switch from runtime kernel width to the corresponding template specialization. - Add CoordinateTransform.h: physical <-> grid coordinate helpers used by both scatter/gather and the native NUFFT. - CIC.hpp: minor signature alignment to the new kernel concept.

- Particle/ParticleSpatialLayout.{h,hpp}: rewrite the spatial layout's update path to use a packed locate + single-pass send/recv pipeline with reusable scratch (rankSendCount, sendOffsets, sendIds, cursor) instead of the per-iteration vector-of-vectors approach. Reuses buffers across updates and avoids host-side hashes for the common case. - Particle/ParticleSort.h, SortBuffer.h: bin-sort by spatial layout with a shared device-side buffer pool (BinSortBuffers) so repeated sorts in scatter/gather hot paths reuse storage. - Particle/ParticleAttrib.{h,hpp}, ParticleAttribBase.h, ParticleBase.{h, hpp}: thread the new sort/bin-aware scatter and gather entry points through the attribute interface; add scatterPIFNUFFT/gatherPIFNUFFT guarded by IPPL_ENABLE_FFT. - FieldLayout/{FieldLayout.h,FieldLayout.hpp,SubFieldLayout.hpp}: store nghost on the layout so the halo and neighbor lookup pick up the correct ghost width when more than one cell is needed. - Field/HaloCells.{h,hpp}: thread nghost through accumulateHalo and fillHalo so the kernel-width-aware halo path works.

- IpplTimings.{h,cpp}: track per-call measurements alongside the cumulative wall time, support resetAllTimers() (for warmup), and add dumpToCSV() for offline analysis. Print() now also reports measurement counts and per-occurrence statistics (mean, stddev, min, max). - BufferView.h: lightweight, type-erased view over a contiguous device byte buffer used by the new send/recv path. - Tuning.h: TileSizeTuner — Kokkos-tools-driven autotuner that enumerates tile-size candidates for the new scatter/gather back-ends. - ParallelDispatch.h: helpers for selecting parallel-for vs parallel-reduce policies based on view rank/extents and execution space. - TypeUtils.h: detail::is_complex_v / MultispaceContainer helpers used by the buffer-handler-aware logging path and by the field inner-product (Hermitian for complex T). - ParameterList.h, Timer.cpp, ViewUtils.h: small follow-on tweaks.

- Archive.{h,hpp}: extend the device-side serialize/deserialize path to handle Vector<T,Dim> + hash views (used by the new packed particle send/recv pipeline) and add a granularity-aware GPU buffer allocator. On HIP, the GPU page granularity (64 KiB on MI250X / MI300X) is required by HSA IPC for large-message GPU-aware MPI. - BufferHandler.hpp: round buffer requests up to a 4 KiB page so we do not churn the GPU-aware MPI registration cache with tiny allocations; allocate a fresh buffer on grow rather than realloc to avoid stale-IPC issues with reused virtual addresses. - Communicator.{h,cpp}: add archive-only isend/recv overloads (used by the new layout); switch buffer handlers to a static singleton via getBufferHandler() so all communicators share one pool. - CommunicatorLogging.cpp: gatherLocalLogs() now uses an is_a_logger SFINAE guard so we only call getLogs() on a LoggingBufferHandler. - Buffers.{cpp,hpp}: route deleteAllBuffers/freeAllBuffers/getBuffer through the singleton handler. - Environment.{h,cpp}: initialize MPI with MPI_THREAD_MULTIPLE and expose threadMultiple() so callers know whether multithreaded sends are safe.

- Field/BareField.{h,hpp}: drop the nghost default on initialize() / updateLayout() now that callers have to pick the correct width; use ippl::detail::infinity<T>() for the Vector min/max identities so reductions with integer T compile. - Field/BareFieldOperations.hpp: support complex-valued inner products via Hermitian conjugation and split allreduce of real/imag parts (std::plus<complex> is not a standard MPI Op). - Field/Field.hpp: small helper for the new scatter integration. - PoissonSolvers/FFT{Open,Periodic,TruncatedGreenPeriodic}PoissonSolver.h: pull heffteBackend from the new fft::HeffteBackend trait instead of FFT_t::heffteBackend. - Types/IpplTypes.h, Vector.hpp: expose detail::infinity<T> for both floating-point and integer T (used by the BareField reduction identities). - FEM/LagrangeSpace.hpp: simplify a parameter type (point_t alias). - Ippl.cpp: call detail::finalizeBinSortBuffers() on shutdown to release the new sort scratch. - Ippl.h: drop the implicit ParallelDispatch include; consumers that need it (HaloCells, ParticleSpatialLayout) include it directly.

- alpine/ElectrostaticPIF/{LandauDampingPIF,LandauDampingPIFPruned, BumponTailInstabilityPIF,PenningTrapPIF}.cpp: physics drivers using the new NUFFT engine to scatter charge onto Fourier modes and gather the resulting field at particle locations. - alpine/ElectrostaticPIF/{ChargedParticlesPIF,ChargedParticlesPIFPruned}.hpp: particle-bunch container specialized for the PIF schemes (NUFFT-based density and force evaluation, energy diagnostics). - alpine/ElectrostaticPIF/CMakeLists.txt: build the four executables; link (cu)finufft when IPPL_ENABLE_FINUFFT is on. - alpine/CMakeLists.txt: register ElectrostaticPIF as an alpine subdirectory. - alpine/{BumponTailInstabilityManager,LandauDampingManager, PenningTrapManager}.h, ParticleContainer.hpp: small adjustments picked up via the new ParticleSpatialLayout signatures.

…ticleUpdate - FFT/NUFFT.cpp + NUFFTTestUtils.h: TYPED_TEST matrix for NUFFT Type-1 and Type-2 (small/medium grid, with/without upsampling, three spread methods, tolerance sweep at 1e-4/1e-7/1e-10, FINUFFT cross-check, three gather methods). - FFT/NUFFTAccuracy.cpp: tolerance-vs-error sweep that compares the native NUFFT against a direct DFT reference. - Interpolation/KernelGatherScatterTest.cpp + Utils.h: charge conservation, periodicity, gather-of-constant-field, scatter/sort comparison, roundtrip; covers Atomic, Tiled, OutputFocused variants (incl. Z-batched). - Interpolation/Binning.cpp: bin-sort correctness and stability tests for the new bin_sort path. - Particle/ParticleUpdate.cpp: spatial-layout update tests under uniform decomposition. - Particle/ParticleUpdateNonuniform.cpp: same but for non-uniform domain decompositions to exercise the rebinning logic. - Particle/ParticleSendRecv.cpp: minor updates for the new Communicator API. - Wire all new tests into the CMakeLists.txts for unit_tests/{FFT, Interpolation,Particle}.

- test/FFT/TestFFTCC.cpp: extend the existing FFT correctness check to exercise the new pruned variants and the heFFTe backend trait. - test/particle/benchmarkParticleUpdate.cpp: switch warmup to Kokkos::Random_XorShift64_Pool device init (avoids host->device copies between iterations) and adjust to the new ParticleSpatialLayout signature. - test/particle/benchmarkParticleUpdateScaling.cpp: new strong-scaling benchmark that exercises ParticleSpatialLayout::update across a range of particle counts and ranks. - test/particle/CMakeLists.txt: register the new scaling benchmark. - test/{kokkos/TestVectorField{,2,3,4},field/TestFieldBC,particle/PICnd, serialization/serialize01}.cpp: minor signature touch-ups (host_mirror_type, isAllPeriodic flag, etc.) so the existing integration tests still build against the merged-in upstream and the new layout APIs. - test/CMakeLists.txt: minor adjustment to keep all subdirectories registering correctly under the new structure.

- Correction.h: applyPreCorrectionType2 now multiplies by factor (not conj(factor)), matching the working pruned path. - AtomicScatter.h: clamp team_size and vector_length to 1 on Serial. - BufferHandler test: use page-aligned expected sizes; resized larger test now expects the smaller free buffer to be retained. - NUFFT test: zero the non-corner-DC band of the upsampled input. - ParticleBase test: re-acquire host mirror after destroy().

- LandauDampingManager: open the manager CSV in OVERWRITE on the first step and APPEND afterwards, so re-running the test does not produce a file with multiple header blocks. - NUFFTAccuracy: call computeDFTReference before the owned-mode early- exit so every rank participates in the same MPI_Allreduce sequence. - GatherScatterTest::ScatterCustomHashTest: size the host fill loop by the post-update local count (getHostMirror() now returns a mirror of the live-particle range). - ParticleUpdateNonuniform::ParticleInjectionBetweenOrbRepartitions: the test creates a fresh bunch each cycle, so compare against injectPerCycle rather than a cumulative count. - unit_tests/FFT/CMakeLists.txt: bump NUFFTAccuracy timeout to 600s.

- CMakeLists.txt: add -fno-lto for the CUDA platform to avoid host-LTO conflicts with the duplicate fatbinData symbols nvcc emits. - AddIpplTest.cmake: introduce IPPL_DEFAULT_INTEGRATION_TIMEOUT=300 and use it for INTEGRATION tests (the unit-test default of 60s was too tight on slower OpenMP runs). - BareFieldOperations.hpp: pre-capture view1/view2 with (void)-casts so nvcc accepts the extended host/device lambda when the constexpr-if branches first-use them. - NUFFT.cpp: in runType2Test, compare absolute error against the L-infinity-norm-scaled tolerance instead of a pointwise relative error. The pointwise refReal can be arbitrarily close to zero for random fields, which made the test sensitive to the (rank, thread) layout that drives the random seeding. - Bump TIMEOUTs that were overshooting on multi-thread OpenMP: unit_tests/FFT NUFFTAccuracy (1800), Interpolation KernelGatherScatterTest (600), PIC ORB (300), test/solver/fem TestNonhomDirichlet_1d_preconditioned/TestPeriodicBC_sin{,_preconditioned} (600). Also fix the misplaced TIMEOUT keyword inside LABELS lists in the solver/fem CMakeLists.

…ives Archive<Properties...> previously took the cudaMalloc/hipMalloc path unconditionally whenever KOKKOS_ENABLE_CUDA or KOKKOS_ENABLE_HIP was defined. That meant Archive<Kokkos::HostSpace> — used by host-side exec spaces such as Kokkos::Serial / Kokkos::OpenMP — was backed by a device pointer, but its serialize() runs std::memcpy on the host and therefore segfaulted on the very first MPI exchange under a CUDA build with more than one rank. Switch the storage choice to depend on the archive's memory space rather than the build configuration. Device archives still get the raw cuda/hipMalloc path (preserving the page-aligned / IPC-stable behaviour needed for the Cray-MPICH HIP IPC bug); host-accessible archives use a regular Kokkos::View<char*, MemorySpace> and the host-side memcpy in serialize() now writes to a host pointer.

…work ParticleAttrib::scatter(...) and ParticleAttrib::gather(...) (the kernel-less, atomic-CIC overloads used by every alpine PIC example) now forward to the kernel-aware scatter_kernel/gather path with LinearKernel<MeshType> and a default ScatterConfig/GatherConfig. This gives the PIC paths the same atomic / atomic-sort / tiled / output- focused dispatch (and TileSizeCache lookup) as NUFFT, while keeping the legacy hash-and-custom-range overload available for TestHashedScatter / GatherScatterTest::ScatterCustomHashTest. Higher-dimensional fields (Dim > 3) still use the direct CIC kernel because the new framework's AtomicScatter/AtomicGather only have 1-, 2-, 3-D specialised paths. Interpolation::Gather/Scatter::DeduceTypes now derives RealType from the kernel's value_type instead of from the position attribute. With mixed precisions (float positions on a double mesh, e.g. PICTest with float T) this avoids silently downcasting the mesh spacing and preserves bit-identical results vs. the legacy CIC path (ASSERT_DOUBLE_EQ in PICTest::Gather is now exact again, no relaxation needed).

…t have tests

srikrrish · 2026-05-05T15:15:21Z

cscs-ci run cscs-ci-gh200, cscs-ci-mi300, cscs-ci-openmp

PaulFisch · 2026-05-05T15:23:39Z

I still have a commit fixing some of the compile failures on juwels, I will also address your comments @srikrrish

aliemen · 2026-05-07T15:56:13Z

Alright, after the size_type fix, the cmake path fix and the R.extent(0) vs R.size() fix, I was able to compile and run OPALX. Then, I had to fix a unit test that couldn't handle the temporary subview returned by getView - but that was an OPALX problem again.

Now, all of our unit and regression tests passed cleanly on CPU and GPU (GH200 on Daint). Our regression tests only test single rank. But from my local runs, all of them looked good with up to 8 ranks (CPU).

Note that I pushed a branch here to IPPL, test-pr-501, that I used to compile OPALX against where I applied my fixes. Feel free to look at it if you want to see what would work with OPALX. Changes I made to OPALX are in 412-..., will also make PR as soon as possible.

s-mayani · 2026-05-08T08:27:41Z

Alps scaling studies, using the same configurations (512^3 and 1024^3) as the paper:

TOTAL_scalings_1024_paper_alps.pdf
TOTAL_scalings_512_paper_alps.pdf

I have added a line comparing with the paper. The FFT clearly benefits from the update's improvement. Something that caught my eye is in the 1024^3 case, at the very end (512 GPUs) this branch ("Paul" in the legend) seems to flatten and stop scaling, whereas in the paper this was not the case. I don't know why this is.

srikrrish · 2026-05-08T08:36:35Z

Alps scaling studies, using the same configurations (512^3 and 1024^3) as the paper:

TOTAL_scalings_1024_paper_alps.pdf TOTAL_scalings_512_paper_alps.pdf

I have added a line comparing with the paper. The FFT clearly benefits from the update's improvement. Something that caught my eye is in the 1024^3 case, at the very end (512 GPUs) this branch ("Paul" in the legend) seems to flatten and stop scaling, whereas in the paper this was not the case. I don't know why this is.

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

s-mayani · 2026-05-08T08:40:52Z

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

Yeah sorry I mean only the PCG solver.

srikrrish · 2026-05-08T08:43:17Z

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

Yeah sorry I mean only the PCG solver.

Could you may be just submit the run corresponding to that data point only to see if it may be some one time effect?

PaulFisch · 2026-05-08T09:04:13Z

Generally the PCG performance decreased, right? Can you send me the script that you submitted with, then I can try to analyze the cause

After hours of debugging yesterday night, I found that heffte is fundamentally broken with the ROCM/MPICH combination that we pass as linker flags in the LUMI IPPL builds. I was able to reproduce in a standalone project without Kokkos. The segfault has nothing to do with my branch and just seems to appear more often for some reason in my branch than in master. When enabling enough HSA debugging env variables, the segfault happens reliable in the first warmup FFT. I will try building on different environments to see if that makes it work. Unsure whether this is a heffte, Cray MPICH, HSA or rocm compiler bug.

srikrrish · 2026-05-08T09:13:37Z

Generally the PCG performance decreased, right? Can you send me the script that you submitted with, then I can try to analyze the cause

After hours of debugging yesterday night, I found that heffte is fundamentally broken with the ROCM/MPICH combination that we pass as linker flags in the LUMI IPPL builds. I was able to reproduce in a standalone project without Kokkos. The segfault has nothing to do with my branch and just seems to appear more often for some reason in my branch than in master. When enabling enough HSA debugging env variables, the segfault happens reliable in the first warmup FFT. I will try building on different environments to see if that makes it work. Unsure whether this is a heffte, Cray MPICH, HSA or rocm compiler bug.

Thanks a lot. Seems something that @paubauer may be able to help. The standalone reproducer could be quite useful for him. No need to spend a lot of time on this especially if it seems to lead to a rabbit hole.

s-mayani · 2026-05-08T09:16:35Z

I launched the job for that specific data point, it's in the queue. As for the heffte issue, seems related to Issue #481 .

aliemen · 2026-05-08T11:30:47Z

+                    if constexpr (Types::KernelType::has_width_template)
+                        kw_ptr[D * W + i] = args.kernel.template eval<W>(xi);
+                    else
+                        kw_ptr[D * W + i] = args.kernel(xi);


For some reason, this is a problem with cuda 12.1 and gcc 12 on Gwendolen. I cannot compile OPALX against it:

.../cuda-ampere80/build/_deps/ippl-src/src/Interpolation/Scatter/AtomicScatter.h(218): error: class "ippl::Interpolation::LinearKernel<double>" has no member "eval" kw_ptr[D * W + i] = args.kernel.template eval<W>(xi); ^

CUDA 12.1 with GCC 12 is diagnosing the discarded eval<W> expression inside a Kokkos device lambda/team kernel. Newer CUDA, like 12.9 with GCC 14, apparently handles that template/device-lambda case correctly...

We could either find a workaround (like making all interpolation kernels provide a harmless width-templated wrapper) or we say "doesn't work with cuda/12.1" and ignore it. But that would mean we need a new module on Gwendolen (which is the case for Kokkos 5 anyways). @aaadelmann what's your opinion?

There were still a lot of compiler bugs with this kind of expressions in cuda 12.9, probably it's not too difficult to build a workaround on my side, I will check.

s-mayani · 2026-05-08T13:01:59Z

The new run still produced a similar timing, with the PCG flattening at 512 GPUs.

aaadelmann · 2026-05-08T21:02:38Z

@s-mayani if you still have time until departure, send me all timing files.

PaulFisch · 2026-05-11T06:44:51Z

I tried to look a but at PCG profiles - most time is spent in the cudaMallocs and cudaFree in every iteration. On my tests, my branch was a bit faster, but maybe I used a different preconditioner. I have two theories: a) My branch does allocation of other stuff differently, and this affects the timing of allocations in the PCG, b) My branch needs more iterations to converge for some reason. I will investigate further. Likely the best long-term solution would be to pre-allocate the views in the PCG

srikrrish · 2026-05-11T07:26:14Z

I tried to look a but at PCG profiles - most time is spent in the cudaMallocs and cudaFree in every iteration. On my tests, my branch was a bit faster, but maybe I used a different preconditioner. I have two theories: a) My branch does allocation of other stuff differently, and this affects the timing of allocations in the PCG, b) My branch needs more iterations to converge for some reason. I will investigate further. Likely the best long-term solution would be to pre-allocate the views in the PCG

More iterations to converge seems to be the most concerning thing for me. @s-mayani Do you know whether you are also seeing the same behaviour in your tests?

PaulFisch · 2026-05-11T07:36:18Z

I should clarify that I did not actually see my branch needing more iterations to converge. I was just speculating what might be the issue for the difference in performance. If it is the mallocs, the cleaner way is likely to slightly refactor the PCG not to malloc in every iteration of the CG, as that would give a massive performance boost. Given that I did not observe a slow down within an iteration in my tests, I just stated that theoretically it could be that my branch needs more iterations, which would also explain longer runtime. That would be something that I would need to investigate, if that would be the case.

aaadelmann · 2026-05-11T09:11:30Z

If you could paste the timing-profiles into this PR, I will make an issue out of it.

Squashes four small cleanup commits previously prepared on pif-bo: - Doxygen comments on public APIs across FFT, Interpolation, Particle, Utility, and Communicate. - Trailing-whitespace / tab / newline normalization. - Replace raw `int` overalloc counters with `size_type`. - Replace `size_t` / `uint64_t` with `size_type` where appropriate. Pure code-quality changes; no behavior change.

NSYS profiles showed cudaMalloc / cudaFree firing inside every (P)CG iteration. Two sources: - PCG::operator() created r, d, s, q as locals on every solve and used .deepCopy() on the preconditioner result inside the loop. Move r/d/q to CG protected members, add s and pcond_out as PCG members, allocate all of them once via an overridden initializeFields(), and use updateLayout() per solve to track load-balance repartitions. - The preconditioner returned a fresh Field per call. Switch the API to operator()(u, result), writing into a caller-provided buffer. Each concrete preconditioner gains an init_fields() override with a fields_initialized_m lazy-init flag so scratch is allocated once and only updateLayout()'d thereafter. The PCG result staging field pcond_out keeps its default NoBcFace BCs so the preconditioner's internal operator chain never triggers PeriodicFace::apply MPI calls -- this preserves the master code path's MPI sequence and avoids an intermittent multi-rank halo deadlock seen during solves. Kept as a separate commit so it can be reverted independently while the cluster deadlock is investigated.

The PIF example pair (ChargedParticlesPIF + LandauDampingPIF) and its "Pruned" near-duplicates differed only in a boolean (use_upsampled_inputs true vs false) and the upsampling step that grew the field-storage grid. Drop the duplicate files and expose the choice as: - ChargedParticlesPIF constructor parameter useUpsampledInputs (default true, matching the original ChargedParticlesPIF behaviour). The flag drives the NUFFT plan's use_upsampled_inputs setting. - LandauDampingPIF optional 10th positional argv: "pruned" selects the pruned pipeline (no 2x grid upsample, original-resolution rho/Sk fields); anything else / absent keeps the original upsampled behaviour. Output CSV filename gets the matching "Pruned" suffix in that mode. Net: -996 LOC across two deleted files; both variants reachable from one source.

…build transformFinufft and the cuFINUFFT/FINUFFT scratch buffers are hardcoded for 3D (Rank<3> MDRangePolicy + tempField_m typed as Kokkos::View<***>). Until those gain 2D variants the constexpr branch in FFT::transform() must also check Dim == 3, otherwise FINUFFT-enabled builds eagerly instantiate the 3D-only transformFinufft body for 2D FFT<NUFFTransform, ...> types exercised by unit_tests/FFT/NUFFT.cpp and the CUDA TU fails to compile. No behaviour change for 3D users (the runtime check on useFinufft_m is preserved). 2D NUFFT continues to route through transformNative as before on FINUFFT-disabled builds.

The earlier Dim==3 guard on FFT::transform() is necessary but not sufficient: the FFT<NUFFTransform, ...> constructor unconditionally calls allocateFinufftBuffers() (which reads lDom[2]) and initBackend() calls initFinufft() when use_finufft is requested. For 2D test instantiations of FFT<NUFFTransform, ...> on a FINUFFT-enabled build, those paths read past the local NDIndex and segfault at construction. Wrap both call sites in `if constexpr (Dim == 3)`; the native NUFFT backend already handles 2D so transform()/initBackend() fall through to it. Restores NUFFT unit-test pass (single-rank 60/60) on the CUDA backend.

…udaMalloc in halo) NSYS profile (cg_n2 chebyshev 14 outer * 31 inner = 434 inner ops) showed ~1 cudaMalloc per inner op in multi-rank runs, which the previous PCG/ Preconditioner hoist (40438d1) did not eliminate. Root cause: the op_m wrapper macro #define IPPL_SOLVER_OPERATOR_WRAPPER(fun, type) [](type arg) { ... } takes the Field by value and the OperatorF typedefs match (std::function<OperatorRet(lhs_type)>). On every op_m(d) call the Field is copy-constructed. BareField's Kokkos::View members are reference-counted-shared so the data is not copied, but the embedded HaloCells::haloData_m.buffer is a 0-initialised View on first use -- pack()'s Kokkos::realloc(buffer, size*overalloc) then grew the *copy* and the grown buffer was destroyed when the lambda returned, so the next op_m(d) call copied a 0-sized buffer again and reallocated. Fix in three places, all converging on the same signature change: - LinearSolvers/Preconditioner.h: macro takes `type& arg`. - LinearSolvers/PCG.h: OperatorF/LowerF/UpperF/UpperLowerF/InverseDiagF/ DiagF use std::function<...(lhs_type&)>. Also update is_same_v<InvDiagF, std::function<double(Field&)>> checks in Preconditioner.h to match. - PoissonSolvers/PoissonCG.h: drop the duplicated macro definition that silently shadowed the corrected one with the old by-value form. NSYS numbers (n=2 chebyshev 14 outer * 31 inner, rank 1): cudaMalloc before: 505 cudaMalloc after: 28 Same iteration counts and residue norms (verified across {noprecond, jacobi, newton-3, chebyshev-31, gauss-seidel, ssor, richardson} * {OpenMP, CUDA} * {n=1, 2, 4}).

PaulFisch · 2026-05-29T20:23:10Z

I have now fixed all the reviews and tried to address the PCG problem.

I measured now a speedup on PCG, let me know if there is still a regression observed. I just realised that I did not update the branch wrt to the master, will try to merge with the master.

srikrrish · 2026-05-29T20:31:03Z

I have now fixed all the reviews and tried to address the PCG problem. I measured now a speedup on PCG, let me know if there is still a regression observed. I just realised that I did not update the branch wrt to the master, will try to merge with the master.

This looks great! Thanks a lot @PaulFisch !

# Conflicts: # src/Communicate/Archive.hpp # src/Field/BareField.hpp # src/Field/BareFieldOperations.hpp # src/Field/FieldBufferOps.hpp # src/Particle/ParticleSpatialLayout.h # src/Particle/ParticleSpatialLayout.hpp

PaulFisch · 2026-05-29T21:04:21Z

Will see if CI passes now, otherwise will try to fix. The two problems discussed earlier with hangs and PCG slowdown seemed to be connected - I was not able to reproduce the deadlock without PCG anymore, and with my updates to PCG, it did not hang anymore. I now modified so that PCG does not do any rellocations within the CG loop, but I pushed that (i.e. asked an LLM to squash onto) a single commit, so it should be somewhat easily separable from the rest.

aaadelmann · 2026-05-29T21:06:14Z

Awesome !!!!!

srikrrish · 2026-05-30T13:33:38Z

cscs-ci run cscs-ci-gh200, cscs-ci-mi300, cscs-ci-openmp

aaadelmann · 2026-05-30T14:03:11Z

@PaulFisch are you done with working on this PR?

PaulFisch · 2026-05-30T14:56:44Z

Yes, I think it should work now, I saw the mi300 CI failed due to some CI issues - let me know if there are any problems

PaulFisch added 15 commits May 4, 2026 11:40

aaadelmann requested review from aaadelmann, aliemen and srikrrish and removed request for aliemen May 4, 2026 10:31

aaadelmann added enhancement New feature or request gitlab-mirror labels May 4, 2026

Fix compilation problems in files that are only in examples and do no…

2702034

…t have tests

srikrrish requested changes May 5, 2026

View reviewed changes

Comment thread alpine/ParticleContainer.hpp Outdated

Comment thread cmake/Dependencies.cmake

Comment thread src/FFT/Backend/Heffte.h Outdated

Comment thread src/FFT/NUFFT/ESKernel.h

Comment thread unit_tests/FFT/NUFFT.cpp Outdated

PaulFisch added 5 commits May 5, 2026 20:39

Fix diverse compilation problems and address Sri review

540d9f7

Add back BO

9bbe077

try fixing BO

9ce5a76

update sm90 sweep

f65989c

landau warmup

6677a53

aliemen mentioned this pull request May 8, 2026

Proof OPALX for Kokkos 5 and IPPLs PIF PR OPALX-project/OPALX#414

Merged

aliemen reviewed May 8, 2026

View reviewed changes

PaulFisch added 6 commits May 29, 2026 16:46

Merge remote-tracking branch 'main/master' into pif-pr

b1d77b4

# Conflicts: # src/Communicate/Archive.hpp # src/Field/BareField.hpp # src/Field/BareFieldOperations.hpp # src/Field/FieldBufferOps.hpp # src/Particle/ParticleSpatialLayout.h # src/Particle/ParticleSpatialLayout.hpp

Conversation

PaulFisch commented May 4, 2026

Build system

FFT refactor

Interpolation

Particle layout

Communication and Particle Update

Utilities

Alpine

Unit tests

Integration tests

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

srikrrish commented May 5, 2026

Uh oh!

PaulFisch commented May 5, 2026

Uh oh!

aliemen commented May 7, 2026

Uh oh!

s-mayani commented May 8, 2026

Uh oh!

srikrrish commented May 8, 2026

Uh oh!

s-mayani commented May 8, 2026

Uh oh!

srikrrish commented May 8, 2026

Uh oh!

PaulFisch commented May 8, 2026

Uh oh!

srikrrish commented May 8, 2026

Uh oh!

s-mayani commented May 8, 2026

Uh oh!

aliemen May 8, 2026

Choose a reason for hiding this comment

Uh oh!

PaulFisch May 8, 2026

Choose a reason for hiding this comment

Uh oh!

s-mayani commented May 8, 2026

Uh oh!

aaadelmann commented May 8, 2026

Uh oh!

PaulFisch commented May 11, 2026

Uh oh!

srikrrish commented May 11, 2026

Uh oh!

PaulFisch commented May 11, 2026

Uh oh!

aaadelmann commented May 11, 2026

Uh oh!

PaulFisch commented May 29, 2026

Uh oh!

srikrrish commented May 29, 2026

Uh oh!

PaulFisch commented May 29, 2026

Uh oh!

aaadelmann commented May 29, 2026

Uh oh!

srikrrish commented May 30, 2026

Uh oh!

aaadelmann commented May 30, 2026

Uh oh!

PaulFisch commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants