Skip to content

PIF, Higher Order and Optimized Scatter/Gather, #501

Open
PaulFisch wants to merge 41 commits into
IPPL-framework:masterfrom
PaulFisch:pif-pr
Open

PIF, Higher Order and Optimized Scatter/Gather, #501
PaulFisch wants to merge 41 commits into
IPPL-framework:masterfrom
PaulFisch:pif-pr

Conversation

@PaulFisch
Copy link
Copy Markdown
Collaborator

This PR adds the Particle-in-Fourier method to IPPL, together with restructured FFT, scatter/gather, particle layout, communication, and timing infrastructure. A list of changes:

Build system

  • New IPPL_ENABLE_FINUFFT (CPU finufft / GPU cufinufft via FetchContent)
    and IPPL_ENABLE_CUFFTMP (cuFFTMp + NVSHMEM) build flags, both default OFF.
  • Kokkos default bumped to 5.0.0
  • cmake/AddIpplTest.cmake passes TEST_LINK_LIBS through to non-integration
    tests.
  • cmake/AutoTunePresets.cmake, IpplAutoTunePresets.h.in, and pre-tuned
    sweeps under cmake/auto_tune/sm_90/ ship the autotuner data alongside
    the build (for new scatter/gather kernels).

FFT refactor

  • The old FFT.h/.hpp is split into a Backend layer
    (Backend/{Backend,CuFFT,CuFFTMp,Heffte}.h) handling plan
    management, and a Transform layer
    (Transform/{CC,RC,PrunedCC,PrunedRC,Trig,NUFFT}.h) exposing per-transform
    front-ends.
  • A new native NUFFT (Type-1 and Type-2) lives under src/FFT/NUFFT/:
    • NativeNUFFT.h drives the upsampling-FFT pipeline.
    • ESKernel.h is an exponential-of-semicircle spreading kernel with
      width-specialised optimized evaluation.
    • Correction.h provides deconvolution factors, NUFFTUtilities.h
      provides helpers.
  • The new NUFFT type plugs into the Transform layer via NUFFT.{h,hpp}
    and supports both the native engine and the (cu)FINUFFT backend behind
    IPPL_ENABLE_FINUFFT.

Interpolation

  • New scatter dispatch in Interpolation/Scatter/: Scatter.h selects
    between AtomicScatter, GridParallelScatter (output-focused), and
    TiledScatter back-ends through ScatterConfig and a runtime
    TileSizeCache loaded from the autotune presets.
  • Matching gather front-end (AtomicGather, AtomicSort variants) under
    Interpolation/Gather/.
  • Binning.h bin-sorts particles by tile so hot tiles avoid atomic
    contention. Kernels.h, WidthDispatcher.h, and CoordinateTransform.h
    are shared with the native NUFFT.
  • AutoTune.{h,cpp} and TileSizeCache.cpp provide the runtime
    cache-lookup that lets the same binary pick a different tile size on
    different hardware.

Particle layout

  • ParticleSpatialLayout::update is rewritten from the per-iteration
    vector-of-vectors model into a packed locate + single-pass send/recv
    pipeline with reusable scratch (rankSendCount, sendOffsets, sendIds,
    cursor). Buffers are reused across updates and host-side hashes are
    avoided.
  • ParticleSort.h / SortBuffer.h provide a bin-sort with a shared
    device-side buffer pool (BinSortBuffers) so repeated sorts in
    scatter/gather hot paths reuse storage.
  • FieldLayout now stores nghost; HaloCells.{h,hpp} thread it through
    accumulateHalo / fillHalo, enabling kernel-width-aware halos (>1
    ghost cell). (Before this PR, nghosts>1 was broken in IPPL)
  • ParticleAttrib exposes scatterPIFNUFFT / gatherPIFNUFFT guarded by
    IPPL_ENABLE_FFT.
  • ParticleAttrib::scatter(...) / ParticleAttrib::gather(...) (CIC overloads used by
    every alpine PIC example) now
    forward to the kernel-aware path with LinearKernel<MeshType> and a
    default ScatterConfig / GatherConfig. PIC paths therefore inherit the
    same atomic / atomic-sort / tiled / output-focused dispatch and
    TileSizeCache lookup as NUFFT.

Communication and Particle Update

This PR removes a per-attribute buf_m send/recv staging view for every
ParticleAttrib. The old update path went
attribute view -> per-attribute buf_m -> MPI buffer → per-attribute buf_m->→ attribute view, with one buf_m per attribute per exchange and copies on
both ends. The new path serializes attribute data directly into a shared,
reused MPI buffer managed by a singleton BufferHandler, and
deserializes straight back into the destination attribute's storage at a
caller-supplied offset. The per-attribute buf_m is not needed for the common path.

  • ParticleAttrib::serialize / deserialize now operate on dview_m
    directly and accept a hash view (gather-on-the-fly of the live
    particle indices going to one rank) and a destination offset. There is
    no intermediate per-attribute buffer to size, allocate, copy into, or
    copy out of. getView() returns a subview over the live range only, so
    attribute kernels never see the overallocated tail.
  • A granularity-aware GPU allocator is added to Archive'. Device
    archives are allocated with raw cudaMalloc / hipMalloc so the
    pointer is page-aligned. Host
    archives keep using a regular Kokkos::View.

Utilities

  • IpplTimings.{h,cpp}: track per-call measurements alongside cumulative
    wall time, support resetAllTimers() (for warmup), and add
    dumpToCSV() for offline analysis. print() reports measurement counts
    and per-occurrence statistics (mean, stddev, min, max).
  • BufferView.h: lightweight, type-erased view over a contiguous device
    byte buffer, used by the new send/recv path.
  • Tuning.h: TileSizeTuner interface to Kokkos Tools autotuner that
    enumerates tile-size candidates for the new scatter/gather back-ends.
  • ParallelDispatch.h: helpers for selecting parallel-for vs.
    parallel-reduce policies based on view rank/extents and execution space.

Alpine

  • alpine/ElectrostaticPIF/{LandauDampingPIF, LandauDampingPIFPruned, BumponTailInstabilityPIF, PenningTrapPIF}.cpp: added as Alpine PIF examples.
  • alpine/ElectrostaticPIF/{ChargedParticlesPIF, ChargedParticlesPIFPruned}.hpp:
    particle-bunch container specialised for the PIF schemes

Unit tests

  • FFT/NUFFT.cpp + NUFFTTestUtils.h: test matrix for NUFFT Type-1
    and Type-2 (small / medium grid, with / without upsampling, three spread
    methods, tolerance sweep at 1e-4 / 1e-7 / 1e-10, FINUFFT cross-check if compiled,
    three gather methods).
  • FFT/NUFFTAccuracy.cpp: tolerance-vs-error sweep that compares the
    native NUFFT against a direct DFT reference.
  • Interpolation/KernelGatherScatterTest.cpp: charge conservation,
    periodicity, gather-of-constant-field, scatter/sort comparison,
    roundtrip; covers Atomic, Tiled, OutputFocused
  • Interpolation/Binning.cpp: bin-sort correctness and stability tests.
  • Particle/ParticleUpdate.cpp: spatial-layout update tests under uniform
    decomposition.
  • Particle/ParticleUpdateNonuniform.cpp: same for non-uniform domain
    decompositions

Integration tests

  • test/FFT/TestFFTCC.cpp extended for the new pruned variants and the
    heFFTe backend trait.
  • test/particle/benchmarkParticleUpdate.cpp switched to
    Kokkos::Random_XorShift64_Pool device init (avoids host->device copies
    between iterations) and adapted to the new ParticleSpatialLayout
    signature.

PaulFisch added 15 commits May 4, 2026 11:40
- Top-level CMakeLists.txt: add option(IPPL_ENABLE_FINUFFT) and
  option(IPPL_ENABLE_CUFFTMP), both defaulting OFF.
- cmake/Dependencies.cmake: add FetchContent block for (CU)FINUFFT and
  the cuFFTMp / NVSHMEM discovery block. Bump Kokkos default to 5.0.0
  and enable Kokkos_ENABLE_COMPILE_AS_CMAKE_LANGUAGE.
- cmake/InstallIppl.cmake: install finufft / cufinufft targets when
  built in-tree.
- cmake/AddIpplTest.cmake: pass TEST_LINK_LIBS through to non-integration
  tests as well.
- src/CMakeLists.txt: link Heffte/finufft/cufinufft only when FFT and
  the corresponding option are enabled; embed build metadata as an
  opt-in to reduce rebuild churn.
- .gitignore: exclude .idea*, cmake-build-*.
- Split the original FFT.h/FFT.hpp into:
  - Backend/{Backend,CuFFT,CuFFTMp,Heffte,Interface}.h: per-library
    backend selection and plan management.
  - Transform/{CC,RC,PrunedCC,PrunedRC,Trig,Common,Transform,
    NUFFT.{h,hpp}}.h: per-transform front-ends (complex-complex,
    real-complex, pruned variants, trigonometric, NUFFT).
  - Traits.h: backend availability traits and selection helpers.
- Add a native Kokkos NUFFT (Type-1 / Type-2) engine under FFT/NUFFT/:
  NativeNUFFT.h driving the upsampling-FFT pipeline, ESKernel.h
  exponential-of-semicircle spreading kernel with width-specialised
  Estrin polynomial evaluators, Correction.h deconvolution factors,
  Quadrature.h Gauss-Legendre quadrature, NUFFTUtilities.h helpers.
- The new NUFFT type plugs into the Transform layer via NUFFT.{h,hpp}
  and supports both the native engine and the (cu)FINUFFT backend
  behind IPPL_ENABLE_FINUFFT.
- Add Scatter/{Scatter,GridParallelScatter,AtomicScatter,TiledScatter,
  ScatterConfig,ScatterArgumentsBase,TileSizeCache}.h: a configurable
  scatter front-end that selects between atomic, output-focused, and
  tiled grid-parallel back-ends, with an optional tile-size cache for
  hardware-specific tuning data loaded at runtime.
- Add Gather/{Gather,AtomicGather,GatherConfig,GatherArgumentsBase}.h:
  matching gather front-end (Atomic / AtomicSort variants).
- Add Binning.h: bin-sort utilities used to group particles by tile so
  scatter/gather kernels can avoid atomics on hot tiles.
- Add Kernels.h: shared kernel-evaluation helpers consumed by the new
  scatter/gather kernels.
- Add WidthDispatcher.h: compile-time switch from runtime kernel width
  to the corresponding template specialization.
- Add CoordinateTransform.h: physical <-> grid coordinate helpers used
  by both scatter/gather and the native NUFFT.
- CIC.hpp: minor signature alignment to the new kernel concept.
- Particle/ParticleSpatialLayout.{h,hpp}: rewrite the spatial layout's
  update path to use a packed locate + single-pass send/recv pipeline
  with reusable scratch (rankSendCount, sendOffsets, sendIds, cursor)
  instead of the per-iteration vector-of-vectors approach. Reuses
  buffers across updates and avoids host-side hashes for the common
  case.
- Particle/ParticleSort.h, SortBuffer.h: bin-sort by spatial layout
  with a shared device-side buffer pool (BinSortBuffers) so repeated
  sorts in scatter/gather hot paths reuse storage.
- Particle/ParticleAttrib.{h,hpp}, ParticleAttribBase.h, ParticleBase.{h,
  hpp}: thread the new sort/bin-aware scatter and gather entry points
  through the attribute interface; add scatterPIFNUFFT/gatherPIFNUFFT
  guarded by IPPL_ENABLE_FFT.
- FieldLayout/{FieldLayout.h,FieldLayout.hpp,SubFieldLayout.hpp}: store
  nghost on the layout so the halo and neighbor lookup pick up the
  correct ghost width when more than one cell is needed.
- Field/HaloCells.{h,hpp}: thread nghost through accumulateHalo and
  fillHalo so the kernel-width-aware halo path works.
- IpplTimings.{h,cpp}: track per-call measurements alongside the
  cumulative wall time, support resetAllTimers() (for warmup), and
  add dumpToCSV() for offline analysis. Print() now also reports
  measurement counts and per-occurrence statistics (mean, stddev,
  min, max).
- BufferView.h: lightweight, type-erased view over a contiguous
  device byte buffer used by the new send/recv path.
- Tuning.h: TileSizeTuner — Kokkos-tools-driven autotuner that
  enumerates tile-size candidates for the new scatter/gather back-ends.
- ParallelDispatch.h: helpers for selecting parallel-for vs
  parallel-reduce policies based on view rank/extents and execution
  space.
- TypeUtils.h: detail::is_complex_v / MultispaceContainer helpers
  used by the buffer-handler-aware logging path and by the field
  inner-product (Hermitian for complex T).
- ParameterList.h, Timer.cpp, ViewUtils.h: small follow-on tweaks.
- Archive.{h,hpp}: extend the device-side serialize/deserialize path
  to handle Vector<T,Dim> + hash views (used by the new packed
  particle send/recv pipeline) and add a granularity-aware GPU buffer
  allocator. On HIP, the GPU page granularity (64 KiB on MI250X /
  MI300X) is required by HSA IPC for large-message GPU-aware MPI.
- BufferHandler.hpp: round buffer requests up to a 4 KiB page so we
  do not churn the GPU-aware MPI registration cache with tiny
  allocations; allocate a fresh buffer on grow rather than realloc to
  avoid stale-IPC issues with reused virtual addresses.
- Communicator.{h,cpp}: add archive-only isend/recv overloads (used
  by the new layout); switch buffer handlers to a static singleton
  via getBufferHandler() so all communicators share one pool.
- CommunicatorLogging.cpp: gatherLocalLogs() now uses an is_a_logger
  SFINAE guard so we only call getLogs() on a LoggingBufferHandler.
- Buffers.{cpp,hpp}: route deleteAllBuffers/freeAllBuffers/getBuffer
  through the singleton handler.
- Environment.{h,cpp}: initialize MPI with MPI_THREAD_MULTIPLE and
  expose threadMultiple() so callers know whether multithreaded sends
  are safe.
- Field/BareField.{h,hpp}: drop the nghost default on initialize() /
  updateLayout() now that callers have to pick the correct width;
  use ippl::detail::infinity<T>() for the Vector min/max identities
  so reductions with integer T compile.
- Field/BareFieldOperations.hpp: support complex-valued inner products
  via Hermitian conjugation and split allreduce of real/imag parts
  (std::plus<complex> is not a standard MPI Op).
- Field/Field.hpp: small helper for the new scatter integration.
- PoissonSolvers/FFT{Open,Periodic,TruncatedGreenPeriodic}PoissonSolver.h:
  pull heffteBackend from the new fft::HeffteBackend trait instead of
  FFT_t::heffteBackend.
- Types/IpplTypes.h, Vector.hpp: expose detail::infinity<T> for both
  floating-point and integer T (used by the BareField reduction
  identities).
- FEM/LagrangeSpace.hpp: simplify a parameter type (point_t alias).
- Ippl.cpp: call detail::finalizeBinSortBuffers() on shutdown to
  release the new sort scratch.
- Ippl.h: drop the implicit ParallelDispatch include; consumers that
  need it (HaloCells, ParticleSpatialLayout) include it directly.
- alpine/ElectrostaticPIF/{LandauDampingPIF,LandauDampingPIFPruned,
  BumponTailInstabilityPIF,PenningTrapPIF}.cpp: physics drivers using
  the new NUFFT engine to scatter charge onto Fourier modes and
  gather the resulting field at particle locations.
- alpine/ElectrostaticPIF/{ChargedParticlesPIF,ChargedParticlesPIFPruned}.hpp:
  particle-bunch container specialized for the PIF schemes
  (NUFFT-based density and force evaluation, energy diagnostics).
- alpine/ElectrostaticPIF/CMakeLists.txt: build the four executables;
  link (cu)finufft when IPPL_ENABLE_FINUFFT is on.
- alpine/CMakeLists.txt: register ElectrostaticPIF as an alpine
  subdirectory.
- alpine/{BumponTailInstabilityManager,LandauDampingManager,
  PenningTrapManager}.h, ParticleContainer.hpp: small adjustments
  picked up via the new ParticleSpatialLayout signatures.
…ticleUpdate

- FFT/NUFFT.cpp + NUFFTTestUtils.h: TYPED_TEST matrix for NUFFT
  Type-1 and Type-2 (small/medium grid, with/without upsampling,
  three spread methods, tolerance sweep at 1e-4/1e-7/1e-10, FINUFFT
  cross-check, three gather methods).
- FFT/NUFFTAccuracy.cpp: tolerance-vs-error sweep that compares the
  native NUFFT against a direct DFT reference.
- Interpolation/KernelGatherScatterTest.cpp + Utils.h: charge
  conservation, periodicity, gather-of-constant-field, scatter/sort
  comparison, roundtrip; covers Atomic, Tiled, OutputFocused
  variants (incl. Z-batched).
- Interpolation/Binning.cpp: bin-sort correctness and stability tests
  for the new bin_sort path.
- Particle/ParticleUpdate.cpp: spatial-layout update tests under
  uniform decomposition.
- Particle/ParticleUpdateNonuniform.cpp: same but for non-uniform
  domain decompositions to exercise the rebinning logic.
- Particle/ParticleSendRecv.cpp: minor updates for the new
  Communicator API.
- Wire all new tests into the CMakeLists.txts for unit_tests/{FFT,
  Interpolation,Particle}.
- test/FFT/TestFFTCC.cpp: extend the existing FFT correctness check
  to exercise the new pruned variants and the heFFTe backend trait.
- test/particle/benchmarkParticleUpdate.cpp: switch warmup to
  Kokkos::Random_XorShift64_Pool device init (avoids host->device
  copies between iterations) and adjust to the new
  ParticleSpatialLayout signature.
- test/particle/benchmarkParticleUpdateScaling.cpp: new strong-scaling
  benchmark that exercises ParticleSpatialLayout::update across a
  range of particle counts and ranks.
- test/particle/CMakeLists.txt: register the new scaling benchmark.
- test/{kokkos/TestVectorField{,2,3,4},field/TestFieldBC,particle/PICnd,
  serialization/serialize01}.cpp: minor signature touch-ups
  (host_mirror_type, isAllPeriodic flag, etc.) so the existing
  integration tests still build against the merged-in upstream and
  the new layout APIs.
- test/CMakeLists.txt: minor adjustment to keep all subdirectories
  registering correctly under the new structure.
- Correction.h: applyPreCorrectionType2 now multiplies by factor (not
  conj(factor)), matching the working pruned path.
- AtomicScatter.h: clamp team_size and vector_length to 1 on Serial.
- BufferHandler test: use page-aligned expected sizes; resized larger
  test now expects the smaller free buffer to be retained.
- NUFFT test: zero the non-corner-DC band of the upsampled input.
- ParticleBase test: re-acquire host mirror after destroy().
- LandauDampingManager: open the manager CSV in OVERWRITE on the first
  step and APPEND afterwards, so re-running the test does not produce a
  file with multiple header blocks.
- NUFFTAccuracy: call computeDFTReference before the owned-mode early-
  exit so every rank participates in the same MPI_Allreduce sequence.
- GatherScatterTest::ScatterCustomHashTest: size the host fill loop by
  the post-update local count (getHostMirror() now returns a mirror of
  the live-particle range).
- ParticleUpdateNonuniform::ParticleInjectionBetweenOrbRepartitions:
  the test creates a fresh bunch each cycle, so compare against
  injectPerCycle rather than a cumulative count.
- unit_tests/FFT/CMakeLists.txt: bump NUFFTAccuracy timeout to 600s.
- CMakeLists.txt: add -fno-lto for the CUDA platform to avoid host-LTO
  conflicts with the duplicate fatbinData symbols nvcc emits.
- AddIpplTest.cmake: introduce IPPL_DEFAULT_INTEGRATION_TIMEOUT=300 and
  use it for INTEGRATION tests (the unit-test default of 60s was too
  tight on slower OpenMP runs).
- BareFieldOperations.hpp: pre-capture view1/view2 with (void)-casts so
  nvcc accepts the extended host/device lambda when the constexpr-if
  branches first-use them.
- NUFFT.cpp: in runType2Test, compare absolute error against the
  L-infinity-norm-scaled tolerance instead of a pointwise relative
  error. The pointwise refReal can be arbitrarily close to zero for
  random fields, which made the test sensitive to the (rank, thread)
  layout that drives the random seeding.
- Bump TIMEOUTs that were overshooting on multi-thread OpenMP:
  unit_tests/FFT NUFFTAccuracy (1800), Interpolation
  KernelGatherScatterTest (600), PIC ORB (300), test/solver/fem
  TestNonhomDirichlet_1d_preconditioned/TestPeriodicBC_sin{,_preconditioned}
  (600). Also fix the misplaced TIMEOUT keyword inside LABELS lists in
  the solver/fem CMakeLists.
…ives

Archive<Properties...> previously took the cudaMalloc/hipMalloc path
unconditionally whenever KOKKOS_ENABLE_CUDA or KOKKOS_ENABLE_HIP was
defined. That meant Archive<Kokkos::HostSpace> — used by host-side
exec spaces such as Kokkos::Serial / Kokkos::OpenMP — was backed by a
device pointer, but its serialize() runs std::memcpy on the host and
therefore segfaulted on the very first MPI exchange under a CUDA
build with more than one rank.

Switch the storage choice to depend on the archive's memory space
rather than the build configuration. Device archives still get the raw
cuda/hipMalloc path (preserving the page-aligned / IPC-stable behaviour
needed for the Cray-MPICH HIP IPC bug); host-accessible archives use a
regular Kokkos::View<char*, MemorySpace> and the host-side memcpy in
serialize() now writes to a host pointer.
…work

ParticleAttrib::scatter(...) and ParticleAttrib::gather(...) (the
kernel-less, atomic-CIC overloads used by every alpine PIC example)
now forward to the kernel-aware scatter_kernel/gather path with
LinearKernel<MeshType> and a default ScatterConfig/GatherConfig. This
gives the PIC paths the same atomic / atomic-sort / tiled / output-
focused dispatch (and TileSizeCache lookup) as NUFFT, while keeping
the legacy hash-and-custom-range overload available for
TestHashedScatter / GatherScatterTest::ScatterCustomHashTest.

Higher-dimensional fields (Dim > 3) still use the direct CIC kernel
because the new framework's AtomicScatter/AtomicGather only have
1-, 2-, 3-D specialised paths.

Interpolation::Gather/Scatter::DeduceTypes now derives RealType from
the kernel's value_type instead of from the position attribute. With
mixed precisions (float positions on a double mesh, e.g. PICTest with
float T) this avoids silently downcasting the mesh spacing and
preserves bit-identical results vs. the legacy CIC path
(ASSERT_DOUBLE_EQ in PICTest::Gather is now exact again, no
relaxation needed).
@aaadelmann aaadelmann requested review from aaadelmann, aliemen and srikrrish and removed request for aliemen May 4, 2026 10:31
@aaadelmann aaadelmann added enhancement New feature or request gitlab-mirror labels May 4, 2026
Comment thread alpine/ParticleContainer.hpp Outdated
Comment thread cmake/Dependencies.cmake
Comment thread src/FFT/Backend/Heffte.h Outdated
Comment thread src/FFT/NUFFT/ESKernel.h
Comment thread unit_tests/FFT/NUFFT.cpp Outdated
@srikrrish
Copy link
Copy Markdown
Member

cscs-ci run cscs-ci-gh200, cscs-ci-mi300, cscs-ci-openmp

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I still have a commit fixing some of the compile failures on juwels, I will also address your comments @srikrrish

@aliemen
Copy link
Copy Markdown
Collaborator

aliemen commented May 7, 2026

Alright, after the size_type fix, the cmake path fix and the R.extent(0) vs R.size() fix, I was able to compile and run OPALX. Then, I had to fix a unit test that couldn't handle the temporary subview returned by getView - but that was an OPALX problem again.

Now, all of our unit and regression tests passed cleanly on CPU and GPU (GH200 on Daint). Our regression tests only test single rank. But from my local runs, all of them looked good with up to 8 ranks (CPU).

Note that I pushed a branch here to IPPL, test-pr-501, that I used to compile OPALX against where I applied my fixes. Feel free to look at it if you want to see what would work with OPALX. Changes I made to OPALX are in 412-..., will also make PR as soon as possible.

@s-mayani
Copy link
Copy Markdown
Collaborator

s-mayani commented May 8, 2026

Alps scaling studies, using the same configurations (512^3 and 1024^3) as the paper:

TOTAL_scalings_1024_paper_alps.pdf
TOTAL_scalings_512_paper_alps.pdf

I have added a line comparing with the paper. The FFT clearly benefits from the update's improvement. Something that caught my eye is in the 1024^3 case, at the very end (512 GPUs) this branch ("Paul" in the legend) seems to flatten and stop scaling, whereas in the paper this was not the case. I don't know why this is.

@srikrrish
Copy link
Copy Markdown
Member

Alps scaling studies, using the same configurations (512^3 and 1024^3) as the paper:

TOTAL_scalings_1024_paper_alps.pdf TOTAL_scalings_512_paper_alps.pdf

I have added a line comparing with the paper. The FFT clearly benefits from the update's improvement. Something that caught my eye is in the 1024^3 case, at the very end (512 GPUs) this branch ("Paul" in the legend) seems to flatten and stop scaling, whereas in the paper this was not the case. I don't know why this is.

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

@s-mayani
Copy link
Copy Markdown
Collaborator

s-mayani commented May 8, 2026

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

Yeah sorry I mean only the PCG solver.

@srikrrish
Copy link
Copy Markdown
Member

Thanks a lot Sonali. Yeah the FFT PIC is vastly improved now! For the flattening at 512 GPUs (for 1024^3 case) do you mean the PCG solver? Because I don't see that in FFT or FEM.

Yeah sorry I mean only the PCG solver.

Could you may be just submit the run corresponding to that data point only to see if it may be some one time effect?

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

Generally the PCG performance decreased, right? Can you send me the script that you submitted with, then I can try to analyze the cause

After hours of debugging yesterday night, I found that heffte is fundamentally broken with the ROCM/MPICH combination that we pass as linker flags in the LUMI IPPL builds. I was able to reproduce in a standalone project without Kokkos. The segfault has nothing to do with my branch and just seems to appear more often for some reason in my branch than in master. When enabling enough HSA debugging env variables, the segfault happens reliable in the first warmup FFT. I will try building on different environments to see if that makes it work. Unsure whether this is a heffte, Cray MPICH, HSA or rocm compiler bug.

@srikrrish
Copy link
Copy Markdown
Member

Generally the PCG performance decreased, right? Can you send me the script that you submitted with, then I can try to analyze the cause

After hours of debugging yesterday night, I found that heffte is fundamentally broken with the ROCM/MPICH combination that we pass as linker flags in the LUMI IPPL builds. I was able to reproduce in a standalone project without Kokkos. The segfault has nothing to do with my branch and just seems to appear more often for some reason in my branch than in master. When enabling enough HSA debugging env variables, the segfault happens reliable in the first warmup FFT. I will try building on different environments to see if that makes it work. Unsure whether this is a heffte, Cray MPICH, HSA or rocm compiler bug.

Thanks a lot. Seems something that @paubauer may be able to help. The standalone reproducer could be quite useful for him. No need to spend a lot of time on this especially if it seems to lead to a rabbit hole.

@s-mayani
Copy link
Copy Markdown
Collaborator

s-mayani commented May 8, 2026

I launched the job for that specific data point, it's in the queue. As for the heffte issue, seems related to Issue #481 .

Comment on lines +217 to +220
if constexpr (Types::KernelType::has_width_template)
kw_ptr[D * W + i] = args.kernel.template eval<W>(xi);
else
kw_ptr[D * W + i] = args.kernel(xi);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, this is a problem with cuda 12.1 and gcc 12 on Gwendolen. I cannot compile OPALX against it:

.../cuda-ampere80/build/_deps/ippl-src/src/Interpolation/Scatter/AtomicScatter.h(218): error: class "ippl::Interpolation::LinearKernel<double>" has no member "eval"
                          kw_ptr[D * W + i] = args.kernel.template eval<W>(xi);
                                                                   ^

CUDA 12.1 with GCC 12 is diagnosing the discarded eval<W> expression inside a Kokkos device lambda/team kernel. Newer CUDA, like 12.9 with GCC 14, apparently handles that template/device-lambda case correctly...

We could either find a workaround (like making all interpolation kernels provide a harmless width-templated wrapper) or we say "doesn't work with cuda/12.1" and ignore it. But that would mean we need a new module on Gwendolen (which is the case for Kokkos 5 anyways). @aaadelmann what's your opinion?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were still a lot of compiler bugs with this kind of expressions in cuda 12.9, probably it's not too difficult to build a workaround on my side, I will check.

@s-mayani
Copy link
Copy Markdown
Collaborator

s-mayani commented May 8, 2026

The new run still produced a similar timing, with the PCG flattening at 512 GPUs.

@aaadelmann
Copy link
Copy Markdown
Member

@s-mayani if you still have time until departure, send me all timing files.

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I tried to look a but at PCG profiles - most time is spent in the cudaMallocs and cudaFree in every iteration. On my tests, my branch was a bit faster, but maybe I used a different preconditioner. I have two theories: a) My branch does allocation of other stuff differently, and this affects the timing of allocations in the PCG, b) My branch needs more iterations to converge for some reason. I will investigate further. Likely the best long-term solution would be to pre-allocate the views in the PCG

@srikrrish
Copy link
Copy Markdown
Member

I tried to look a but at PCG profiles - most time is spent in the cudaMallocs and cudaFree in every iteration. On my tests, my branch was a bit faster, but maybe I used a different preconditioner. I have two theories: a) My branch does allocation of other stuff differently, and this affects the timing of allocations in the PCG, b) My branch needs more iterations to converge for some reason. I will investigate further. Likely the best long-term solution would be to pre-allocate the views in the PCG

More iterations to converge seems to be the most concerning thing for me. @s-mayani Do you know whether you are also seeing the same behaviour in your tests?

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I should clarify that I did not actually see my branch needing more iterations to converge. I was just speculating what might be the issue for the difference in performance. If it is the mallocs, the cleaner way is likely to slightly refactor the PCG not to malloc in every iteration of the CG, as that would give a massive performance boost. Given that I did not observe a slow down within an iteration in my tests, I just stated that theoretically it could be that my branch needs more iterations, which would also explain longer runtime. That would be something that I would need to investigate, if that would be the case.

@aaadelmann
Copy link
Copy Markdown
Member

If you could paste the timing-profiles into this PR, I will make an issue out of it.

PaulFisch added 6 commits May 29, 2026 16:46
Squashes four small cleanup commits previously prepared on pif-bo:

  - Doxygen comments on public APIs across FFT, Interpolation, Particle,
    Utility, and Communicate.
  - Trailing-whitespace / tab / newline normalization.
  - Replace raw `int` overalloc counters with `size_type`.
  - Replace `size_t` / `uint64_t` with `size_type` where appropriate.

Pure code-quality changes; no behavior change.
NSYS profiles showed cudaMalloc / cudaFree firing inside every (P)CG
iteration. Two sources:

  - PCG::operator() created r, d, s, q as locals on every solve and
    used .deepCopy() on the preconditioner result inside the loop.
    Move r/d/q to CG protected members, add s and pcond_out as PCG
    members, allocate all of them once via an overridden
    initializeFields(), and use updateLayout() per solve to track
    load-balance repartitions.

  - The preconditioner returned a fresh Field per call. Switch the
    API to operator()(u, result), writing into a caller-provided
    buffer. Each concrete preconditioner gains an init_fields()
    override with a fields_initialized_m lazy-init flag so scratch
    is allocated once and only updateLayout()'d thereafter.

The PCG result staging field pcond_out keeps its default NoBcFace
BCs so the preconditioner's internal operator chain never triggers
PeriodicFace::apply MPI calls -- this preserves the master code
path's MPI sequence and avoids an intermittent multi-rank halo
deadlock seen during solves.

Kept as a separate commit so it can be reverted independently while
the cluster deadlock is investigated.
The PIF example pair (ChargedParticlesPIF + LandauDampingPIF) and its
"Pruned" near-duplicates differed only in a boolean (use_upsampled_inputs
true vs false) and the upsampling step that grew the field-storage grid.
Drop the duplicate files and expose the choice as:

  - ChargedParticlesPIF constructor parameter useUpsampledInputs (default
    true, matching the original ChargedParticlesPIF behaviour). The flag
    drives the NUFFT plan's use_upsampled_inputs setting.

  - LandauDampingPIF optional 10th positional argv: "pruned" selects the
    pruned pipeline (no 2x grid upsample, original-resolution rho/Sk
    fields); anything else / absent keeps the original upsampled
    behaviour. Output CSV filename gets the matching "Pruned" suffix in
    that mode.

Net: -996 LOC across two deleted files; both variants reachable from one
source.
…build

transformFinufft and the cuFINUFFT/FINUFFT scratch buffers are hardcoded
for 3D (Rank<3> MDRangePolicy + tempField_m typed as Kokkos::View<***>).
Until those gain 2D variants the constexpr branch in FFT::transform() must
also check Dim == 3, otherwise FINUFFT-enabled builds eagerly instantiate
the 3D-only transformFinufft body for 2D FFT<NUFFTransform, ...> types
exercised by unit_tests/FFT/NUFFT.cpp and the CUDA TU fails to compile.

No behaviour change for 3D users (the runtime check on useFinufft_m is
preserved). 2D NUFFT continues to route through transformNative as before
on FINUFFT-disabled builds.
The earlier Dim==3 guard on FFT::transform() is necessary but not
sufficient: the FFT<NUFFTransform, ...> constructor unconditionally
calls allocateFinufftBuffers() (which reads lDom[2]) and initBackend()
calls initFinufft() when use_finufft is requested. For 2D test
instantiations of FFT<NUFFTransform, ...> on a FINUFFT-enabled build,
those paths read past the local NDIndex and segfault at construction.

Wrap both call sites in `if constexpr (Dim == 3)`; the native NUFFT
backend already handles 2D so transform()/initBackend() fall through to
it. Restores NUFFT unit-test pass (single-rank 60/60) on the CUDA
backend.
…udaMalloc in halo)

NSYS profile (cg_n2 chebyshev 14 outer * 31 inner = 434 inner ops) showed
~1 cudaMalloc per inner op in multi-rank runs, which the previous PCG/
Preconditioner hoist (40438d1) did not eliminate. Root cause: the
op_m wrapper macro

  #define IPPL_SOLVER_OPERATOR_WRAPPER(fun, type) [](type arg) { ... }

takes the Field by value and the OperatorF typedefs match
(std::function<OperatorRet(lhs_type)>). On every op_m(d) call the Field
is copy-constructed. BareField's Kokkos::View members are
reference-counted-shared so the data is not copied, but the embedded
HaloCells::haloData_m.buffer is a 0-initialised View on first use --
pack()'s Kokkos::realloc(buffer, size*overalloc) then grew the *copy*
and the grown buffer was destroyed when the lambda returned, so the
next op_m(d) call copied a 0-sized buffer again and reallocated.

Fix in three places, all converging on the same signature change:

  - LinearSolvers/Preconditioner.h: macro takes `type& arg`.
  - LinearSolvers/PCG.h: OperatorF/LowerF/UpperF/UpperLowerF/InverseDiagF/
    DiagF use std::function<...(lhs_type&)>. Also update
    is_same_v<InvDiagF, std::function<double(Field&)>> checks in
    Preconditioner.h to match.
  - PoissonSolvers/PoissonCG.h: drop the duplicated macro definition
    that silently shadowed the corrected one with the old by-value form.

NSYS numbers (n=2 chebyshev 14 outer * 31 inner, rank 1):
  cudaMalloc before: 505
  cudaMalloc after:   28

Same iteration counts and residue norms (verified across {noprecond,
jacobi, newton-3, chebyshev-31, gauss-seidel, ssor, richardson} *
{OpenMP, CUDA} * {n=1, 2, 4}).
@PaulFisch
Copy link
Copy Markdown
Collaborator Author

I have now fixed all the reviews and tried to address the PCG problem.
scaling
I measured now a speedup on PCG, let me know if there is still a regression observed. I just realised that I did not update the branch wrt to the master, will try to merge with the master.

@srikrrish
Copy link
Copy Markdown
Member

I have now fixed all the reviews and tried to address the PCG problem. scaling I measured now a speedup on PCG, let me know if there is still a regression observed. I just realised that I did not update the branch wrt to the master, will try to merge with the master.

This looks great! Thanks a lot @PaulFisch !

# Conflicts:
#	src/Communicate/Archive.hpp
#	src/Field/BareField.hpp
#	src/Field/BareFieldOperations.hpp
#	src/Field/FieldBufferOps.hpp
#	src/Particle/ParticleSpatialLayout.h
#	src/Particle/ParticleSpatialLayout.hpp
@PaulFisch
Copy link
Copy Markdown
Collaborator Author

Will see if CI passes now, otherwise will try to fix. The two problems discussed earlier with hangs and PCG slowdown seemed to be connected - I was not able to reproduce the deadlock without PCG anymore, and with my updates to PCG, it did not hang anymore. I now modified so that PCG does not do any rellocations within the CG loop, but I pushed that (i.e. asked an LLM to squash onto) a single commit, so it should be somewhat easily separable from the rest.

@aaadelmann
Copy link
Copy Markdown
Member

Awesome !!!!!

@srikrrish
Copy link
Copy Markdown
Member

cscs-ci run cscs-ci-gh200, cscs-ci-mi300, cscs-ci-openmp

@aaadelmann
Copy link
Copy Markdown
Member

@PaulFisch are you done with working on this PR?

@PaulFisch
Copy link
Copy Markdown
Collaborator Author

Yes, I think it should work now, I saw the mi300 CI failed due to some CI issues - let me know if there are any problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request gitlab-mirror

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants