Skip to content

Add sorting callback#1044

Draft
LasNikas wants to merge 6 commits intotrixi-framework:mainfrom
LasNikas:sorting-callback
Draft

Add sorting callback#1044
LasNikas wants to merge 6 commits intotrixi-framework:mainfrom
LasNikas:sorting-callback

Conversation

@LasNikas
Copy link
Collaborator

@LasNikas LasNikas commented Jan 13, 2026

not sorted sorted
image image

Background

I ran a simulation with approximately 1.8 million fluid particles.
After restarting the simulation at t = 5.0 s, I observed a severe performance regression, despite no code changes between the initial run and the restart.

For the same number of iterations, the wall time increased from ~4.4 h to ~20 h.
image

At first glance, this was surprising because:

  • Allocation behavior did not change noticeably.
  • The slowdown was almost entirely in the fluid–fluid interaction.

To isolate the issue, I ran the simulation for 4 iterations starting from:

  • t = 0.0 s, and
  • t = 5.0 s (restart state),

The performance regression:

  • Is fully reproducible
  • Occurs on different hardware (MI300A CPU and Apple M4)
  • Manifests primarily in fluid-fluid interactions

t = 0.0s

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi simulation finished.  Final time: 5.1580098904047174e-5  Time steps: 4 (accepted), 4 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

──────────────────────────────────────────────────────────────────────────────────────────────
             TrixiParticles.jl                       Time                    Allocations
                                            ───────────────────────   ────────────────────────
             Tot / % measured:                   13.3s /  74.7%            609MiB /  41.4%

Section                             ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────────
kick!                                   26    7.62s   76.8%   293ms   59.4MiB   23.6%  2.28MiB
  system interaction                    26    5.23s   52.7%   201ms   55.6MiB   22.1%  2.14MiB
    fluid1-fluid1                       26    3.59s   36.2%   138ms   19.2KiB    0.0%     755B
    open_boundary2-open_boundary2       26    617ms    6.2%  23.7ms   16.7MiB    6.6%   657KiB
    fluid1-boundary3                    26    471ms    4.7%  18.1ms   17.5KiB    0.0%     691B
    fluid1-open_boundary2               26    231ms    2.3%  8.88ms   16.6MiB    6.6%   652KiB
    open_boundary2-fluid1               26    152ms    1.5%  5.86ms   13.9MiB    5.5%   549KiB
    open_boundary2-boundary3            26    150ms    1.5%  5.77ms   8.19MiB    3.3%   323KiB
    ~system interaction~                26   11.6ms    0.1%   445μs    153KiB    0.1%  5.87KiB
    boundary3-fluid1                    26   11.9μs    0.0%   458ns     0.00B    0.0%    0.00B
    boundary3-open_boundary2            26   11.5μs    0.0%   442ns     0.00B    0.0%    0.00B
    boundary3-boundary3                 26   9.29μs    0.0%   357ns     0.00B    0.0%    0.00B
  update systems and nhs                26    2.30s   23.2%  88.4ms    844KiB    0.3%  32.5KiB
    ~update systems and nhs~            26    1.73s   17.4%  66.5ms    517KiB    0.2%  19.9KiB
    compute boundary pressure           26    240ms    2.4%  9.24ms   79.2KiB    0.0%  3.05KiB
    update nhs                          26    160ms    1.6%  6.17ms   48.8KiB    0.0%  1.88KiB
    update density diffusion            52    151ms    1.5%  2.90ms    190KiB    0.1%  3.66KiB
    inverse state equation              26   16.4ms    0.2%   630μs   8.53KiB    0.0%     336B
  source terms                          26   79.7ms    0.8%  3.06ms   2.96MiB    1.2%   117KiB
  reset ∂v/∂t                           26   9.03ms    0.1%   347μs      832B    0.0%    32.0B
  ~kick!~                               26   3.72ms    0.0%   143μs   1.55KiB    0.0%    60.9B
update callback                          4    2.07s   20.8%   517ms    185MiB   73.4%  46.2MiB
  update systems and nhs                 4    1.27s   12.8%   317ms   48.8MiB   19.4%  12.2MiB
  update open boundary                   4    796ms    8.0%   199ms    136MiB   54.1%  34.0MiB
    check domain                         4    571ms    5.8%   143ms    127MiB   50.5%  31.8MiB
    update boundary quantities           4    142ms    1.4%  35.5ms   3.75MiB    1.5%   960KiB
    flow rate calculation                4   82.8ms    0.8%  20.7ms   5.32MiB    2.1%  1.33MiB
    ~update open boundary~               4    607μs    0.0%   152μs   1.55KiB    0.0%     396B
    update pressure model                3   11.9μs    0.0%  3.98μs     0.00B    0.0%    0.00B
  ~update callback~                      4    513μs    0.0%   128μs      976B    0.0%     244B
drift!                                  26    235ms    2.4%  9.05ms   7.57MiB    3.0%   298KiB
  velocity                              26    227ms    2.3%  8.72ms   7.57MiB    3.0%   298KiB
  reset ∂u/∂t                           26   8.33ms    0.1%   321μs      832B    0.0%    32.0B
  ~drift!~                              26    463μs    0.0%  17.8μs      976B    0.0%    37.5B
──────────────────────────────────────────────────────────────────────────────────────────────
julia> using BenchmarkTools

julia> dv_ode, du_ode = copy(sol.u[end]).x; v_ode, u_ode = copy(sol.u[end]).x;

julia> @benchmark TrixiParticles.interact!($dv_ode, $v_ode, $u_ode, $fluid_system, $fluid_system, $semi)
BenchmarkTools.Trial: 40 samples with 1 evaluation per sample.
 Range (min  max):  124.624 ms  132.314 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     126.240 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   126.689 ms ±   2.207 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▂
  ██▃▁▁▁▅▁▃▁▁▁▁▁▁▁▁▃▅▁▃▁█▆▁▁▁▃▁▃▁▃▁▁▁▁▁▁▁▁▁▅▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▃▃ ▁
  125 ms           Histogram: frequency by time          132 ms <

 Memory estimate: 736 bytes, allocs estimate: 1.

t = 5.0

Trixi simulation finished.  Final time: 5.000173614232843  Time steps: 4 (accepted), 4 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

──────────────────────────────────────────────────────────────────────────────────────────────
             TrixiParticles.jl                       Time                    Allocations
                                            ───────────────────────   ────────────────────────
             Tot / % measured:                   62.9s /  94.5%            606MiB /  41.1%

Section                             ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────────
kick!                                   26    54.9s   92.4%   2.11s   54.8MiB   22.0%  2.11MiB
  system interaction                    26    38.6s   65.0%   1.48s   51.0MiB   20.5%  1.96MiB
    fluid1-fluid1                       26    36.1s   60.8%   1.39s   19.2KiB    0.0%     755B
    fluid1-boundary3                    26    933ms    1.6%  35.9ms   17.5KiB    0.0%     691B
    open_boundary2-open_boundary2       26    665ms    1.1%  25.6ms   16.7MiB    6.7%   657KiB
    open_boundary2-fluid1               26    445ms    0.7%  17.1ms   9.37MiB    3.8%   369KiB
    fluid1-open_boundary2               26    286ms    0.5%  11.0ms   16.6MiB    6.7%   652KiB
    open_boundary2-boundary3            26    148ms    0.2%  5.69ms   8.18MiB    3.3%   322KiB
    ~system interaction~                26   10.3ms    0.0%   398μs    153KiB    0.1%  5.87KiB
    boundary3-fluid1                    26   10.7μs    0.0%   413ns     0.00B    0.0%    0.00B
    boundary3-open_boundary2            26   4.91μs    0.0%   189ns     0.00B    0.0%    0.00B
    boundary3-boundary3                 26   2.58μs    0.0%  99.2ns     0.00B    0.0%    0.00B
  update systems and nhs                26    16.2s   27.2%   622ms    844KiB    0.3%  32.5KiB
    ~update systems and nhs~            26    15.3s   25.8%   589ms    517KiB    0.2%  19.9KiB
    update nhs                          26    353ms    0.6%  13.6ms   48.8KiB    0.0%  1.88KiB
    compute boundary pressure           26    310ms    0.5%  11.9ms   79.2KiB    0.0%  3.05KiB
    update density diffusion            52    155ms    0.3%  2.99ms    190KiB    0.1%  3.66KiB
    inverse state equation              26   18.8ms    0.0%   721μs   8.53KiB    0.0%     336B
  source terms                          26   96.0ms    0.2%  3.69ms   2.99MiB    1.2%   118KiB
  reset ∂v/∂t                           26   13.6ms    0.0%   524μs      832B    0.0%    32.0B
  ~kick!~                               26   3.75ms    0.0%   144μs   1.55KiB    0.0%    60.9B
update callback                          4    4.29s    7.2%   1.07s    188MiB   75.6%  47.0MiB
  update systems and nhs                 4    3.43s    5.8%   857ms   51.1MiB   20.5%  12.8MiB
  update open boundary                   4    858ms    1.4%   214ms    137MiB   55.1%  34.2MiB
    check domain                         4    626ms    1.1%   156ms    128MiB   51.4%  32.0MiB
    update boundary quantities           4    140ms    0.2%  35.0ms   3.75MiB    1.5%   960KiB
    flow rate calculation                4   91.2ms    0.2%  22.8ms   5.31MiB    2.1%  1.33MiB
    ~update open boundary~               4    678μs    0.0%   169μs   1.55KiB    0.0%     396B
    update pressure model                3   13.5μs    0.0%  4.50μs     0.00B    0.0%    0.00B
  ~update callback~                      4    522μs    0.0%   130μs      976B    0.0%     244B
drift!                                  26    245ms    0.4%  9.41ms   5.93MiB    2.4%   234KiB
  velocity                              26    236ms    0.4%  9.07ms   5.93MiB    2.4%   234KiB
  reset ∂u/∂t                           26   8.51ms    0.0%   327μs      832B    0.0%    32.0B
  ~drift!~                              26    480μs    0.0%  18.5μs      976B    0.0%    37.5B
──────────────────────────────────────────────────────────────────────────────────────────────
julia> dv_ode, du_ode = copy(sol_restart.u[end]).x; v_ode, u_ode = copy(sol_restart.u[end]).x;

julia> @benchmark TrixiParticles.interact!($dv_ode, $v_ode, $u_ode, $fluid_system, $fluid_system, $semi)
BenchmarkTools.Trial: 4 samples with 1 evaluation per sample.
 Range (min  max):  1.352 s     1.353 s  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.353 s               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.353 s ± 466.380 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                          ▁             ▁
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  1.35 s         Histogram: frequency by time         1.35 s <

 Memory estimate: 736 bytes, allocs estimate: 1.

Benchmarking a single interact! call showed a slowdown from roughly 10x with

  • Identical allocation counts
  • No additional GC activity

This strongly suggested that the issue was not allocation-related, but rather due to memory access patterns.

After an extended debugging session, @efaulhaber suggested that the slowdown might be caused by cache misses.

Initially, we were skeptical, but to test this hypothesis I created this PR.
The figures below show the particle indexing at t=0 (left) and t=5 (right).

t = 0 t = 5
image image

Results after sorting by cell ID for t=5

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi simulation finished.  Final time: 5.000173614232843  Time steps: 4 (accepted), 4 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

──────────────────────────────────────────────────────────────────────────────────────────────
             TrixiParticles.jl                       Time                    Allocations
                                            ───────────────────────   ────────────────────────
             Tot / % measured:                   13.9s /  75.4%            813MiB /  55.8%

Section                             ncalls     time    %tot     avg     alloc    %tot      avg
──────────────────────────────────────────────────────────────────────────────────────────────
kick!                                   26    7.73s   74.0%   297ms   63.4MiB   14.0%  2.44MiB
  system interaction                    26    5.27s   50.5%   203ms   59.6MiB   13.1%  2.29MiB
    fluid1-fluid1                       26    3.55s   34.0%   137ms   19.2KiB    0.0%     755B
    open_boundary2-open_boundary2       26    693ms    6.6%  26.6ms   16.7MiB    3.7%   657KiB
    fluid1-boundary3                    26    455ms    4.4%  17.5ms   17.5KiB    0.0%     691B
    fluid1-open_boundary2               26    258ms    2.5%  9.93ms   16.6MiB    3.7%   652KiB
    open_boundary2-boundary3            26    153ms    1.5%  5.88ms   12.2MiB    2.7%   479KiB
    open_boundary2-fluid1               26    150ms    1.4%  5.78ms   13.9MiB    3.1%   549KiB
    ~system interaction~                26   11.5ms    0.1%   442μs    153KiB    0.0%  5.87KiB
    boundary3-boundary3                 26   8.42μs    0.0%   324ns     0.00B    0.0%    0.00B
    boundary3-fluid1                    26   6.67μs    0.0%   257ns     0.00B    0.0%    0.00B
    boundary3-open_boundary2            26   5.58μs    0.0%   215ns     0.00B    0.0%    0.00B
  update systems and nhs                26    2.36s   22.6%  90.8ms    844KiB    0.2%  32.5KiB
    ~update systems and nhs~            26    1.85s   17.7%  71.0ms    517KiB    0.1%  19.9KiB
    compute boundary pressure           26    205ms    2.0%  7.87ms   79.2KiB    0.0%  3.05KiB
    update density diffusion            52    168ms    1.6%  3.22ms    190KiB    0.0%  3.66KiB
    update nhs                          26    128ms    1.2%  4.93ms   48.8KiB    0.0%  1.88KiB
    inverse state equation              26   14.3ms    0.1%   548μs   8.53KiB    0.0%     336B
  source terms                          26   79.1ms    0.8%  3.04ms   2.97MiB    0.7%   117KiB
  reset ∂v/∂t                           26   16.2ms    0.2%   624μs      832B    0.0%    32.0B
  ~kick!~                               26   3.39ms    0.0%   130μs   1.55KiB    0.0%    60.9B
update callback                          4    2.08s   20.0%   521ms    194MiB   42.8%  48.5MiB
  update systems and nhs                 4    1.26s   12.1%   316ms   54.6MiB   12.0%  13.6MiB
  update open boundary                   4    821ms    7.9%   205ms    140MiB   30.8%  34.9MiB
    check domain                         4    591ms    5.7%   148ms    128MiB   28.2%  32.0MiB
    update boundary quantities           4    145ms    1.4%  36.2ms   3.76MiB    0.8%   961KiB
    flow rate calculation                4   83.7ms    0.8%  20.9ms   7.87MiB    1.7%  1.97MiB
    ~update open boundary~               4    809μs    0.0%   202μs   1.55KiB    0.0%     396B
    update pressure model                3   13.5μs    0.0%  4.51μs     0.00B    0.0%    0.00B
  ~update callback~                      4    446μs    0.0%   112μs      976B    0.0%     244B
sorting callback                         1    399ms    3.8%   399ms    190MiB   41.9%   190MiB
drift!                                  26    232ms    2.2%  8.92ms   5.93MiB    1.3%   234KiB
  velocity                              26    223ms    2.1%  8.58ms   5.93MiB    1.3%   234KiB
  reset ∂u/∂t                           26   8.30ms    0.1%   319μs      832B    0.0%    32.0B
  ~drift!~                              26    419μs    0.0%  16.1μs      976B    0.0%    37.5B
──────────────────────────────────────────────────────────────────────────────────────────────
julia> dv_ode, du_ode = copy(sol_restart.u[end]).x; v_ode, u_ode = copy(sol_restart.u[end]).x;

julia> @benchmark TrixiParticles.interact!($dv_ode, $v_ode, $u_ode, $fluid_system, $fluid_system, $semi)
BenchmarkTools.Trial: 38 samples with 1 evaluation per sample.
 Range (min  max):  132.136 ms  134.641 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     132.750 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   132.838 ms ± 526.554 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █▂        ▅         ▂
  ▅▁▁▅██▅▅▁▁▅█▅▅██▁▁█▁▅▁▁▅█▅▁▅█▁▁▁▁▁▁▁▅▁▅▁▁▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅ ▁
  132 ms           Histogram: frequency by time          135 ms <

 Memory estimate: 736 bytes, allocs estimate: 1.

t=0 sorted

julia> dv_ode, du_ode = copy(sol.u[end]).x; v_ode, u_ode = copy(sol.u[end]).x;

julia> @benchmark TrixiParticles.interact!($dv_ode, $v_ode, $u_ode, $fluid_system, $fluid_system, $semi)
BenchmarkTools.Trial: 41 samples with 1 evaluation per sample.
 Range (min  max):  121.612 ms  122.807 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     121.922 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   121.942 ms ± 212.599 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▃▃ █▃█   ██▃▃          █                                 
  ▇▇▁▇▁▇██▁███▇▇▁████▇▇▇▁▇▇▇▇▁▁█▇▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  122 ms           Histogram: frequency by time          123 ms <

 Memory estimate: 736 bytes, allocs estimate: 1.

Note that there are different number of active particles and the location is also not exactly identical.

The following plot shows a 24-hour simulation run comparing the sorted (orange, sorted every 1000 step) and unsorted (blue) particle layouts. Due to buffered stdout, the unsorted run did not flush all output before the job was terminated, so the plotted data for this case is incomplete. The figure therefore illustrates only a trend. More detailed benchmarks should still be carried out.

image

@LasNikas LasNikas self-assigned this Jan 13, 2026
@codecov
Copy link

codecov bot commented Jan 13, 2026

Codecov Report

❌ Patch coverage is 0% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.73%. Comparing base (89c85b4) to head (398daf0).
⚠️ Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/callbacks/sorting.jl 0.00% 72 Missing ⚠️
src/general/buffer.jl 0.00% 17 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1044      +/-   ##
==========================================
- Coverage   89.63%   88.73%   -0.91%     
==========================================
  Files         121      122       +1     
  Lines        8763     8852      +89     
==========================================
  Hits         7855     7855              
- Misses        908      997      +89     
Flag Coverage Δ
total 88.73% <0.00%> (-0.91%) ⬇️
unit 65.11% <0.00%> (-0.67%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@LasNikas LasNikas mentioned this pull request Jan 13, 2026
19 tasks
Comment on lines +102 to +107
# Apply to all data
active_particle .= active_particle_sorted[perm2]
system_coords .= system_coords[:, combined_perm]
system_velocity .= system_velocity[:, combined_perm]
system_pressure .= system_pressure[combined_perm]
system_density .= system_density[combined_perm]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't make sense. It's system-specific (e.g. TLSPH doesn't have pressure). You need to pass the combined_perm to sort_system!.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I just assumed that only fluid systems are sorted.
It doesn't make much sense to sort a TLSPH, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we might want to sort DEM (not a priority though IMO, so we could skip this if it makes things easier). But what about IISPH? I'll have to check again if we need to sort anything else there.

If this actually works for all fluid systems, I'd skip DEM.

@@ -0,0 +1,165 @@
struct SortingCallback{I}
Copy link
Member

@efaulhaber efaulhaber Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
struct SortingCallback{I}
# These are the systems that require sorting.
# Boundary particles always stay fixed relative to each other, TLSPH computes in the initial configuration.
const SortingSystem = Union{AbstractFluidSystem, DEMSystem}
struct SortingCallback{I}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the name SortingSystem, but I don't have a better idea.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm how about DynamicNeighborhoodSystem? Or a name that indicates, that the trajectory of single particles change a lot during simulation.

end

# TODO: Sort also masses and particle spacings for variable smoothing lengths.
function sort_particles!(system::AbstractFluidSystem, v, u, nhs,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function sort_particles!(system::AbstractFluidSystem, v, u, nhs,
function sort_particles!(system::SortingSystem, v, u, nhs,

@LasNikas LasNikas added the enhancement New feature or request label Jan 14, 2026
Comment on lines +97 to +99
cell_ids[particle] = PointNeighbors.cell_index(cell_list,
PointNeighbors.cell_coords(point_coords,
nhs))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cell_ids[particle] = PointNeighbors.cell_index(cell_list,
PointNeighbors.cell_coords(point_coords,
nhs))
cell_coords[particle] = PointNeighbors.cell_coords(point_coords, nhs)

This works for all cell lists, and sorting by cell index doesn't make sense for anything other than the cell-based GridNeighborhoodSearch, so I think this is fine and we don't need any new API.

# TODO: Sort also masses and particle spacings for variable smoothing lengths.
function sort_particles!(system::AbstractFluidSystem, v, u, nhs,
cell_list::FullGridCellList, semi)
cell_ids = zero(allocate(semi.parallelization_backend, Int, nparticles(system)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cell_ids = zero(allocate(semi.parallelization_backend, Int, nparticles(system)))
cell_coords = allocate(semi.parallelization_backend, SVector{ndims(system), Int}, nparticles(system))

nhs))
end

perm = sortperm(transfer2cpu(cell_ids))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
perm = sortperm(transfer2cpu(cell_ids))
perm = sortperm(transfer2cpu(cell_coords))

Comment on lines +85 to +86
nhs = get_neighborhood_search(system, semi)
cell_list = nhs.cell_list
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nhs = get_neighborhood_search(system, semi)
cell_list = nhs.cell_list
nhs = get_neighborhood_search(system, semi)
if !(nhs isa GridNeighborhoodSearch)
throw(ArgumentError("`SortingCallback` can only be used with a `GridNeighborhoodSearch`")
end

Or do we need this for the trivial NHS as well?

@efaulhaber
Copy link
Member

grafik It looks like the GPU gets a constant overhead per particle from randomly shuffling the particle index. For large problems, sorting is 3-4x faster than a fully random index. On the CPU, there is no difference for small problems, but the runtime per particle grows linearly with the problem size, as opposed to being constant as expected. This is a 3x speedup from sorting at 1M particles, 4.5x at 4M, 8.3x at 16M, 9.4x at 65M.

Here are some more details for 4M particles (2D):

Ordering Perfect grid CPU (ms) Realistic perturbation CPU (ms) Perfect grid H100 (ms) Realistic perturbation H100 (ms)
Baseline 73 88 6.4 9.9
Sorted by linear cell index 73 88 8.4 9.9
Sorted by cell Z index 73 90 8.3 9.7
Shuffled index 374 368 32.4 31.9

And 3D with 1M particles:

Ordering Perfect grid CPU (ms) Realistic perturbation CPU (ms) Perfect grid H100 (ms) Realistic perturbation H100 (ms)
Baseline 100 122 18 32.9
Sorted by linear cell index 111 126 22.3 30.5
Sorted by cell Z index 129 149 21.4 30.1
Shuffled index 495 495 66 71

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

discussion enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants