bench: fix GPU power throttling in benchmark utilities#2899
bench: fix GPU power throttling in benchmark utilities#2899Edenzzzz wants to merge 5 commits intoflashinfer-ai:mainfrom
Conversation
When bench_gpu_time_with_cudagraph runs many kernels back-to-back within a single graph replay, sustained peak power draw forces the GPU to throttle clock frequency (up to 20% on B200), producing artificially lower benchmark numbers. Fix: (1) Cap num_iters_within_graph so total graph duration stays under 5ms (empirical power throttling threshold). (2) Insert sync+sleep gaps between graph replays when cumulative compute would exceed the threshold. Before: b=4 s=8192 h=16 d=128 BF16 FA4 graph=1268 vs event=1455 (-13%) After: graph=1423 vs event=1446 (-1.6%), xlarge shapes now +10% faster Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Sustained back-to-back kernel execution causes GPU clock throttling (up to 20% on B200), producing artificially lower benchmark numbers. This affects both bench_gpu_time_with_cuda_event (at high repeat_iters) and bench_gpu_time_with_cudagraph (within and between graph replays). Fix: Insert sync+sleep cooldown gaps every ~5ms of sustained compute to let GPU clocks recover. For cudagraph, also cap num_iters_within_graph so a single graph replay doesn't exceed the throttling threshold. Before (b=4 s=8192 h=16 d=128, 60 iters): Events iter 0-9: 1450 TFLOPS → iter 70-79: 1237 TFLOPS (-15%) Graph(n=10): 1268 TFLOPS vs events: 1455 (-13%) After (same config, 60 iters): Events: stable 1399 TFLOPS throughout Graph: 1427 TFLOPS (+2% over events, from reduced launch overhead) Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdded a GPU power-throttling mitigation mechanism to benchmarking utilities: introduced a module-level threshold and inserted periodic synchronization/sleep cooldowns in GPU timing loops; CUDA-graph path gains a pre-probe to estimate kernel/graph duration and adjust replay counts accordingly. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to prevent GPU power throttling during benchmarking by inserting synchronization and sleep intervals when sustained compute exceeds a 5ms threshold. It includes logic to estimate kernel execution times and dynamically adjust iteration counts within CUDA graphs to maintain stable clock frequencies. Feedback was provided to cap the sleep duration at 10ms to prevent benchmarks from becoming excessively slow when executing long-running kernels.
| sleep_after_kernel_run(estimated_kernel_execution_time) | ||
| elif (iter_idx + 1) % iters_per_burst == 0: | ||
| torch.cuda.synchronize() | ||
| time.sleep(estimated_kernel_execution_time / 1000) |
There was a problem hiding this comment.
The sleep duration here is unbounded. If estimated_kernel_execution_time is very large (e.g., >1s), this will cause the benchmark to sleep for a long time after each burst, significantly slowing it down.
Consider capping the sleep duration to a reasonable value, for instance 10ms, to prevent unexpectedly long benchmark runs while still providing an effective cooldown period.
| time.sleep(estimated_kernel_execution_time / 1000) | |
| time.sleep(min(estimated_kernel_execution_time / 1000, 0.01)) |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
flashinfer/testing/utils.py (2)
33-43: Consider making the throttle threshold configurable or architecture-aware.The 5.0ms threshold is empirically measured on B200. Per the PR description, testing on H100 and A100 is pending. Different GPU architectures may have different power throttling characteristics, and a hardcoded value may not be optimal across all hardware.
Consider either:
- Adding an optional parameter to the benchmarking functions to override this threshold
- Using
torch.cuda.get_device_properties()to detect architecture and adjust accordingly- At minimum, documenting that this value is B200-specific
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/testing/utils.py` around lines 33 - 43, The constant GPU_POWER_THROTTLE_THRESHOLD_MS is hardcoded to a B200-specific value; make the threshold configurable and architecture-aware by replacing direct uses of GPU_POWER_THROTTLE_THRESHOLD_MS with a resolved value that can be overridden: add an optional parameter (e.g., throttle_threshold_ms) to the benchmarking/measurement entrypoints that currently rely on GPU_POWER_THROTTLE_THRESHOLD_MS, implement a helper (e.g., resolve_throttle_threshold) that uses torch.cuda.get_device_properties() to return a sensible default per GPU family (B200 vs H100/A100) and falls back to the original 5.0ms, and update callers to pass through the new parameter or call the helper; also update the module docstring to note the default is B200-derived and refer to the new override.
1489-1493: Add a warning whennum_iters_within_graphis silently reduced.When the caller's requested
num_iters_within_graphwould cause throttling, the code silently reduces it. Callers (e.g., those relying on the default of 10) might be surprised when their measurement granularity changes unexpectedly.🔧 Suggested fix
max_sustained_ms = GPU_POWER_THROTTLE_THRESHOLD_MS if single_kernel_ms * num_iters_within_graph > max_sustained_ms: + original_iters = num_iters_within_graph num_iters_within_graph = max(1, int(max_sustained_ms / single_kernel_ms)) + warnings.warn( + f"num_iters_within_graph reduced from {original_iters} to " + f"{num_iters_within_graph} to avoid GPU power throttling " + f"(single kernel: {single_kernel_ms:.2f}ms, threshold: {max_sustained_ms}ms).", + category=UserWarning, + stacklevel=2, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/testing/utils.py` around lines 1489 - 1493, The code silently reduces num_iters_within_graph when single_kernel_ms * num_iters_within_graph exceeds GPU_POWER_THROTTLE_THRESHOLD_MS; change this to log or warn the caller before mutating the value by emitting a clear message that includes the original requested num_iters_within_graph, the computed reduced value, single_kernel_ms, and GPU_POWER_THROTTLE_THRESHOLD_MS so callers are aware the granularity changed; locate the logic around num_iters_within_graph and single_kernel_ms and add a warnings.warn or logger.warning call just prior to assigning the reduced num_iters_within_graph.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/testing/utils.py`:
- Around line 919-923: The current change makes sleep_after_run=False
accidentally trigger a cooldown sleep; restore opt-out by adding a new boolean
parameter (e.g., disable_throttle_mitigation=False) to the function that
contains sleep_after_run and use it in the branch logic so: if sleep_after_run:
call sleep_after_kernel_run(...); elif not disable_throttle_mitigation and
(iter_idx + 1) % iters_per_burst == 0: torch.cuda.synchronize();
time.sleep(estimated_kernel_execution_time / 1000). Update the function
signature and any callers that need the new behavior and adjust docstring to
describe disable_throttle_mitigation, referencing sleep_after_run,
sleep_after_kernel_run, estimated_kernel_execution_time, iter_idx, and
iters_per_burst.
- Around line 1546-1551: The new branch that injects cooldown gaps when
sleep_after_run is False changes behavior for existing callers; restore previous
semantics by making cooldown injection opt-in: add a boolean flag (e.g.,
inject_cooldown_when_no_sleep default False) to the function that contains this
block (and propagate it through bench_gpu_time_with_cuda_event callers), then
change the branch to only perform the torch.cuda.synchronize()/time.sleep(...)
cooldown when inject_cooldown_when_no_sleep is True and (iter_idx + 1) %
replays_per_burst == 0; keep the existing
sleep_after_kernel_run(estimated_kernel_execution_time) path unchanged and
ensure the cooldown still happens after end_events[iter_idx].record() so
measurements remain correct.
---
Nitpick comments:
In `@flashinfer/testing/utils.py`:
- Around line 33-43: The constant GPU_POWER_THROTTLE_THRESHOLD_MS is hardcoded
to a B200-specific value; make the threshold configurable and architecture-aware
by replacing direct uses of GPU_POWER_THROTTLE_THRESHOLD_MS with a resolved
value that can be overridden: add an optional parameter (e.g.,
throttle_threshold_ms) to the benchmarking/measurement entrypoints that
currently rely on GPU_POWER_THROTTLE_THRESHOLD_MS, implement a helper (e.g.,
resolve_throttle_threshold) that uses torch.cuda.get_device_properties() to
return a sensible default per GPU family (B200 vs H100/A100) and falls back to
the original 5.0ms, and update callers to pass through the new parameter or call
the helper; also update the module docstring to note the default is B200-derived
and refer to the new override.
- Around line 1489-1493: The code silently reduces num_iters_within_graph when
single_kernel_ms * num_iters_within_graph exceeds
GPU_POWER_THROTTLE_THRESHOLD_MS; change this to log or warn the caller before
mutating the value by emitting a clear message that includes the original
requested num_iters_within_graph, the computed reduced value, single_kernel_ms,
and GPU_POWER_THROTTLE_THRESHOLD_MS so callers are aware the granularity
changed; locate the logic around num_iters_within_graph and single_kernel_ms and
add a warnings.warn or logger.warning call just prior to assigning the reduced
num_iters_within_graph.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: ac4aba7a-2ca1-40a3-b4ae-54df3bcfd0af
📒 Files selected for processing (1)
flashinfer/testing/utils.py
| if sleep_after_run: | ||
| sleep_after_kernel_run(estimated_kernel_execution_time) | ||
| elif (iter_idx + 1) % iters_per_burst == 0: | ||
| torch.cuda.synchronize() | ||
| time.sleep(estimated_kernel_execution_time / 1000) |
There was a problem hiding this comment.
Semantic change: sleep_after_run=False now injects cooldown sleep.
Previously, sleep_after_run=False meant no sleeping between iterations. Now it triggers automatic cooldown gaps. This silently changes behavior for existing callers (e.g., benchmarks/routines/moe.py, benchmarks/routines/attention.py) that explicitly pass sleep_after_run=False expecting no additional delays.
Consider one of:
- Add a new parameter like
disable_throttle_mitigation=Falseto let callers opt out - Rename
sleep_after_runto better reflect the new behavior (breaking change) - Document this behavioral change prominently in the docstring
The measurement timing is correct since sleep occurs after end_events[iter_idx].record().
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/testing/utils.py` around lines 919 - 923, The current change makes
sleep_after_run=False accidentally trigger a cooldown sleep; restore opt-out by
adding a new boolean parameter (e.g., disable_throttle_mitigation=False) to the
function that contains sleep_after_run and use it in the branch logic so: if
sleep_after_run: call sleep_after_kernel_run(...); elif not
disable_throttle_mitigation and (iter_idx + 1) % iters_per_burst == 0:
torch.cuda.synchronize(); time.sleep(estimated_kernel_execution_time / 1000).
Update the function signature and any callers that need the new behavior and
adjust docstring to describe disable_throttle_mitigation, referencing
sleep_after_run, sleep_after_kernel_run, estimated_kernel_execution_time,
iter_idx, and iters_per_burst.
| if sleep_after_run: | ||
| sleep_after_kernel_run(estimated_kernel_execution_time) | ||
| elif (iter_idx + 1) % replays_per_burst == 0: | ||
| # Cooldown gap to prevent clock throttling from sustained compute. | ||
| torch.cuda.synchronize() | ||
| time.sleep(graph_duration_ms / 1000) |
There was a problem hiding this comment.
Same semantic change concern as noted for bench_gpu_time_with_cuda_event.
The sleep_after_run=False branch now injects cooldown gaps, changing the behavior for existing callers. See the earlier comment on lines 919-923 for suggested mitigations.
The cooldown implementation itself is correct—sleep occurs after end_events[iter_idx].record() so it doesn't affect measurement accuracy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/testing/utils.py` around lines 1546 - 1551, The new branch that
injects cooldown gaps when sleep_after_run is False changes behavior for
existing callers; restore previous semantics by making cooldown injection
opt-in: add a boolean flag (e.g., inject_cooldown_when_no_sleep default False)
to the function that contains this block (and propagate it through
bench_gpu_time_with_cuda_event callers), then change the branch to only perform
the torch.cuda.synchronize()/time.sleep(...) cooldown when
inject_cooldown_when_no_sleep is True and (iter_idx + 1) % replays_per_burst ==
0; keep the existing sleep_after_kernel_run(estimated_kernel_execution_time)
path unchanged and ensure the cooldown still happens after
end_events[iter_idx].record() so measurements remain correct.
CUPTI measures pure GPU kernel time, but throttled clocks still produce longer kernel durations. Apply the same cooldown gap logic. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
|
cc @yzh119 should be ready |
Summary
bench_gpu_time_with_cuda_event,bench_gpu_time_with_cudagraph, andbench_gpu_time_with_cuptithat causes artificially lower benchmark numbersnum_iters_within_graphto keep single-replay duration under the thresholdProblem
When setting
repeatinstead ofrepeat_time_ms, running kernels back-to-back causes sustained peak power draw on modern GPUs, forcing clock frequency throttling. On B200:Before fix — per-iteration TFLOPS degradation over 100 event-timed iterations (b=4 s=8192 h=16 d=128, BF16 FA4 on B200):
Fix
Introduce
GPU_POWER_THROTTLE_THRESHOLD_MS = 5.0— the empirical threshold (on B200) before sustained compute triggers clock throttling. All three timing paths now insertsync+sleepcooldown gaps at this interval.After fix — Event vs Graph comparison (60 iters each, BF16 FA4 on B200):
Graph ≥ events for 8/10 shapes.
Reproducer
Test plan
🤖 Generated with Claude Code
Summary by CodeRabbit