Skip to content

Add prefetch example#526

Open
Officeyutong wants to merge 11 commits intoeunomia-bpf:masterfrom
Officeyutong:add-two-new-helpers
Open

Add prefetch example#526
Officeyutong wants to merge 11 commits intoeunomia-bpf:masterfrom
Officeyutong:add-two-new-helpers

Conversation

@Officeyutong
Copy link
Contributor

Please try to use the copilot to summary your PR. You don't need to fill all info below, just it can help giving your a checklist.

Description

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Test A
  • Test B

Test Configuration:

  • Firmware version:
  • Hardware:
  • Toolchain:
  • SDK:

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new GPU prefetch example demonstrating CUDA unified memory (UVM) prefetching capabilities with eBPF tracing support. However, the implementation appears incomplete with several critical issues including incorrect documentation (copied from threadhist), kernel tracing mismatches, and unused code artifacts.

Key changes:

  • Adds seq_prefetch_kernel CUDA kernel showcasing batch-based memory prefetching with prefetch.global.L2 instructions
  • Implements BPF helper function 0x509 (bpf_prefetch_l2) in the GPU trampoline infrastructure
  • Includes BPF program for kernel execution tracing (though targeting wrong kernel)

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
example/gpu/prefetch/prefetch_example.cu CUDA application demonstrating UVM prefetching with batch-based memory access patterns; contains Chinese comments and unused parameters
example/gpu/prefetch/prefetch.c Userspace BPF loader for tracing GPU kernel executions; appears to be boilerplate from threadhist example
example/gpu/prefetch/prefetch.bpf.c eBPF program for GPU kernel tracing; incorrectly targets vectorAdd instead of seq_prefetch_kernel
example/gpu/prefetch/README.md Documentation entirely copied from threadhist without updates; completely incorrect for this example
example/gpu/prefetch/Makefile Build configuration for prefetch example; references non-existent vec_add target
example/gpu/prefetch/.gitignore Git ignore patterns; includes incorrect references to threadhist and vec_add
attach/nv_attach_impl/trampoline_ptx.h PTX assembly for prefetch helper function 0x509
attach/nv_attach_impl/trampoline/default_trampoline.cu Implements GPU-side prefetch helper and adds unused GPU_HASH_MAP constant

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1 to 220
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire README.md file appears to be copied from the threadhist example without being updated for the prefetch example. The documentation describes threadhist functionality (thread execution histograms) but this directory contains a prefetch example. The README should be rewritten to describe:

  • What the prefetch example demonstrates (GPU memory prefetching with CUDA unified memory)
  • How the seq_prefetch_kernel works
  • The purpose of the prefetch.bpf.c eBPF program
  • How to build and run the prefetch_example
  • The relationship between the BPF tracing and the CUDA prefetch operations

All references to "threadhist", thread execution counts, and load balancing should be replaced with content relevant to the prefetch functionality.

Copilot uses AI. Check for mistakes.
Comment on lines 98 to 99
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter prefetch_distance_pages is declared in the function signature and passed to the kernel, but it's never actually used inside the seq_prefetch_kernel. The kernel uses hard-coded values BATCH_SIZE = 4 and PREFETCH_BATCHES = 2 instead. Either:

  1. Remove the unused parameter from the function signature and kernel call, or
  2. Use the parameter to configure the prefetch distance as originally intended

This appears to be dead code left over from an earlier implementation.

Suggested change
size_t stride_elems,
int prefetch_distance_pages)
size_t stride_elems)

Copilot uses AI. Check for mistakes.
.PHONY: clean
clean:
$(call msg,CLEAN)
$(Q)rm -rf $(OUTPUT) $(APPS) vec_add
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 81, the Makefile clean target removes vec_add, but this file is never built by this Makefile. Looking at the all target (line 69), only prefetch and prefetch_example are built. The reference to vec_add should be removed from the clean target.

Suggested change
$(Q)rm -rf $(OUTPUT) $(APPS) vec_add
$(Q)rm -rf $(OUTPUT) $(APPS)

Copilot uses AI. Check for mistakes.
Comment on lines +454 to 460
// extern "C" __noinline__ __device__ uint64_t _bpf_helper_ext_0510(
// uint64_t time_to_sleep, uint64_t, uint64_t, uint64_t, uint64_t)
// {
// asm volatile("nanosleep.u32 %0;" ::"r"((uint32_t)time_to_sleep));
// return 0;
// }

Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commented-out code for _bpf_helper_ext_0510 (nanosleep helper) should either be:

  1. Removed if it's not needed for this PR, or
  2. Uncommented and properly implemented if it's intended to be part of the prefetch functionality

Dead/commented code that isn't actively being used should be removed to maintain code cleanliness unless there's a specific reason to keep it (which should be documented with a comment explaining why).

Suggested change
// extern "C" __noinline__ __device__ uint64_t _bpf_helper_ext_0510(
// uint64_t time_to_sleep, uint64_t, uint64_t, uint64_t, uint64_t)
// {
// asm volatile("nanosleep.u32 %0;" ::"r"((uint32_t)time_to_sleep));
// return 0;
// }

Copilot uses AI. Check for mistakes.
Comment on lines 86 to 87
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The constant BPF_MAP_TYPE_GPU_HASH_MAP is defined but never used in this file or the changes shown. If this is part of the prefetch functionality, it should be used; otherwise, it should be removed. If it's being added for future use, that should be documented with a comment explaining its purpose.

Suggested change
const int BPF_MAP_TYPE_GPU_HASH_MAP = 1501; // non-per-thread, single-copy
// shared hashmap
// Reserved for future use: GPU hash map type for non-per-thread, single-copy shared hashmap.
const int BPF_MAP_TYPE_GPU_HASH_MAP = 1501;

Copilot uses AI. Check for mistakes.
Comment on lines 236 to 238
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable prefetch_distance_pages is initialized to 4 but never used, as noted earlier in the kernel. This initialization on lines 236-238 along with the comment "Start with 4 pages ahead" suggests it was intended to be configurable, but the functionality was never completed. Either remove this unused variable or complete the implementation to actually use it.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +7
/threadhist
/.output
/victim*
/vec_add.cpp
/vec_add
/vec_add-new.cpp
/prefetch
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The .gitignore file references files that don't match the prefetch example. Specifically:

  • Line 1: /threadhist should likely be /prefetch (the actual binary name per Makefile line 24)
  • Lines 4-5: vec_add.cpp and vec_add don't appear to be built by this example

The .gitignore should be updated to reflect the actual build artifacts of the prefetch example.

Suggested change
/threadhist
/.output
/victim*
/vec_add.cpp
/vec_add
/vec_add-new.cpp
/prefetch
/prefetch
/.output
/victim*

Copilot uses AI. Check for mistakes.
Comment on lines 104 to 166
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are in Chinese and should be translated to English for consistency with the rest of the codebase. The comments say:

  • Line 104: "每次处理的页面批次大小" (Page batch size processed each time)
  • Line 106: "预取窗口大小(批次数)" (Prefetch window size (number of batches))
  • Line 142: "按批次处理" (Process by batch)
  • Line 145: "预取未来批次" (Prefetch future batches)
  • Line 166: "处理当前批次的所有页面" (Process all pages in the current batch)

Copilot uses AI. Check for mistakes.
Comment on lines 191 to 195
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name run_seq_device_prefetch is misleading. The function explicitly checks that mode is NOT "device" (line 203-206) and only works with UVM (Unified Virtual Memory) modes. The function should be renamed to something like run_seq_uvm_prefetch or run_uvm_device_prefetch to accurately reflect that it only supports UVM modes, not regular device memory.

Copilot uses AI. Check for mistakes.
Comment on lines +1549 to +1550
}
// .globl _bpf_helper_ext_0509 // -- Begin function _bpf_helper_ext_0509
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an inconsistency in the PTX code structure. Line 1549 has a closing brace } that appears to close a function that wasn't opened in the visible diff. This suggests the PTX assembly might have formatting issues. The comment on line 1550 indicates this is the beginning of _bpf_helper_ext_0509, but it's placed after what looks like a closing brace from a previous function. While this may be intentional based on the full file context, it's worth verifying that the PTX assembly structure is correct and properly formatted.

Copilot uses AI. Check for mistakes.
@Officeyutong Officeyutong marked this pull request as ready for review December 6, 2025 12:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments