Fix GPU-CPU memory synchronization and illegal memory access on NVIDI…#503
Fix GPU-CPU memory synchronization and illegal memory access on NVIDI…#503dr-xjx wants to merge 2 commits intoeunomia-bpf:masterfrom
Conversation
e144630 to
a79c35e
Compare
|
@yunwei37 please review this,tks~ |
|
@copilot |
There was a problem hiding this comment.
Pull Request Overview
This PR addresses critical GPU-CPU memory synchronization issues on NVIDIA Tegra platforms (Orin/Thor) by implementing proper memory management and synchronization primitives to fix illegal memory access errors and data consistency problems.
Key Changes:
- Migrated from standard memory allocation to CUDA pinned memory (
cudaHostAlloc) to resolve illegal memory access on integrated GPU systems - Added
volatilequalifiers to shared memory synchronization flags to prevent compiler optimizations - Used
ld.volatile.global.u32PTX instruction to ensure cache-bypassing memory reads in GPU spin-wait loops
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| runtime/src/attach/bpf_attach_ctx_cuda.cpp | Implements CUDAContext with cudaHostAlloc-based memory allocation and worker thread management |
| runtime/src/attach/bpf_attach_ctx.cpp | Removes explicit thread lifecycle management (now handled by CUDAContext) |
| runtime/include/bpf_attach_ctx.hpp | Adds volatile qualifiers to CommSharedMem and refactors CUDAContext class structure |
| attach/nv_attach_impl/trampoline/default_trampoline.cu | Adds volatile qualifiers and uses ld.volatile.global.u32 for cache-bypassing reads |
| attach/nv_attach_impl/nv_attach_impl_frida_setup.cpp | Adds additional debug logging for kernel launch parameters |
| attach/nv_attach_impl/nv_attach_fatbin_record.cpp | Corrects error message text for clarity |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks a lot for the PR! It seems like it fix many bugs in one PR, would you mind seperate them into mmultiple?
Also, does it has performance impact? If yes, how can we improve it while keep it correct? @Officeyutong @Sy0307 What do you think? |
I believe this PR should have no major impact, but there are minor considerations:
|
|
Please fix the CI |
1d1cef8 to
0bd768e
Compare
9019573 to
581f0e8
Compare
|
01f71c3 to
3771b8e
Compare
d0eceed to
ea901ea
Compare
…A Orin X This commit addresses critical memory consistency and access issues when running on NVIDIA Tegra platforms (Orin/Thor). Changes: 1. Add volatile qualifier to CommSharedMem members - Mark flag1, flag2, request_id, and map_id as volatile - Prevents compiler optimizations that could cause data inconsistency - Ensures each access reads from/writes to actual memory 2. Use ld.volatile.global.u32 for flag2 in make_helper_call - Forces memory load on each iteration in spin-wait loop - Bypasses L1 cache to ensure visibility of CPU writes - Critical for cross-device synchronization 3. Migrate to cudaHostAlloc for CommSharedMem allocation - Replaces standard library memory allocation (malloc/new) - Uses pinned memory to resolve CUDA_ERROR_ILLEGAL_ADDRESS - Follows NVIDIA Memory Management guidelines for Tegra platforms - Reference: https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management 4. Align CommSharedMem lifetime with bpftime_attach_ctx - Implement RAII pattern using std::unique_ptr with custom deleter - Ensures proper resource cleanup on context destruction Root Cause: On NVIDIA Orin/Thor, GPU access to constData allocated via standard library functions triggers illegal memory access errors. The error occurs in _bpf_helper_ext_0006 when executing: ... ld.const.u64 %rd1, [constData]; ... add.s64 %rd27, %rd1, %rd43; st.u8 [%rd27+24], %rs1; Error: CUDA_ERROR_ILLEGAL_ADDRESS (error code: 700) Solution: Using cudaHostAlloc with pinned memory ensures proper memory mapping between CPU and GPU address spaces on integrated GPU systems. Testing: - Platform: NVIDIA Orin X - CPU: ARM Cortex-A78AE (12-core) - CUDA Driver/Runtime: 12.1 / 11.4 - Compute Capability: 8.7 - Memory: 25767 MBytes - Test case: example/gpu/cuda-counter - Status: Verified working
ea901ea to
0d9dc86
Compare
0d9dc86 to
7fb6f5f
Compare
|
@yunwei37 please review again,tks~ |
|
Thanks! We are testing it locally these days |
…A Orin X
This commit addresses critical memory consistency and access issues when running on NVIDIA Tegra platforms (Orin/Thor).
Changes:
Add volatile qualifier to CommSharedMem members
Use ld.volatile.global.u32 for flag2 in make_helper_call
Migrate to cudaHostAlloc for CommSharedMem allocation
Align CommSharedMem lifetime with bpftime_attach_ctx
Root Cause:
On NVIDIA Orin/Thor, GPU access to constData allocated via standard library functions triggers illegal memory access errors. The error occurs in _bpf_helper_ext_0006 when executing:
...
ld.const.u64 %rd1, [constData];
...
add.s64 %rd27, %rd1, %rd43;
st.u8 [%rd27+24], %rs1;
Error: CUDA_ERROR_ILLEGAL_ADDRESS (error code: 700)
Solution:
Using cudaHostAlloc with pinned memory ensures proper memory mapping between CPU and GPU address spaces on integrated GPU systems.
Testing:
Please try to use the copilot to summary your PR. You don't need to fill all info below, just it can help giving your a checklist.
Description
Fixes # (issue)
Type of change
How Has This Been Tested?
example/gpu/cuda-counter
Test Configuration:
Checklist