Fix GPU-CPU memory synchronization and illegal memory access on NVIDI… by dr-xjx · Pull Request #503 · eunomia-bpf/bpftime

dr-xjx · 2025-11-17T12:30:28Z

…A Orin X

This commit addresses critical memory consistency and access issues when running on NVIDIA Tegra platforms (Orin/Thor).

Changes:

Add volatile qualifier to CommSharedMem members
- Mark flag1, flag2, request_id, and map_id as volatile
- Prevents compiler optimizations that could cause data inconsistency
- Ensures each access reads from/writes to actual memory
Use ld.volatile.global.u32 for flag2 in make_helper_call
- Forces memory load on each iteration in spin-wait loop
- Bypasses L1 cache to ensure visibility of CPU writes
- Critical for cross-device synchronization
Migrate to cudaHostAlloc for CommSharedMem allocation
- Replaces standard library memory allocation (malloc/new)
- Uses pinned memory to resolve CUDA_ERROR_ILLEGAL_ADDRESS
- Follows NVIDIA Memory Management guidelines for Tegra platforms
- Reference: https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management
Align CommSharedMem lifetime with bpftime_attach_ctx
- Implement RAII pattern using std::unique_ptr with custom deleter
- Ensures proper resource cleanup on context destruction

Root Cause:
On NVIDIA Orin/Thor, GPU access to constData allocated via standard library functions triggers illegal memory access errors. The error occurs in _bpf_helper_ext_0006 when executing:
...
ld.const.u64 %rd1, [constData];
...
add.s64 %rd27, %rd1, %rd43;
st.u8 [%rd27+24], %rs1;

Error: CUDA_ERROR_ILLEGAL_ADDRESS (error code: 700)

Solution:
Using cudaHostAlloc with pinned memory ensures proper memory mapping between CPU and GPU address spaces on integrated GPU systems.

Testing:

Platform: NVIDIA Orin X
CPU: ARM Cortex-A78AE (12-core)
CUDA Driver/Runtime: 12.1 / 11.4
Compute Capability: 8.7
Memory: 25767 MBytes
Test case: example/gpu/cuda-counter
Status: Verified working

Please try to use the copilot to summary your PR. You don't need to fill all info below, just it can help giving your a checklist.

Description

Fixes # (issue)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

Test A
example/gpu/cuda-counter
Test B

Test Configuration:

Platform: NVIDIA Orin X
CPU: ARM Cortex-A78AE (12-core)
CUDA Driver/Runtime: 12.1 / 11.4
Compute Capability: 8.7
Memory: 25767 MBytes
Test case: example/gpu/cuda-counter
Status: Verified working
Firmware version:
Hardware:
Toolchain:
SDK:

Checklist

My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules
I have checked my code and corrected any misspellings

dr-xjx · 2025-11-18T02:38:19Z

@yunwei37 please review this，tks~

dr-xjx · 2025-11-18T02:51:56Z

@copilot

attach/nv_attach_impl/trampoline/default_trampoline.cu

runtime/src/attach/bpf_attach_ctx_cuda.cpp

Copilot

Pull Request Overview

This PR addresses critical GPU-CPU memory synchronization issues on NVIDIA Tegra platforms (Orin/Thor) by implementing proper memory management and synchronization primitives to fix illegal memory access errors and data consistency problems.

Key Changes:

Migrated from standard memory allocation to CUDA pinned memory (cudaHostAlloc) to resolve illegal memory access on integrated GPU systems
Added volatile qualifiers to shared memory synchronization flags to prevent compiler optimizations
Used ld.volatile.global.u32 PTX instruction to ensure cache-bypassing memory reads in GPU spin-wait loops

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
runtime/src/attach/bpf_attach_ctx_cuda.cpp	Implements CUDAContext with cudaHostAlloc-based memory allocation and worker thread management
runtime/src/attach/bpf_attach_ctx.cpp	Removes explicit thread lifecycle management (now handled by CUDAContext)
runtime/include/bpf_attach_ctx.hpp	Adds volatile qualifiers to CommSharedMem and refactors CUDAContext class structure
attach/nv_attach_impl/trampoline/default_trampoline.cu	Adds volatile qualifiers and uses ld.volatile.global.u32 for cache-bypassing reads
attach/nv_attach_impl/nv_attach_impl_frida_setup.cpp	Adds additional debug logging for kernel launch parameters
attach/nv_attach_impl/nv_attach_fatbin_record.cpp	Corrects error message text for clarity

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

runtime/src/attach/bpf_attach_ctx_cuda.cpp

runtime/include/bpf_attach_ctx.hpp

runtime/src/attach/bpf_attach_ctx_cuda.cpp

runtime/include/bpf_attach_ctx.hpp

runtime/src/attach/bpf_attach_ctx_cuda.cpp

runtime/include/bpf_attach_ctx.hpp

runtime/src/attach/bpf_attach_ctx_cuda.cpp

yunwei37 · 2025-11-18T03:51:31Z

Thanks a lot for the PR!

It seems like it fix many bugs in one PR, would you mind seperate them into mmultiple?

Use ld.volatile.global.u32 for flag2 in make_helper_call

Also, does it has performance impact? If yes, how can we improve it while keep it correct?

@Officeyutong @Sy0307 What do you think?

dr-xjx · 2025-11-19T03:36:07Z

Thanks a lot for the PR!

It seems like it fix many bugs in one PR, would you mind seperate them into mmultiple?

Use ld.volatile.global.u32 for flag2 in make_helper_call

Also, does it has performance impact? If yes, how can we improve it while keep it correct?

@Officeyutong @Sy0307 What do you think?
These changes should be kept as a cohesive whole, and splitting them isn't really necessary.

I believe this PR should have no major impact, but there are minor considerations:

Memory Allocation: cudaHostAlloc requires contiguous physical memory allocation, which may fail when system memory is fragmented or under pressure
Instruction Overhead: ld.volatile is unnecessary for uncached memory access scenarios, but the performance impact is negligible

example/gpu/cuda-counter/cuda_probe.bpf.c

Officeyutong · 2025-11-19T08:34:01Z

Please fix the CI

dr-xjx · 2025-11-21T06:35:43Z

Please fix the CI
@Officeyutong @yunwei37
How should I debug CI? In my environment, can trampoline_ptx.h be compiled normally.
I have tried many times, but still haven't been able to fix CI

…A Orin X This commit addresses critical memory consistency and access issues when running on NVIDIA Tegra platforms (Orin/Thor). Changes: 1. Add volatile qualifier to CommSharedMem members - Mark flag1, flag2, request_id, and map_id as volatile - Prevents compiler optimizations that could cause data inconsistency - Ensures each access reads from/writes to actual memory 2. Use ld.volatile.global.u32 for flag2 in make_helper_call - Forces memory load on each iteration in spin-wait loop - Bypasses L1 cache to ensure visibility of CPU writes - Critical for cross-device synchronization 3. Migrate to cudaHostAlloc for CommSharedMem allocation - Replaces standard library memory allocation (malloc/new) - Uses pinned memory to resolve CUDA_ERROR_ILLEGAL_ADDRESS - Follows NVIDIA Memory Management guidelines for Tegra platforms - Reference: https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management 4. Align CommSharedMem lifetime with bpftime_attach_ctx - Implement RAII pattern using std::unique_ptr with custom deleter - Ensures proper resource cleanup on context destruction Root Cause: On NVIDIA Orin/Thor, GPU access to constData allocated via standard library functions triggers illegal memory access errors. The error occurs in _bpf_helper_ext_0006 when executing: ... ld.const.u64 %rd1, [constData]; ... add.s64 %rd27, %rd1, %rd43; st.u8 [%rd27+24], %rs1; Error: CUDA_ERROR_ILLEGAL_ADDRESS (error code: 700) Solution: Using cudaHostAlloc with pinned memory ensures proper memory mapping between CPU and GPU address spaces on integrated GPU systems. Testing: - Platform: NVIDIA Orin X - CPU: ARM Cortex-A78AE (12-core) - CUDA Driver/Runtime: 12.1 / 11.4 - Compute Capability: 8.7 - Memory: 25767 MBytes - Test case: example/gpu/cuda-counter - Status: Verified working

dr-xjx · 2025-11-24T05:45:21Z

@yunwei37 please review again,tks~

yunwei37 · 2025-11-27T08:56:44Z

Thanks! We are testing it locally these days

pull-request-size bot added the size/L label Nov 17, 2025

dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from e144630 to a79c35e Compare November 17, 2025 12:49

yunwei37 requested review from Officeyutong, Sy0307 and Copilot November 18, 2025 03:43

Copilot started reviewing on behalf of yunwei37 November 18, 2025 03:43 View session

Copilot finished reviewing on behalf of yunwei37 November 18, 2025 03:47

yunwei37 reviewed Nov 18, 2025

View reviewed changes

attach/nv_attach_impl/trampoline/default_trampoline.cu Show resolved Hide resolved

runtime/src/attach/bpf_attach_ctx_cuda.cpp Outdated Show resolved Hide resolved

Copilot AI reviewed Nov 18, 2025

View reviewed changes

dr-xjx requested a review from yunwei37 November 19, 2025 06:24

Officeyutong reviewed Nov 19, 2025

View reviewed changes

example/gpu/cuda-counter/cuda_probe.bpf.c Outdated Show resolved Hide resolved

dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch 7 times, most recently from 1d1cef8 to 0bd768e Compare November 21, 2025 03:55

pull-request-size bot added size/XXL and removed size/L labels Nov 21, 2025

dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch 3 times, most recently from 9019573 to 581f0e8 Compare November 21, 2025 06:31

dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from 01f71c3 to 3771b8e Compare November 21, 2025 06:37

dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from d0eceed to ea901ea Compare November 24, 2025 03:30

dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from ea901ea to 0d9dc86 Compare November 24, 2025 04:43

pull-request-size bot added size/L and removed size/XXL labels Nov 24, 2025

dr-xjx requested a review from Officeyutong November 24, 2025 05:36

Resolved all review comments

7fb6f5f

dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from 0d9dc86 to 7fb6f5f Compare November 24, 2025 05:44

pull-request-size bot added size/XXL and removed size/L labels Nov 24, 2025

This was referenced Dec 1, 2025

Weekly Org Report (2025-11-17..2025-11-23) eunomia-bpf/eunomia.dev#56

Open

Monthly Org Report (2025-11-01..2025-11-30) eunomia-bpf/eunomia.dev#57

Open

Uh oh!

Conversation

dr-xjx commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Checklist

Uh oh!

dr-xjx commented Nov 18, 2025

Uh oh!

dr-xjx commented Nov 18, 2025

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yunwei37 commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-xjx commented Nov 19, 2025

Uh oh!

Uh oh!

Officeyutong commented Nov 19, 2025

Uh oh!

dr-xjx commented Nov 21, 2025

Uh oh!

dr-xjx commented Nov 24, 2025

Uh oh!

yunwei37 commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

dr-xjx commented Nov 17, 2025 •

edited

Loading

yunwei37 commented Nov 18, 2025 •

edited

Loading