Skip to content

Fix GPU-CPU memory synchronization and illegal memory access on NVIDI…#503

Open
dr-xjx wants to merge 2 commits intoeunomia-bpf:masterfrom
dr-xjx:xjx/fix-tegra-memory-sync-orin-x
Open

Fix GPU-CPU memory synchronization and illegal memory access on NVIDI…#503
dr-xjx wants to merge 2 commits intoeunomia-bpf:masterfrom
dr-xjx:xjx/fix-tegra-memory-sync-orin-x

Conversation

@dr-xjx
Copy link
Contributor

@dr-xjx dr-xjx commented Nov 17, 2025

…A Orin X

This commit addresses critical memory consistency and access issues when running on NVIDIA Tegra platforms (Orin/Thor).

Changes:

  1. Add volatile qualifier to CommSharedMem members

    • Mark flag1, flag2, request_id, and map_id as volatile
    • Prevents compiler optimizations that could cause data inconsistency
    • Ensures each access reads from/writes to actual memory
  2. Use ld.volatile.global.u32 for flag2 in make_helper_call

    • Forces memory load on each iteration in spin-wait loop
    • Bypasses L1 cache to ensure visibility of CPU writes
    • Critical for cross-device synchronization
  3. Migrate to cudaHostAlloc for CommSharedMem allocation

  4. Align CommSharedMem lifetime with bpftime_attach_ctx

    • Implement RAII pattern using std::unique_ptr with custom deleter
    • Ensures proper resource cleanup on context destruction

Root Cause:
On NVIDIA Orin/Thor, GPU access to constData allocated via standard library functions triggers illegal memory access errors. The error occurs in _bpf_helper_ext_0006 when executing:
...
ld.const.u64 %rd1, [constData];
...
add.s64 %rd27, %rd1, %rd43;
st.u8 [%rd27+24], %rs1;

Error: CUDA_ERROR_ILLEGAL_ADDRESS (error code: 700)

Solution:
Using cudaHostAlloc with pinned memory ensures proper memory mapping between CPU and GPU address spaces on integrated GPU systems.

Testing:

  • Platform: NVIDIA Orin X
  • CPU: ARM Cortex-A78AE (12-core)
  • CUDA Driver/Runtime: 12.1 / 11.4
  • Compute Capability: 8.7
  • Memory: 25767 MBytes
  • Test case: example/gpu/cuda-counter
  • Status: Verified working

Please try to use the copilot to summary your PR. You don't need to fill all info below, just it can help giving your a checklist.

Description

Fixes # (issue)

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

  • Test A
    example/gpu/cuda-counter
  • Test B

Test Configuration:

  • Platform: NVIDIA Orin X
  • CPU: ARM Cortex-A78AE (12-core)
  • CUDA Driver/Runtime: 12.1 / 11.4
  • Compute Capability: 8.7
  • Memory: 25767 MBytes
  • Test case: example/gpu/cuda-counter
  • Status: Verified working
  • Firmware version:
  • Hardware:
  • Toolchain:
  • SDK:

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@dr-xjx dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from e144630 to a79c35e Compare November 17, 2025 12:49
@dr-xjx
Copy link
Contributor Author

dr-xjx commented Nov 18, 2025

@yunwei37 please review this,tks~

@dr-xjx
Copy link
Contributor Author

dr-xjx commented Nov 18, 2025

@copilot

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses critical GPU-CPU memory synchronization issues on NVIDIA Tegra platforms (Orin/Thor) by implementing proper memory management and synchronization primitives to fix illegal memory access errors and data consistency problems.

Key Changes:

  • Migrated from standard memory allocation to CUDA pinned memory (cudaHostAlloc) to resolve illegal memory access on integrated GPU systems
  • Added volatile qualifiers to shared memory synchronization flags to prevent compiler optimizations
  • Used ld.volatile.global.u32 PTX instruction to ensure cache-bypassing memory reads in GPU spin-wait loops

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
runtime/src/attach/bpf_attach_ctx_cuda.cpp Implements CUDAContext with cudaHostAlloc-based memory allocation and worker thread management
runtime/src/attach/bpf_attach_ctx.cpp Removes explicit thread lifecycle management (now handled by CUDAContext)
runtime/include/bpf_attach_ctx.hpp Adds volatile qualifiers to CommSharedMem and refactors CUDAContext class structure
attach/nv_attach_impl/trampoline/default_trampoline.cu Adds volatile qualifiers and uses ld.volatile.global.u32 for cache-bypassing reads
attach/nv_attach_impl/nv_attach_impl_frida_setup.cpp Adds additional debug logging for kernel launch parameters
attach/nv_attach_impl/nv_attach_fatbin_record.cpp Corrects error message text for clarity

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@yunwei37
Copy link
Member

yunwei37 commented Nov 18, 2025

Thanks a lot for the PR!

It seems like it fix many bugs in one PR, would you mind seperate them into mmultiple?

Use ld.volatile.global.u32 for flag2 in make_helper_call

Also, does it has performance impact? If yes, how can we improve it while keep it correct?

@Officeyutong @Sy0307 What do you think?

@dr-xjx
Copy link
Contributor Author

dr-xjx commented Nov 19, 2025

Thanks a lot for the PR!

It seems like it fix many bugs in one PR, would you mind seperate them into mmultiple?

Use ld.volatile.global.u32 for flag2 in make_helper_call

Also, does it has performance impact? If yes, how can we improve it while keep it correct?

@Officeyutong @Sy0307 What do you think?
These changes should be kept as a cohesive whole, and splitting them isn't really necessary.

I believe this PR should have no major impact, but there are minor considerations:

  1. Memory Allocation: cudaHostAlloc requires contiguous physical memory allocation, which may fail when system memory is fragmented or under pressure

  2. Instruction Overhead: ld.volatile is unnecessary for uncached memory access scenarios, but the performance impact is negligible

@dr-xjx dr-xjx requested a review from yunwei37 November 19, 2025 06:24
@Officeyutong
Copy link
Contributor

Please fix the CI

@dr-xjx dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch 7 times, most recently from 1d1cef8 to 0bd768e Compare November 21, 2025 03:55
@dr-xjx dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch 3 times, most recently from 9019573 to 581f0e8 Compare November 21, 2025 06:31
@dr-xjx
Copy link
Contributor Author

dr-xjx commented Nov 21, 2025

Please fix the CI
@Officeyutong @yunwei37
How should I debug CI? In my environment, can trampoline_ptx.h be compiled normally.
I have tried many times, but still haven't been able to fix CI

@dr-xjx dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from 01f71c3 to 3771b8e Compare November 21, 2025 06:37
@dr-xjx dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from d0eceed to ea901ea Compare November 24, 2025 03:30
…A Orin X

This commit addresses critical memory consistency and access issues when
running on NVIDIA Tegra platforms (Orin/Thor).

Changes:
1. Add volatile qualifier to CommSharedMem members
   - Mark flag1, flag2, request_id, and map_id as volatile
   - Prevents compiler optimizations that could cause data inconsistency
   - Ensures each access reads from/writes to actual memory

2. Use ld.volatile.global.u32 for flag2 in make_helper_call
   - Forces memory load on each iteration in spin-wait loop
   - Bypasses L1 cache to ensure visibility of CPU writes
   - Critical for cross-device synchronization

3. Migrate to cudaHostAlloc for CommSharedMem allocation
   - Replaces standard library memory allocation (malloc/new)
   - Uses pinned memory to resolve CUDA_ERROR_ILLEGAL_ADDRESS
   - Follows NVIDIA Memory Management guidelines for Tegra platforms
   - Reference: https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#memory-management

4. Align CommSharedMem lifetime with bpftime_attach_ctx
   - Implement RAII pattern using std::unique_ptr with custom deleter
   - Ensures proper resource cleanup on context destruction

Root Cause:
On NVIDIA Orin/Thor, GPU access to constData allocated via standard
library functions triggers illegal memory access errors. The error occurs
in _bpf_helper_ext_0006 when executing:
  ...
  ld.const.u64 %rd1, [constData];
  ...
  add.s64 %rd27, %rd1, %rd43;
  st.u8 [%rd27+24], %rs1;

Error: CUDA_ERROR_ILLEGAL_ADDRESS (error code: 700)

Solution:
Using cudaHostAlloc with pinned memory ensures proper memory mapping
between CPU and GPU address spaces on integrated GPU systems.

Testing:
- Platform: NVIDIA Orin X
- CPU: ARM Cortex-A78AE (12-core)
- CUDA Driver/Runtime: 12.1 / 11.4
- Compute Capability: 8.7
- Memory: 25767 MBytes
- Test case: example/gpu/cuda-counter
- Status: Verified working
@dr-xjx dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from ea901ea to 0d9dc86 Compare November 24, 2025 04:43
@dr-xjx dr-xjx requested a review from Officeyutong November 24, 2025 05:36
@dr-xjx dr-xjx force-pushed the xjx/fix-tegra-memory-sync-orin-x branch from 0d9dc86 to 7fb6f5f Compare November 24, 2025 05:44
@dr-xjx
Copy link
Contributor Author

dr-xjx commented Nov 24, 2025

@yunwei37 please review again,tks~

@yunwei37
Copy link
Member

Thanks! We are testing it locally these days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments