Skip to content

Conversation

@jayhawk-commits
Copy link
Collaborator

No description provided.

@jayhawk-commits jayhawk-commits merged commit 48213c1 into develop Apr 7, 2025
systems-assistant bot pushed a commit that referenced this pull request Jul 22, 2025
Issues include:

Update ROCm SMI displaying None or Not Supported to N/A
Update ROCm SMI displaying err msg to instead log err

Signed-off-by: Juan Castillo [email protected]
Change-Id: I1a2ce6e4f329666b5666664a7d7b4475d6c1cbc7
systems-assistant bot pushed a commit that referenced this pull request Jul 22, 2025
Issues include:

Update ROCm SMI displaying None or Not Supported to N/A
Update ROCm SMI displaying err msg to instead log err

Signed-off-by: Juan Castillo [email protected]
Change-Id: I1a2ce6e4f329666b5666664a7d7b4475d6c1cbc7
systems-assistant bot pushed a commit that referenced this pull request Aug 5, 2025
jayhawk-commits pushed a commit that referenced this pull request Aug 8, 2025
Issues include:

Update ROCm SMI displaying None or Not Supported to N/A
Update ROCm SMI displaying err msg to instead log err

Signed-off-by: Juan Castillo [email protected]
Change-Id: I1a2ce6e4f329666b5666664a7d7b4475d6c1cbc7

[ROCm/rocm_smi_lib commit: 55ee3cc]
jayhawk-commits pushed a commit that referenced this pull request Aug 8, 2025
Issues include:

Update ROCm SMI displaying None or Not Supported to N/A
Update ROCm SMI displaying err msg to instead log err

Signed-off-by: Juan Castillo [email protected]
Change-Id: I1a2ce6e4f329666b5666664a7d7b4475d6c1cbc7

[ROCm/rocm_smi_lib commit: 898ae4f]
systems-assistant bot pushed a commit that referenced this pull request Aug 10, 2025
jayhawk-commits pushed a commit that referenced this pull request Aug 11, 2025
jayhawk-commits pushed a commit that referenced this pull request Aug 18, 2025
[ROCm/hipother commit: a00c818]
kcossett-amd added a commit to kcossett-amd/rocm-systems that referenced this pull request Oct 16, 2025
Co-authored-by: Pratik Basyal <[email protected]>
dgaliffiAMD added a commit that referenced this pull request Oct 21, 2025
…ument to avoid instrumenting around C "main" wrapper (#1322)

* Add check for Fortran main

* Comment change

* MAIN__ -> Fortran main

* Cray Compiler comment change

* Add changelog and troubleshooting comments

* Improve CHANGELOG.md message

* Change CHANGELOG msg to be in 7.2.0

* Apply review change #1

Co-authored-by: Pratik Basyal <[email protected]>

* Apply review change #2

Co-authored-by: Pratik Basyal <[email protected]>

* Apply review change #3

Co-authored-by: Pratik Basyal <[email protected]>

---------

Co-authored-by: Pratik Basyal <[email protected]>
Co-authored-by: David Galiffi <[email protected]>
ggottipa-amd pushed a commit that referenced this pull request Oct 31, 2025
…ument to avoid instrumenting around C "main" wrapper (#1322)

* Add check for Fortran main

* Comment change

* MAIN__ -> Fortran main

* Cray Compiler comment change

* Add changelog and troubleshooting comments

* Improve CHANGELOG.md message

* Change CHANGELOG msg to be in 7.2.0

* Apply review change #1

Co-authored-by: Pratik Basyal <[email protected]>

* Apply review change #2

Co-authored-by: Pratik Basyal <[email protected]>

* Apply review change #3

Co-authored-by: Pratik Basyal <[email protected]>

---------

Co-authored-by: Pratik Basyal <[email protected]>
Co-authored-by: David Galiffi <[email protected]>
ammallya pushed a commit that referenced this pull request Nov 17, 2025
The bug was reproduced like this.

In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done

The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.

The fix:

Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.

The Python CLI should not treat this as an error, but should continue to print what the API returned.

---------

Signed-off-by: Oosman Saeed <[email protected]>
ammallya pushed a commit that referenced this pull request Nov 18, 2025
The bug was reproduced like this.

In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done

The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.

The fix:

Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.

The Python CLI should not treat this as an error, but should continue to print what the API returned.

---------

Signed-off-by: Oosman Saeed <[email protected]>

[ROCm/amdsmi commit: 5b95d22]
ammallya pushed a commit that referenced this pull request Nov 21, 2025
The bug was reproduced like this.

In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done

The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.

The fix:

Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.

The Python CLI should not treat this as an error, but should continue to print what the API returned.

---------

Signed-off-by: Oosman Saeed <[email protected]>

[ROCm/amdsmi commit: 5b95d22]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants