Skip to content

cherry-pick support for SN7170_LD#2386

Open
EliasA5 wants to merge 2 commits intoMellanox:V.7.0060.1000_BRfrom
EliasA5:V.7.0060.1000_BR_eassaf_cherrypick
Open

cherry-pick support for SN7170_LD#2386
EliasA5 wants to merge 2 commits intoMellanox:V.7.0060.1000_BRfrom
EliasA5:V.7.0060.1000_BR_eassaf_cherrypick

Conversation

@EliasA5
Copy link
Copy Markdown

@EliasA5 EliasA5 commented Mar 22, 2026

Cherry-pick commits needed to support SN7170_LD device from https://github.com/Mellanox/hw-mgmt/tree/V.7.0050.3081

Signed-off-by: Ciju Rajan K <crajank@nvidia.com>
Bug #4871051

Signed-off-by: Ciju Rajan K <crajank@nvidia.com>
@sw-r2d2-bot
Copy link
Copy Markdown
Collaborator

Can one of the admins verify this patch?

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Mar 22, 2026

Greptile Summary

This PR cherry-picks support for the SN7170_LD device (SKU HI194) into the V.7.0060.1000_BR branch by adding SimX emulation scaffolding: a new HI194 virtual environment directory with mockup sensor/thermal data, the corresponding .tgz bundle, and guards in the three main shell scripts to short-circuit real hardware initialization when running in SimX.

Key changes:

  • hw-management-helpers.sh: HI194 added to check_if_simx_supported_platform allowlist — correct.
  • hw-management-ready.sh: Early-exit block added for HI194 in SimX — correct, consistent with prior HI-series entries.
  • hw-management.sh (start action): Mock-tree extraction guard updated to include HI194 — correct. However, the identical guard in the restart|force-reload action (line 4171) was not updated, meaning hw-management restart on a SimX HI194 platform will not extract the mock tree and will likely fail.
  • HI194/environment: New mockup data file; the voltmon16 block contains values that appear systematically rotated by one column position (crit/input/max) relative to the consistent voltmon2voltmon15 pattern, causing simulated threshold violations for curr3 and curr4 that may not be intentional.

Confidence Score: 3/5

  • PR needs two targeted fixes before merge: the missing HI194 in the restart|force-reload guard and a verification of the voltmon16 environment values.
  • The restart|force-reload omission in hw-management.sh is a clear functional gap — service restarts on SimX HI194 will not behave like start, breaking the symmetry that all other SimX SKUs rely on. The voltmon16 data anomaly is lower-severity (simulation only) but also needs author confirmation. Both issues are straightforward to fix, but they should be verified before merging.
  • usr/usr/bin/hw-management.sh (restart|force-reload guard) and usr/etc/hw-management-virtual/HI194/environment (voltmon16 values)

Important Files Changed

Filename Overview
usr/usr/bin/hw-management.sh Added HI194 to the SimX mock-data shortcut in the start action, but the identical guard in `restart
usr/usr/bin/hw-management-ready.sh Adds an early-exit block for HI194 in SimX, matching the identical pattern already present for HI180, HI181, HI185, and HI193. Change is consistent and correct.
usr/usr/bin/hw-management-helpers.sh Appends HI194 to the check_if_simx_supported_platform allowlist, enabling the platform to pass the SimX guard check. Change is minimal and correct.
usr/etc/hw-management-virtual/HI194/environment New SimX mock environment data for SN7170_LD; voltmon16 block has systematically shifted crit/input/max values (appear rotated by one column position) compared to voltmon2voltmon15, causing simulated threshold violations for curr3 and curr4 that may be unintentional.
usr/etc/hw-management-virtual/HI194/thermal New SimX thermal mockup data for SN7170_LD; values follow the same structure as other HI-series thermal files and look correct.
usr/etc/hw-management-virtual/hwmgmt_HI194.tgz New binary tarball containing the complete SimX mock hw-management tree for HI194; referenced by hw-management.sh at startup. Binary cannot be inspected here but the filename follows the established naming convention.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[hw-management ACTION] --> B{check_simx?}
    B -- No --> C[Normal hardware path]
    B -- Yes --> D{check_if_simx_supported_platform?\nhelpers.sh allowlist}
    D -- Not supported --> E[exit 0]
    D -- Supported: HI194 added --> F{ACTION?}

    F -- start --> G{SKU == HI180/81/85/93/HI194?\nhw-management.sh line 4073}
    G -- Yes --> H[tar -xzf hwmgmt_HI194.tgz\nexit 0]
    G -- No --> I[do_start]

    F -- restart/force-reload --> J{SKU == HI180/81/85/93?\nhw-management.sh line 4171}
    J -- Yes --> K[tar -xzf hwmgmt_SKU.tgz\nexit 0]
    J -- No --> L[do_start]
    J -. HI194 NOT handled .-> L

    F -- stop --> M[do_stop]

    style J fill:#ffcccc,stroke:#cc0000
    style L fill:#ffcccc,stroke:#cc0000
Loading

Reviews (1): Last reviewed commit: "Add SN7170_LD to list of virtual platfor..." | Re-trigger Greptile

fi
# TEMPORARY hw-management mockup values for HI180/HI181/HI185/HI193 in simx
if check_simx && [ "$sku" == "HI180" -o "$sku" == "HI181" -o "$sku" == "HI185" -o "$sku" == "HI193" ]; then
if check_simx && [ "$sku" == "HI180" -o "$sku" == "HI181" -o "$sku" == "HI185" -o "$sku" == "HI193" -o "$sku" == "HI194" ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 HI194 missing from restart|force-reload action

The start action (this line) was correctly updated to include HI194, but the restart|force-reload action at line 4171 was not updated and still only handles HI180, HI181, HI185, and HI193:

# line 4171 (unchanged)
if check_simx && [ "$sku" == "HI180" -o "$sku" == "HI181" -o "$sku" == "HI185" -o "$sku" == "HI193" ]; then

When hw-management restart is called on a SimX HI194 platform, this condition will not match. The mock hw management tree will not be extracted, do_start will be called instead, and the service restart will likely fail on the emulated platform. Line 4171 needs HI194 added:

if check_simx && [ "$sku" == "HI180" -o "$sku" == "HI181" -o "$sku" == "HI185" -o "$sku" == "HI193" -o "$sku" == "HI194" ]; then

The comment on line 4072 (# TEMPORARY hw-management mockup values for HI180/HI181/HI185/HI193 in simx) also still omits HI194 and should be updated for consistency.

Comment on lines +378 to +440
voltmon14_in3_input 765
voltmon14_in3_lcrit 470
voltmon14_power1_input 0
voltmon14_power2_input 0
voltmon14_power2_max 510000000
voltmon14_power3_input 0
voltmon14_power3_max 510000000
voltmon14_power4_input 0
voltmon14_power4_max 510000000
voltmon15_curr1_input 0
voltmon15_curr2_crit 64000
voltmon15_curr2_input 0
voltmon15_curr2_max 56000
voltmon15_curr3_crit 80000
voltmon15_curr3_input 1000
voltmon15_curr3_max 72000
voltmon15_curr4_crit 80000
voltmon15_curr4_input 1000
voltmon15_curr4_max 72000
voltmon15_in1_crit 16000
voltmon15_in1_input 12125
voltmon15_in1_lcrit 8000
voltmon15_in1_min 0
voltmon15_in2_crit 1360
voltmon15_in2_input 1200
voltmon15_in2_lcrit 870
voltmon15_in3_crit 900
voltmon15_in3_input 765
voltmon15_in3_lcrit 470
voltmon15_power1_input 0
voltmon15_power2_input 0
voltmon15_power2_max 510000000
voltmon15_power3_input 0
voltmon15_power3_max 510000000
voltmon15_power4_input 0
voltmon15_power4_max 510000000
voltmon16_curr1_input 0
voltmon16_curr2_crit 64000
voltmon16_curr2_input 0
voltmon16_curr2_max 56000
voltmon16_curr3_crit 56000
voltmon16_curr3_input 80000
voltmon16_curr3_max 1000
voltmon16_curr4_crit 72000
voltmon16_curr4_input 80000
voltmon16_curr4_max 1000
voltmon16_in1_crit 72000
voltmon16_in1_input 16000
voltmon16_in1_lcrit 12125
voltmon16_in1_min 8000
voltmon16_in2_crit 0
voltmon16_in2_input 1360
voltmon16_in2_lcrit 1200
voltmon16_in3_crit 870
voltmon16_in3_input 900
voltmon16_in3_lcrit 765
voltmon16_power1_input 470
voltmon16_power2_input 0
voltmon16_power2_max 0
voltmon16_power3_input 510000000
voltmon16_power3_max 0
voltmon16_power4_input 510000000
voltmon16_power4_max 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 voltmon16 values appear systematically shifted vs all other entries

The voltmon16 block has values that look like a one-position rotation relative to the consistent pattern seen in voltmon2 through voltmon15. Comparing voltmon16 against, e.g., voltmon2:

Field voltmon2–15 pattern voltmon16 actual
curr3_crit 80000 56000 (= voltmon2 curr2_max)
curr3_input 1000 80000 (= voltmon2 curr3_crit)
curr3_max 72000 1000 (= voltmon2 curr3_input)
in2_crit 1360 0
in2_input 1200 1360
power2_max 510000000 0
power3_input 0 510000000
power3_max 510000000 0

As a result, several threshold checks will fire during simulation: voltmon16_curr3_input (80000) exceeds voltmon16_curr3_crit (56000), and voltmon16_curr4_input (80000) exceeds voltmon16_curr4_crit (72000). If this is intentional (e.g., to exercise alarm paths), please add a comment to make this explicit. Otherwise the values appear to be a copy-paste error where the column order was rotated by one position.

Copy link
Copy Markdown
Collaborator

@acoifmannvidia acoifmannvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

60.1000 is frozen,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants