Skip to content

Replace docker exec gnoi_client with native Python gRPC calls#367

Open
sigabrtv1-ui wants to merge 13 commits intosonic-net:masterfrom
sigabrtv1-ui:feature/gnoi-native-grpc
Open

Replace docker exec gnoi_client with native Python gRPC calls#367
sigabrtv1-ui wants to merge 13 commits intosonic-net:masterfrom
sigabrtv1-ui:feature/gnoi-native-grpc

Conversation

@sigabrtv1-ui
Copy link
Copy Markdown

What I did

Refactored gnoi_shutdown_daemon to use direct Python gRPC calls instead of shelling out to docker exec gnmi gnoi_client.

Why I did it

Three issues in the current subprocess-based approach:

  1. Broken RebootStatus detection — the poll checks "reboot complete" in out_s.lower(), but gnoi_client outputs JSON ({"active":false,"status":{"status":"STATUS_SUCCESS"}}). The substring never matches, so the poll always times out regardless of DPU state.

  2. Suppressed error context_send_reboot_command() uses suppress_stderr=True, discarding Go panic traces from gnoi_client. The daemon logs only "Reboot command failed" with no actionable detail.

  3. Unnecessary NPU gnmi container dependency — the current path shells into the NPU gnmi container to run gnoi_client, but the DPU's own gnmi server is the actual RPC endpoint. Direct gRPC connects without this intermediary.

How I did it

  • Vendored gNOI System proto Python stubs (from openconfig/gnoi) into host_modules/gnoi/
  • Added a GnoiClient wrapper (client.py) with reboot(), reboot_status(), cancel_reboot() methods
  • Refactored _send_reboot_commandGnoiClient.reboot() with proper grpc.RpcError handling (status codes + details)
  • Refactored _poll_reboot_status → checks resp.active and resp.status.status protobuf fields directly
  • Added grpcio and protobuf to setup.py dependencies
  • Updated all unit tests + added new coverage for gRPC error paths

How to verify it

On a SmartSwitch with DPU:

  1. Set DPU admin_status to down in CONFIG_DB
  2. Verify the daemon sends gNOI Reboot HALT via direct gRPC (no docker exec in logs)
  3. Verify RebootStatus polling correctly detects completion (should NOT time out when DPU reports success)
  4. Check syslog for structured error messages on failures (status code + details)

Design doc: #361

hdwhdw added 10 commits March 21, 2026 03:44
Proposes replacing subprocess-based gnoi_client invocations in
gnoi_shutdown_daemon with direct Python gRPC calls.

Documents the gnoi_client output format, a parsing bug in RebootStatus
polling, and the difficulty of diagnosing RPC failures through Go panic
stack traces.

Ref: sonic-net#360

Signed-off-by: Dawei Huang <[email protected]>
- Add types_pb2.py to vendored stubs file list (types.proto dependency)
- Fix RebootStatus enum access pattern for protobuf Python codegen
- Fix test mock to implement grpc.Call interface (code()/details())

Signed-off-by: Dawei Huang <[email protected]>
- Correct stderr statement: _send_reboot_command uses suppress_stderr=True,
  so panic output is suppressed, not captured
- Add packaging note: setup.py must include host_modules.gnoi in packages
  list for vendored stubs to be installed

Signed-off-by: Dawei Huang <[email protected]>
Section 8 now links to gnoi_client, reboot.go, and system.proto
instead of inlining full source code.

Signed-off-by: Dawei Huang <[email protected]>
Refactor gnoi_shutdown_daemon to use direct gRPC instead of shelling
out to 'docker exec gnmi gnoi_client'. This addresses three issues:

1. Broken RebootStatus completion detection - the old code checked for
   'reboot complete' in stdout, but gnoi_client outputs JSON with
   'STATUS_SUCCESS', so the poll always timed out regardless of DPU state.

2. Suppressed error context - _send_reboot_command used suppress_stderr=True,
   discarding Go panic traces. Now gRPC errors carry status codes and details
   natively (e.g. UNAVAILABLE: connection refused).

3. Unnecessary NPU gnmi container dependency - the DPU's own gnmi server is
   the actual RPC endpoint; direct gRPC connects without involving the NPU
   gnmi container as an intermediary.

Changes:
- Vendor gNOI System proto Python stubs (from openconfig/gnoi) into
  host_modules/gnoi/ (system_pb2, types_pb2, common_pb2)
- Add GnoiClient wrapper (client.py) with reboot(), reboot_status(),
  cancel_reboot() methods and context manager support
- Refactor _send_reboot_command to use GnoiClient.reboot() with proper
  grpc.RpcError handling
- Refactor _poll_reboot_status to check resp.active and resp.status.status
  protobuf fields directly instead of string matching
- Add grpcio and protobuf to setup.py install_requires
- Add host_modules.gnoi to setup.py packages list
- Update all unit tests to mock GnoiClient instead of subprocess calls
- Add new tests: gRPC error handling, transient error recovery,
  failure status detection

Signed-off-by: sigabrtv1-ui <[email protected]>

Signed-off-by: Dawei Huang <[email protected]>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@hdwhdw hdwhdw requested review from hdwhdw, judyjoseph and vvolam March 30, 2026 16:20
Integrate upstream sonic-net#352 (_should_skip_gnoi_shutdown for offline/powered-down
DPUs) with native gRPC refactor. All 37 tests pass.

Signed-off-by: sigabrtv1-ui <[email protected]>

Signed-off-by: Dawei Huang <[email protected]>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

from host_modules.gnoi import system_pb2_grpc


class GnoiClient:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might have other gnoi services. Should not pin it to System.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — refactored in 5b12fba. GnoiClient now manages just the gRPC channel and exposes service stubs via properties (client.system). Future gNOI services (Healthz, Cert, etc.) can add their own @property or use client.channel directly for custom stubs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call should be routed from through service. client.system.reboot instead of client.reboot.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 9fa8000 — removed all convenience methods. Callers now go through client.system.Reboot() / client.system.RebootStatus() directly. GnoiClient just manages the channel + exposes service stubs via properties.

GnoiClient now manages the gRPC channel and exposes service stubs via
properties (e.g. client.system). The System convenience methods (reboot,
reboot_status, cancel_reboot) remain for backward compat, but future
gNOI services (Healthz, Cert, etc.) can add their own stubs without
modifying the client class.

Also exposes client.channel for direct stub construction when needed.

Signed-off-by: sigabrtv1-ui <[email protected]>

Signed-off-by: Dawei Huang <[email protected]>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Remove convenience methods from GnoiClient. All RPCs now go through
service stub properties: client.system.Reboot(), client.system.RebootStatus().

This keeps the client clean for future gNOI services — each service
gets its own stub property, callers use the standard protobuf API.

Signed-off-by: sigabrtv1-ui <[email protected]>

Signed-off-by: Dawei Huang <[email protected]>
@mssonicbld
Copy link
Copy Markdown

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants