Skip to content

Fix too_many_pings during DPU SetPackage by coordinating keepalive settings#620

Open
sigabrtv1-ui wants to merge 9 commits intosonic-net:masterfrom
sigabrtv1-ui:fix/keepalive-too-many-pings
Open

Fix too_many_pings during DPU SetPackage by coordinating keepalive settings#620
sigabrtv1-ui wants to merge 9 commits intosonic-net:masterfrom
sigabrtv1-ui:fix/keepalive-too-many-pings

Conversation

@sigabrtv1-ui
Copy link
Copy Markdown
Contributor

What I did

Fixed a keepalive mismatch between the DPU proxy client and gNMI server that causes ENHANCE_YOUR_CALM / "too_many_pings" disconnections during long-running SetPackage operations on smartswitch DPUs.

How I did it

  1. Proxy client (pkg/interceptors/dpuproxy/proxy.go): Increased keepalive ping interval from 10s → 30s and timeout from 3s → 10s.
  2. gNMI server (telemetry/telemetry.go): Added explicit KeepaliveEnforcementPolicy with MinTime: 20s so the server tolerates the proxy ping frequency instead of using the default 5-minute minimum.

Why I did it

During DPU image installation via SetPackage, sonic-installer install can run for several minutes. The proxy client was pinging every 10s, but the DPU server's default enforcement policy (MinTime=5m) treated this as a flood, terminating the connection with too_many_pings.

The 30s client interval with 20s server minimum provides a safe margin while still detecting dead connections reasonably fast.

How to verify it

  1. Deploy on a smartswitch with DPU
  2. Run SetPackage gNOI RPC targeting a DPU with a valid image
  3. Verify the installation completes without too_many_pings disconnection
  4. Verify keepalive still detects dead connections (kill the DPU gNMI server mid-operation)

Fixes: #619
Related: sonic-net/sonic-mgmt#23054

@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

ryanzhu706
ryanzhu706 previously approved these changes Mar 18, 2026
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Second, // Send keepalive ping every 10s
Timeout: 3 * time.Second, // Wait 3s for ping ack before considering connection dead
Time: 30 * time.Second, // Send keepalive ping every 30s
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a configurable knob

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment below. Seems the bot can't comment since it is not in the org.

Comment thread telemetry/telemetry.go
// frequently during long-running operations like SetPackage.
// See: https://github.com/sonic-net/sonic-gnmi/issues/619
keep_alive_policy := keepalive.EnforcementPolicy{
MinTime: 20 * time.Second, // Allow pings as frequent as every 20s
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - should this be a configurable knob?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment below. Seems the bot can't comment since it is not in the org.

@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

Copy link
Copy Markdown
Contributor Author

@sigabrtv1-ui sigabrtv1-ui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed review feedback: keepalive values are now configurable via environment variables (GNMI_CLIENT_KEEPALIVE_INTERVAL, GNMI_CLIENT_KEEPALIVE_TIMEOUT, GNMI_SERVER_KEEPALIVE_MIN_TIME). Defaults remain the same so behavior is unchanged out of the box.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sigabrtv1-ui
Copy link
Copy Markdown
Contributor Author

@vaibhavhd Good call — addressed in the latest commit (c77261e). All three keepalive values are now configurable via environment variables:

  • GNMI_CLIENT_KEEPALIVE_INTERVAL — client ping interval (default: 30s)
  • GNMI_CLIENT_KEEPALIVE_TIMEOUT — client ping ack timeout (default: 10s)
  • GNMI_SERVER_KEEPALIVE_MIN_TIME — server minimum allowed ping interval (default: 20s)

Defaults remain the same so behavior is unchanged out of the box.

@sigabrtv1-ui sigabrtv1-ui force-pushed the fix/keepalive-too-many-pings branch from c77261e to ae20fed Compare March 19, 2026 02:55
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sigabrtv1-ui sigabrtv1-ui force-pushed the fix/keepalive-too-many-pings branch from ae20fed to 16b2092 Compare March 19, 2026 14:40
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

hdwhdw and others added 4 commits March 19, 2026 15:50
SmartSwitch DPU runs on ARM64, so we need an ARM64 deb artifact.

- Add BuildArm64 stage that runs after Build succeeds
- Uses sonicso1ES-arm64 pool with sonic-slave-bookworm arm64 container
- New install-dependencies-arm64.yml template for arm64 deps
- Publishes sonic-gnmi.arm64 artifact

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Download libnl debs (libnl-3-200, libnl-genl-3-200, libnl-route-3-200,
libnl-nf-3-200) from common-lib.arm64 artifact alongside libyang, matching
the pattern used by the amd64 install-dependencies template.

Signed-off-by: sigabrtv1-ui <sig.abrt.v1@gmail.com>

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
…ttings

The DPU proxy client was sending keepalive pings every 10s, but the gRPC
server uses the default EnforcementPolicy with MinTime=5m. During long-running
SetPackage operations (sonic-installer can take minutes), this mismatch
causes the server to terminate the connection with GOAWAY/ENHANCE_YOUR_CALM
and debug data 'too_many_pings'.

Fix:
1. Increase proxy client keepalive interval from 10s to 30s and timeout
   from 3s to 10s for better resilience during long operations.
2. Add explicit KeepaliveEnforcementPolicy to the gNMI server allowing
   pings as frequent as every 20s, with a safety margin above the client's
   30s interval.

Fixes: sonic-net#619
Related: sonic-net/sonic-mgmt#23054

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@sigabrtv1-ui sigabrtv1-ui force-pushed the fix/keepalive-too-many-pings branch from 16b2092 to 883512e Compare March 19, 2026 15:50
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@hdwhdw hdwhdw requested a review from prsunny March 20, 2026 14:58
Comment thread pkg/interceptors/dpuproxy/proxy.go Outdated
Time: 10 * time.Second, // Send keepalive ping every 10s
Timeout: 3 * time.Second, // Wait 3s for ping ack before considering connection dead
PermitWithoutStream: true, // Send pings even when no active RPCs
Time: getEnvDuration("GNMI_CLIENT_KEEPALIVE_INTERVAL", 30*time.Second),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like getting configurable values from OS environment variables. We have tools (config db?) that we should leverage. May be the agent isn't aware of it, and needs some SONiC specific training to understand how we expect these knobs to be implemented.

@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sigabrtv1-ui sigabrtv1-ui force-pushed the fix/keepalive-too-many-pings branch from 447676e to 1cfdf4b Compare March 20, 2026 16:31
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

vaibhavhd
vaibhavhd previously approved these changes Mar 20, 2026
Copy link
Copy Markdown

@vaibhavhd vaibhavhd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on offline discussion about a follow up PR where the timeouts/settings will be introduced as config.

hdwhdw
hdwhdw previously approved these changes Mar 20, 2026
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@sigabrtv1-ui sigabrtv1-ui dismissed stale reviews from hdwhdw and vaibhavhd via b29bd22 March 25, 2026 17:29
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

…v vars

Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

- Unify install-dependencies.yml with arch parameter (amd64/arm64)
  instead of maintaining separate install-dependencies-arm64.yml
- Add installTestDeps parameter to skip pytest/redis/.NET for build-only jobs
- Extract build-deb.yml template for shared checkout + build + publish flow
- ARM64 stage now uses build-deb.yml template instead of inline steps
- Remove install-dependencies-arm64.yml (merged into unified template)

Signed-off-by: sigabrtv1-ui <sig.abrt.v1@gmail.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines failed to run 1 pipeline(s).

Switch amd64 from downloading the full sonic-buildimage.vs artifact to
using Azure.sonic-buildimage.common_libs (artifact: common-lib), matching
the pattern used by sonic-sairedis and other SONiC repos.

- Both architectures now use the same download step with arch-aware artifact name
- Yang wheels still downloaded separately from pipeline 142 (arch-independent)
- Eliminates arch-specific branching for libyang/libnl download and install
- Smaller artifact downloads for amd64 (common-lib vs full sonic-buildimage.vs)

Signed-off-by: sigabrtv1-ui <sig.abrt.v1@gmail.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines failed to run 1 pipeline(s).

Azure Pipelines doesn't allow ${{ if }} directives embedded in strings.
Replace with explicit parameters (commonLibArtifact, swssCommonArtifact,
publishArtifact) and separate conditional blocks for arch-specific steps.

Fixes validation errors:
- 'The directive if is not allowed in this context'
- Directives not supported for expressions embedded within a string

Signed-off-by: sigabrtv1-ui <sig.abrt.v1@gmail.com>
@mssonicbld
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DPU SetPackage fails with too_many_pings during image installation

6 participants