Fix too_many_pings during DPU SetPackage by coordinating keepalive settings#620
Fix too_many_pings during DPU SetPackage by coordinating keepalive settings#620sigabrtv1-ui wants to merge 9 commits intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
5a3d4dd to
436961b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| grpc.WithKeepaliveParams(keepalive.ClientParameters{ | ||
| Time: 10 * time.Second, // Send keepalive ping every 10s | ||
| Timeout: 3 * time.Second, // Wait 3s for ping ack before considering connection dead | ||
| Time: 30 * time.Second, // Send keepalive ping every 30s |
There was a problem hiding this comment.
See the comment below. Seems the bot can't comment since it is not in the org.
| // frequently during long-running operations like SetPackage. | ||
| // See: https://github.com/sonic-net/sonic-gnmi/issues/619 | ||
| keep_alive_policy := keepalive.EnforcementPolicy{ | ||
| MinTime: 20 * time.Second, // Allow pings as frequent as every 20s |
There was a problem hiding this comment.
Same as above - should this be a configurable knob?
There was a problem hiding this comment.
See the comment below. Seems the bot can't comment since it is not in the org.
|
/azp run |
sigabrtv1-ui
left a comment
There was a problem hiding this comment.
Addressed review feedback: keepalive values are now configurable via environment variables (GNMI_CLIENT_KEEPALIVE_INTERVAL, GNMI_CLIENT_KEEPALIVE_TIMEOUT, GNMI_SERVER_KEEPALIVE_MIN_TIME). Defaults remain the same so behavior is unchanged out of the box.
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@vaibhavhd Good call — addressed in the latest commit (c77261e). All three keepalive values are now configurable via environment variables:
Defaults remain the same so behavior is unchanged out of the box. |
c77261e to
ae20fed
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
ae20fed to
16b2092
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
SmartSwitch DPU runs on ARM64, so we need an ARM64 deb artifact. - Add BuildArm64 stage that runs after Build succeeds - Uses sonicso1ES-arm64 pool with sonic-slave-bookworm arm64 container - New install-dependencies-arm64.yml template for arm64 deps - Publishes sonic-gnmi.arm64 artifact Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
Download libnl debs (libnl-3-200, libnl-genl-3-200, libnl-route-3-200, libnl-nf-3-200) from common-lib.arm64 artifact alongside libyang, matching the pattern used by the amd64 install-dependencies template. Signed-off-by: sigabrtv1-ui <sig.abrt.v1@gmail.com> Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
…ttings The DPU proxy client was sending keepalive pings every 10s, but the gRPC server uses the default EnforcementPolicy with MinTime=5m. During long-running SetPackage operations (sonic-installer can take minutes), this mismatch causes the server to terminate the connection with GOAWAY/ENHANCE_YOUR_CALM and debug data 'too_many_pings'. Fix: 1. Increase proxy client keepalive interval from 10s to 30s and timeout from 3s to 10s for better resilience during long operations. 2. Add explicit KeepaliveEnforcementPolicy to the gNMI server allowing pings as frequent as every 20s, with a safety margin above the client's 30s interval. Fixes: sonic-net#619 Related: sonic-net/sonic-mgmt#23054 Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
16b2092 to
883512e
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| Time: 10 * time.Second, // Send keepalive ping every 10s | ||
| Timeout: 3 * time.Second, // Wait 3s for ping ack before considering connection dead | ||
| PermitWithoutStream: true, // Send pings even when no active RPCs | ||
| Time: getEnvDuration("GNMI_CLIENT_KEEPALIVE_INTERVAL", 30*time.Second), |
There was a problem hiding this comment.
I don't like getting configurable values from OS environment variables. We have tools (config db?) that we should leverage. May be the agent isn't aware of it, and needs some SONiC specific training to understand how we expect these knobs to be implemented.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
447676e to
1cfdf4b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
vaibhavhd
left a comment
There was a problem hiding this comment.
Approving based on offline discussion about a follow up PR where the timeouts/settings will be introduced as config.
Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…v vars Signed-off-by: Dawei Huang <daweihuang@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
- Unify install-dependencies.yml with arch parameter (amd64/arm64) instead of maintaining separate install-dependencies-arm64.yml - Add installTestDeps parameter to skip pytest/redis/.NET for build-only jobs - Extract build-deb.yml template for shared checkout + build + publish flow - ARM64 stage now uses build-deb.yml template instead of inline steps - Remove install-dependencies-arm64.yml (merged into unified template) Signed-off-by: sigabrtv1-ui <sig.abrt.v1@gmail.com>
|
/azp run |
|
Azure Pipelines failed to run 1 pipeline(s). |
Switch amd64 from downloading the full sonic-buildimage.vs artifact to using Azure.sonic-buildimage.common_libs (artifact: common-lib), matching the pattern used by sonic-sairedis and other SONiC repos. - Both architectures now use the same download step with arch-aware artifact name - Yang wheels still downloaded separately from pipeline 142 (arch-independent) - Eliminates arch-specific branching for libyang/libnl download and install - Smaller artifact downloads for amd64 (common-lib vs full sonic-buildimage.vs) Signed-off-by: sigabrtv1-ui <sig.abrt.v1@gmail.com>
|
/azp run |
|
Azure Pipelines failed to run 1 pipeline(s). |
Azure Pipelines doesn't allow ${{ if }} directives embedded in strings.
Replace with explicit parameters (commonLibArtifact, swssCommonArtifact,
publishArtifact) and separate conditional blocks for arch-specific steps.
Fixes validation errors:
- 'The directive if is not allowed in this context'
- Directives not supported for expressions embedded within a string
Signed-off-by: sigabrtv1-ui <sig.abrt.v1@gmail.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
What I did
Fixed a keepalive mismatch between the DPU proxy client and gNMI server that causes
ENHANCE_YOUR_CALM/"too_many_pings"disconnections during long-runningSetPackageoperations on smartswitch DPUs.How I did it
pkg/interceptors/dpuproxy/proxy.go): Increased keepalive ping interval from 10s → 30s and timeout from 3s → 10s.telemetry/telemetry.go): Added explicitKeepaliveEnforcementPolicywithMinTime: 20sso the server tolerates the proxy ping frequency instead of using the default 5-minute minimum.Why I did it
During DPU image installation via
SetPackage,sonic-installer installcan run for several minutes. The proxy client was pinging every 10s, but the DPU server's default enforcement policy (MinTime=5m) treated this as a flood, terminating the connection withtoo_many_pings.The 30s client interval with 20s server minimum provides a safe margin while still detecting dead connections reasonably fast.
How to verify it
SetPackagegNOI RPC targeting a DPU with a valid imagetoo_many_pingsdisconnectionFixes: #619
Related: sonic-net/sonic-mgmt#23054