Skip to content

sky.serve.status() fails in Kubernetes pods due to check_network_connection() TLS handshake failure #8865

@lingniao2000

Description

@lingniao2000

Bug

sky.serve.status() (and all sky.serve.* operations) always fails with RuntimeError: Failed to refresh services status due to network error when the SkyPilot API Server runs inside a Kubernetes pod.

Root Cause

check_network_connection() in sky/backends/backend_utils.py is called before any serve operation:

_TEST_IP_LIST = ['https://8.8.8.8', 'https://1.1.1.1']

def check_network_connection():
    for _ in range(max_retries):
        for ip in _TEST_IP_LIST:
            try:
                http.head(ip, timeout=timeout)
                return
            except (requests.Timeout, requests.exceptions.ConnectionError):
                continue
    raise exceptions.NetworkError(...)

In K8s pods, this check fails for two reasons:

  1. TLS handshake failure: The container's OpenSSL cannot complete TLS handshake with external HTTPS endpoints (SSLV3_ALERT_HANDSHAKE_FAILURE), even when TCP connectivity works fine (e.g., socket.create_connection(('1.1.1.1', 443)) succeeds but requests.head('https://1.1.1.1') fails with SSLError, which is caught as ConnectionError).

  2. Air-gapped / restricted networks: Many production K8s clusters have no public internet egress by policy. The SkyPilot Helm chart is designed to run in K8s, but the network check assumes public internet access.

The actual serve operations only need K8s-internal connectivity (API Server → Controller pod → Replica pods), which works perfectly fine.

Environment

  • SkyPilot: nightly (installed via Helm chart)
  • K8s: v1.28+
  • CNI: Cilium (but reproducible with any CNI)
  • K8s pods can reach each other and the K8s API server; ssh from API Server to Controller works fine
  • https://8.8.8.8 and https://1.1.1.1 fail due to TLS handshake failure from within the container

Steps to Reproduce

  1. Deploy SkyPilot API Server in K8s using the official Helm chart
  2. Start a service with sky serve up
  3. From within the API Server pod: sky.serve.status() → fails with "network error"
  4. Verify manually: SSH from API Server to Controller pod works; the service is actually READY and serving traffic

Expected Behavior

sky.serve.status() should succeed when the API Server can reach the Controller pod, regardless of public internet availability.

Suggested Fix

Skip the public internet check when running inside Kubernetes:

def check_network_connection():
    # In K8s, pod networking is managed by CNI; public internet may not be available.
    # All SkyPilot serve operations in K8s only need cluster-internal connectivity.
    if os.environ.get("KUBERNETES_SERVICE_HOST"):
        return
    # ... existing logic ...

KUBERNETES_SERVICE_HOST is automatically set by Kubernetes in every pod — zero external dependency, non-K8s environments unaffected.

Current Workaround

We inject a sed patch into the Deployment startup script (before exec sky api start) to add the early return when KUBERNETES_SERVICE_HOST is detected.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions