Bug
sky.serve.status() (and all sky.serve.* operations) always fails with RuntimeError: Failed to refresh services status due to network error when the SkyPilot API Server runs inside a Kubernetes pod.
Root Cause
check_network_connection() in sky/backends/backend_utils.py is called before any serve operation:
_TEST_IP_LIST = ['https://8.8.8.8', 'https://1.1.1.1']
def check_network_connection():
for _ in range(max_retries):
for ip in _TEST_IP_LIST:
try:
http.head(ip, timeout=timeout)
return
except (requests.Timeout, requests.exceptions.ConnectionError):
continue
raise exceptions.NetworkError(...)
In K8s pods, this check fails for two reasons:
-
TLS handshake failure: The container's OpenSSL cannot complete TLS handshake with external HTTPS endpoints (SSLV3_ALERT_HANDSHAKE_FAILURE), even when TCP connectivity works fine (e.g., socket.create_connection(('1.1.1.1', 443)) succeeds but requests.head('https://1.1.1.1') fails with SSLError, which is caught as ConnectionError).
-
Air-gapped / restricted networks: Many production K8s clusters have no public internet egress by policy. The SkyPilot Helm chart is designed to run in K8s, but the network check assumes public internet access.
The actual serve operations only need K8s-internal connectivity (API Server → Controller pod → Replica pods), which works perfectly fine.
Environment
- SkyPilot: nightly (installed via Helm chart)
- K8s: v1.28+
- CNI: Cilium (but reproducible with any CNI)
- K8s pods can reach each other and the K8s API server;
ssh from API Server to Controller works fine
https://8.8.8.8 and https://1.1.1.1 fail due to TLS handshake failure from within the container
Steps to Reproduce
- Deploy SkyPilot API Server in K8s using the official Helm chart
- Start a service with
sky serve up
- From within the API Server pod:
sky.serve.status() → fails with "network error"
- Verify manually: SSH from API Server to Controller pod works; the service is actually READY and serving traffic
Expected Behavior
sky.serve.status() should succeed when the API Server can reach the Controller pod, regardless of public internet availability.
Suggested Fix
Skip the public internet check when running inside Kubernetes:
def check_network_connection():
# In K8s, pod networking is managed by CNI; public internet may not be available.
# All SkyPilot serve operations in K8s only need cluster-internal connectivity.
if os.environ.get("KUBERNETES_SERVICE_HOST"):
return
# ... existing logic ...
KUBERNETES_SERVICE_HOST is automatically set by Kubernetes in every pod — zero external dependency, non-K8s environments unaffected.
Current Workaround
We inject a sed patch into the Deployment startup script (before exec sky api start) to add the early return when KUBERNETES_SERVICE_HOST is detected.
Bug
sky.serve.status()(and allsky.serve.*operations) always fails withRuntimeError: Failed to refresh services status due to network errorwhen the SkyPilot API Server runs inside a Kubernetes pod.Root Cause
check_network_connection()insky/backends/backend_utils.pyis called before any serve operation:In K8s pods, this check fails for two reasons:
TLS handshake failure: The container's OpenSSL cannot complete TLS handshake with external HTTPS endpoints (
SSLV3_ALERT_HANDSHAKE_FAILURE), even when TCP connectivity works fine (e.g.,socket.create_connection(('1.1.1.1', 443))succeeds butrequests.head('https://1.1.1.1')fails withSSLError, which is caught asConnectionError).Air-gapped / restricted networks: Many production K8s clusters have no public internet egress by policy. The SkyPilot Helm chart is designed to run in K8s, but the network check assumes public internet access.
The actual serve operations only need K8s-internal connectivity (API Server → Controller pod → Replica pods), which works perfectly fine.
Environment
sshfrom API Server to Controller works finehttps://8.8.8.8andhttps://1.1.1.1fail due to TLS handshake failure from within the containerSteps to Reproduce
sky serve upsky.serve.status()→ fails with "network error"Expected Behavior
sky.serve.status()should succeed when the API Server can reach the Controller pod, regardless of public internet availability.Suggested Fix
Skip the public internet check when running inside Kubernetes:
KUBERNETES_SERVICE_HOSTis automatically set by Kubernetes in every pod — zero external dependency, non-K8s environments unaffected.Current Workaround
We inject a
sedpatch into the Deployment startup script (beforeexec sky api start) to add the early return whenKUBERNETES_SERVICE_HOSTis detected.