Skip to content

Commit 93170fa

Browse files
t0mdavid-mclaude
andauthored
Claude/kubernetes migration plan kq jw d (#358)
* Add Kubernetes manifests and CI workflows for de.NBI migration Decompose the monolithic Docker container into Kubernetes workloads: - Streamlit Deployment with health probes and session affinity - Redis Deployment + Service for job queue - RQ Worker Deployment for background workflows - CronJob for workspace cleanup - Ingress with WebSocket support and cookie-based sticky sessions - Shared PVC (ReadWriteMany) for workspace data - ConfigMap for runtime configuration (replaces build-time settings) - Kustomize base + template-app overlay for multi-app deployment Code changes: - Remove unsafe enableCORS=false and enableXsrfProtection=false from config.toml - Make workspace path configurable via WORKSPACES_DIR env var in clean-up-workspaces.py CI/CD: - Add build-and-push-image.yml to push Docker images to ghcr.io - Add k8s-manifests-ci.yml for manifest validation and kind integration tests https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix kubeconform validation to skip kustomization.yaml kustomization.yaml is a Kustomize config file, not a standard K8s resource, so kubeconform has no schema for it. Exclude it via -ignore-filename-pattern. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add matrix strategy to test both Dockerfiles in integration tests The integration-test job now uses a matrix with Dockerfile_simple and Dockerfile. Each matrix entry checks if its Dockerfile exists before running — all steps are guarded with an `if` condition so they skip gracefully when a Dockerfile is absent. This allows downstream forks that only have one Dockerfile to pass CI without errors. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Adapt K8s base manifests for de.NBI Cinder CSI storage - Switch workspace PVC from ReadWriteMany to ReadWriteOnce with cinder-csi storage class (required by de.NBI KKP cluster) - Increase PVC storage to 500Gi - Add namespace: openms to kustomization.yaml - Reduce pod resource requests (1Gi/500m) and limits (8Gi/4 CPU) so all workspace-mounting pods fit on a single node https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add pod affinity rules to co-locate all workspace pods on same node The workspaces PVC uses ReadWriteOnce (Cinder CSI block storage) which requires all pods mounting it to run on the same node. Without explicit affinity rules, the scheduler was failing silently, leaving pods in Pending state with no events. Adds a `volume-group: workspaces` label and podAffinity with requiredDuringSchedulingIgnoredDuringExecution to streamlit deployment, rq-worker deployment, and cleanup cronjob. This ensures the scheduler explicitly co-locates all workspace-consuming pods on the same node. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: wait for ingress-nginx admission webhook before deploying The controller pod being Ready doesn't guarantee the admission webhook service is accepting connections. Add a polling loop that waits for the webhook endpoint to have an IP assigned before applying the Ingress resource, preventing "connection refused" errors during kustomize apply. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: add -n openms namespace to integration test steps The kustomize overlay deploys into the openms namespace, but the verification steps (Redis wait, Redis ping, deployment checks) were querying the default namespace, causing "no matching resources found". https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: retry kustomize deploy for webhook readiness Replace the unreliable endpoint-IP polling with a retry loop on kubectl apply (up to 5 attempts with backoff). This handles the race where the ingress-nginx admission webhook has an endpoint IP but isn't yet accepting TCP connections. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix REDIS_URL to use prefixed service name in overlay Kustomize namePrefix renames the Redis service to template-app-redis, but the REDIS_URL env var in streamlit and rq-worker deployments still referenced the unprefixed name "redis", causing the rq-worker to CrashLoopBackOff with "Name or service not known". Add JSON patches in the overlay to set the correct prefixed hostname. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Add Traefik IngressRoute for direct LB IP access The cluster uses Traefik, not nginx, so the nginx Ingress annotations are ignored. Add a Traefik IngressRoute with PathPrefix(/) catch-all routing and sticky session cookie for Streamlit session affinity. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix CI: skip Traefik IngressRoute CRD in validation and integration tests kubeconform doesn't know the Traefik IngressRoute CRD schema, and the kind cluster in integration tests doesn't have Traefik installed. Skip the IngressRoute in kubeconform validation and filter it out with yq before applying to the kind cluster. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * Fix IngressRoute service name for kustomize namePrefix Kustomize namePrefix doesn't rewrite service references inside CRDs, so the IngressRoute was pointing to 'streamlit' instead of 'template-app-streamlit', causing Traefik to return 404. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: use ConfigMap as settings override instead of full replacement The ConfigMap was replacing the entire settings.json, losing keys like "version" and "repository-name" that the app expects (causing KeyError). Now the ConfigMap only contains deployment-specific overrides, which are merged into the Docker image's base settings.json at container startup using jq. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ * fix: add set -euo pipefail to fail fast on settings merge error Addresses CodeRabbit review: if jq merge fails, the container should not start with unmerged settings. https://claude.ai/code/session_01RNJ3dVjV1VTHcC9ugE3FQJ --------- Co-authored-by: Claude <[email protected]>
1 parent 6a2bc03 commit 93170fa

7 files changed

Lines changed: 57 additions & 38 deletions

File tree

.github/workflows/k8s-manifests-ci.yml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,10 @@ jobs:
2121
2222
- name: Validate K8s manifests (base)
2323
run: |
24-
kubeconform -summary -strict -kubernetes-version 1.28.0 -ignore-filename-pattern 'kustomization.yaml' k8s/base/*.yaml
24+
kubeconform -summary -strict -kubernetes-version 1.28.0 \
25+
-ignore-filename-pattern 'kustomization.yaml' \
26+
-ignore-filename-pattern 'traefik-ingressroute.yaml' \
27+
k8s/base/*.yaml
2528
2629
- name: Install kubectl
2730
uses: azure/setup-kubectl@v3
@@ -33,7 +36,7 @@ jobs:
3336
3437
- name: Validate kustomized output
3538
run: |
36-
kubectl kustomize k8s/overlays/template-app/ | kubeconform -summary -strict -kubernetes-version 1.28.0
39+
kubectl kustomize k8s/overlays/template-app/ | kubeconform -summary -strict -kubernetes-version 1.28.0 -skip IngressRoute
3740
3841
integration-test:
3942
runs-on: ubuntu-latest
@@ -83,7 +86,9 @@ jobs:
8386
- name: Deploy with Kustomize
8487
if: steps.check.outputs.exists == 'true'
8588
run: |
89+
# Filter out Traefik CRDs (kind cluster uses nginx, not Traefik)
8690
kubectl kustomize k8s/overlays/template-app/ | \
91+
yq 'select(.kind != "IngressRoute")' | \
8792
sed 's|imagePullPolicy: IfNotPresent|imagePullPolicy: Never|g' > /tmp/manifests.yaml
8893
for i in 1 2 3 4 5; do
8994
if kubectl apply -f /tmp/manifests.yaml; then

k8s/base/configmap.yaml

Lines changed: 2 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -3,37 +3,7 @@ kind: ConfigMap
33
metadata:
44
name: streamlit-config
55
data:
6-
settings.json: |
6+
settings-overrides.json: |
77
{
8-
"app-name": "OpenMS WebApp Template",
9-
"online_deployment": true,
10-
"enable_workspaces": true,
11-
"workspaces_dir": "..",
12-
"queue_settings": {
13-
"default_timeout": 7200,
14-
"result_ttl": 86400
15-
},
16-
"demo_workspaces": {
17-
"enabled": true,
18-
"source_dirs": ["example-data/workspaces"]
19-
},
20-
"max_threads": {
21-
"local": 4,
22-
"online": 2
23-
},
24-
"analytics": {
25-
"matomo": {
26-
"enabled": true,
27-
"url": "https://cdn.matomo.cloud/openms.matomo.cloud",
28-
"tag": "yDGK8bfY"
29-
},
30-
"google-analytics": {
31-
"enabled": false,
32-
"tag": ""
33-
},
34-
"piwik-pro": {
35-
"enabled": false,
36-
"tag": ""
37-
}
38-
}
8+
"online_deployment": true
399
}

k8s/base/kustomization.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,5 @@ resources:
1212
- streamlit-service.yaml
1313
- rq-worker-deployment.yaml
1414
- ingress.yaml
15+
- traefik-ingressroute.yaml
1516
- cleanup-cronjob.yaml

k8s/base/rq-worker-deployment.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@ spec:
3232
command: ["/bin/bash", "-c"]
3333
args:
3434
- |
35+
set -euo pipefail
3536
source /root/miniforge3/bin/activate streamlit-env
37+
jq -s '.[0] * .[1]' /app/settings.json /app/settings-overrides.json > /tmp/settings-merged.json && mv /tmp/settings-merged.json /app/settings.json
3638
exec rq worker openms-workflows --url $REDIS_URL
3739
env:
3840
- name: REDIS_URL
@@ -41,8 +43,8 @@ spec:
4143
- name: workspaces
4244
mountPath: /workspaces-streamlit-template
4345
- name: config
44-
mountPath: /app/settings.json
45-
subPath: settings.json
46+
mountPath: /app/settings-overrides.json
47+
subPath: settings-overrides.json
4648
readOnly: true
4749
resources:
4850
requests:

k8s/base/streamlit-deployment.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@ spec:
3232
command: ["/bin/bash", "-c"]
3333
args:
3434
- |
35+
set -euo pipefail
3536
source /root/miniforge3/bin/activate streamlit-env
37+
jq -s '.[0] * .[1]' /app/settings.json /app/settings-overrides.json > /tmp/settings-merged.json && mv /tmp/settings-merged.json /app/settings.json
3638
exec streamlit run app.py --server.address 0.0.0.0
3739
ports:
3840
- containerPort: 8501
@@ -43,8 +45,8 @@ spec:
4345
- name: workspaces
4446
mountPath: /workspaces-streamlit-template
4547
- name: config
46-
mountPath: /app/settings.json
47-
subPath: settings.json
48+
mountPath: /app/settings-overrides.json
49+
subPath: settings-overrides.json
4850
readOnly: true
4951
readinessProbe:
5052
httpGet:

k8s/base/traefik-ingressroute.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
apiVersion: traefik.io/v1alpha1
2+
kind: IngressRoute
3+
metadata:
4+
name: streamlit-traefik
5+
spec:
6+
entryPoints:
7+
- web
8+
routes:
9+
- match: PathPrefix(`/`)
10+
kind: Rule
11+
services:
12+
- name: streamlit
13+
port: 8501
14+
sticky:
15+
cookie:
16+
name: stroute
17+
httpOnly: true
18+
sameSite: lax

k8s/overlays/template-app/kustomization.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,3 +22,24 @@ patches:
2222
- op: replace
2323
path: /spec/rules/0/host
2424
value: template.openms.example.de
25+
- target:
26+
kind: Deployment
27+
name: streamlit
28+
patch: |
29+
- op: replace
30+
path: /spec/template/spec/containers/0/env/0/value
31+
value: "redis://template-app-redis:6379/0"
32+
- target:
33+
kind: Deployment
34+
name: rq-worker
35+
patch: |
36+
- op: replace
37+
path: /spec/template/spec/containers/0/env/0/value
38+
value: "redis://template-app-redis:6379/0"
39+
- target:
40+
kind: IngressRoute
41+
name: streamlit-traefik
42+
patch: |
43+
- op: replace
44+
path: /spec/routes/0/services/0/name
45+
value: "template-app-streamlit"

0 commit comments

Comments
 (0)