Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 15 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,22 +316,25 @@ You can set those environment variables by `--set oap.env.<ENV_NAME>=<ENV_VALUE>

> The environment variables take priority over the overrode configuration files.

## Rerun OAP init job
## OAP init job

Kubernetes Job cannot be rerun by default, if you want to rerun the OAP init
job, you need to delete the Job and recreate it.
The OAP storage schema (Elasticsearch indices / SQL tables / BanyanDB groups) is created by a
one-shot `*-oap-init-*` Job that runs OAP in `-Dmode=init`. The main OAP Deployment runs in
`-Dmode=no-init` and blocks (its `12800` port stays closed, so it is not Ready) until that schema
exists. The init Job is a **normal release resource** that runs in the main install/upgrade phase,
so `helm upgrade --install --wait` works: the Job creates the schema while OAP waits for it. To get
Helm to surface init-Job failures directly (instead of only seeing OAP fail to become Ready), add
`--wait-for-jobs` alongside `--wait`.

The Job name carries a hash of the chart values, so any `helm upgrade` that changes a value
re-creates the Job and re-runs init automatically (Helm prunes the previous one).

To **force a rerun** without changing any value — delete the Job and re-run `helm upgrade`; Helm
recreates the (now missing) Job and init runs again:

```shell
# Make sure to export the Job manifest to a file before deleting it.
kubectl get job -n "${SKYWALKING_RELEASE_NAMESPACE}" -l release=$SKYWALKING_RELEASE_NAME -o yaml > oap-init.job.yaml
# Trim the Job manifest to keep only the Job part, you can either download yq from https://github.com/mikefarah/yq or
# manually remove the fields that are not needed.
yq 'del(.items[0].metadata.creationTimestamp,.items[0].metadata.resourceVersion,.items[0].metadata.uid,.items[0].status,.items[0].spec.template.metadata.labels."batch.kubernetes.io/controller-uid",.items[0].spec.template.metadata.labels."controller-uid",.items[0].spec.selector.matchLabels."batch.kubernetes.io/controller-uid")' oap-init.job.yaml > oap-init.job.trimmed.yaml
# Check the file oap-init.job.trimmed.yaml to make sure it has correct content
# Delete the Job
kubectl delete job -n "${SKYWALKING_RELEASE_NAMESPACE}" -l release=$SKYWALKING_RELEASE_NAME
# Create the Job
kubectl -n "${SKYWALKING_RELEASE_NAMESPACE}" apply -f oap-init.job.trimmed.yaml
helm upgrade "$SKYWALKING_RELEASE_NAME" <chart> -n "${SKYWALKING_RELEASE_NAMESPACE}" --reuse-values
```

# Contact Us
Expand Down
3 changes: 2 additions & 1 deletion chart/skywalking/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ The following table lists the configurable parameters of the Skywalking chart an
| `oap.nodeSelector` | OAP labels for master pod assignment | `{}` |
| `oap.tolerations` | OAP tolerations | `[]` |
| `oap.resources` | OAP node resources requests & limits | `{} - cpu limit must be an integer` |
| `oap.startupProbe` | Configuration fields for the [startupProbe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) | `tcpSocket.port: 12800` <br> `failureThreshold: 9` <br> `periodSeconds: 10`
| `oap.startupProbe` | Configuration fields for the [startupProbe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/). The default budget (`failureThreshold` * `periodSeconds` = 300s) is large enough for OAP to wait in no-init mode while the OAP init Job creates the storage schema. | `tcpSocket.port: 12800` <br> `failureThreshold: 30` <br> `periodSeconds: 10`
| `oap.livenessProbe` | Configuration fields for the [livenessProbe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) | `tcpSocket.port: 12800` <br> `initialDelaySeconds: 5` <br> `periodSeconds: 10`
| `oap.readinessProbe` | Configuration fields for the [readinessProbe](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) | `tcpSocket.port: 12800` <br> `initialDelaySeconds: 5` <br> `periodSeconds: 10`
| `oap.env` | OAP environment variables | `[]` |
Expand Down Expand Up @@ -109,6 +109,7 @@ The following table lists the configurable parameters of the Skywalking chart an
| `oapInit.nodeSelector` | OAP init job labels for master pod assignment | `{}` |
| `oapInit.tolerations` | OAP init job tolerations | `[]` |
| `oapInit.extraPodLabels` | OAP init job metadata labels | `[]` |
| `oapInit.ttlSecondsAfterFinished` | Seconds after which the finished OAP init Job (and its Pod) is auto-deleted by the Kubernetes TTL-after-finished controller. Empty keeps the Job. Leave empty with GitOps tools (Argo CD/Flux), which would recreate it after deletion. | `""` |
| `satellite.name` | Satellite deployment name | `satellite` |
| `satellite.replicas` | Satellite k8s deployment replicas | `1` |
| `satellite.enabled` | Is enable Satellite | `false` |
Expand Down
5 changes: 4 additions & 1 deletion chart/skywalking/templates/oap-deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,10 @@ spec:
{{ else }}
tcpSocket:
port: 12800
failureThreshold: 9
# In no-init mode OAP blocks (port 12800 stays closed) until the init Job has created
# the storage schema. Give it a generous budget (30 * 10s = 300s) so the pod waits for
# the init Job instead of being restarted during a cold start.
failureThreshold: 30
periodSeconds: 10
{{- end }}
readinessProbe:
Expand Down
19 changes: 15 additions & 4 deletions chart/skywalking/templates/oap-init.job.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,17 +18,28 @@
apiVersion: batch/v1
kind: Job
metadata:
name: "{{ template "skywalking.oap.fullname" . }}-init"
# NOTE: This Job is intentionally a normal release resource, NOT a Helm hook.
# Running it as a post-install/post-upgrade hook deadlocks `helm upgrade --install --wait`:
# Helm waits for every release resource to become Ready before it runs post-* hooks, but the
# OAP Deployment runs in `-Dmode=no-init` and never becomes Ready until this Job has created
# the storage schema -- so the hook (and therefore the schema) would never run. As a main-phase
# resource the Job runs alongside the OAP Deployment, which blocks in no-init mode until the
# schema appears, so `--wait` resolves instead of deadlocking.
#
# The name carries a hash of the chart values: a Job's `spec.template` is immutable, so a stable
# name would make `helm upgrade` fail with "field is immutable" whenever the pod template changes.
# Hashing yields a fresh Job whenever a relevant value changes; Helm prunes the previous one.
name: "{{ printf "%s-init-%s" (include "skywalking.oap.fullname" . | trunc 40 | trimSuffix "-") (.Values | toYaml | sha256sum | trunc 8) }}"
labels:
app: {{ template "skywalking.name" . }}
chart: {{ .Chart.Name }}-{{ .Chart.Version }}
component: "{{ template "skywalking.fullname" . }}-job"
heritage: {{ .Release.Service }}
release: {{ .Release.Name }}
annotations:
"helm.sh/hook": post-install,post-upgrade,post-rollback
"helm.sh/hook-weight": "1"
spec:
{{- if .Values.oapInit.ttlSecondsAfterFinished }}
ttlSecondsAfterFinished: {{ .Values.oapInit.ttlSecondsAfterFinished }}
{{- end }}
template:
metadata:
name: "{{ .Release.Name }}-oap-init"
Expand Down
14 changes: 11 additions & 3 deletions chart/skywalking/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,11 +75,13 @@ oap:
# initialDelaySeconds: 5
# periodSeconds: 20
startupProbe: {}
# Time to boot the application is set to:
# 9 (failureThreshold) * 10 (periodSeconds) = 90 seconds in this case.
# Boot budget defaults to 30 (failureThreshold) * 10 (periodSeconds) = 300 seconds.
# In no-init mode OAP keeps port 12800 closed until the OAP init Job has created the storage
# schema, so the budget must be large enough to cover storage startup + schema creation;
# otherwise the pod is restarted while it is legitimately waiting for the init Job.
# tcpSocket:
# port: 12800
# failureThreshold: 9
# failureThreshold: 30
# periodSeconds: 10
readinessProbe: {}
# tcpSocket:
Expand Down Expand Up @@ -301,6 +303,12 @@ oapInit:
tolerations: []
extraPodLabels: {}
# sidecar.istio.io/inject: false
# Auto-delete the completed init Job (and its Pod) this many seconds after it finishes, via the
# Kubernetes TTL-after-finished controller. Leave empty to keep the completed Job around.
# NOTE: leave this empty when using GitOps tools (e.g. Argo CD, Flux) -- they would recreate the
# Job after the TTL controller deletes it, re-running init on every reconcile. The Job name is
# value-hashed, so upgrades already work without TTL; this is only for tidying finished Jobs.
ttlSecondsAfterFinished: ""

# Elasticsearch managed by ECK (eck-elasticsearch chart)
# When enabled, the ECK operator is also installed as a dependency.
Expand Down
Loading