feat: add retryInterval and static retry timing for SparkApplication#2851
feat: add retryInterval and static retry timing for SparkApplication#2851weisscorp wants to merge 6 commits intokubeflow:masterfrom
Conversation
|
🎉 Welcome to the Kubeflow Spark Operator! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
b7da26f to
101a2fc
Compare
There was a problem hiding this comment.
Pull request overview
Adds configurable retry interval behavior for SparkApplication restart policy, including a new static retry mode that schedules retries relative to the most recent failure time rather than submission time.
Changes:
- Add
restartPolicy.retryIntervalandrestartPolicy.retryIntervalMethod(lineardefault,staticoptional) to the API/CRDs/docs. - Update retry timing calculation to support static mode and explicit retry interval precedence.
- Update controller status reset behavior and add unit tests for retry timing.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/util/sparkapplication.go |
Implements retryInterval precedence and retryIntervalMethod-based timing calculation. |
pkg/util/sparkapplication_test.go |
Adds unit tests for linear vs static retry timing and explicit interval precedence. |
internal/controller/sparkapplication/controller.go |
Extends status reset behavior to clear termination time on rerun transitions. |
docs/api-docs.md |
Regenerates API docs to include the new restart policy fields. |
config/crd/bases/sparkoperator.k8s.io_sparkapplications.yaml |
Regenerates SparkApplication CRD schema with new fields/enums/default. |
config/crd/bases/sparkoperator.k8s.io_scheduledsparkapplications.yaml |
Regenerates ScheduledSparkApplication CRD schema with new fields/enums/default. |
charts/spark-operator-chart/crds/sparkoperator.k8s.io_sparkapplications.yaml |
Updates Helm-packaged SparkApplication CRD with new fields/enums/default. |
charts/spark-operator-chart/crds/sparkoperator.k8s.io_scheduledsparkapplications.yaml |
Updates Helm-packaged ScheduledSparkApplication CRD with new fields/enums/default. |
api/v1beta2/zz_generated.deepcopy.go |
Regenerates deep-copies for the new RetryInterval pointer field. |
api/v1beta2/sparkapplication_types.go |
Adds API fields/types/constants for retry interval + method. |
api/v1beta2/defaults.go |
Defaults retryIntervalMethod to linear when unset. |
api/v1beta2/defaults_test.go |
Adds/updates tests for defaulting behavior of retryIntervalMethod. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
Keep static retry interval behavior stable across reruns by clearing previous run termination time before each submit, and add a regression test.
Purpose of this PR
This PR adds explicit retry-interval control for
SparkApplicationrestart policy and fixes static retry timing behavior.It addresses the case where long-running apps should be restarted with a fixed delay after each failure (for example, 5s / 5m / 15m), without linear backoff growth.
Proposed changes:
restartPolicy.retryInterval(seconds) to API/CRD.restartPolicy.retryIntervalMethodsupport withlinear(default) andstatic.linearbehavior unchanged (existing attempts-based backoff).static, compute next retry from failure time (status.terminationTime) instead of original submission time.terminationTimewhen app moves to pending rerun, so each new failure uses a fresh static interval.Change Category
Rationale
Previously, retry delay could effectively depend on submission timing and attempts history, which is not suitable for fixed restart cadence scenarios.
This change introduces an explicit fixed-interval mode and keeps current default behavior intact for backward compatibility.
Checklist
Additional Notes
retryIntervaltakes precedence overonFailureRetryInterval/onSubmissionFailureRetryIntervalwhen set.linearto preserve existing behavior.go test ./api/v1beta2 ./pkg/utilmake manifests,make update-crd,make build-api-docs.