Skip to content

feat: add retryInterval and static retry timing for SparkApplication#2851

Open
weisscorp wants to merge 6 commits intokubeflow:masterfrom
weisscorp:feature/retry-interval-upstream-master
Open

feat: add retryInterval and static retry timing for SparkApplication#2851
weisscorp wants to merge 6 commits intokubeflow:masterfrom
weisscorp:feature/retry-interval-upstream-master

Conversation

@weisscorp
Copy link
Copy Markdown

@weisscorp weisscorp commented Feb 26, 2026

Purpose of this PR

This PR adds explicit retry-interval control for SparkApplication restart policy and fixes static retry timing behavior.

It addresses the case where long-running apps should be restarted with a fixed delay after each failure (for example, 5s / 5m / 15m), without linear backoff growth.

Proposed changes:

  • Add restartPolicy.retryInterval (seconds) to API/CRD.
  • Add restartPolicy.retryIntervalMethod support with linear (default) and static.
  • Keep linear behavior unchanged (existing attempts-based backoff).
  • For static, compute next retry from failure time (status.terminationTime) instead of original submission time.
  • Reset terminationTime when app moves to pending rerun, so each new failure uses a fresh static interval.
  • Regenerate CRDs / API docs / manifests and update unit tests in retry timing logic.

Change Category

  • Bugfix (non-breaking change which fixes an issue)
  • Feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that could affect existing functionality)
  • Documentation update

Rationale

Previously, retry delay could effectively depend on submission timing and attempts history, which is not suitable for fixed restart cadence scenarios.
This change introduces an explicit fixed-interval mode and keeps current default behavior intact for backward compatibility.

Checklist

  • I have conducted a self-review of my own code.
  • I have updated documentation accordingly.
  • I have added tests that prove my changes are effective or that my feature works.
  • Existing unit tests pass locally with my changes.

Additional Notes

  • retryInterval takes precedence over onFailureRetryInterval / onSubmissionFailureRetryInterval when set.
  • Default method remains linear to preserve existing behavior.
  • Local validation used:
    • go test ./api/v1beta2 ./pkg/util
    • plus regenerated artifacts via make manifests, make update-crd, make build-api-docs.
    • validated behavior in a production-like cluster

Copilot AI review requested due to automatic review settings February 26, 2026 06:28
@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Spark Operator! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@google-oss-prow google-oss-prow bot requested a review from ImpSy February 26, 2026 06:28
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chenyi015 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
@weisscorp weisscorp force-pushed the feature/retry-interval-upstream-master branch from b7da26f to 101a2fc Compare February 26, 2026 06:34
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds configurable retry interval behavior for SparkApplication restart policy, including a new static retry mode that schedules retries relative to the most recent failure time rather than submission time.

Changes:

  • Add restartPolicy.retryInterval and restartPolicy.retryIntervalMethod (linear default, static optional) to the API/CRDs/docs.
  • Update retry timing calculation to support static mode and explicit retry interval precedence.
  • Update controller status reset behavior and add unit tests for retry timing.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/util/sparkapplication.go Implements retryInterval precedence and retryIntervalMethod-based timing calculation.
pkg/util/sparkapplication_test.go Adds unit tests for linear vs static retry timing and explicit interval precedence.
internal/controller/sparkapplication/controller.go Extends status reset behavior to clear termination time on rerun transitions.
docs/api-docs.md Regenerates API docs to include the new restart policy fields.
config/crd/bases/sparkoperator.k8s.io_sparkapplications.yaml Regenerates SparkApplication CRD schema with new fields/enums/default.
config/crd/bases/sparkoperator.k8s.io_scheduledsparkapplications.yaml Regenerates ScheduledSparkApplication CRD schema with new fields/enums/default.
charts/spark-operator-chart/crds/sparkoperator.k8s.io_sparkapplications.yaml Updates Helm-packaged SparkApplication CRD with new fields/enums/default.
charts/spark-operator-chart/crds/sparkoperator.k8s.io_scheduledsparkapplications.yaml Updates Helm-packaged ScheduledSparkApplication CRD with new fields/enums/default.
api/v1beta2/zz_generated.deepcopy.go Regenerates deep-copies for the new RetryInterval pointer field.
api/v1beta2/sparkapplication_types.go Adds API fields/types/constants for retry interval + method.
api/v1beta2/defaults.go Defaults retryIntervalMethod to linear when unset.
api/v1beta2/defaults_test.go Adds/updates tests for defaulting behavior of retryIntervalMethod.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: weisscorp <49550074+weisscorp@users.noreply.github.com>
Keep static retry interval behavior stable across reruns by clearing previous run termination time before each submit, and add a regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants