(fix) Add subflow infrastructure failure state handling to Kubernetes observer#21097
Open
haziqishere wants to merge 8 commits intoPrefectHQ:mainfrom
Open
(fix) Add subflow infrastructure failure state handling to Kubernetes observer#21097haziqishere wants to merge 8 commits intoPrefectHQ:mainfrom
haziqishere wants to merge 8 commits intoPrefectHQ:mainfrom
Conversation
…fy dynamic subflow pod
desertaxle
requested changes
Mar 13, 2026
Member
desertaxle
left a comment
There was a problem hiding this comment.
Thanks for opening a PR, @haziqishere, but we need to spend some time discussing #21022 to ensure the root cause is well understood, and we are aligned on a solution.
We don't want to update flow run state on pod events alone because we don't want to conflict with any retries that are configured on the Kubernetes job. In Prefect, we don't consider flow runs failed until all retries (both ours and Kubernetes') have been exhausted.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
closes 21022
This PR fixes subflow runs getting permanently stuck in
Runningwhen their pod dies due to an infrastructure failure inprefect_kuberneteslibrary. @desertaxle helps with addingInfrastructureDiagnosisto monitor status at pod-level in this PR #21050 . This PR will build on that by adding graceful handling of subflow Prefect run state when terminal failure is detected.The new
handle_subflow_failure_statesetting is opt-in (default=None) . No behaviour changes unless explicitly configured.DiagnosisCodeenum toInfrastructureDiagnosisfor machine-readable failure identification, replacing string-based discriminationprefect.io/parent-task-run-idlabel to Kubernetes pods for subflows, enabling the observer to distinguish subflow pods from top-level flow run podsWhy only subflow pods?
Top-level flow run pods are managed by the worker, which detects job failure directly.
Subflow pods have no active watcher after the worker returns. The observer is the only
component that can detect and act on their failures, making them prone to getting stuck
in
Runningindefinitely.The
prefect.io/parent-task-run-idlabel (set by the worker for all subflow pods) isused to distinguish subflow pods from top-level flow run pods. This act like an additional tag for us to easily differentiate whether the pod is running normal flow or the pod is dynamic infra for subflow run. This ensure the prefect run state
mutation only applies where it's needed.
handle_subflow_failure_statesetting to force subflow run state toFailedorCrashedwhen a terminal infrastructure failure is detected (OOMKilled, ImagePullBackOff,CrashLoopBackOff, Eviction, Unschedulable) which fixes subflow runs getting permanently stuck in
Runningwhen their pod dies unexpectedlyKubernetesObserverSettingsand default toNoneHow to change this setting?
Specify in Prefect config
PREFECT_INTEGRATIONS_KUBERNETES_OBSERVER_HANDLE_SUBFLOW_FAILURE_STATE:crashedto mark is asCrashedorPREFECT_INTEGRATIONS_KUBERNETES_OBSERVER_HANDLE_SUBFLOW_FAILURE_STATE:failedfor marking it asFailed.I deliberately set this to
Noneso this changes will not be enforce to all users. I recommend to mark is ascrashedas that will be intentional behaviour.failedsetting is quite opinionated, since for my team's current use-case, we benefit from having subflow run to be marked asFailed.[code]prefix for easier filtering. Example of log output:[oom_killed] Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...Why is this important?
Before the changes, the log looked like:
Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...After the changes:
[oom_killed] Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...The
[oom_killed]prefix lets operators filter logs by failure type easily, e.g.:kubectl logs <observer-pod> | grep "\[oom_killed\]"Without it you'd have to grep for human-readable strings like "OOMKilled" which is fragile if the message wording ever changes. The code is stable.
Unit Test:
Checklist
mint.json.