Skip to content

(fix) Add subflow infrastructure failure state handling to Kubernetes observer#21097

Open
haziqishere wants to merge 8 commits intoPrefectHQ:mainfrom
haziqishere:main
Open

(fix) Add subflow infrastructure failure state handling to Kubernetes observer#21097
haziqishere wants to merge 8 commits intoPrefectHQ:mainfrom
haziqishere:main

Conversation

@haziqishere
Copy link

@haziqishere haziqishere commented Mar 12, 2026

closes 21022

This PR fixes subflow runs getting permanently stuck in Running when their pod dies due to an infrastructure failure in prefect_kubernetes library. @desertaxle helps with adding InfrastructureDiagnosis to monitor status at pod-level in this PR #21050 . This PR will build on that by adding graceful handling of subflow Prefect run state when terminal failure is detected.

The new handle_subflow_failure_state setting is opt-in (default=None) . No behaviour changes unless explicitly configured.

  • Adds DiagnosisCode enum to InfrastructureDiagnosis for machine-readable failure identification, replacing string-based discrimination
  • Adds prefect.io/parent-task-run-id label to Kubernetes pods for subflows, enabling the observer to distinguish subflow pods from top-level flow run pods
Why only subflow pods?

Top-level flow run pods are managed by the worker, which detects job failure directly.
Subflow pods have no active watcher after the worker returns. The observer is the only
component that can detect and act on their failures, making them prone to getting stuck
in Running indefinitely.

The prefect.io/parent-task-run-id label (set by the worker for all subflow pods) is
used to distinguish subflow pods from top-level flow run pods. This act like an additional tag for us to easily differentiate whether the pod is running normal flow or the pod is dynamic infra for subflow run. This ensure the prefect run state
mutation only applies where it's needed.

  • Adds handle_subflow_failure_state setting to force subflow run state to Failed or Crashed when a terminal infrastructure failure is detected (OOMKilled, ImagePullBackOff,
    CrashLoopBackOff, Eviction, Unschedulable) which fixes subflow runs getting permanently stuck in Running when their pod dies unexpectedly
  • This setting is added under KubernetesObserverSettings and default to None
How to change this setting?

Specify in Prefect config PREFECT_INTEGRATIONS_KUBERNETES_OBSERVER_HANDLE_SUBFLOW_FAILURE_STATE:crashed to mark is as Crashed or PREFECT_INTEGRATIONS_KUBERNETES_OBSERVER_HANDLE_SUBFLOW_FAILURE_STATE:failed for marking it as Failed.

I deliberately set this to None so this changes will not be enforce to all users. I recommend to mark is as crashed as that will be intentional behaviour. failed setting is quite opinionated, since for my team's current use-case, we benefit from having subflow run to be marked as Failed.

  • Improves diagnosis log output with [code] prefix for easier filtering. Example of log output:
    [oom_killed] Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...
Why is this important?

Before the changes, the log looked like:
Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...

After the changes:
[oom_killed] Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...

The [oom_killed] prefix lets operators filter logs by failure type easily, e.g.:
kubectl logs <observer-pod> | grep "\[oom_killed\]"

Without it you'd have to grep for human-readable strings like "OOMKilled" which is fragile if the message wording ever changes. The code is stable.

Unit Test:

image image

Checklist

  • This pull request references any related issue by including "closes 21022"
  • If this pull request adds new functionality, it includes unit tests that cover the changes
  • If this pull request removes docs files, it includes redirect settings in mint.json.
  • If this pull request adds functions or classes, it includes helpful docstrings.

@github-actions github-actions bot added bug Something isn't working labels Mar 12, 2026
Copy link
Member

@desertaxle desertaxle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening a PR, @haziqishere, but we need to spend some time discussing #21022 to ensure the root cause is well understood, and we are aligned on a solution.

We don't want to update flow run state on pod events alone because we don't want to conflict with any retries that are configured on the Kubernetes job. In Prefect, we don't consider flow runs failed until all retries (both ours and Kubernetes') have been exhausted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Subflow remain in Running state even after pod status was OOMKilled which causes the main flow left in Running state indefinitely

2 participants