(fix) Add subflow infrastructure failure state handling to Kubernetes observer by haziqishere · Pull Request #21097 · PrefectHQ/prefect

haziqishere · 2026-03-12T00:07:08Z

This PR fixes subflow runs getting permanently stuck in Running when their pod dies due to an infrastructure failure in prefect_kubernetes library. @desertaxle helps with adding InfrastructureDiagnosis to monitor status at pod-level in this PR #21050 . This PR will build on that by adding graceful handling of subflow Prefect run state when terminal failure is detected.

The new handle_subflow_failure_state setting is opt-in (default=None) . No behaviour changes unless explicitly configured.

Adds DiagnosisCode enum to InfrastructureDiagnosis for machine-readable failure identification, replacing string-based discrimination
Adds prefect.io/parent-task-run-id label to Kubernetes pods for subflows, enabling the observer to distinguish subflow pods from top-level flow run pods

Why only subflow pods?

Top-level flow run pods are managed by the worker, which detects job failure directly.
Subflow pods have no active watcher after the worker returns. The observer is the only
component that can detect and act on their failures, making them prone to getting stuck
in Running indefinitely.

The prefect.io/parent-task-run-id label (set by the worker for all subflow pods) is
used to distinguish subflow pods from top-level flow run pods. This act like an additional tag for us to easily differentiate whether the pod is running normal flow or the pod is dynamic infra for subflow run. This ensure the prefect run state
mutation only applies where it's needed.

Adds handle_subflow_failure_state setting to force subflow run state to Failed or Crashed when a terminal infrastructure failure is detected (OOMKilled, ImagePullBackOff,
CrashLoopBackOff, Eviction, Unschedulable) which fixes subflow runs getting permanently stuck in Running when their pod dies unexpectedly
This setting is added under KubernetesObserverSettings and default to None

How to change this setting?

Specify in Prefect config PREFECT_INTEGRATIONS_KUBERNETES_OBSERVER_HANDLE_SUBFLOW_FAILURE_STATE:crashed to mark is as Crashed or PREFECT_INTEGRATIONS_KUBERNETES_OBSERVER_HANDLE_SUBFLOW_FAILURE_STATE:failed for marking it as Failed.

I deliberately set this to None so this changes will not be enforce to all users. I recommend to mark is as crashed as that will be intentional behaviour. failed setting is quite opinionated, since for my team's current use-case, we benefit from having subflow run to be marked as Failed.

Improves diagnosis log output with [code] prefix for easier filtering. Example of log output:
[oom_killed] Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...

Why is this important?

Before the changes, the log looked like:
Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...

After the changes:
[oom_killed] Container 'main' was killed due to out-of-memory (OOMKilled): ... Resolution: ...

The [oom_killed] prefix lets operators filter logs by failure type easily, e.g.:
kubectl logs <observer-pod> | grep "\[oom_killed\]"

Without it you'd have to grep for human-readable strings like "OOMKilled" which is fragile if the message wording ever changes. The code is stable.

Unit Test:

Checklist

This pull request references any related issue by including "closes 21022"
If this pull request adds new functionality, it includes unit tests that cover the changes
If this pull request removes docs files, it includes redirect settings in mint.json.
If this pull request adds functions or classes, it includes helpful docstrings.

…fy dynamic subflow pod

desertaxle

Thanks for opening a PR, @haziqishere, but we need to spend some time discussing #21022 to ensure the root cause is well understood, and we are aligned on a solution.

We don't want to update flow run state on pod events alone because we don't want to conflict with any retries that are configured on the Kubernetes job. In Prefect, we don't consider flow runs failed until all retries (both ours and Kubernetes') have been exhausted.

haziqishere added 7 commits March 12, 2026 06:24

adding enum class for diagnosis code

588b399

adding DiagnosisCode and improving the log line

a0c1927

add optional setting to handle subflow pod failure state

7f6abe3

add 'prefect.io/parent-task-run-id' label to subflow to easily identi…

7332649

…fy dynamic subflow pod

add logic to handle prefect job state for failed subflow pod

290cd0c

add test-case

0e6a519

fix typo

110bd84

haziqishere requested review from chrisguidry, cicdw, desertaxle and zzstoatzz as code owners March 12, 2026 00:07

github-actions bot added bug Something isn't working labels Mar 12, 2026

Merge branch 'main' into main

792a31b

desertaxle requested changes Mar 13, 2026

View reviewed changes

desertaxle mentioned this pull request Mar 13, 2026

Subflow remain in Running state even after pod status was OOMKilled which causes the main flow left in Running state indefinitely #21022

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(fix) Add subflow infrastructure failure state handling to Kubernetes observer#21097

(fix) Add subflow infrastructure failure state handling to Kubernetes observer#21097
haziqishere wants to merge 8 commits intoPrefectHQ:mainfrom
haziqishere:main

haziqishere commented Mar 12, 2026 •

edited

Loading

Uh oh!

desertaxle left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haziqishere commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

desertaxle left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haziqishere commented Mar 12, 2026 •

edited

Loading