Skip to content

[BUG] CoPilot sets terminationGracePeriodSeconds in nanoseconds instead of seconds, causing pods to hang in Terminating state #7097

@mitja-kleider

Description

@mitja-kleider

Describe the bug

In flyteplugins/go/tasks/pluginmachinery/flytek8s/copilot.go, the CoPilot timeout is assigned directly to TerminationGracePeriodSeconds without converting from nanoseconds to seconds:

coPilotPod.TerminationGracePeriodSeconds = (*int64)(&cfg.Timeout.Duration)

cfg.Timeout.Duration is a time.Duration, which stores nanoseconds as int64. Kubernetes expects terminationGracePeriodSeconds in seconds.

For example, a 1-hour copilot timeout produces terminationGracePeriodSeconds: 3600000000000.

Expected behavior

The duration should be converted to seconds before assignment:

seconds := int64(cfg.Timeout.Duration.Seconds())
coPilotPod.TerminationGracePeriodSeconds = &seconds

How to reproduce

  1. Run a Flyte task with CoPilot enabled (i.e., with output handling) which ignores SIGTERM
  2. Cancel or timeout the execution
  3. Observe the pod enters Terminating state and never completes termination until manual force-deletion
  4. Inspect the pod: kubectl get pod <name> -o jsonpath='{.spec.terminationGracePeriodSeconds}' returns 3600000000000
  5. Inspect the deletion timestamp: it will be decades/centuries in the future

Environment

  • FlytePropeller version: v1.16.3 (also present on master/v2.0.9)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions