Skip to content

Airflow 3.2.0 scheduler/triggerer deadlock on task_instance due to concurrent updates of deferrable tasks #65818

@jinshen-cn

Description

@jinshen-cn

Under which category would you file this issue?

Airflow Core

Apache Airflow version

3.2.0

What happened and how to reproduce it?

Description

After upgrading from Airflow 3.1.7 → 3.2.0, we are consistently observing MySQL deadlocks between the scheduler and triggerer when processing deferrable tasks.

This did not occur in 3.1.7 under the same workload.

Environment

  • Airflow version: 3.2.0
  • Previous version (no issue): 3.1.7
  • Executor: CeleryExecutor
  • DB: MySQL
  • Scheduler replicas: 3
  • Triggerer: 1 instance
  • Workload: heavy use of deferrable operators (sensors / async tasks)

Symptoms

  • Scheduler/Trigger crashes or restarts due to DB deadlocks
  • Deadlocks consistently involve task_instance table
  • System becomes unstable under load

Example deadlock pattern:

UPDATE task_instance
SET updated_at=..., trigger_id=NULL
WHERE task_instance.state != 'deferred'
  AND task_instance.trigger_id IS NOT NULL

conflicting with:

UPDATE task_instance
SET state='scheduled', trigger_id=NULL,
    next_method='__fail__', next_kwargs=...
WHERE task_instance.state = 'deferred'
  AND task_instance.trigger_timeout < now()

Root Cause Analysis

Key observation

In Airflow 3.2.0, both scheduler and triggerer mutate task_instance rows for deferrable tasks:

Triggerer (set-based update)

  • Performs bulk UPDATE on deferred tasks that timeout
  • Updates:
    • state
    • trigger_id
    • next_method
    • next_kwargs

Scheduler (callback-driven updates)

  • Processes executor callbacks via:
callback = session.get(Callback, callback_id)
callback.run(session=session)
  • Inside callback.run():
    • Loads TaskInstance
    • Mutates:
      • state
      • trigger_id
      • other fields

Result

Two independent writers:

  • Triggerer → bulk UPDATE (set-based)
  • Scheduler → row-by-row ORM UPDATE

Both target overlapping task_instance rows.

Why this causes deadlocks

  • Both queries scan overlapping row sets (even if predicates are logically disjoint)
  • Lock acquisition order differs:
    • Triggerer: index scan order
    • Scheduler: callback / primary key order
  • With multiple scheduler replicas, contention increases significantly

Typical pattern:

Scheduler: locks row A → waits for row B
Triggerer: locks row B → waits for row A
→ DEADLOCK

What you think should happen instead?

  1. Avoid concurrent writes:
  • Scheduler should not mutate task_instance fields owned by triggerer
  1. Enforce consistent ordering:
  • Ensure both components lock rows in deterministic order
  1. Batch updates:
  • Avoid large scans or uncontrolled ORM flushes
  1. Ownership separation:
  • Triggerer handles deferred lifecycle exclusively
  • Scheduler only consumes results

Operating System

Ubuntu 22.04.5 LTS

Deployment

None

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

No response

Official Helm Chart version

Not Applicable

Kubernetes Version

No response

Helm Chart configuration

No response

Docker Image customizations

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:Triggererarea:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetpriority:criticalShowstopper bug that should be patched immediately

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions