Skip to content

Reuse an operator's output storage across region re-executions #5709

@aglinxinyuan

Description

@aglinxinyuan

Feature Summary

When a region re-executes, RegionExecutionCoordinator recreates each operator's iceberg output and state documents. An operator whose output accumulates across runs — e.g. a loop body that re-executes once per iteration — needs its existing storage preserved instead of clobbered on every region invocation. There is currently no way for an operator to opt into reusing its output storage across re-executions.

Proposed Solution or Design

  • Add a reusesOutputStorageOnReExecution: Boolean = false field to PhysicalOp plus a withReusesOutputStorageOnReExecution builder.
  • Add a pure provisionOutputDocument(uri, reuseExistingStorage, documentExists, createDocument) decision function in RegionExecutionCoordinator: when the owning operator sets the flag and the document already exists, reuse it; otherwise (re)create. Unit-test the reuse×exists matrix.
  • Name the flag for the behavior the scheduler checks, not the operator that sets it, so any future operator needing the same treatment can opt in.

Default false keeps every existing operator recreating its storage as before (no behavior change). The for-loop feature sets the flag on Loop End.

Affected Area

  • Workflow Engine (Amber)
  • Storage / Metadata

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Task.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions