Skip to content

Fix leaf expression reconciliation#22971

Merged
adriangb merged 4 commits into
apache:mainfrom
pydantic:fix_leaf_expression_projection_reconciliatin
Jun 16, 2026
Merged

Fix leaf expression reconciliation#22971
adriangb merged 4 commits into
apache:mainfrom
pydantic:fix_leaf_expression_projection_reconciliatin

Conversation

@cetra3

@cetra3 cetra3 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

This fixes a bug with the extract leaf expressions

What changes are included in this PR?

This is a one liner that sanity checks the schema is the same length when we are doing expression pushdown

Are these changes tested?

Yes, a couple of tests have been added.

Are there any user-facing changes?

Nope!

@github-actions github-actions Bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jun 16, 2026
Simplify recovery-projection conditions in split_and_push_projection

The recovery decision had four overlapping triggers. Two of them are
fully subsumed by the field-count check (any schema widening also changes
the field count), so remove them along with the now-dead
`standalone_columns` tracking:

- the `columns_needed`-widening heuristic
- the `transformed || !is_column` heuristic

What remains are the two genuine, complementary cases: the `needs_alias`
branch catches an equal-arity rename (which the count check cannot see),
and the field-count check catches a leaked-column widening (which the
rename check cannot see). Add comments explaining why a count comparison
is used deliberately rather than a stricter ordered/qualified schema
equality, which spuriously fires on benign reorders and qualifier
differences.

No behavior change: all extract_leaf tests and struct.slt pass unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cetra3 cetra3 force-pushed the fix_leaf_expression_projection_reconciliatin branch from ed934da to c2ee386 Compare June 16, 2026 10:26
adriangb and others added 3 commits June 16, 2026 13:04
My earlier simplification removed the `columns_needed`-widening heuristic
(and its `standalone_columns` tracking) on the theory that the field-count
check subsumed it. It does not: when a merge widens an *inner* projection
with column refs that came from inside extracted aliases (e.g. CSE-generated
`__common_expr_*` refs), the immediate `base_plan` field count can match the
original while the schema has still drifted. Dropping the heuristic
reintroduces `Schema error: No field named __common_expr_1` on the extended
sqllogictest suite (expr.slt / projection_pushdown.slt), which the unit
tests and struct.slt did not exercise.

Revert to the original, verified logic. A simplification can be revisited
later only if validated against the full extended slt suite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Of the two recovery heuristics I previously tried to remove together, only
one is actually redundant. Isolating them one at a time against the full
sqllogictest suite shows:

- the `transformed || !is_column` heuristic is LOAD-BEARING: removing it
  reintroduces `Schema error: No field named __common_expr_1` on expr.slt
  (a transformed/CSE expression whose recovery the field-count check cannot
  see). It is kept.
- the `columns_needed`-widening heuristic (and its `standalone_columns`
  tracking) is genuinely redundant given the above plus the field-count
  check: any widening it would catch is already covered.

Remove only the redundant one. Validated with no behavior change: all 49
extract_leaf unit tests and the full 485-file sqllogictest suite pass,
including expr.slt and projection_pushdown.slt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The recovery decision was spread across three conditions: a field-count
delta, a `transformed || !is_column` heuristic, and a `needs_alias` flag.
Together they were a proxy for the actual invariant: a recovery projection
is needed iff `base_plan` no longer exposes the original projection's set of
output columns. That happens two ways — a column is renamed (a transformed
expr now surfaces as its internal `__datafusion_extracted_*` alias) or a
column is leaked (pushdown widens base_plan with another extraction's
aliases). Both are exactly a change in the set of output names.

Collapse the three into one comparison of the set of unqualified field
names. Unqualified is deliberate: extracted aliases are globally unique so
name-only comparison is unambiguous for them, while it ignores benign column
reordering and SubqueryAlias re-qualification that a qualified/ordered
comparison spuriously treats as drift (stacking redundant recovery
projections — confirmed when I first tried qualified equality).

Validated as a pure refactor: all 49 extract_leaf unit tests pass with
byte-identical inline snapshots, and the full 485-file sqllogictest suite
passes (incl. expr.slt / projection_pushdown.slt).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@adriangb adriangb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cetra3

@adriangb adriangb added this pull request to the merge queue Jun 16, 2026
Merged via the queue into apache:main with commit d5f03d9 Jun 16, 2026
38 checks passed
@adriangb adriangb deleted the fix_leaf_expression_projection_reconciliatin branch June 16, 2026 15:46
cetra3 added a commit to pydantic/datafusion that referenced this pull request Jun 16, 2026
- Closes apache#22955

This fixes a bug with the extract leaf expressions

This is a one liner that sanity checks the schema is the same length
when we are doing expression pushdown

Yes, a couple of tests have been added.

Nope!

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

optimize_projections fails on leaf-expression pushdown over a view with an unconsumed column

2 participants