[SPARK-56395][SQL][FOLLOWUP] Fix RewriteNearestByJoin nullability for LEFT OUTER#56484
Closed
cloud-fan wants to merge 3 commits into
Closed
[SPARK-56395][SQL][FOLLOWUP] Fix RewriteNearestByJoin nullability for LEFT OUTER#56484cloud-fan wants to merge 3 commits into
cloud-fan wants to merge 3 commits into
Conversation
yadavay-amzn
approved these changes
Jun 13, 2026
yadavay-amzn
left a comment
Contributor
There was a problem hiding this comment.
LGTM with one optional follow-up
fix is minimal, idiomatic, and well-tested. No blocking issues.
One non-blocking observation:
The new regression test uses a deterministic ranking. The fix also covers the non-deterministic ranking branch (which routes through Alias(rankingInJoin, "__ranking__")), but no test fails if a future change accidentally regresses that branch. A non-deterministic-ranking variant with non-nullable right inputs would round out coverage. Optional.
What I verified
- Premise is real:
Join.computeOutputwidens right-side outputs forLeftOuter. The pre-fix code captures unwidened right.output references in places that sit on top of the widened join, creating a real internal inconsistency. - Bug repro and fix confirmed locally: restored the production file to pre-fix while keeping the new regression test, ran it, and got the exact error message documented in the PR body (LeftOuter: x#2 declared nullable=false but its child produces nullable=true).
- Why existing tests missed the bug: the DSL
$"x".intconstructsAttributeReferencewithnullable=trueby
default, makingwithNullability(true)a no-op. The new test uses non-nullable inputs explicitly, which is what
the PR body calls out and is the cleanest possible fixture.
… LEFT OUTER Followup to apache#55682. In RewriteNearestByJoin, when the NEAREST BY join type is LEFT OUTER, the synthesized Join widens the right-side columns to nullable. However, the synthesized Aggregate (and the optional __ranking__ Project) built on top of that join still referenced the right-side columns via right.output and rankingExpression with their original (non-nullable) nullability. As a result the rewritten plan can declare a right-side column as non-nullable while its child -- the join -- produces it as nullable. This commit maps the right-side attributes to their widened (nullable) form for LEFT OUTER and rewrites both the CreateStruct(right.*) and the ranking expression to use that widened nullability, so the rewritten plan's schema is consistent with its child. For INNER joins the right side is not widened, so this is a no-op.
…bility Update RewriteNearestByJoinSuite's expected-plan helper to mirror the widened right-side nullability for LEFT OUTER, and add a focused regression test that asserts every right-side attribute the synthesized Aggregate references agrees on nullability with its join child. The test uses non-nullable right columns so the LEFT OUTER widening is observable -- it fails without the production fix and passes with it. INNER stays a no-op.
… nondeterministic ranking path Generalize the LEFT OUTER nullability regression test in RewriteNearestByJoinSuite from an Aggregate-only assertion to a whole-plan integrity walk, and add a nondeterministic-ranking case alongside the deterministic one across both Inner and LeftOuter. The widened ranking reference lands in the Aggregate on the deterministic path but in a `__ranking__` Project (above the Join) on the nondeterministic path. The old assertion only inspected Aggregate expressions, so a regression isolated to the nondeterministic branch would not have been caught. No framework check guards this class of bug either -- LogicalPlanIntegrity compares types `asNullable` and schemas via `equalsIgnoreNullability` -- so this test is the only guard. Verified the new coverage has teeth: reverting just the nondeterministic alias to the un-widened expression fails the test specifically on LeftOuter/nondeterministic while the deterministic case still passes. Co-authored-by: Isaac
589beef to
ce977d2
Compare
gengliangwang
approved these changes
Jun 15, 2026
Contributor
Author
|
thanks for review, merging to master/4.x/4.2 (not a user-facing bug, won't fail RC) |
cloud-fan
added a commit
that referenced
this pull request
Jun 15, 2026
… LEFT OUTER ### What changes were proposed in this pull request? Followup to #55682. In `RewriteNearestByJoin`, when the `NEAREST BY` join type is `LEFT OUTER`, the synthesized `Join` widens the right-side columns to nullable. However, the synthesized `Aggregate` (and the optional `__ranking__` `Project`) built on top of that join still referenced the right-side columns via `right.output` and `rankingExpression` with their original (non-nullable) nullability. As a result the rewritten plan can declare a right-side column as non-nullable while its child -- the join -- produces it as nullable. This PR maps the right-side attributes to their widened (nullable) form for `LEFT OUTER` and rewrites both the `CreateStruct(right.*)` and the ranking expression to use that widened nullability, so the rewritten plan's schema is consistent with its child. For `INNER` joins the right side is not widened, so this is a no-op. ### Why are the changes needed? Without this fix the rewritten plan for a `LEFT OUTER NEAREST BY` declares right-side columns non-nullable while its join child produces them nullable -- an inconsistency that nullability/plan-integrity validation flags as a regression. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a regression test to `RewriteNearestByJoinSuite` that, for both `INNER` and `LEFT OUTER`, asserts every right-side attribute the synthesized `Aggregate` references agrees on nullability with its join child. The test uses **non-nullable** right-side columns so that `LEFT OUTER`'s widening is observable -- it fails without this fix (`x#.. declared nullable=false but its child produces nullable=true`) and passes with it, while `INNER` stays a no-op. The suite's expected-plan helper was also updated to mirror the widened nullability. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #56484 from cloud-fan/SPARK-56395-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 62e4e16) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
cloud-fan
added a commit
that referenced
this pull request
Jun 15, 2026
… LEFT OUTER ### What changes were proposed in this pull request? Followup to #55682. In `RewriteNearestByJoin`, when the `NEAREST BY` join type is `LEFT OUTER`, the synthesized `Join` widens the right-side columns to nullable. However, the synthesized `Aggregate` (and the optional `__ranking__` `Project`) built on top of that join still referenced the right-side columns via `right.output` and `rankingExpression` with their original (non-nullable) nullability. As a result the rewritten plan can declare a right-side column as non-nullable while its child -- the join -- produces it as nullable. This PR maps the right-side attributes to their widened (nullable) form for `LEFT OUTER` and rewrites both the `CreateStruct(right.*)` and the ranking expression to use that widened nullability, so the rewritten plan's schema is consistent with its child. For `INNER` joins the right side is not widened, so this is a no-op. ### Why are the changes needed? Without this fix the rewritten plan for a `LEFT OUTER NEAREST BY` declares right-side columns non-nullable while its join child produces them nullable -- an inconsistency that nullability/plan-integrity validation flags as a regression. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a regression test to `RewriteNearestByJoinSuite` that, for both `INNER` and `LEFT OUTER`, asserts every right-side attribute the synthesized `Aggregate` references agrees on nullability with its join child. The test uses **non-nullable** right-side columns so that `LEFT OUTER`'s widening is observable -- it fails without this fix (`x#.. declared nullable=false but its child produces nullable=true`) and passes with it, while `INNER` stays a no-op. The suite's expected-plan helper was also updated to mirror the widened nullability. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #56484 from cloud-fan/SPARK-56395-followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 62e4e16) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Followup to #55682.
In
RewriteNearestByJoin, when theNEAREST BYjoin type isLEFT OUTER, the synthesizedJoinwidens the right-side columns to nullable. However, the synthesizedAggregate(and the optional__ranking__Project) built on top of that join still referenced the right-side columns viaright.outputandrankingExpressionwith their original (non-nullable) nullability. As a result the rewritten plan can declare a right-side column as non-nullable while its child -- the join -- produces it as nullable.This PR maps the right-side attributes to their widened (nullable) form for
LEFT OUTERand rewrites both theCreateStruct(right.*)and the ranking expression to use that widened nullability, so the rewritten plan's schema is consistent with its child. ForINNERjoins the right side is not widened, so this is a no-op.Why are the changes needed?
Without this fix the rewritten plan for a
LEFT OUTER NEAREST BYdeclares right-side columns non-nullable while its join child produces them nullable -- an inconsistency that nullability/plan-integrity validation flags as a regression.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Added a regression test to
RewriteNearestByJoinSuitethat, for bothINNERandLEFT OUTER, asserts every right-side attribute the synthesizedAggregatereferences agrees on nullability with its join child. The test uses non-nullable right-side columns so thatLEFT OUTER's widening is observable -- it fails without this fix (x#.. declared nullable=false but its child produces nullable=true) and passes with it, whileINNERstays a no-op. The suite's expected-plan helper was also updated to mirror the widened nullability.Was this patch authored or co-authored using generative AI tooling?
No.