[SPARK-56504][SQL] Extend join pushdown to accept pushed samples#56486
Open
XdithyX wants to merge 1 commit into
Open
[SPARK-56504][SQL] Extend join pushdown to accept pushed samples#56486XdithyX wants to merge 1 commit into
XdithyX wants to merge 1 commit into
Conversation
Contributor
Author
|
Hi @stanyao , Can you please have a look, when you get a chance? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR extends the DataSource V2 join pushdown API so that Spark can pass already-pushed table sample information to connectors when attempting join pushdown (SPARK-56504).
Before this change, Spark blocked join pushdown when either side of the join had a real pushed sample, because the merged join scan builder had no way to know about that sample. Allowing join pushdown in that state would have silently dropped the sample and changed query results.
This PR adds:
A new sample-aware
SupportsPushDownJoin.pushDownJoin(...)overload that receives:leftSamplerightSampleA public
SupportsPushDownJoin.TableSamplerecord containing:lowerBoundupperBoundwithReplacementseedsampleMethodDefault compatibility behavior for existing connectors:
false, so existing connectors do not accidentally drop the sample.Optimizer wiring in
V2ScanRelationPushDownto pass pushed sample information from the left and right scan builders into the new sample-aware join pushdown API.JDBC join pushdown support for pushed samples:
JDBCScanBuildernow preserves table sample clauses in the left and right pushed join subqueries.Test coverage for:
Why are the changes needed?
SPARK-55978 added table sample pushdown support and intentionally blocked composing pushed samples with join pushdown. That block was necessary because
SupportsPushDownJoindid not receive sample information. If Spark pushed the join after pushing a sample, the connector could build a joined scan without knowing that one side had already been sampled.This PR fixes that limitation.
With this change, connectors that support both sample pushdown and join pushdown can make an explicit decision:
truewhen they can correctly execute the sampled joinfalsewhen they cannotExisting connectors remain safe because the new default method rejects real pushed samples unless the connector explicitly opts in by overriding the new overload.
Does this PR introduce any user-facing change?
Yes, for DataSource V2 connectors that implement the new sample-aware join pushdown API.
Previously, Spark did not push down joins when either side had a real pushed sample. After this change, Spark can push down such joins if the connector explicitly supports the sample and join combination.
For existing connectors that only implement the old join pushdown API, behavior remains safe: Spark will not push down a join with a real pushed sample through the default implementation.
How was this patch tested?
Added and updated tests in:
DataSourceV2TableSampleSuiteJDBCSuiteRan:
Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex GPT 5.5