Skip to content

[SPARK-56504][SQL] Extend join pushdown to accept pushed samples#56486

Open
XdithyX wants to merge 1 commit into
apache:masterfrom
XdithyX:SPARK-56504
Open

[SPARK-56504][SQL] Extend join pushdown to accept pushed samples#56486
XdithyX wants to merge 1 commit into
apache:masterfrom
XdithyX:SPARK-56504

Conversation

@XdithyX

@XdithyX XdithyX commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR extends the DataSource V2 join pushdown API so that Spark can pass already-pushed table sample information to connectors when attempting join pushdown (SPARK-56504).

Before this change, Spark blocked join pushdown when either side of the join had a real pushed sample, because the merged join scan builder had no way to know about that sample. Allowing join pushdown in that state would have silently dropped the sample and changed query results.

This PR adds:

  1. A new sample-aware SupportsPushDownJoin.pushDownJoin(...) overload that receives:

    • leftSample
    • rightSample
  2. A public SupportsPushDownJoin.TableSample record containing:

    • lowerBound
    • upperBound
    • withReplacement
    • seed
    • sampleMethod
  3. Default compatibility behavior for existing connectors:

    • If there is no sample, or only a no-op sample, Spark delegates to the existing join pushdown method.
    • If there is a real pushed sample, the default implementation returns false, so existing connectors do not accidentally drop the sample.
  4. Optimizer wiring in V2ScanRelationPushDown to pass pushed sample information from the left and right scan builders into the new sample-aware join pushdown API.

  5. JDBC join pushdown support for pushed samples:

    • JDBCScanBuilder now preserves table sample clauses in the left and right pushed join subqueries.
    • If a pushed sample cannot be represented as a JDBC table sample clause, join pushdown is rejected.
  6. Test coverage for:

    • left-side pushed sample with join pushdown
    • right-side pushed sample with join pushdown
    • both-side pushed samples with join pushdown
    • old join pushdown API compatibility
    • rejection of real pushed samples by connectors that do not override the new API
    • JDBC SQL generation preserving table sample clauses in pushed join sides

Why are the changes needed?

SPARK-55978 added table sample pushdown support and intentionally blocked composing pushed samples with join pushdown. That block was necessary because SupportsPushDownJoin did not receive sample information. If Spark pushed the join after pushing a sample, the connector could build a joined scan without knowing that one side had already been sampled.

This PR fixes that limitation.

With this change, connectors that support both sample pushdown and join pushdown can make an explicit decision:

  • return true when they can correctly execute the sampled join
  • return false when they cannot

Existing connectors remain safe because the new default method rejects real pushed samples unless the connector explicitly opts in by overriding the new overload.

Does this PR introduce any user-facing change?

Yes, for DataSource V2 connectors that implement the new sample-aware join pushdown API.

Previously, Spark did not push down joins when either side had a real pushed sample. After this change, Spark can push down such joins if the connector explicitly supports the sample and join combination.

For existing connectors that only implement the old join pushdown API, behavior remains safe: Spark will not push down a join with a real pushed sample through the default implementation.

How was this patch tested?

Added and updated tests in:

  • DataSourceV2TableSampleSuite
  • JDBCSuite
  • in-memory test connector/catalog fixtures for sample-aware and legacy join pushdown behavior

Ran:

build/sbt 'sql/testOnly *DataSourceV2TableSampleSuite' 'sql/testOnly org.apache.spark.sql.jdbc.JDBCSuite -- -z SPARK-56504' 

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex GPT 5.5

@XdithyX XdithyX marked this pull request as ready for review June 13, 2026 18:14
@XdithyX

XdithyX commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

Hi @stanyao , Can you please have a look, when you get a chance?
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant