Skip to content

[SPARK-56660][SQL] Decompose struct equality into field-level predicates for filter pushdown#56244

Open
yadavay-amzn wants to merge 2 commits into
apache:masterfrom
yadavay-amzn:fix/SPARK-56660-struct-predicate-decompose
Open

[SPARK-56660][SQL] Decompose struct equality into field-level predicates for filter pushdown#56244
yadavay-amzn wants to merge 2 commits into
apache:masterfrom
yadavay-amzn:fix/SPARK-56660-struct-predicate-decompose

Conversation

@yadavay-amzn

@yadavay-amzn yadavay-amzn commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add optimizer rule DecomposeStructComparison that rewrites struct-level equality (= and <=>) into a conjunction of field-level equalities. This enables filter pushdown for individual struct fields.

For example, struct_col = struct(1, 'a') becomes struct_col.field1 = 1 AND struct_col.field2 = 'a'.

Why are the changes needed?

Struct literal comparisons and tuple comparisons are treated as opaque predicates by the optimizer. Data source filter pushdown only understands scalar predicates, so struct equality cannot be pushed down for file pruning (Parquet row group skipping, partition pruning, etc.), even though the equivalent scalar predicates would be pushed.

Does this PR introduce any user-facing change?

Yes. Queries filtering on struct equality will now benefit from file pruning and filter pushdown, improving performance on large tables.

How was this patch tested?

Added StructPredicateDecomposeSuite with tests covering EqualTo, EqualNullSafe, nested structs, single-field structs, empty structs, tuple comparisons, non-deterministic guard, and GreaterThan exclusion.

Was this patch authored or co-authored using generative AI tooling?

Yes.

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56660-struct-predicate-decompose branch 3 times, most recently from 857e4be to a9a74c4 Compare June 3, 2026 01:15

@yyanyy yyanyy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change!

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56660-struct-predicate-decompose branch from a9a74c4 to 76dca41 Compare June 5, 2026 01:53
@yadavay-amzn

yadavay-amzn commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@yyanyy Thanks for reviewing and great catch on the NULL semantics, you're right!

Spark's struct equality uses InterpretedOrdering which treats null=null within fields as equal (returns TRUE), while EqualTo(null, null) returns NULL.

Fixed: the decomposition now uses EqualNullSafe (<=>) for per-field comparisons, which matches the struct equality semantics exactly:

  • null <=> null → true (matches struct behavior)
  • null <=> 2 → false (matches struct behavior)

The only remaining discrepancy is when the entire struct itself is null (original returns NULL, decomposed returns FALSE), but since our rule only fires in Filter context, this is harmless (both NULL and FALSE exclude the row from WHERE).

Also added a width guard (max 100 fields) to prevent stack overflow on very wide/deeply nested structs, per your second concern.

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56660-struct-predicate-decompose branch from 76dca41 to 707a859 Compare June 9, 2026 00:05
yadavay-amzn added a commit to yadavay-amzn/spark that referenced this pull request Jun 9, 2026
…Conf; rework tests

Addresses review feedback on PR apache#56244:

1. Correctness fix for NULL handling. The original decomposition rewrote
   EqualTo(struct, struct) into a plain conjunction of per-field EqualTo
   comparisons, which silently changed semantics for non-null structs that
   contained NULL fields:
     - Before this PR: struct(1, null) = struct(1, null) returned TRUE
       (Spark's whole-struct EqualTo evaluates ordering.equiv on the row,
       which treats per-field NULL == NULL as equal).
     - With original PR apache#56244: returned NULL.

   The fix wraps the conjunction with a null-check that mirrors the
   original outer null behavior:
     - EqualTo(L, R) over nullable structs: IF (L IS NULL OR R IS NULL)
       THEN NULL ELSE And(EqualNullSafe(L.fi, R.fi)).
     - EqualNullSafe(L, R): IF (L IS NULL AND R IS NULL) THEN TRUE
       ELSE IF (L IS NULL OR R IS NULL) THEN FALSE
       ELSE And(EqualNullSafe(L.fi, R.fi)).
   The wrappers fold out cleanly when either operand is non-nullable,
   leaving the simple conjunction in the common
   `CreateNamedStruct = column` pushdown case.

2. SQLConf gate. Add `spark.sql.optimizer.decomposeStructComparison.enabled`
   (default false) so users opt in once the behavior has soaked. Add
   `spark.sql.optimizer.decomposeStructComparison.maxFields` (default 1000)
   that bounds total decomposed predicates including recursively nested
   struct fields, replacing the unprincipled per-level field cap of 100.

3. Scaladoc explaining Filter scope. Document why join conditions and
   aggregate grouping keys are deliberately not rewritten.

4. Tests reworked as oracle tests. The original suite asserted post-rewrite
   NULL behavior directly, which codified the regression as expected. The
   rewritten suite uses two patterns:
     - Catalyst-level: build expressions and assert eval result of original
       expression equals eval result of rewritten expression on representative
       inputs (struct(1, null), whole-struct null, Not wrapper, etc.).
     - End-to-end: run each query with the rule enabled and with the conf
       disabled; assert row sets are identical.
   Added tests for: Not(struct = struct) with NULL fields, whole-struct null
   on one side, conf gating. Removed: 3 wrong-oracle NULL tests, structural-
   only "nullable fields decomposes" test, duplicate LessThan, duplicate
   3-level nested, single-field, duplicate join test in catalyst suite.
…Conf; rework tests

Addresses review feedback on PR apache#56244:

1. Correctness fix for NULL handling. The original decomposition rewrote
   EqualTo(struct, struct) into a plain conjunction of per-field EqualTo
   comparisons, which silently changed semantics for non-null structs that
   contained NULL fields:
     - Before this PR: struct(1, null) = struct(1, null) returned TRUE
       (Spark's whole-struct EqualTo evaluates ordering.equiv on the row,
       which treats per-field NULL == NULL as equal).
     - With original PR apache#56244: returned NULL.

   The fix wraps the conjunction with a null-check that mirrors the
   original outer null behavior:
     - EqualTo(L, R) over nullable structs: IF (L IS NULL OR R IS NULL)
       THEN NULL ELSE And(EqualNullSafe(L.fi, R.fi)).
     - EqualNullSafe(L, R): IF (L IS NULL AND R IS NULL) THEN TRUE
       ELSE IF (L IS NULL OR R IS NULL) THEN FALSE
       ELSE And(EqualNullSafe(L.fi, R.fi)).
   The wrappers fold out cleanly when either operand is non-nullable,
   leaving the simple conjunction in the common
   `CreateNamedStruct = column` pushdown case.

2. SQLConf gate. Add `spark.sql.optimizer.decomposeStructComparison.enabled`
   (default false) so users opt in once the behavior has soaked. Add
   `spark.sql.optimizer.decomposeStructComparison.maxFields` (default 1000)
   that bounds total decomposed predicates including recursively nested
   struct fields, replacing the unprincipled per-level field cap of 100.

3. Scaladoc explaining Filter scope. Document why join conditions and
   aggregate grouping keys are deliberately not rewritten.

4. Tests reworked as oracle tests. The original suite asserted post-rewrite
   NULL behavior directly, which codified the regression as expected. The
   rewritten suite uses two patterns:
     - Catalyst-level: build expressions and assert eval result of original
       expression equals eval result of rewritten expression on representative
       inputs (struct(1, null), whole-struct null, Not wrapper, etc.).
     - End-to-end: run each query with the rule enabled and with the conf
       disabled; assert row sets are identical.
   Added tests for: Not(struct = struct) with NULL fields, whole-struct null
   on one side, conf gating. Removed: 3 wrong-oracle NULL tests, structural-
   only "nullable fields decomposes" test, duplicate LessThan, duplicate
   3-level nested, single-field, duplicate join test in catalyst suite.
@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-56660-struct-predicate-decompose branch from 7de21ca to 8941106 Compare June 16, 2026 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants