perf(spark): fold concat_ws plans with literal-NULL separator via simplify#22959
perf(spark): fold concat_ws plans with literal-NULL separator via simplify#22959davidlghellin wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds plan-time simplification for Spark concat_ws and extends sqllogictest coverage to validate constant folding behavior and runtime results.
Changes:
- Implement
ScalarUDFImpl::simplifyforSparkConcatWsto fold specific constant cases early. - Add
EXPLAIN-based sqllogictest assertions to validate folding behavior (logical/physical plans) plus runtime sanity checks.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| datafusion/spark/src/function/string/concat_ws.rs | Adds simplify() rules to fold concat_ws to constant NULL / empty string in specific cases. |
| datafusion/sqllogictest/test_files/spark/string/concat_ws.slt | Adds plan-time folding EXPLAIN assertions and runtime sanity queries for new simplifications. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fn simplify( | ||
| &self, | ||
| args: Vec<Expr>, | ||
| _info: &SimplifyContext, | ||
| ) -> Result<ExprSimplifyResult> { | ||
| if let Some(Expr::Literal(scalar, _)) = args.first() | ||
| && scalar.is_null() | ||
| { | ||
| return Ok(ExprSimplifyResult::Simplified(Expr::Literal( | ||
| ScalarValue::Utf8(None), | ||
| None, | ||
| ))); | ||
| } |
There was a problem hiding this comment.
Empirically not an issue today — the SLT EXPLAIN SELECT concat_ws(CAST(NULL AS STRING), a, b) ... folds to Projection: Utf8(NULL), which means ConstEvaluator collapses the CAST to a typed NULL literal before simplify() runs. If the optimizer pass order ever shifts and the wrapper survives, the SLT will fail visibly. Happy to add peeling if you have a reproducer where it doesn't fold today.
| return Ok(ExprSimplifyResult::Simplified(Expr::Literal( | ||
| ScalarValue::Utf8(None), | ||
| None, | ||
| ))); | ||
| } | ||
|
|
||
| if matches!(args.as_slice(), [Expr::Literal(_, _)]) { | ||
| return Ok(ExprSimplifyResult::Simplified(Expr::Literal( | ||
| ScalarValue::Utf8(Some(String::new())), | ||
| None, | ||
| ))); | ||
| } |
There was a problem hiding this comment.
The second arg to Expr::Literal is Option<FieldMetadata>, not Option<DataType>. The type is already encoded in the ScalarValue variant (ScalarValue::Utf8(None) IS a typed Utf8 NULL). None for metadata is the standard pattern across datafusion-functions.
| query TT | ||
| EXPLAIN SELECT concat_ws(CAST(NULL AS STRING), a, b) AS r FROM VALUES ('x', 'y'), ('p', 'q') AS t(a, b); | ||
| ---- | ||
| logical_plan | ||
| 01)Projection: Utf8(NULL) AS r | ||
| 02)--SubqueryAlias: t | ||
| 03)----Projection: | ||
| 04)------Values: (Utf8("x"), Utf8("y")), (Utf8("p"), Utf8("q")) | ||
| physical_plan | ||
| 01)ProjectionExec: expr=[NULL as r] | ||
| 02)--DataSourceExec: partitions=1, partition_sizes=[1] |
There was a problem hiding this comment.
Acknowledged. Kept the physical plan details because they confirm the actual optimization win (no concat_ws in ProjectionExec, no columns scanned in DataSourceExec) — ... wildcards would let a regression through silently. Matches the repo convention for EXPLAIN cases (see simplify_expressions.slt, aggregate.slt). When optimizer formatting shifts, the expected output gets updated as part of that PR.
Which issue does this PR close?
Rationale for this change
concat_wsplans containing a literal-NULL separator can be folded at plan time, but DataFusion's existingConstEvaluatoronly fires when all arguments are literals. When the separator is a literal NULL but the value arguments are columns, the call survives until execution and runs per row to produce NULL each time — pure overhead that downstream projection pushdown can't reclaim because the columns are still referenced.What changes are included in this PR?
A new
ScalarUDFImpl::simplifyimplementation onSparkConcatWswith two rules:concat_ws(NULL_literal, …)→NULL::Utf8. A literal-NULL separator yields NULL regardless of the value args, including when they are columns. This is the caseConstEvaluatorcan't reach.concat_ws(sep_literal)(only the separator, no value args) →''. Mostly redundant withConstEvaluatorfor the all-literal case; kept for symmetry.The runtime physical execution path (
invoke_with_args,only_separator,spark_concat_ws, theStringView/ArgViewenums, and thewrite_list_row/push_parthelpers) is unchanged.Are these changes tested?
Yes. New SLT cases in
datafusion/sqllogictest/test_files/spark/string/concat_ws.sltare grouped into four labelled categories:Are there any user-facing changes?
No API or behaviour changes.