Skip to content

perf(spark): fold concat_ws plans with literal-NULL separator via simplify#22959

Draft
davidlghellin wants to merge 2 commits into
apache:mainfrom
davidlghellin:perf/concat_ws_simplify
Draft

perf(spark): fold concat_ws plans with literal-NULL separator via simplify#22959
davidlghellin wants to merge 2 commits into
apache:mainfrom
davidlghellin:perf/concat_ws_simplify

Conversation

@davidlghellin

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

  • NA

Rationale for this change

concat_ws plans containing a literal-NULL separator can be folded at plan time, but DataFusion's existing ConstEvaluator only fires when all arguments are literals. When the separator is a literal NULL but the value arguments are columns, the call survives until execution and runs per row to produce NULL each time — pure overhead that downstream projection pushdown can't reclaim because the columns are still referenced.

What changes are included in this PR?

A new ScalarUDFImpl::simplify implementation on SparkConcatWs with two rules:

  1. concat_ws(NULL_literal, …)NULL::Utf8. A literal-NULL separator yields NULL regardless of the value args, including when they are columns. This is the case ConstEvaluator can't reach.
  2. concat_ws(sep_literal) (only the separator, no value args) → ''. Mostly redundant with ConstEvaluator for the all-literal case; kept for symmetry.

The runtime physical execution path (invoke_with_args, only_separator, spark_concat_ws, the StringView/ArgView enums, and the write_list_row/push_part helpers) is unchanged.

Are these changes tested?

Yes. New SLT cases in datafusion/sqllogictest/test_files/spark/string/concat_ws.slt are grouped into four labelled categories:

Are there any user-facing changes?

No API or behaviour changes.

Copilot AI review requested due to automatic review settings June 15, 2026 17:33
@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) spark labels Jun 15, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds plan-time simplification for Spark concat_ws and extends sqllogictest coverage to validate constant folding behavior and runtime results.

Changes:

  • Implement ScalarUDFImpl::simplify for SparkConcatWs to fold specific constant cases early.
  • Add EXPLAIN-based sqllogictest assertions to validate folding behavior (logical/physical plans) plus runtime sanity checks.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
datafusion/spark/src/function/string/concat_ws.rs Adds simplify() rules to fold concat_ws to constant NULL / empty string in specific cases.
datafusion/sqllogictest/test_files/spark/string/concat_ws.slt Adds plan-time folding EXPLAIN assertions and runtime sanity queries for new simplifications.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +128 to +140
fn simplify(
&self,
args: Vec<Expr>,
_info: &SimplifyContext,
) -> Result<ExprSimplifyResult> {
if let Some(Expr::Literal(scalar, _)) = args.first()
&& scalar.is_null()
{
return Ok(ExprSimplifyResult::Simplified(Expr::Literal(
ScalarValue::Utf8(None),
None,
)));
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empirically not an issue today — the SLT EXPLAIN SELECT concat_ws(CAST(NULL AS STRING), a, b) ... folds to Projection: Utf8(NULL), which means ConstEvaluator collapses the CAST to a typed NULL literal before simplify() runs. If the optimizer pass order ever shifts and the wrapper survives, the SLT will fail visibly. Happy to add peeling if you have a reproducer where it doesn't fold today.

Comment on lines +136 to +147
return Ok(ExprSimplifyResult::Simplified(Expr::Literal(
ScalarValue::Utf8(None),
None,
)));
}

if matches!(args.as_slice(), [Expr::Literal(_, _)]) {
return Ok(ExprSimplifyResult::Simplified(Expr::Literal(
ScalarValue::Utf8(Some(String::new())),
None,
)));
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second arg to Expr::Literal is Option<FieldMetadata>, not Option<DataType>. The type is already encoded in the ScalarValue variant (ScalarValue::Utf8(None) IS a typed Utf8 NULL). None for metadata is the standard pattern across datafusion-functions.

Comment on lines +392 to +402
query TT
EXPLAIN SELECT concat_ws(CAST(NULL AS STRING), a, b) AS r FROM VALUES ('x', 'y'), ('p', 'q') AS t(a, b);
----
logical_plan
01)Projection: Utf8(NULL) AS r
02)--SubqueryAlias: t
03)----Projection:
04)------Values: (Utf8("x"), Utf8("y")), (Utf8("p"), Utf8("q"))
physical_plan
01)ProjectionExec: expr=[NULL as r]
02)--DataSourceExec: partitions=1, partition_sizes=[1]

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. Kept the physical plan details because they confirm the actual optimization win (no concat_ws in ProjectionExec, no columns scanned in DataSourceExec) — ... wildcards would let a regression through silently. Matches the repo convention for EXPLAIN cases (see simplify_expressions.slt, aggregate.slt). When optimizer formatting shifts, the expected output gets updated as part of that PR.

@davidlghellin davidlghellin marked this pull request as draft June 15, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants