[SPARK-47670][SQL] Share repeated top-level JSON path parsing by sunchao · Pull Request #56547 · apache/spark

sunchao · 2026-06-16T16:18:39Z

Why are the changes needed?

Each get_json_object expression currently parses its JSON input independently. A projection that extracts several fields from the same JSON string therefore repeats the scan and parse work for every path, so its CPU cost grows roughly with the number of extracted fields.

SPARK-47670 proposes sharing that parsing work. This PR implements a conservative first scope for deterministic, simple top-level paths. Nested paths, wildcards, array subscripts, computed JSON inputs, and evaluation-order-sensitive expression shapes remain unchanged so that Hive-compatible behavior is preserved.

What changes were proposed in this PR?

Add an internal MultiGetJsonObject expression and evaluator that extracts several top-level fields with one Jackson parse and returns them as a struct.
Extend OptimizeCsvJsonExprs to group eligible sibling get_json_object expressions over the same input attribute and rewrite them to fields of one shared parse.
Preserve legacy behavior for malformed JSON, null and duplicate keys, generator-side rendering failures, and parser-side failures. The shared evaluator falls back to the existing per-path evaluator when sharing cannot safely preserve independent results.
Keep nested paths, wildcards, array subscripts, conditional or guarded evaluation, unsupported arithmetic order, and separately projected from_json expressions out of the rewrite.
Gate the optimization behind the internal, default-disabled spark.sql.optimizer.getJsonObjectSharedParsing.enabled configuration. The existing JSON expression optimization must also be enabled.
Add optimizer, runtime, code-generation, malformed-input, deep-nesting, and microbenchmark coverage.

Does this PR introduce any user-facing change?

Yes. It adds an internal, default-disabled configuration. With the flag disabled, existing analyzed and optimized plans are unchanged. When enabled, eligible repeated top-level get_json_object calls use a shared parse; their result semantics remain unchanged.

How was this PR tested?

The following suites and style checks passed on JDK 17:

build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"
build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"
build/sbt "catalyst/scalastyle; catalyst/Test/scalastyle; sql/scalastyle; sql/Test/scalastyle"

The complete optimizer suite passed 20 tests, and the complete JsonFunctionsSuite passed 104 tests.

The added benchmark was generated with:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain org.apache.spark.sql.execution.datasources.json.SharedJsonParseBenchmark"

Best times for 200,000 rows containing 32 JSON fields were:

Extracted paths	Shared parsing off	Shared parsing on	Relative speedup
2	740 ms	390 ms	1.9x
4	1,387 ms	480 ms	2.9x
8	3,154 ms	727 ms	4.3x
16	5,109 ms	709 ms	7.2x

The benchmark result file records the full JDK 17 environment and output.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5)

sunchao · 2026-06-16T17:31:48Z

cc @cloud-fan @viirya @dongjoon-hyun @peter-toth could you take a look? Thanks!

viirya

Nicely scoped change — defaulting off, restricting to deterministic simple top-level fields, and the care taken to preserve the legacy per-path semantics (first-non-null-on-duplicate-keys, all-null on a syntax error, raw vs quoted string rendering, generator-failure isolation) all hold up under tracing. A few non-blocking items before merge:

1. Lock down the parser-side fallback assumption with a legacy comparison.
shared get_json_object falls back after parser-side rendering failure only does checkAnswer(query, Row(null, "2")). For that result to come out of fallback, the legacy extraction of $['b.c'] must scan past the oversized a value without tripping StreamConstraintsException — i.e. Jackson must not materialize the non-matching string during skipChildren/nextToken. That's a real dependency on Jackson internals, and it's the one place the test doesn't pin behavior against the legacy path. Could you add a shared == legacy comparison there, the way the other runtime tests do, so a future Jackson bump can't silently change this?

2. The collect/rewrite allow/deny lists must stay in lockstep.
collectGetJsonObjectFields and rewriteGetJsonObjectFields carry the same exclusion set (ConditionalExpression | And | Or | In | TryEval | LambdaFunction | CreateNamedStruct) and the same recursion set (Alias/GetStructField/Cast/left-first BinaryArithmetic), maintained independently. If they ever drift — collect treats something as shareable that rewrite doesn't descend into (or vice versa) — you get an un-rewritten GetJsonObject left behind. It's not a correctness hazard today (rewrite just leaves the original in place), but it's an easy footgun when someone adds a branch later. Worth extracting the shared classification into one helper both sides consume.

3. nullIntolerant = true diverges from GetJsonObject.
GetJsonObject doesn't declare nullIntolerant (defaults to false); MultiGetJsonObject does. It's correct here — null json yields a null struct — but the divergence is non-obvious. A one-line comment on why would save a future reader the double-take.

4. Benchmark result file.
SharedJsonParseBenchmark-results.txt was generated locally on macOS / Apple M5 Max / JDK 17. Committed benchmark results are conventionally regenerated by the CI bot on the Linux runners (and across JDK 17/21/25). The bot will regenerate on merge, so this is cosmetic, but the macOS/M5 header is out of line with the rest of sql/core/benchmarks.

Optional: a test with a non-UTF8_BINARY collation on the json input (since inputTypes uses StringTypeWithCollation) would close a small coverage gap, and a fuzz/property comparison (random JSON x random simple paths, shared vs legacy) would be the strongest guard for a change whose whole job is to replicate existing semantics.

dongjoon-hyun

Thank you, @sunchao . Looks promising to me and agree with the above mentioned review comment too. And, additional comments are the following.

dongjoon-hyun · 2026-06-16T18:37:51Z

+ * expression used to share sibling [[GetJsonObject]] expressions; unsupported JSON paths remain
+ * as independent GetJsonObject expressions.
+ */
+case class MultiGetJsonObject(


Like other classes, shall we define prettyName to be more complete? multi_get_json_object?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

Line 64 in a6e3fdd

override def prettyName: String = "get_json_object"

dongjoon-hyun · 2026-06-16T18:45:39Z

+  }
+
+  private def copyCurrentStructure(parser: JsonParser): Option[UTF8String] = {
+    val output = new ByteArrayOutputStream()


Do you think we can reuse output buffer via reset like GetJsonObjectEvaluator?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/json/JsonExpressionEvalUtils.scala

Line 397 in a6e3fdd

outputBuffer.reset()

yes! updated

sunchao · 2026-06-16T21:04:48Z

Thanks @viirya @dongjoon-hyun ! Addressed the comments and updated the PR

dongjoon-hyun · 2026-06-16T21:13:42Z

Thank you, @sunchao . BTW, as mentioned by Liang-Chi, you need to generate the benchmark results still~ Please trigger it at your fork and on your branch .

https://github.com/apache/spark/actions/workflows/benchmark.yml

….SharedJsonParseBenchmark (JDK 17, Scala 2.13, split 1 of 1)

dongjoon-hyun · 2026-06-16T21:49:34Z

Like the following, we need all Java combinations. I'm waiting for 21 and 25.

$ ls core/benchmarks/PlatformBenchmark-*              
core/benchmarks/PlatformBenchmark-jdk21-results.txt
core/benchmarks/PlatformBenchmark-jdk25-results.txt
core/benchmarks/PlatformBenchmark-results.txt

….SharedJsonParseBenchmark (JDK 21, Scala 2.13, split 1 of 1)

….SharedJsonParseBenchmark (JDK 25, Scala 2.13, split 1 of 1)

dongjoon-hyun

+1, LGTM. Thank you, @sunchao .

viirya

Thanks for the quick turnaround — all four points are addressed, and I traced the refactor to confirm it's behavior-preserving.

The big one was extracting getJsonObjectTraversalChild as the single source of truth for both collect and rewrite — that removes the drift risk I was worried about. I checked the rewrite reconstruction: other.withNewChildren(rewrite(child) +: other.children.tail) is equivalent to the old per-case version only because the traversal child is always children.head — which holds, since Alias/Cast/GetStructField are all UnaryExpression (sole child, empty tail) and BinaryArithmetic.children.head == left with tail == Seq(right). So it rebuilds exactly Seq(rewrite(left), right) as before. Collect's case other likewise reduces to Nil for anything off the whitelist, matching the old case _ => Nil. Good.

The fallback test now pinning shared == legacy == Row(null, "2") is exactly what closes the Jackson-internals dependency. And the benchmark files are regenerated on the Linux runners across JDK 17/21/25.

One thing I noticed while re-reading, not a request: reusing outputBuffer with reset() instead of allocating per field is a nice touch and matches the existing GetJsonObjectEvaluator pattern — safe here since the calls are sequential and per-task.

LGTM.

[SPARK-47670][SQL] Share repeated top-level JSON path parsing

08550ca

sunchao marked this pull request as ready for review June 16, 2026 17:30

viirya reviewed Jun 16, 2026

View reviewed changes

dongjoon-hyun reviewed Jun 16, 2026

View reviewed changes

[SPARK-47670][SQL] Address shared JSON parsing review feedback

58a6c53

Benchmark results for org.apache.spark.sql.execution.datasources.json…

8fb127d

….SharedJsonParseBenchmark (JDK 17, Scala 2.13, split 1 of 1)

sunchao added 2 commits June 16, 2026 23:01

Benchmark results for org.apache.spark.sql.execution.datasources.json…

c330295

….SharedJsonParseBenchmark (JDK 21, Scala 2.13, split 1 of 1)

Benchmark results for org.apache.spark.sql.execution.datasources.json…

2d9d2d2

….SharedJsonParseBenchmark (JDK 25, Scala 2.13, split 1 of 1)

dongjoon-hyun approved these changes Jun 17, 2026

View reviewed changes

viirya approved these changes Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47670][SQL] Share repeated top-level JSON path parsing#56547

[SPARK-47670][SQL] Share repeated top-level JSON path parsing#56547
sunchao wants to merge 5 commits into
apache:masterfrom
sunchao:codex/SPARK-47670-shared-get-json-object-parsing

sunchao commented Jun 16, 2026

Uh oh!

sunchao commented Jun 16, 2026

Uh oh!

viirya left a comment

Uh oh!

dongjoon-hyun left a comment •

edited

Loading

Uh oh!

dongjoon-hyun Jun 16, 2026 •

edited

Loading

Uh oh!

sunchao Jun 16, 2026

Uh oh!

dongjoon-hyun Jun 16, 2026 •

edited

Loading

Uh oh!

sunchao Jun 16, 2026

Uh oh!

sunchao commented Jun 16, 2026

Uh oh!

dongjoon-hyun commented Jun 16, 2026 •

edited

Loading

Uh oh!

dongjoon-hyun commented Jun 16, 2026

Uh oh!

dongjoon-hyun left a comment

Uh oh!

viirya left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

sunchao commented Jun 16, 2026

Why are the changes needed?

What changes were proposed in this PR?

Does this PR introduce any user-facing change?

How was this PR tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

sunchao commented Jun 16, 2026

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

sunchao commented Jun 16, 2026

Uh oh!

dongjoon-hyun commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 16, 2026

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun Jun 16, 2026 •

edited

Loading

dongjoon-hyun Jun 16, 2026 •

edited

Loading

dongjoon-hyun commented Jun 16, 2026 •

edited

Loading