Skip to content

[SPARK-47670][SQL] Share repeated top-level JSON path parsing#56547

Open
sunchao wants to merge 5 commits into
apache:masterfrom
sunchao:codex/SPARK-47670-shared-get-json-object-parsing
Open

[SPARK-47670][SQL] Share repeated top-level JSON path parsing#56547
sunchao wants to merge 5 commits into
apache:masterfrom
sunchao:codex/SPARK-47670-shared-get-json-object-parsing

Conversation

@sunchao

@sunchao sunchao commented Jun 16, 2026

Copy link
Copy Markdown
Member

Why are the changes needed?

Each get_json_object expression currently parses its JSON input independently. A projection that extracts several fields from the same JSON string therefore repeats the scan and parse work for every path, so its CPU cost grows roughly with the number of extracted fields.

SPARK-47670 proposes sharing that parsing work. This PR implements a conservative first scope for deterministic, simple top-level paths. Nested paths, wildcards, array subscripts, computed JSON inputs, and evaluation-order-sensitive expression shapes remain unchanged so that Hive-compatible behavior is preserved.

What changes were proposed in this PR?

  • Add an internal MultiGetJsonObject expression and evaluator that extracts several top-level fields with one Jackson parse and returns them as a struct.
  • Extend OptimizeCsvJsonExprs to group eligible sibling get_json_object expressions over the same input attribute and rewrite them to fields of one shared parse.
  • Preserve legacy behavior for malformed JSON, null and duplicate keys, generator-side rendering failures, and parser-side failures. The shared evaluator falls back to the existing per-path evaluator when sharing cannot safely preserve independent results.
  • Keep nested paths, wildcards, array subscripts, conditional or guarded evaluation, unsupported arithmetic order, and separately projected from_json expressions out of the rewrite.
  • Gate the optimization behind the internal, default-disabled spark.sql.optimizer.getJsonObjectSharedParsing.enabled configuration. The existing JSON expression optimization must also be enabled.
  • Add optimizer, runtime, code-generation, malformed-input, deep-nesting, and microbenchmark coverage.

Does this PR introduce any user-facing change?

Yes. It adds an internal, default-disabled configuration. With the flag disabled, existing analyzed and optimized plans are unchanged. When enabled, eligible repeated top-level get_json_object calls use a shared parse; their result semantics remain unchanged.

How was this PR tested?

The following suites and style checks passed on JDK 17:

build/sbt "catalyst/testOnly org.apache.spark.sql.catalyst.optimizer.OptimizeJsonExprsSuite"
build/sbt "sql/testOnly org.apache.spark.sql.JsonFunctionsSuite"
build/sbt "catalyst/scalastyle; catalyst/Test/scalastyle; sql/scalastyle; sql/Test/scalastyle"

The complete optimizer suite passed 20 tests, and the complete JsonFunctionsSuite passed 104 tests.

The added benchmark was generated with:

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/Test/runMain org.apache.spark.sql.execution.datasources.json.SharedJsonParseBenchmark"

Best times for 200,000 rows containing 32 JSON fields were:

Extracted paths Shared parsing off Shared parsing on Relative speedup
2 740 ms 390 ms 1.9x
4 1,387 ms 480 ms 2.9x
8 3,154 ms 727 ms 4.3x
16 5,109 ms 709 ms 7.2x

The benchmark result file records the full JDK 17 environment and output.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5)

@sunchao sunchao marked this pull request as ready for review June 16, 2026 17:30
@sunchao

sunchao commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

cc @cloud-fan @viirya @dongjoon-hyun @peter-toth could you take a look? Thanks!

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely scoped change — defaulting off, restricting to deterministic simple top-level fields, and the care taken to preserve the legacy per-path semantics (first-non-null-on-duplicate-keys, all-null on a syntax error, raw vs quoted string rendering, generator-failure isolation) all hold up under tracing. A few non-blocking items before merge:

1. Lock down the parser-side fallback assumption with a legacy comparison.
shared get_json_object falls back after parser-side rendering failure only does checkAnswer(query, Row(null, "2")). For that result to come out of fallback, the legacy extraction of $['b.c'] must scan past the oversized a value without tripping StreamConstraintsException — i.e. Jackson must not materialize the non-matching string during skipChildren/nextToken. That's a real dependency on Jackson internals, and it's the one place the test doesn't pin behavior against the legacy path. Could you add a shared == legacy comparison there, the way the other runtime tests do, so a future Jackson bump can't silently change this?

2. The collect/rewrite allow/deny lists must stay in lockstep.
collectGetJsonObjectFields and rewriteGetJsonObjectFields carry the same exclusion set (ConditionalExpression | And | Or | In | TryEval | LambdaFunction | CreateNamedStruct) and the same recursion set (Alias/GetStructField/Cast/left-first BinaryArithmetic), maintained independently. If they ever drift — collect treats something as shareable that rewrite doesn't descend into (or vice versa) — you get an un-rewritten GetJsonObject left behind. It's not a correctness hazard today (rewrite just leaves the original in place), but it's an easy footgun when someone adds a branch later. Worth extracting the shared classification into one helper both sides consume.

3. nullIntolerant = true diverges from GetJsonObject.
GetJsonObject doesn't declare nullIntolerant (defaults to false); MultiGetJsonObject does. It's correct here — null json yields a null struct — but the divergence is non-obvious. A one-line comment on why would save a future reader the double-take.

4. Benchmark result file.
SharedJsonParseBenchmark-results.txt was generated locally on macOS / Apple M5 Max / JDK 17. Committed benchmark results are conventionally regenerated by the CI bot on the Linux runners (and across JDK 17/21/25). The bot will regenerate on merge, so this is cosmetic, but the macOS/M5 header is out of line with the rest of sql/core/benchmarks.

Optional: a test with a non-UTF8_BINARY collation on the json input (since inputTypes uses StringTypeWithCollation) would close a small coverage gap, and a fuzz/property comparison (random JSON x random simple paths, shared vs legacy) would be the strongest guard for a change whose whole job is to replicate existing semantics.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @sunchao . Looks promising to me and agree with the above mentioned review comment too. And, additional comments are the following.

* expression used to share sibling [[GetJsonObject]] expressions; unsupported JSON paths remain
* as independent GetJsonObject expressions.
*/
case class MultiGetJsonObject(

@dongjoon-hyun dongjoon-hyun Jun 16, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like other classes, shall we define prettyName to be more complete? multi_get_json_object?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

private def copyCurrentStructure(parser: JsonParser): Option[UTF8String] = {
val output = new ByteArrayOutputStream()

@dongjoon-hyun dongjoon-hyun Jun 16, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we can reuse output buffer via reset like GetJsonObjectEvaluator?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! updated

@sunchao

sunchao commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

Thanks @viirya @dongjoon-hyun ! Addressed the comments and updated the PR

@dongjoon-hyun

dongjoon-hyun commented Jun 16, 2026

Copy link
Copy Markdown
Member

Thank you, @sunchao . BTW, as mentioned by Liang-Chi, you need to generate the benchmark results still~ Please trigger it at your fork and on your branch .

….SharedJsonParseBenchmark (JDK 17, Scala 2.13, split 1 of 1)
@dongjoon-hyun

Copy link
Copy Markdown
Member

Like the following, we need all Java combinations. I'm waiting for 21 and 25.

$ ls core/benchmarks/PlatformBenchmark-*              
core/benchmarks/PlatformBenchmark-jdk21-results.txt
core/benchmarks/PlatformBenchmark-jdk25-results.txt
core/benchmarks/PlatformBenchmark-results.txt

sunchao added 2 commits June 16, 2026 23:01
….SharedJsonParseBenchmark (JDK 21, Scala 2.13, split 1 of 1)
….SharedJsonParseBenchmark (JDK 25, Scala 2.13, split 1 of 1)

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @sunchao .

@viirya viirya left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround — all four points are addressed, and I traced the refactor to confirm it's behavior-preserving.

The big one was extracting getJsonObjectTraversalChild as the single source of truth for both collect and rewrite — that removes the drift risk I was worried about. I checked the rewrite reconstruction: other.withNewChildren(rewrite(child) +: other.children.tail) is equivalent to the old per-case version only because the traversal child is always children.head — which holds, since Alias/Cast/GetStructField are all UnaryExpression (sole child, empty tail) and BinaryArithmetic.children.head == left with tail == Seq(right). So it rebuilds exactly Seq(rewrite(left), right) as before. Collect's case other likewise reduces to Nil for anything off the whitelist, matching the old case _ => Nil. Good.

The fallback test now pinning shared == legacy == Row(null, "2") is exactly what closes the Jackson-internals dependency. And the benchmark files are regenerated on the Linux runners across JDK 17/21/25.

One thing I noticed while re-reading, not a request: reusing outputBuffer with reset() instead of allocating per field is a nice touch and matches the existing GetJsonObjectEvaluator pattern — safe here since the calls are sequential and per-task.

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants