[SPARK-47670][SQL] Share repeated top-level JSON path parsing#56547
[SPARK-47670][SQL] Share repeated top-level JSON path parsing#56547sunchao wants to merge 5 commits into
Conversation
|
cc @cloud-fan @viirya @dongjoon-hyun @peter-toth could you take a look? Thanks! |
viirya
left a comment
There was a problem hiding this comment.
Nicely scoped change — defaulting off, restricting to deterministic simple top-level fields, and the care taken to preserve the legacy per-path semantics (first-non-null-on-duplicate-keys, all-null on a syntax error, raw vs quoted string rendering, generator-failure isolation) all hold up under tracing. A few non-blocking items before merge:
1. Lock down the parser-side fallback assumption with a legacy comparison.
shared get_json_object falls back after parser-side rendering failure only does checkAnswer(query, Row(null, "2")). For that result to come out of fallback, the legacy extraction of $['b.c'] must scan past the oversized a value without tripping StreamConstraintsException — i.e. Jackson must not materialize the non-matching string during skipChildren/nextToken. That's a real dependency on Jackson internals, and it's the one place the test doesn't pin behavior against the legacy path. Could you add a shared == legacy comparison there, the way the other runtime tests do, so a future Jackson bump can't silently change this?
2. The collect/rewrite allow/deny lists must stay in lockstep.
collectGetJsonObjectFields and rewriteGetJsonObjectFields carry the same exclusion set (ConditionalExpression | And | Or | In | TryEval | LambdaFunction | CreateNamedStruct) and the same recursion set (Alias/GetStructField/Cast/left-first BinaryArithmetic), maintained independently. If they ever drift — collect treats something as shareable that rewrite doesn't descend into (or vice versa) — you get an un-rewritten GetJsonObject left behind. It's not a correctness hazard today (rewrite just leaves the original in place), but it's an easy footgun when someone adds a branch later. Worth extracting the shared classification into one helper both sides consume.
3. nullIntolerant = true diverges from GetJsonObject.
GetJsonObject doesn't declare nullIntolerant (defaults to false); MultiGetJsonObject does. It's correct here — null json yields a null struct — but the divergence is non-obvious. A one-line comment on why would save a future reader the double-take.
4. Benchmark result file.
SharedJsonParseBenchmark-results.txt was generated locally on macOS / Apple M5 Max / JDK 17. Committed benchmark results are conventionally regenerated by the CI bot on the Linux runners (and across JDK 17/21/25). The bot will regenerate on merge, so this is cosmetic, but the macOS/M5 header is out of line with the rest of sql/core/benchmarks.
Optional: a test with a non-UTF8_BINARY collation on the json input (since inputTypes uses StringTypeWithCollation) would close a small coverage gap, and a fuzz/property comparison (random JSON x random simple paths, shared vs legacy) would be the strongest guard for a change whose whole job is to replicate existing semantics.
There was a problem hiding this comment.
Thank you, @sunchao . Looks promising to me and agree with the above mentioned review comment too. And, additional comments are the following.
| * expression used to share sibling [[GetJsonObject]] expressions; unsupported JSON paths remain | ||
| * as independent GetJsonObject expressions. | ||
| */ | ||
| case class MultiGetJsonObject( |
There was a problem hiding this comment.
Like other classes, shall we define prettyName to be more complete? multi_get_json_object?
| } | ||
|
|
||
| private def copyCurrentStructure(parser: JsonParser): Option[UTF8String] = { | ||
| val output = new ByteArrayOutputStream() |
There was a problem hiding this comment.
Do you think we can reuse output buffer via reset like GetJsonObjectEvaluator?
|
Thanks @viirya @dongjoon-hyun ! Addressed the comments and updated the PR |
|
Thank you, @sunchao . BTW, as mentioned by Liang-Chi, you need to generate the benchmark results still~ Please trigger it at your fork and on your branch . |
….SharedJsonParseBenchmark (JDK 17, Scala 2.13, split 1 of 1)
|
Like the following, we need all Java combinations. I'm waiting for 21 and 25. |
….SharedJsonParseBenchmark (JDK 21, Scala 2.13, split 1 of 1)
….SharedJsonParseBenchmark (JDK 25, Scala 2.13, split 1 of 1)
dongjoon-hyun
left a comment
There was a problem hiding this comment.
+1, LGTM. Thank you, @sunchao .
viirya
left a comment
There was a problem hiding this comment.
Thanks for the quick turnaround — all four points are addressed, and I traced the refactor to confirm it's behavior-preserving.
The big one was extracting getJsonObjectTraversalChild as the single source of truth for both collect and rewrite — that removes the drift risk I was worried about. I checked the rewrite reconstruction: other.withNewChildren(rewrite(child) +: other.children.tail) is equivalent to the old per-case version only because the traversal child is always children.head — which holds, since Alias/Cast/GetStructField are all UnaryExpression (sole child, empty tail) and BinaryArithmetic.children.head == left with tail == Seq(right). So it rebuilds exactly Seq(rewrite(left), right) as before. Collect's case other likewise reduces to Nil for anything off the whitelist, matching the old case _ => Nil. Good.
The fallback test now pinning shared == legacy == Row(null, "2") is exactly what closes the Jackson-internals dependency. And the benchmark files are regenerated on the Linux runners across JDK 17/21/25.
One thing I noticed while re-reading, not a request: reusing outputBuffer with reset() instead of allocating per field is a nice touch and matches the existing GetJsonObjectEvaluator pattern — safe here since the calls are sequential and per-task.
LGTM.
Why are the changes needed?
Each
get_json_objectexpression currently parses its JSON input independently. A projection that extracts several fields from the same JSON string therefore repeats the scan and parse work for every path, so its CPU cost grows roughly with the number of extracted fields.SPARK-47670 proposes sharing that parsing work. This PR implements a conservative first scope for deterministic, simple top-level paths. Nested paths, wildcards, array subscripts, computed JSON inputs, and evaluation-order-sensitive expression shapes remain unchanged so that Hive-compatible behavior is preserved.
What changes were proposed in this PR?
MultiGetJsonObjectexpression and evaluator that extracts several top-level fields with one Jackson parse and returns them as a struct.OptimizeCsvJsonExprsto group eligible siblingget_json_objectexpressions over the same input attribute and rewrite them to fields of one shared parse.from_jsonexpressions out of the rewrite.spark.sql.optimizer.getJsonObjectSharedParsing.enabledconfiguration. The existing JSON expression optimization must also be enabled.Does this PR introduce any user-facing change?
Yes. It adds an internal, default-disabled configuration. With the flag disabled, existing analyzed and optimized plans are unchanged. When enabled, eligible repeated top-level
get_json_objectcalls use a shared parse; their result semantics remain unchanged.How was this PR tested?
The following suites and style checks passed on JDK 17:
The complete optimizer suite passed 20 tests, and the complete
JsonFunctionsSuitepassed 104 tests.The added benchmark was generated with:
Best times for 200,000 rows containing 32 JSON fields were:
The benchmark result file records the full JDK 17 environment and output.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex (GPT-5)