[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326
[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326davidm-db wants to merge 11 commits into
Conversation
2d1d82b to
e788aef
Compare
e788aef to
eb12469
Compare
|
cc @MaxGekk please review. |
|
cc @dejankrak-db please review. |
…rces/parquet/types/ops/ParquetTypeOps.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
…rces/parquet/types/ops/TimeTypeParquetOps.scala Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
|
Read-path guard is stricter than the original → ON/OFF behavior difference
This produces a difference between framework ON and OFF when reading a
This contradicts the "behavior is identical in both cases" statement in the PR description. Since this is an edge case (only via explicit read schema) and orthogonal to the framework wiring in this PR, I'd suggest handling it in a separate follow-up rather than blocking this one. The follow-up should decide intent and either:
|
| def newConverter( | ||
| parquetType: org.apache.parquet.schema.Type, | ||
| updater: ParentContainerUpdater): org.apache.parquet.io.api.Converter | ||
| with HasParentContainerUpdater |
There was a problem hiding this comment.
Minor design suggestion: the abstract method here is the simple overload, while the extended overload (below) has a default that delegates to it. As the scaladoc warns, struct-backed types must override the extended overload, yet they are still forced to provide an implementation of this simple one that the docs say is wrong for them ("will compile but produce incorrect behavior at runtime").
Consider making the extended overload the abstract one (and giving the simple overload a default that throws / delegates), so future struct-backed implementations aren't required to ship a misleading simple impl. Not blocking for this PR since the only reference type (TimeType) is primitive and correctly uses the simple version - just something to tidy before a struct-backed type lands.
There was a problem hiding this comment.
Will followup with the change.
| def newConverter( | ||
| parquetType: org.apache.parquet.schema.Type, | ||
| updater: ParentContainerUpdater, | ||
| schemaConverter: org.apache.spark.sql.execution.datasources.parquet | ||
| .ParquetToSparkSchemaConverter, | ||
| convertTz: Option[java.time.ZoneId], | ||
| datetimeRebaseSpec: RebaseSpec, | ||
| int96RebaseSpec: RebaseSpec): org.apache.parquet.io.api.Converter | ||
| with HasParentContainerUpdater = |
There was a problem hiding this comment.
Nit (style): org.apache.parquet.schema.Type is already imported at the top of the file (and used unqualified in convertToParquetType), but here both newConverter overloads spell out fully-qualified names: org.apache.parquet.schema.Type, org.apache.parquet.io.api.Converter, org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter, and java.time.ZoneId. Mixing imported and fully-qualified references for the same Type is inconsistent and a bit noisy.
Suggest adding imports for Converter, ParquetToSparkSchemaConverter, and ZoneId (the latter two are in this package / java.time) and using the unqualified names throughout for readability.
|
PR description is out of sync with the diff (doc-only) A few things in the description don't match the actual changes; worth trimming so reviewers aren't looking for code that isn't there:
None of these affect correctness; just aligning the narrative with the code. |
MaxGekk
left a comment
There was a problem hiding this comment.
0 blocking, 2 non-blocking, 0 nits.
Clean wiring of the framework dispatch into the six integration sites. Two non-blocking notes on API extensibility and test coverage.
Design / architecture (1)
types/ops/ParquetTypeOps.scala:78:convertToParquetTypedropsinShredded— see inline
Suggestions (1)
types/ops/TimeTypeParquetOps.scala:105: no unit test forrequireCompatibleParquetTypereject paths — see inline
| * @param repetition REQUIRED, OPTIONAL, or REPEATED | ||
| * @return the Parquet Type for this DataType | ||
| */ | ||
| def convertToParquetType(fieldName: String, repetition: Repetition): Type |
There was a problem hiding this comment.
SparkToParquetSchemaConverter.convertField calls convertToParquetType(field.name, repetition) but drops inShredded. convertFieldDefault uses that flag at line 726 for timestamp types inside shredded Variant schemas. No current framework type needs it, but a future struct-backed type might differ in Parquet schema when written inside a shredded context.
Adding a default parameter now is backward-compatible — existing impls ignore it, future ones opt in:
| def convertToParquetType(fieldName: String, repetition: Repetition): Type | |
| def convertToParquetType(fieldName: String, repetition: Repetition, inShredded: Boolean = false): Type |
And the call site in convertField would forward it: .map(_.convertToParquetType(field.name, repetition, inShredded)). Avoids a breaking API change when the first struct-backed type lands.
There was a problem hiding this comment.
Done, added inShredded: Boolean = false
| * the legacy ParquetRowConverter path so reads fail loudly instead of | ||
| * silently misinterpreting bytes. | ||
| */ | ||
| private[ops] def requireCompatibleParquetType( |
There was a problem hiding this comment.
requireCompatibleParquetType is only exercised through integration tests for the accepted encoding. The reject paths — raw INT64 (no annotation), INT64 TIME(NANOS), INT32 TIME(MILLIS), INT64 TIME(MICROS, isAdjustedToUTC=true) — are not directly tested. A small unit test (e.g. a TimeTypeParquetOpsTest, or cases in ParquetSchemaSuite) would:
- Document exactly which encodings are rejected and why.
- Provide a regression hook for the
isAdjustedToUTC=trueON/OFF behavior difference flagged in the conversation thread — whichever resolution is chosen (mirror the original guard, or tighten both paths), the test pins the intended behavior.
There was a problem hiding this comment.
added TimeTypeParquetOpsSuite covering all four reject paths
- Add inShredded: Boolean = false default param to ParquetTypeOps.convertToParquetType and forward from SparkToParquetSchemaConverter.convertField call site, so future struct-backed types can branch on shredded Variant context without an API break. Existing TimeTypeParquetOps override accepts the param (ignored). - Replace fully-qualified type names in newConverter overloads with imports for ZoneId, Converter, and ParquetToSparkSchemaConverter so the trait surface reads consistently. - Add TimeTypeParquetOpsSuite with 9 unit tests pinning the requireCompatibleParquetType reject set (raw INT64, INT64 TIME(NANOS), INT32 TIME(MILLIS), INT64 TIME(MICROS, isAdjustedToUTC=true), plus TIMESTAMP/DECIMAL/group/BINARY safety cases). The isAdjustedToUTC=true case serves as the ON/OFF divergence regression hook. Co-authored-by: Isaac
|
Will update description once we reach the final version. cc @MaxGekk |
The parquet-mr Types builder itself rejects this combination with
IllegalStateException("TIME(MICROS,false) can only annotate INT64") at
schema-construction time, so the wrong-primitive branch of
requireCompatibleParquetType is unreachable for the TIME annotation. The
raw-INT64 / TIMESTAMP / DECIMAL / group tests already cover the
!isPrimitive and "non-TIME annotation" branches.
Verified locally: TimeTypeParquetOpsSuite now 8/8 pass (sql/testOnly).
Co-authored-by: Isaac
| // ==================== Row-Based Read ==================== | ||
|
|
||
| override def newConverter( | ||
| parquetType: org.apache.parquet.schema.Type, |
There was a problem hiding this comment.
The same FQN cleanup applied to ParquetTypeOps after Max's note was missed here: Type is already imported, and Converter could be.
Import Converter and drop the org.apache.parquet... prefixes on both, for consistency with the rest of the file.
There was a problem hiding this comment.
@stevomitric Could you address this one as well? Same cleanup as in ParquetTypeOps after my earlier note — import Converter and use the already-imported Type unqualified.
| * by 1000x). | ||
| * | ||
| * These tests document the exact reject set and pin the intended behavior of | ||
| * the read-path guard. They also serve as a regression hook for the |
There was a problem hiding this comment.
This scaladoc (and the test comment at line 80) describe the behavior in terms of the PR review history - "whichever resolution lands ... update this test". Once merged, that framing reads as provisional and leaves a future reader only a PR link for context.
Suggest stating the current invariant as intended behavior, and pointing to a durable follow-up reference (a JIRA ticket) for the pending isAdjustedToUTC resolution.
There was a problem hiding this comment.
@stevomitric Please address this one too: reword the scaladoc and the comment at line 80 to state the current invariant as the intended behavior, and point to a JIRA ticket for the pending isAdjustedToUTC resolution instead of the PR-review framing.
There was a problem hiding this comment.
Max please create the jira.
There was a problem hiding this comment.
|
LGTM, left minor comments. |
MaxGekk
left a comment
There was a problem hiding this comment.
2 addressed, 0 remaining, 1 new. (newly introduced)
Both prior findings addressed cleanly — inShredded is forwarded at the convertField call site, and the new TimeTypeParquetOpsSuite pins the guard's accept/reject set. I also verified the removed BINARY+TIME(MICROS) test's rationale: parquet-column's Types builder does reject that combination ("%s can only annotate INT64").
Nits: 1 minor item (see inline comment).
| * For struct-backed types: returns a GroupType with sub-fields and a logical type annotation. | ||
| * | ||
| * @param fieldName the column/field name in the Parquet schema | ||
| * @param repetition REQUIRED, OPTIONAL, or REPEATED |
There was a problem hiding this comment.
The new inShredded parameter isn't documented, and its meaning (shredded Variant context) isn't obvious from the name:
| * @param repetition REQUIRED, OPTIONAL, or REPEATED | |
| * @param repetition REQUIRED, OPTIONAL, or REPEATED | |
| * @param inShredded whether the field is nested within a shredded Variant schema |
MaxGekk
left a comment
There was a problem hiding this comment.
1 addressed, 0 remaining, 0 new.
Clean polish round — the round-2 nit (@param inShredded doc) and both external reviewer comments (FQN cleanup in TimeTypeParquetOps; durable-reference rewording → SPARK-57416) are resolved with no logic change. No new findings.
Two items remain open but are already on the record and deferred by the author, so I'm not re-posting them: the ParquetTypeOps trait scaladoc still lists "Vectorized read" / "Filter pushdown" among concerns it covers, but the trait implements neither yet; and a couple of PR-description bullets (ParquetWriteSupport companion-utility extraction, ParquetSchemaConverter read-path reverse lookup) don't match the diff. Both are from the 2026-06-09 thread — worth folding into the description pass before merge.
cloud-fan
left a comment
There was a problem hiding this comment.
0 blocking, 2 non-blocking, 1 nit.
Clean, well-scoped refactoring that faithfully follows the established TypeOps framework-first pattern; flag ON/OFF behavioral equivalence verified at all 6 wired integration sites. The non-blocking findings are all extension-surface ergonomics that affect the next (struct-backed) type rather than this PR — a natural fit for follow-up #3.
Design / architecture (1)
- ParquetTypeOps.scala:81: struct-backed extension surface is asymmetric —
convertToParquetTypelacks the sub-field recursion hook thatmakeWriter(callback) and the extendednewConverter(schemaConverter) carry, and the abstractnewConverteris the unsafe simple overload — see inline
Suggestions (1)
- ParquetTypeOps.scala:43: trait scaladoc overstates the implemented surface (lists vectorized/filter methods that don't exist);
isBatchReadSupportedprecondition — see inline
Nits: 1 minor item (see inline comments).
Verification
Verified flag ON vs OFF behavioral equivalence for TimeType at all 6 integration sites by tracing the framework path against each extracted *Default: schema conversion, value write, row read, supportDataType, isBatchReadSupported, and schema clipping all produce identical results, and each *Default recurses for nested fields through the framework-first public method so nested dispatch stays consistent. The one intended difference is the read-path guard (INT64 TIME(MICROS, isAdjustedToUTC=true) rejected ON, accepted OFF) — documented and tracked by SPARK-57416.
| * @param inShredded whether the field is nested within a shredded Variant schema | ||
| * @return the Parquet Type for this DataType | ||
| */ | ||
| def convertToParquetType( |
There was a problem hiding this comment.
The struct-backed paths pass framework-recursion context inconsistently: makeWriter gets a makeFieldWriter: DataType => ValueWriter callback and the extended newConverter gets schemaConverter, but convertToParquetType gets only (fieldName, repetition, inShredded). The legacy struct case recurses via convertField(f, ...), which carries SparkToParquetSchemaConverter config (timestamp output type, legacy format, field-id) — but this method is called on the ops instance, which has no converter reference, so a struct-backed type would have to hand-build its sub-schema and lose that config. Suggest a minimal StructField => Type callback here (matching makeWriter's style) for symmetry.
Relatedly, since the abstract newConverter is the simple 2-arg overload that the scaladoc says is wrong for struct-backed types, making the extended overload the sole abstract method (per @MaxGekk's thread above) is the same shape of fix — both worth settling alongside the struct-backed reference type in follow-up #3.
There was a problem hiding this comment.
Agreed on both - the StructField => Type recursion callback on convertToParquetType (to carry SparkToParquetSchemaConverter config for sub-fields, matching makeWriter's style) and making the extended newConverter the sole abstract method. As you noted, both are best settled against a real struct-backed reference type so the callback signature is validated by an actual consumer rather than added speculatively, so I'm deferring them to follow-up #3 (the struct-backed reference type) alongside Max's same point. Left the current primitive surface as-is for this PR.
| * - Schema conversion: Spark DataType <-> Parquet schema type | ||
| * - Value write: writing values to Parquet RecordConsumer | ||
| * - Row-based read: creating Parquet converters for reading into InternalRow | ||
| * - Vectorized read: creating batch updaters for columnar reading |
There was a problem hiding this comment.
The class scaladoc says the trait "covers all Parquet concerns" and lists "Vectorized read: creating batch updaters" and "Filter pushdown: creating Parquet filter predicates", but the trait exposes neither — only the isBatchReadSupported capability gate. Since the PR description is explicit that those are deferred, consider rewording to mark them as not-yet-on-the-trait.
Relatedly, isBatchReadSupported returning true only works for types the legacy Java vectorized path already handles (there is no framework vectorized hook yet) — worth a one-line note in its scaladoc so a future type doesn't return true and route into a factory that doesn't know it. TimeType is safe for exactly that reason.
There was a problem hiding this comment.
Done. Reworded the class scaladoc to drop the "covers all Parquet concerns" framing and the Vectorized-read Filter-pushdown bullets
| private[parquet] trait ParquetTypeOps extends Serializable { | ||
|
|
||
| /** The DataType this Ops instance handles. */ | ||
| def dataType: DataType |
There was a problem hiding this comment.
This abstract member doesn't appear to have a consumer — dispatch keys off ParquetTypeOps(field.dataType), not the ops instance's dataType. It's mirrored from TypeOps, but there the reference impl gets it from a base class; here every implementer must write override def dataType for no reader. Consider dropping it (or giving it a default) to keep the required surface minimal.
There was a problem hiding this comment.
Done - removed the abstract dataType from the trait and the override def dataType from TimeTypeParquetOps.
Resolves a silent semantic conflict from SPARK-57372 (removal of the Types Framework feature flag): master deleted spark.sql.types.framework.enabled, so the gate in ParquetTypeOps.apply (if (!SQLConf.get.typesFrameworkEnabled) return None) is removed and the framework now dispatches unconditionally, matching how SPARK-57372 adapted TypeOps/TypeApiOps/ConnectTypeOps. SQLConf import retained (used as a parameter type). Co-authored-by: Isaac
What changes were proposed in this pull request?
This PR implements Phase 3a (Storage Formats - Parquet) of the Spark Types Framework (SPARK-55444, parent: SPARK-53504). It adds a new optional
ParquetTypeOpstrait that enables framework-managed types to participate in Parquet read/write paths with zero per-type changes to Parquet infrastructure files.Scope of this PR: the schema-conversion, write-path, and row-based read-path integration sites — 6 Scala files modified plus 2 added. Vectorized-read (
ParquetVectorUpdaterFactory,VectorizedColumnReader) and filter-pushdown (ParquetFilters) integration are deferred — see "Follow-ups" below.New trait:
ParquetTypeOpsinsql/core(packageo.a.s.sql.execution.datasources.parquet.types.ops) following the Phase 1c pattern (ConnectArrowTypeOps — separate module, separate factory). The trait surface implemented in this PR:RecordConsumer)InternalRow)supportDataType, plus theisBatchReadSupportedcapability flag)parquetStructSchemafor column pruning of struct-backed types)Not yet on the trait (deferred to Follow-ups): vectorized-read batch updaters and filter-pushdown predicates. Today the trait only exposes the
isBatchReadSupportedgate for the vectorized path and has no filter surface.Reference implementation:
TimeTypeParquetOpsvalidates the schema, write, row-read, and type-gate paths for a primitiveINT64-backed type.Dispatch pattern: Framework FIRST at all integration sites, with the entire original code extracted to
*Defaultmethods as fallback — the sameOps(dt).map(_.method).getOrElse(methodDefault(dt))pattern established in Phase 1a (PR #54223) and Phase 1c (PR #54905).TimeTypeis always tested through the framework path when the feature flag is ON.Integration sites (in this PR):
ParquetSchemaConverterconvertField→convertToParquetType, original code extracted toconvertFieldDefault. (The Parquet→Spark read/inference direction is unchanged apart from a cosmeticcase _ => illegalType()line merge;TimeTypeinference still uses the hardcodedTimeLogicalTypeAnnotationcase.)ParquetWriteSupportmakeWriterintomakeWriter+makeWriterDefault. (No companion-utility extraction —consumeGroup/writeFieldsare untouched.)ParquetRowConverternewConverterwith method overloading (simple for primitive, extended for struct-backed)ParquetFileFormatsupportDataTypeParquetUtilsisBatchReadSupportedParquetReadSupportclipParquetTypefor struct-backed types viaparquetStructSchemaIntegration sites (follow-up, not in this PR):
ParquetVectorUpdaterFactory(Java)getUpdatervia Java-friendlygetVectorUpdaterOrNullVectorizedColumnReader(Java)isLazyDecodingSupportedvia Java-friendlyisLazyDecodingSupportedForParquetFiltersFrameworkFilterOpscustom extractor +orElseon 7 PartialFunctions + framework-firstvalueCanMakeFilterOnDesign decisions:
ParquetTypeOpsis a separate trait insql/core(not onTypeOpsinsql/catalyst) because Parquet types (RecordConsumer,ParquetVectorUpdater, etc.) live insql/coreand catalyst cannot reference them.rowRepresentationType(Phase 1b) is NOT used for Parquet — it is scoped to row infrastructure only. Using it would erase type identity in Parquet value paths, create dispatch asymmetry between struct-backed and primitive types, and extend it beyond its designed scope.parquetStructSchemais independent ofPhysicalDataType— Parquet storage representation may differ from internal row representation.recordConsumeris passed as() => RecordConsumer(lazy supplier) becausemakeWriteris called duringinit()whenrecordConsumeris still null (set later inprepareForWrite()).FrameworkFilterOpscustom extractor insideParquetFiltersbecauseParquetSchemaTypeis a private inner class that cannot be referenced from outside.Follow-ups (not in this PR):
ParquetVectorUpdaterFactoryandVectorizedColumnReader(both Java). Will use Java-friendly methods (getVectorUpdaterOrNull,isLazyDecodingSupportedFor) to dispatch from Java code paths into the ScalaParquetTypeOps.ParquetFiltersvia aFrameworkFilterOpscustom extractor added to 7PartialFunctions plus framework-firstvalueCanMakeFilterOn. Requires a custom extractor becauseParquetSchemaTypeis a private inner class that cannot be referenced from outsideParquetFilters.TimeTypeisINT64-backed and exercises only the primitive paths. A struct-backed reference type would hardenparquetStructSchema,clipParquetTypeframework-first dispatch, and the extendednewConverteroverload before downstream consumers (e.g., extended-range nanosecond timestamps, DECFLOAT) land.TimeTyperead-path guard divergence described under "user-facing change" below: either relax the framework guard (TimeTypeParquetOps.requireCompatibleParquetType) to mirror the legacyParquetRowConverterguard, or apply the same tightening to the*Defaultpath and add a test documenting the intended change.Why are the changes needed?
Adding a new data type to Spark currently requires modifying 8+ Parquet files with scattered, type-specific logic. This PR enables the framework to handle all Parquet concerns for new types — a new type implements
ParquetTypeOpsand registers in the companion'sapply(), and the Parquet infrastructure files dispatch through it automatically.This is the first storage format integration (Phase 3a).
TimeTypeserves as the reference implementation validating the paths wired in this PR.Does this PR introduce any user-facing change?
No. This is a refactoring that adds framework dispatch to Parquet infrastructure. When the framework flag (
spark.sql.types.framework.enabled) is ON (default in tests),TimeType's Parquet handling goes through the framework; when OFF, the original*Defaultcode paths execute unchanged.Behavior is identical between ON and OFF for the schema-conversion, write, and row-based read paths wired here, with one known, documented exception: the framework read-path guard
TimeTypeParquetOps.requireCompatibleParquetTypeis stricter than the legacyParquetRowConverterguard it replaces. For anINT64 TIME(MICROS, isAdjustedToUTC=true)column read asTimeType, framework ON rejects it (cannotCreateParquetConverterForDataTypeError/FAILED_READ_FILE) while OFF succeeds. This is reachable only via an explicit read schema (schema inference already mapsisAdjustedToUTC=truetoillegalType()), is orthogonal to the framework wiring in this PR, and its reconciliation is tracked as a follow-up in SPARK-57416.How was this patch tested?
All existing Parquet test suites pass with the framework enabled (default in tests):
ParquetSchemaSuite: 131 tests passedParquetIOSuite: 88 tests passed (including "Read TimeType for the logical TIME type")ParquetVectorizedSuite: 25 tests passedParquetV1FilterSuite+ParquetV2FilterSuite: 101 tests passed (including "SPARK-51687: filter pushdown - time")Framework ON/OFF equivalence: all tests pass identically with
spark.sql.types.framework.enabled = trueandfalse.New unit test
TimeTypeParquetOpsSuitepins the accept/reject set ofrequireCompatibleParquetType, including the intentionalisAdjustedToUTC=truerejection noted above (cross-referenced to SPARK-57416).Note:
ParquetVectorizedSuiteand the V1/V2 filter suites exercise the existing (non-framework) code paths forTimeType, since vectorized-read and filter-pushdown integration are in the follow-ups above. Framework ON/OFF equivalence is verified for the schema-conversion, write, and row-based read paths actually wired in this PR.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.6)