Skip to content

[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326

Open
davidm-db wants to merge 11 commits into
apache:masterfrom
davidm-db:davidm-db/types_framework_3a
Open

[SPARK-55444][SQL] Types Framework - Phase 3a - Storage Formats (Parquet)#55326
davidm-db wants to merge 11 commits into
apache:masterfrom
davidm-db:davidm-db/types_framework_3a

Conversation

@davidm-db

@davidm-db davidm-db commented Apr 13, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR implements Phase 3a (Storage Formats - Parquet) of the Spark Types Framework (SPARK-55444, parent: SPARK-53504). It adds a new optional ParquetTypeOps trait that enables framework-managed types to participate in Parquet read/write paths with zero per-type changes to Parquet infrastructure files.

Scope of this PR: the schema-conversion, write-path, and row-based read-path integration sites — 6 Scala files modified plus 2 added. Vectorized-read (ParquetVectorUpdaterFactory, VectorizedColumnReader) and filter-pushdown (ParquetFilters) integration are deferred — see "Follow-ups" below.

New trait: ParquetTypeOps in sql/core (package o.a.s.sql.execution.datasources.parquet.types.ops) following the Phase 1c pattern (ConnectArrowTypeOps — separate module, separate factory). The trait surface implemented in this PR:

  • Schema conversion (Spark DataType → Parquet schema type, write path)
  • Value write (writing values to the Parquet RecordConsumer)
  • Row-based read (creating Parquet converters for reading into InternalRow)
  • Type gates (supportDataType, plus the isBatchReadSupported capability flag)
  • Schema clipping (parquetStructSchema for column pruning of struct-backed types)

Not yet on the trait (deferred to Follow-ups): vectorized-read batch updaters and filter-pushdown predicates. Today the trait only exposes the isBatchReadSupported gate for the vectorized path and has no filter surface.

Reference implementation: TimeTypeParquetOps validates the schema, write, row-read, and type-gate paths for a primitive INT64-backed type.

Dispatch pattern: Framework FIRST at all integration sites, with the entire original code extracted to *Default methods as fallback — the same Ops(dt).map(_.method).getOrElse(methodDefault(dt)) pattern established in Phase 1a (PR #54223) and Phase 1c (PR #54905). TimeType is always tested through the framework path when the feature flag is ON.

Integration sites (in this PR):

File Dispatch
ParquetSchemaConverter Write-path schema: framework-first convertFieldconvertToParquetType, original code extracted to convertFieldDefault. (The Parquet→Spark read/inference direction is unchanged apart from a cosmetic case _ => illegalType() line merge; TimeType inference still uses the hardcoded TimeLogicalTypeAnnotation case.)
ParquetWriteSupport Framework-first split of makeWriter into makeWriter + makeWriterDefault. (No companion-utility extraction — consumeGroup/writeFields are untouched.)
ParquetRowConverter Framework-first newConverter with method overloading (simple for primitive, extended for struct-backed)
ParquetFileFormat Framework-first supportDataType
ParquetUtils Framework-first isBatchReadSupported
ParquetReadSupport Framework-first clipParquetType for struct-backed types via parquetStructSchema

Integration sites (follow-up, not in this PR):

File Dispatch
ParquetVectorUpdaterFactory (Java) Framework-first getUpdater via Java-friendly getVectorUpdaterOrNull
VectorizedColumnReader (Java) Framework-first isLazyDecodingSupported via Java-friendly isLazyDecodingSupportedFor
ParquetFilters FrameworkFilterOps custom extractor + orElse on 7 PartialFunctions + framework-first valueCanMakeFilterOn

Design decisions:

  • ParquetTypeOps is a separate trait in sql/core (not on TypeOps in sql/catalyst) because Parquet types (RecordConsumer, ParquetVectorUpdater, etc.) live in sql/core and catalyst cannot reference them.
  • rowRepresentationType (Phase 1b) is NOT used for Parquet — it is scoped to row infrastructure only. Using it would erase type identity in Parquet value paths, create dispatch asymmetry between struct-backed and primitive types, and extend it beyond its designed scope.
  • parquetStructSchema is independent of PhysicalDataType — Parquet storage representation may differ from internal row representation.
  • recordConsumer is passed as () => RecordConsumer (lazy supplier) because makeWriter is called during init() when recordConsumer is still null (set later in prepareForWrite()).
  • Filter dispatch (deferred to the follow-up) will use a FrameworkFilterOps custom extractor inside ParquetFilters because ParquetSchemaType is a private inner class that cannot be referenced from outside.

Follow-ups (not in this PR):

  1. Vectorized-read integrationParquetVectorUpdaterFactory and VectorizedColumnReader (both Java). Will use Java-friendly methods (getVectorUpdaterOrNull, isLazyDecodingSupportedFor) to dispatch from Java code paths into the Scala ParquetTypeOps.
  2. Filter-pushdown integrationParquetFilters via a FrameworkFilterOps custom extractor added to 7 PartialFunctions plus framework-first valueCanMakeFilterOn. Requires a custom extractor because ParquetSchemaType is a private inner class that cannot be referenced from outside ParquetFilters.
  3. Second reference implementation (struct-backed)TimeType is INT64-backed and exercises only the primitive paths. A struct-backed reference type would harden parquetStructSchema, clipParquetType framework-first dispatch, and the extended newConverter overload before downstream consumers (e.g., extended-range nanosecond timestamps, DECFLOAT) land.
  4. Read-path guard ON/OFF reconciliation (SPARK-57416) — reconcile the TimeType read-path guard divergence described under "user-facing change" below: either relax the framework guard (TimeTypeParquetOps.requireCompatibleParquetType) to mirror the legacy ParquetRowConverter guard, or apply the same tightening to the *Default path and add a test documenting the intended change.

Why are the changes needed?

Adding a new data type to Spark currently requires modifying 8+ Parquet files with scattered, type-specific logic. This PR enables the framework to handle all Parquet concerns for new types — a new type implements ParquetTypeOps and registers in the companion's apply(), and the Parquet infrastructure files dispatch through it automatically.

This is the first storage format integration (Phase 3a). TimeType serves as the reference implementation validating the paths wired in this PR.

Does this PR introduce any user-facing change?

No. This is a refactoring that adds framework dispatch to Parquet infrastructure. When the framework flag (spark.sql.types.framework.enabled) is ON (default in tests), TimeType's Parquet handling goes through the framework; when OFF, the original *Default code paths execute unchanged.

Behavior is identical between ON and OFF for the schema-conversion, write, and row-based read paths wired here, with one known, documented exception: the framework read-path guard TimeTypeParquetOps.requireCompatibleParquetType is stricter than the legacy ParquetRowConverter guard it replaces. For an INT64 TIME(MICROS, isAdjustedToUTC=true) column read as TimeType, framework ON rejects it (cannotCreateParquetConverterForDataTypeError / FAILED_READ_FILE) while OFF succeeds. This is reachable only via an explicit read schema (schema inference already maps isAdjustedToUTC=true to illegalType()), is orthogonal to the framework wiring in this PR, and its reconciliation is tracked as a follow-up in SPARK-57416.

How was this patch tested?

All existing Parquet test suites pass with the framework enabled (default in tests):

  • ParquetSchemaSuite: 131 tests passed
  • ParquetIOSuite: 88 tests passed (including "Read TimeType for the logical TIME type")
  • ParquetVectorizedSuite: 25 tests passed
  • ParquetV1FilterSuite + ParquetV2FilterSuite: 101 tests passed (including "SPARK-51687: filter pushdown - time")

Framework ON/OFF equivalence: all tests pass identically with spark.sql.types.framework.enabled = true and false.

New unit test TimeTypeParquetOpsSuite pins the accept/reject set of requireCompatibleParquetType, including the intentional isAdjustedToUTC=true rejection noted above (cross-referenced to SPARK-57416).

Note: ParquetVectorizedSuite and the V1/V2 filter suites exercise the existing (non-framework) code paths for TimeType, since vectorized-read and filter-pushdown integration are in the follow-ups above. Framework ON/OFF equivalence is verified for the schema-conversion, write, and row-based read paths actually wired in this PR.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Claude Opus 4.6)

@davidm-db davidm-db force-pushed the davidm-db/types_framework_3a branch from 2d1d82b to e788aef Compare April 13, 2026 17:39
@davidm-db davidm-db force-pushed the davidm-db/types_framework_3a branch from e788aef to eb12469 Compare April 13, 2026 18:40
@davidm-db davidm-db marked this pull request as ready for review June 9, 2026 10:45
@stevomitric

Copy link
Copy Markdown
Contributor

cc @MaxGekk please review.

@stevomitric

Copy link
Copy Markdown
Contributor

cc @dejankrak-db please review.

stevomitric and others added 2 commits June 9, 2026 16:38
…rces/parquet/types/ops/ParquetTypeOps.scala

Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
…rces/parquet/types/ops/TimeTypeParquetOps.scala

Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
@MaxGekk

MaxGekk commented Jun 9, 2026

Copy link
Copy Markdown
Member

Read-path guard is stricter than the original → ON/OFF behavior difference

TimeTypeParquetOps.requireCompatibleParquetType requires INT64 && unit == MICROS && !isAdjustedToUTC. The original ParquetRowConverter guard it replaces only checked TimeLogicalTypeAnnotation && unit == MICROS — it did not check isAdjustedToUTC. So the new guard is strictly tighter, even though the code comment states it "Mirrors the inline guard that existed in ParquetRowConverter before the framework dispatch".

This produces a difference between framework ON and OFF when reading a TIME(MICROS, isAdjustedToUTC=true) column as TimeType (reachable via an explicit read schema, since schema inference already maps isAdjustedToUTC=true to illegalType()). I verified it empirically:

  • framework OFF → succeeds (returns 23:59:59.123456)
  • framework ON → fails with FAILED_READ_FILE / cannotCreateParquetConverterForDataTypeError

This contradicts the "behavior is identical in both cases" statement in the PR description.

Since this is an edge case (only via explicit read schema) and orthogonal to the framework wiring in this PR, I'd suggest handling it in a separate follow-up rather than blocking this one. The follow-up should decide intent and either:

  • mirror the original guard exactly (drop the !isAdjustedToUTC / INT64 checks) to preserve ON/OFF equivalence, or
  • if the stricter check is intentional, apply the same tightening to the *Default path too and add a test documenting the (now intended) change, plus update the comment.

Comment on lines +126 to +129
def newConverter(
parquetType: org.apache.parquet.schema.Type,
updater: ParentContainerUpdater): org.apache.parquet.io.api.Converter
with HasParentContainerUpdater

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor design suggestion: the abstract method here is the simple overload, while the extended overload (below) has a default that delegates to it. As the scaladoc warns, struct-backed types must override the extended overload, yet they are still forced to provide an implementation of this simple one that the docs say is wrong for them ("will compile but produce incorrect behavior at runtime").

Consider making the extended overload the abstract one (and giving the simple overload a default that throws / delegates), so future struct-backed implementations aren't required to ship a misleading simple impl. Not blocking for this PR since the only reference type (TimeType) is primitive and correctly uses the simple version - just something to tidy before a struct-backed type lands.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will followup with the change.

Comment on lines +149 to +157
def newConverter(
parquetType: org.apache.parquet.schema.Type,
updater: ParentContainerUpdater,
schemaConverter: org.apache.spark.sql.execution.datasources.parquet
.ParquetToSparkSchemaConverter,
convertTz: Option[java.time.ZoneId],
datetimeRebaseSpec: RebaseSpec,
int96RebaseSpec: RebaseSpec): org.apache.parquet.io.api.Converter
with HasParentContainerUpdater =

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (style): org.apache.parquet.schema.Type is already imported at the top of the file (and used unqualified in convertToParquetType), but here both newConverter overloads spell out fully-qualified names: org.apache.parquet.schema.Type, org.apache.parquet.io.api.Converter, org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter, and java.time.ZoneId. Mixing imported and fully-qualified references for the same Type is inconsistent and a bit noisy.

Suggest adding imports for Converter, ParquetToSparkSchemaConverter, and ZoneId (the latter two are in this package / java.time) and using the unqualified names throughout for readability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imported.

@MaxGekk

MaxGekk commented Jun 9, 2026

Copy link
Copy Markdown
Member

PR description is out of sync with the diff (doc-only)

A few things in the description don't match the actual changes; worth trimming so reviewers aren't looking for code that isn't there:

  • "ParquetWriteSupport | Companion utility extraction (consumeGroup, writeFields)" - not in the diff. The only change in ParquetWriteSupport is the framework-first split of makeWriter into makeWriter + makeWriterDefault; consumeGroup/writeFields are untouched.
  • "ParquetSchemaConverter | ... + read-path reverse lookup" - the only change in the Parquet->Spark direction is a cosmetic line merge (case _ => illegalType()). No framework dispatch is added on the read/inference path; TimeType inference still uses the hardcoded TimeLogicalTypeAnnotation case.
  • The ParquetTypeOps scaladoc says the trait "covers all Parquet concerns: ... Vectorized read: creating batch updaters ... Filter pushdown: creating Parquet filter predicates". The trait actually has neither (they're correctly deferred to the follow-ups) - it only exposes the isBatchReadSupported gate. Suggest rewording the scaladoc and the description's trait-surface bullet list to reflect what's implemented now vs. deferred.

None of these affect correctness; just aligning the narrative with the code.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 2 non-blocking, 0 nits.
Clean wiring of the framework dispatch into the six integration sites. Two non-blocking notes on API extensibility and test coverage.

Design / architecture (1)

  • types/ops/ParquetTypeOps.scala:78: convertToParquetType drops inShredded — see inline

Suggestions (1)

  • types/ops/TimeTypeParquetOps.scala:105: no unit test for requireCompatibleParquetType reject paths — see inline

* @param repetition REQUIRED, OPTIONAL, or REPEATED
* @return the Parquet Type for this DataType
*/
def convertToParquetType(fieldName: String, repetition: Repetition): Type

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkToParquetSchemaConverter.convertField calls convertToParquetType(field.name, repetition) but drops inShredded. convertFieldDefault uses that flag at line 726 for timestamp types inside shredded Variant schemas. No current framework type needs it, but a future struct-backed type might differ in Parquet schema when written inside a shredded context.

Adding a default parameter now is backward-compatible — existing impls ignore it, future ones opt in:

Suggested change
def convertToParquetType(fieldName: String, repetition: Repetition): Type
def convertToParquetType(fieldName: String, repetition: Repetition, inShredded: Boolean = false): Type

And the call site in convertField would forward it: .map(_.convertToParquetType(field.name, repetition, inShredded)). Avoids a breaking API change when the first struct-backed type lands.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added inShredded: Boolean = false

* the legacy ParquetRowConverter path so reads fail loudly instead of
* silently misinterpreting bytes.
*/
private[ops] def requireCompatibleParquetType(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requireCompatibleParquetType is only exercised through integration tests for the accepted encoding. The reject paths — raw INT64 (no annotation), INT64 TIME(NANOS), INT32 TIME(MILLIS), INT64 TIME(MICROS, isAdjustedToUTC=true) — are not directly tested. A small unit test (e.g. a TimeTypeParquetOpsTest, or cases in ParquetSchemaSuite) would:

  1. Document exactly which encodings are rejected and why.
  2. Provide a regression hook for the isAdjustedToUTC=true ON/OFF behavior difference flagged in the conversation thread — whichever resolution is chosen (mirror the original guard, or tighten both paths), the test pins the intended behavior.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TimeTypeParquetOpsSuite covering all four reject paths

- Add inShredded: Boolean = false default param to
  ParquetTypeOps.convertToParquetType and forward from
  SparkToParquetSchemaConverter.convertField call site, so future
  struct-backed types can branch on shredded Variant context without
  an API break. Existing TimeTypeParquetOps override accepts the param
  (ignored).

- Replace fully-qualified type names in newConverter overloads with
  imports for ZoneId, Converter, and ParquetToSparkSchemaConverter so
  the trait surface reads consistently.

- Add TimeTypeParquetOpsSuite with 9 unit tests pinning the
  requireCompatibleParquetType reject set (raw INT64, INT64 TIME(NANOS),
  INT32 TIME(MILLIS), INT64 TIME(MICROS, isAdjustedToUTC=true), plus
  TIMESTAMP/DECIMAL/group/BINARY safety cases). The isAdjustedToUTC=true
  case serves as the ON/OFF divergence regression hook.

Co-authored-by: Isaac
@stevomitric

Copy link
Copy Markdown
Contributor

Will update description once we reach the final version. cc @MaxGekk

The parquet-mr Types builder itself rejects this combination with
IllegalStateException("TIME(MICROS,false) can only annotate INT64") at
schema-construction time, so the wrong-primitive branch of
requireCompatibleParquetType is unreachable for the TIME annotation. The
raw-INT64 / TIMESTAMP / DECIMAL / group tests already cover the
!isPrimitive and "non-TIME annotation" branches.

Verified locally: TimeTypeParquetOpsSuite now 8/8 pass (sql/testOnly).

Co-authored-by: Isaac
// ==================== Row-Based Read ====================

override def newConverter(
parquetType: org.apache.parquet.schema.Type,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same FQN cleanup applied to ParquetTypeOps after Max's note was missed here: Type is already imported, and Converter could be.
Import Converter and drop the org.apache.parquet... prefixes on both, for consistency with the rest of the file.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevomitric Could you address this one as well? Same cleanup as in ParquetTypeOps after my earlier note — import Converter and use the already-imported Type unqualified.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* by 1000x).
*
* These tests document the exact reject set and pin the intended behavior of
* the read-path guard. They also serve as a regression hook for the

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This scaladoc (and the test comment at line 80) describe the behavior in terms of the PR review history - "whichever resolution lands ... update this test". Once merged, that framing reads as provisional and leaves a future reader only a PR link for context.
Suggest stating the current invariant as intended behavior, and pointing to a durable follow-up reference (a JIRA ticket) for the pending isAdjustedToUTC resolution.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevomitric Please address this one too: reword the scaladoc and the comment at line 80 to state the current invariant as the intended behavior, and point to a JIRA ticket for the pending isAdjustedToUTC resolution instead of the PR-review framing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max please create the jira.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@dejankrak-db

Copy link
Copy Markdown
Contributor

LGTM, left minor comments.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 addressed, 0 remaining, 1 new. (newly introduced)
Both prior findings addressed cleanly — inShredded is forwarded at the convertField call site, and the new TimeTypeParquetOpsSuite pins the guard's accept/reject set. I also verified the removed BINARY+TIME(MICROS) test's rationale: parquet-column's Types builder does reject that combination ("%s can only annotate INT64").

Nits: 1 minor item (see inline comment).

* For struct-backed types: returns a GroupType with sub-fields and a logical type annotation.
*
* @param fieldName the column/field name in the Parquet schema
* @param repetition REQUIRED, OPTIONAL, or REPEATED

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new inShredded parameter isn't documented, and its meaning (shredded Variant context) isn't obvious from the name:

Suggested change
* @param repetition REQUIRED, OPTIONAL, or REPEATED
* @param repetition REQUIRED, OPTIONAL, or REPEATED
* @param inShredded whether the field is nested within a shredded Variant schema

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added docs.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 addressed, 0 remaining, 0 new.
Clean polish round — the round-2 nit (@param inShredded doc) and both external reviewer comments (FQN cleanup in TimeTypeParquetOps; durable-reference rewording → SPARK-57416) are resolved with no logic change. No new findings.

Two items remain open but are already on the record and deferred by the author, so I'm not re-posting them: the ParquetTypeOps trait scaladoc still lists "Vectorized read" / "Filter pushdown" among concerns it covers, but the trait implements neither yet; and a couple of PR-description bullets (ParquetWriteSupport companion-utility extraction, ParquetSchemaConverter read-path reverse lookup) don't match the diff. Both are from the 2026-06-09 thread — worth folding into the description pass before merge.

@cloud-fan cloud-fan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 2 non-blocking, 1 nit.
Clean, well-scoped refactoring that faithfully follows the established TypeOps framework-first pattern; flag ON/OFF behavioral equivalence verified at all 6 wired integration sites. The non-blocking findings are all extension-surface ergonomics that affect the next (struct-backed) type rather than this PR — a natural fit for follow-up #3.

Design / architecture (1)

  • ParquetTypeOps.scala:81: struct-backed extension surface is asymmetric — convertToParquetType lacks the sub-field recursion hook that makeWriter (callback) and the extended newConverter (schemaConverter) carry, and the abstract newConverter is the unsafe simple overload — see inline

Suggestions (1)

  • ParquetTypeOps.scala:43: trait scaladoc overstates the implemented surface (lists vectorized/filter methods that don't exist); isBatchReadSupported precondition — see inline

Nits: 1 minor item (see inline comments).

Verification

Verified flag ON vs OFF behavioral equivalence for TimeType at all 6 integration sites by tracing the framework path against each extracted *Default: schema conversion, value write, row read, supportDataType, isBatchReadSupported, and schema clipping all produce identical results, and each *Default recurses for nested fields through the framework-first public method so nested dispatch stays consistent. The one intended difference is the read-path guard (INT64 TIME(MICROS, isAdjustedToUTC=true) rejected ON, accepted OFF) — documented and tracked by SPARK-57416.

* @param inShredded whether the field is nested within a shredded Variant schema
* @return the Parquet Type for this DataType
*/
def convertToParquetType(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The struct-backed paths pass framework-recursion context inconsistently: makeWriter gets a makeFieldWriter: DataType => ValueWriter callback and the extended newConverter gets schemaConverter, but convertToParquetType gets only (fieldName, repetition, inShredded). The legacy struct case recurses via convertField(f, ...), which carries SparkToParquetSchemaConverter config (timestamp output type, legacy format, field-id) — but this method is called on the ops instance, which has no converter reference, so a struct-backed type would have to hand-build its sub-schema and lose that config. Suggest a minimal StructField => Type callback here (matching makeWriter's style) for symmetry.

Relatedly, since the abstract newConverter is the simple 2-arg overload that the scaladoc says is wrong for struct-backed types, making the extended overload the sole abstract method (per @MaxGekk's thread above) is the same shape of fix — both worth settling alongside the struct-backed reference type in follow-up #3.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both - the StructField => Type recursion callback on convertToParquetType (to carry SparkToParquetSchemaConverter config for sub-fields, matching makeWriter's style) and making the extended newConverter the sole abstract method. As you noted, both are best settled against a real struct-backed reference type so the callback signature is validated by an actual consumer rather than added speculatively, so I'm deferring them to follow-up #3 (the struct-backed reference type) alongside Max's same point. Left the current primitive surface as-is for this PR.

* - Schema conversion: Spark DataType <-> Parquet schema type
* - Value write: writing values to Parquet RecordConsumer
* - Row-based read: creating Parquet converters for reading into InternalRow
* - Vectorized read: creating batch updaters for columnar reading

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class scaladoc says the trait "covers all Parquet concerns" and lists "Vectorized read: creating batch updaters" and "Filter pushdown: creating Parquet filter predicates", but the trait exposes neither — only the isBatchReadSupported capability gate. Since the PR description is explicit that those are deferred, consider rewording to mark them as not-yet-on-the-trait.

Relatedly, isBatchReadSupported returning true only works for types the legacy Java vectorized path already handles (there is no framework vectorized hook yet) — worth a one-line note in its scaladoc so a future type doesn't return true and route into a factory that doesn't know it. TimeType is safe for exactly that reason.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Reworded the class scaladoc to drop the "covers all Parquet concerns" framing and the Vectorized-read Filter-pushdown bullets

private[parquet] trait ParquetTypeOps extends Serializable {

/** The DataType this Ops instance handles. */
def dataType: DataType

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This abstract member doesn't appear to have a consumer — dispatch keys off ParquetTypeOps(field.dataType), not the ops instance's dataType. It's mirrored from TypeOps, but there the reference impl gets it from a base class; here every implementer must write override def dataType for no reader. Consider dropping it (or giving it a default) to keep the required surface minimal.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - removed the abstract dataType from the trait and the override def dataType from TimeTypeParquetOps.

Resolves a silent semantic conflict from SPARK-57372 (removal of the Types Framework feature flag): master deleted spark.sql.types.framework.enabled, so the gate in ParquetTypeOps.apply (if (!SQLConf.get.typesFrameworkEnabled) return None) is removed and the framework now dispatches unconditionally, matching how SPARK-57372 adapted TypeOps/TypeApiOps/ConnectTypeOps. SQLConf import retained (used as a parameter type).

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants