Skip to content

Support native scan for Paimon DSv2 (BatchScanExec) COW tables #2316

@zhuxiangyi

Description

@zhuxiangyi

Background

Auron currently only accelerates Paimon tables that are registered via Hive Metastore and surfaced as HiveTableScanExec. When users configure a Paimon DSv2 catalog (e.g. spark.sql.catalog.paimon = org.apache.paimon.spark.SparkCatalog), the Paimon table is planned as BatchScanExec and native acceleration cannot kick in — the query falls back to Spark row-based execution.

The on-disk layout for the two paths is identical (COW Parquet/ORC files); only the metadata entry point and split planning differ.

Proposal

Add a parallel DSv2 path under thirdparty/auron-paimon that:

  1. Detects BatchScanExec whose scan is a Paimon PaimonBaseScan (via reflection, so the module does not hard-depend on a specific paimon-spark-* artifact and stays cross-Spark-version compatible).
  2. Validates the splits: COW only — any non-rawConvertible split or any split with deletionFiles causes a fallback to Spark (MOR/MOW is not yet supported by Auron's native scan).
  3. Reconstructs partition values from DataSplit.partition() using Paimon's RowDataToObjectArrayConverter.
  4. Wraps the planned files in a new NativePaimonV2TableScanExec, mirroring NativeIcebergTableScanExec in structure (metrics, driver-side metric reporting, empty-RDD handling, pb.FileScanExecConf construction).
  5. Reuses the existing spark.auron.enable.paimon.scan flag — no new flag.

Scope / Non-goals

  • Supports COW only. MOR (primary key, position/equality delete) is intentionally kept as a fallback.
  • Hive-registered Paimon path is not touched.
  • No predicate / residual filter pushdown to Paimon split planner yet (filters still run at the Spark/Auron operator layer; correctness is fine, but a future improvement can mirror the Iceberg path).
  • Statistics (pb.Statistics.getDefaultInstance) are a placeholder for now.

Tests

End-to-end DSv2 integration suite under thirdparty/auron-paimon/src/test/scala/org/apache/auron/paimon/AuronPaimonV2IntegrationSuite.scala covering:

  • simple COW select / projection / partitioned table + predicate
  • ORC COW table (file.format=orc)
  • empty table
  • driver metrics (numPartitions, numFiles) via SparkListenerDriverAccumUpdates
  • fallback when spark.auron.enable.paimon.scan=false
  • fallback for MOR (primary-key) table

PR: #2313

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions