Background
Auron currently only accelerates Paimon tables that are registered via Hive Metastore and surfaced as HiveTableScanExec. When users configure a Paimon DSv2 catalog (e.g. spark.sql.catalog.paimon = org.apache.paimon.spark.SparkCatalog), the Paimon table is planned as BatchScanExec and native acceleration cannot kick in — the query falls back to Spark row-based execution.
The on-disk layout for the two paths is identical (COW Parquet/ORC files); only the metadata entry point and split planning differ.
Proposal
Add a parallel DSv2 path under thirdparty/auron-paimon that:
- Detects
BatchScanExec whose scan is a Paimon PaimonBaseScan (via reflection, so the module does not hard-depend on a specific paimon-spark-* artifact and stays cross-Spark-version compatible).
- Validates the splits: COW only — any non-
rawConvertible split or any split with deletionFiles causes a fallback to Spark (MOR/MOW is not yet supported by Auron's native scan).
- Reconstructs partition values from
DataSplit.partition() using Paimon's RowDataToObjectArrayConverter.
- Wraps the planned files in a new
NativePaimonV2TableScanExec, mirroring NativeIcebergTableScanExec in structure (metrics, driver-side metric reporting, empty-RDD handling, pb.FileScanExecConf construction).
- Reuses the existing
spark.auron.enable.paimon.scan flag — no new flag.
Scope / Non-goals
- Supports COW only. MOR (primary key, position/equality delete) is intentionally kept as a fallback.
- Hive-registered Paimon path is not touched.
- No predicate / residual filter pushdown to Paimon split planner yet (filters still run at the Spark/Auron operator layer; correctness is fine, but a future improvement can mirror the Iceberg path).
- Statistics (
pb.Statistics.getDefaultInstance) are a placeholder for now.
Tests
End-to-end DSv2 integration suite under thirdparty/auron-paimon/src/test/scala/org/apache/auron/paimon/AuronPaimonV2IntegrationSuite.scala covering:
- simple COW select / projection / partitioned table + predicate
- ORC COW table (
file.format=orc)
- empty table
- driver metrics (
numPartitions, numFiles) via SparkListenerDriverAccumUpdates
- fallback when
spark.auron.enable.paimon.scan=false
- fallback for MOR (primary-key) table
PR: #2313
Background
Auron currently only accelerates Paimon tables that are registered via Hive Metastore and surfaced as
HiveTableScanExec. When users configure a Paimon DSv2 catalog (e.g.spark.sql.catalog.paimon = org.apache.paimon.spark.SparkCatalog), the Paimon table is planned asBatchScanExecand native acceleration cannot kick in — the query falls back to Spark row-based execution.The on-disk layout for the two paths is identical (COW Parquet/ORC files); only the metadata entry point and split planning differ.
Proposal
Add a parallel DSv2 path under
thirdparty/auron-paimonthat:BatchScanExecwhosescanis a PaimonPaimonBaseScan(via reflection, so the module does not hard-depend on a specificpaimon-spark-*artifact and stays cross-Spark-version compatible).rawConvertiblesplit or any split withdeletionFilescauses a fallback to Spark (MOR/MOW is not yet supported by Auron's native scan).DataSplit.partition()using Paimon'sRowDataToObjectArrayConverter.NativePaimonV2TableScanExec, mirroringNativeIcebergTableScanExecin structure (metrics, driver-side metric reporting, empty-RDD handling,pb.FileScanExecConfconstruction).spark.auron.enable.paimon.scanflag — no new flag.Scope / Non-goals
pb.Statistics.getDefaultInstance) are a placeholder for now.Tests
End-to-end DSv2 integration suite under
thirdparty/auron-paimon/src/test/scala/org/apache/auron/paimon/AuronPaimonV2IntegrationSuite.scalacovering:file.format=orc)numPartitions,numFiles) viaSparkListenerDriverAccumUpdatesspark.auron.enable.paimon.scan=falsePR: #2313