Add footer cache for iceberg by liurenjie1024 · Pull Request #14645 · NVIDIA/spark-rapids

liurenjie1024 · 2026-04-22T08:01:13Z

Description

Allow iceberg's parquet reader to utilize footer cache. Since in older version of parquet, the ParquetFileReader constructor with ParquetMetadata is not available, we actually ignored the footer cache in iceberg 1.6.x and 1.9.x shim, and it's only available in iceberg 1.10 shim.

Checklists

Documentation

Updated for new or modified user-facing features or behaviors
No user-facing change

Testing

Added or modified tests to cover new code paths
Covered by existing tests
(Please provide the names of the existing tests in the PR description.)
Not required

Performance

Tests ran and results are added in the PR description
Issue filed with a link in the PR description
Not required

For the reason mentioned in pr description, we should only do perf test against iceberg 1.10 shim

greptile-apps · 2026-04-22T08:05:46Z

Greptile Summary

This PR introduces footer-cache support for the Iceberg Parquet reader by extracting shared footer-read/cache logic into a new ParquetFooterUtils object and adding per-version GpuParquetIOShim implementations. The 1.10.x shim injects a pre-parsed ParquetMetadata (read from cache or file) into ParquetFileReader via its 4-arg constructor, while the 1.6.x and 1.9.x shims fall back to uncached reads but correctly bump the miss counter. The GpuParquetScan inline cache logic is cleanly removed in favour of the new shared utility, and verifyParquetMagic / encrypted-parquet detection are likewise consolidated there.

Confidence Score: 5/5

Safe to merge — all resource-management patterns are correct and the shim split is complete across all three iceberg versions.

No P0/P1 findings. ARM patterns (withResource/closeOnExcept) are applied correctly throughout. By-name thunk in getFooterBuffer ensures readFooterBuffer is only evaluated on a cache miss. Shims added for all three iceberg versions (C4). The refactoring in GpuParquetScan preserves existing behaviour, including the encrypted-parquet detection path.

No files require special attention.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetFooterUtils.scala	New utility object factoring out shared footer read/cache logic; ARM patterns correct, by-name thunk used properly for lazy miss evaluation
iceberg/iceberg-1-10-x/src/main/scala/org/apache/iceberg/parquet/GpuParquetIOShim.scala	New 1.10.x shim: reads footer via FileCache using 4-arg ParquetFileReader constructor; ARM patterns (withResource/closeOnExcept) correctly applied
iceberg/iceberg-1-6-x/src/main/scala/org/apache/iceberg/parquet/GpuParquetIOShim.scala	New 1.6.x shim: caching unsupported, falls back to plain open() and bumps miss counter for metric visibility
iceberg/iceberg-1-9-x/src/main/scala/org/apache/iceberg/parquet/GpuParquetIOShim.scala	New 1.9.x shim: identical to 1.6.x shim — caching unsupported, plain open() with miss counter bump
iceberg/common/src/main/scala/com/nvidia/spark/rapids/iceberg/parquet/converter/ToIcebergShaded.scala	New adapter bridging non-shaded parquet InputFile/SeekableInputStream to iceberg-shaded equivalents; all stream delegates correctly forwarded
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/GpuParquetScan.scala	Inline footer-cache logic removed and delegated to ParquetFooterUtils; verifyParquetMagic also extracted; behaviour preserved
iceberg/common/src/main/scala/org/apache/iceberg/parquet/GpuParquetIO.scala	Extended with openReader() that dispatches to per-version shim; thin delegation layer, no resource-management concerns
iceberg/common/src/main/scala/com/nvidia/spark/rapids/iceberg/parquet/reader.scala	newReader() updated to accept optional metrics map and delegate to GpuParquetIO.openReader for footer-cache awareness

Sequence Diagram

sequenceDiagram
    participant Reader as IcebergPartitionedFile.newReader()
    participant GpuParquetIO as GpuParquetIO.openReader()
    participant Shim110 as GpuParquetIOShim (1.10.x)
    participant Shim16_19 as GpuParquetIOShim (1.6/1.9.x)
    participant PFU as ParquetFooterUtils
    participant FC as FileCache
    participant PFR as ParquetFileReader (shaded)

    Reader->>GpuParquetIO: openReader(inputFile, path, opts, metrics)
    GpuParquetIO->>Shim110: openReader(...) [1.10.x]
    GpuParquetIO->>Shim16_19: openReader(...) [1.6/1.9.x]

    alt Iceberg 1.10.x
        Shim110->>PFU: getFooterBuffer(inputFile, metrics, readFooterBufferFromInputFile)
        PFU->>FC: getFooter(inputFile)
        alt Cache HIT
            FC-->>PFU: HMB (cached)
            PFU-->>Shim110: HMB (metrics: hit+size)
        else Cache MISS
            PFU->>PFU: readFooterBufferFromInputFile(inputFile, path)
            PFU->>FC: startFooterCache / complete
            PFU-->>Shim110: HMB (metrics: miss+size)
        end
        Shim110->>Shim110: ToIcebergShaded.shade(HMBInputFile)
        Shim110->>PFR: readFooter(shadedHmbFile, opts, hmbStream) → metadata
        Shim110->>PFR: new ParquetFileReader(realFile, metadata, opts, stream)
    else Iceberg 1.6.x / 1.9.x
        Shim16_19->>Shim16_19: metrics += FILECACHE_FOOTER_MISSES
        Shim16_19->>PFR: ParquetFileReader.open(file, opts)
    end

_{Reviews (5): Last reviewed commit: "Fix buid break" | Re-trigger Greptile}

Signed-off-by: Ray Liu <liurenjie2008@gmail.com>

gerashegalov

A few non-blocking review comments inline. One additional note on PR description scope (#7):

The title says "Add footer cache for iceberg" but only iceberg 1.10.x actually gets footer caching; the 1.6.x and 1.9.x shims always go through the plain ParquetFileReader.open(...) path because the shaded ParquetFileReader in those versions has no public API to inject pre-parsed metadata. Worth calling that limitation out explicitly in the PR description (and probably in the Performance section), so users on the older iceberg shims don't expect a perf win and reviewers don't go looking for tests asserting footer-cache hits there.

gerashegalov · 2026-04-23T23:32:08Z

+      throw new RuntimeException(s"$filePath is not a Parquet file (too small length: $fileLen )")
+    }
+    val footerLengthIndex = fileLen - FooterLengthSize - MAGIC.length
+    withResource(inputFile.open()) { inputStream =>


Lost NVTX range on the iceberg footer-read path.

The original readFooterBufUsingHadoop in GpuParquetScan.scala wraps the actual seek + copy loop in NvtxRegistry.PARQUET_READ_FOOTER_BYTES. This new helper drops it, so iceberg footer reads (which now flow through readFooterBufferFromInputFile from the 1.10.x shim) will silently disappear from NVTX traces and won't be comparable to non-iceberg footer reads.

Suggest wrapping the withResource(inputFile.open()) { ... } block (or just the seek/read loop inside it) in NvtxRegistry.PARQUET_READ_FOOTER_BYTES { ... } for parity.

gerashegalov · 2026-04-23T23:32:08Z

+      IOUtils.readFully(inputStream, magic, 0, magic.length)
+      verifyParquetMagic(filePath, magic)
+      val fIdx = footerIndex(filePath, fileLen, footerLength, FooterLengthSize)
+      val tailBytes = (fileLen - fIdx).toInt


Same Long -> Int truncation concern Greptile flagged on the cache-hit path applies here on the raw-read path as well: if fileLen - fIdx ever exceeds 2 GB this silently overflows to a negative value and HostMemoryBuffer.allocate(tailBytes + MAGIC.length, false) will fail with a confusing message (or worse).

Astronomically unlikely for footers, but Math.toIntExact(fileLen - fIdx) would at least give a deterministic, descriptive failure if it ever happens. Cheap to add.

gerashegalov · 2026-04-23T23:32:08Z

-        // buffer includes header and trailing length and magic, stripped here
-        hmb.slice(MAGIC.length, hmb.getLength - Integer.BYTES - MAGIC.length)
-      }
+    withResource(ParquetFooterUtils.getFooterBuffer(inputFile, metrics,


Nice consolidation here. Follow-up suggestion: the other footer-cache hit/miss block in this same file, readAndSimpleFilterFooter (around line 631), still has its own near-identical FileCache.getFooter / startFooterCache / try { ... token.complete } finally { token.cancel } ladder. That's exactly the pattern you just centralized in ParquetFooterUtils.getFooterBuffer.

It could be reduced to roughly:

withResource(ParquetFooterUtils.getFooterBuffer(inputFile, metrics, readFooterBuffer(fileIO, filePath, conf))) { hmb => ParquetFileReader.readFooter(new HMBInputFile(hmb), ParquetMetadataConverter.range(file.start, file.start + file.length)) }

Happy for it to be a separate PR if you want to keep this one minimal, but the duplication is exactly what motivated the extraction, and centralizing also means the resource-leak fix (the try/finally token.cancel() you correctly added in getFooterBuffer) lives in only one place.

liurenjie1024 · 2026-04-24T06:00:56Z

A few non-blocking review comments inline. One additional note on PR description scope (#7):

The title says "Add footer cache for iceberg" but only iceberg 1.10.x actually gets footer caching; the 1.6.x and 1.9.x shims always go through the plain ParquetFileReader.open(...) path because the shaded ParquetFileReader in those versions has no public API to inject pre-parsed metadata. Worth calling that limitation out explicitly in the PR description (and probably in the Performance section), so users on the older iceberg shims don't expect a perf win and reviewers don't go looking for tests asserting footer-cache hits there.

cc @gerashegalov I've updated the pr description part.

greptile-apps · 2026-04-24T06:01:55Z

Want your agent to iterate on Greptile's feedback? Try greploops.

liurenjie1024 · 2026-04-24T08:29:38Z

build

Add footer cache for iceberg

98644a2

greptile-apps Bot reviewed Apr 22, 2026

View reviewed changes

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetFooterUtils.scala

Comment thread iceberg/common/src/main/scala/org/apache/iceberg/parquet/GpuParquetIO.scala Outdated

liurenjie1024 marked this pull request as draft April 22, 2026 10:04

Refactoring

049e11e

liurenjie1024 marked this pull request as ready for review April 23, 2026 06:05

Address comments

4854473

Signed-off-by: Ray Liu <liurenjie2008@gmail.com>

liurenjie1024 requested review from abellina, gerashegalov and res-life April 23, 2026 06:19

sameerz added the performance A performance related task/issue label Apr 23, 2026

gerashegalov reviewed Apr 23, 2026

View reviewed changes

Address comments

8dcd60a

Fix buid break

0c24534

Conversation

liurenjie1024 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklists

Uh oh!

greptile-apps Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

gerashegalov left a comment

Choose a reason for hiding this comment

Uh oh!

gerashegalov Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gerashegalov Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

gerashegalov Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

liurenjie1024 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

liurenjie1024 commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Apr 24, 2026

Uh oh!

liurenjie1024 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liurenjie1024 commented Apr 22, 2026 •

edited

Loading

greptile-apps Bot commented Apr 22, 2026 •

edited

Loading

liurenjie1024 commented Apr 24, 2026 •

edited

Loading