Conversation
Greptile SummaryThis PR introduces footer-cache support for the Iceberg Parquet reader by extracting shared footer-read/cache logic into a new Confidence Score: 5/5Safe to merge — all resource-management patterns are correct and the shim split is complete across all three iceberg versions. No P0/P1 findings. ARM patterns (withResource/closeOnExcept) are applied correctly throughout. By-name thunk in getFooterBuffer ensures readFooterBuffer is only evaluated on a cache miss. Shims added for all three iceberg versions (C4). The refactoring in GpuParquetScan preserves existing behaviour, including the encrypted-parquet detection path. No files require special attention. Important Files Changed
Sequence DiagramsequenceDiagram
participant Reader as IcebergPartitionedFile.newReader()
participant GpuParquetIO as GpuParquetIO.openReader()
participant Shim110 as GpuParquetIOShim (1.10.x)
participant Shim16_19 as GpuParquetIOShim (1.6/1.9.x)
participant PFU as ParquetFooterUtils
participant FC as FileCache
participant PFR as ParquetFileReader (shaded)
Reader->>GpuParquetIO: openReader(inputFile, path, opts, metrics)
GpuParquetIO->>Shim110: openReader(...) [1.10.x]
GpuParquetIO->>Shim16_19: openReader(...) [1.6/1.9.x]
alt Iceberg 1.10.x
Shim110->>PFU: getFooterBuffer(inputFile, metrics, readFooterBufferFromInputFile)
PFU->>FC: getFooter(inputFile)
alt Cache HIT
FC-->>PFU: HMB (cached)
PFU-->>Shim110: HMB (metrics: hit+size)
else Cache MISS
PFU->>PFU: readFooterBufferFromInputFile(inputFile, path)
PFU->>FC: startFooterCache / complete
PFU-->>Shim110: HMB (metrics: miss+size)
end
Shim110->>Shim110: ToIcebergShaded.shade(HMBInputFile)
Shim110->>PFR: readFooter(shadedHmbFile, opts, hmbStream) → metadata
Shim110->>PFR: new ParquetFileReader(realFile, metadata, opts, stream)
else Iceberg 1.6.x / 1.9.x
Shim16_19->>Shim16_19: metrics += FILECACHE_FOOTER_MISSES
Shim16_19->>PFR: ParquetFileReader.open(file, opts)
end
Reviews (5): Last reviewed commit: "Fix buid break" | Re-trigger Greptile |
Signed-off-by: Ray Liu <liurenjie2008@gmail.com>
gerashegalov
left a comment
There was a problem hiding this comment.
A few non-blocking review comments inline. One additional note on PR description scope (#7):
The title says "Add footer cache for iceberg" but only iceberg 1.10.x actually gets footer caching; the 1.6.x and 1.9.x shims always go through the plain ParquetFileReader.open(...) path because the shaded ParquetFileReader in those versions has no public API to inject pre-parsed metadata. Worth calling that limitation out explicitly in the PR description (and probably in the Performance section), so users on the older iceberg shims don't expect a perf win and reviewers don't go looking for tests asserting footer-cache hits there.
| throw new RuntimeException(s"$filePath is not a Parquet file (too small length: $fileLen )") | ||
| } | ||
| val footerLengthIndex = fileLen - FooterLengthSize - MAGIC.length | ||
| withResource(inputFile.open()) { inputStream => |
There was a problem hiding this comment.
Lost NVTX range on the iceberg footer-read path.
The original readFooterBufUsingHadoop in GpuParquetScan.scala wraps the actual seek + copy loop in NvtxRegistry.PARQUET_READ_FOOTER_BYTES. This new helper drops it, so iceberg footer reads (which now flow through readFooterBufferFromInputFile from the 1.10.x shim) will silently disappear from NVTX traces and won't be comparable to non-iceberg footer reads.
Suggest wrapping the withResource(inputFile.open()) { ... } block (or just the seek/read loop inside it) in NvtxRegistry.PARQUET_READ_FOOTER_BYTES { ... } for parity.
| IOUtils.readFully(inputStream, magic, 0, magic.length) | ||
| verifyParquetMagic(filePath, magic) | ||
| val fIdx = footerIndex(filePath, fileLen, footerLength, FooterLengthSize) | ||
| val tailBytes = (fileLen - fIdx).toInt |
There was a problem hiding this comment.
Same Long -> Int truncation concern Greptile flagged on the cache-hit path applies here on the raw-read path as well: if fileLen - fIdx ever exceeds 2 GB this silently overflows to a negative value and HostMemoryBuffer.allocate(tailBytes + MAGIC.length, false) will fail with a confusing message (or worse).
Astronomically unlikely for footers, but Math.toIntExact(fileLen - fIdx) would at least give a deterministic, descriptive failure if it ever happens. Cheap to add.
| // buffer includes header and trailing length and magic, stripped here | ||
| hmb.slice(MAGIC.length, hmb.getLength - Integer.BYTES - MAGIC.length) | ||
| } | ||
| withResource(ParquetFooterUtils.getFooterBuffer(inputFile, metrics, |
There was a problem hiding this comment.
Nice consolidation here. Follow-up suggestion: the other footer-cache hit/miss block in this same file, readAndSimpleFilterFooter (around line 631), still has its own near-identical FileCache.getFooter / startFooterCache / try { ... token.complete } finally { token.cancel } ladder. That's exactly the pattern you just centralized in ParquetFooterUtils.getFooterBuffer.
It could be reduced to roughly:
withResource(ParquetFooterUtils.getFooterBuffer(inputFile, metrics,
readFooterBuffer(fileIO, filePath, conf))) { hmb =>
ParquetFileReader.readFooter(new HMBInputFile(hmb),
ParquetMetadataConverter.range(file.start, file.start + file.length))
}Happy for it to be a separate PR if you want to keep this one minimal, but the duplication is exactly what motivated the extraction, and centralizing also means the resource-leak fix (the try/finally token.cancel() you correctly added in getFooterBuffer) lives in only one place.
cc @gerashegalov I've updated the pr description part. |
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
|
build |
Fixes #14064
Description
Allow iceberg's parquet reader to utilize footer cache. Since in older version of parquet, the
ParquetFileReaderconstructor withParquetMetadatais not available, we actually ignored the footer cache in iceberg 1.6.x and 1.9.x shim, and it's only available in iceberg 1.10 shim.Checklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance
For the reason mentioned in pr description, we should only do perf test against iceberg 1.10 shim