Skip to content

StrictMetricsEvaluator treats unknown record_count as rows must match #732

@kevinjqliu

Description

@kevinjqliu

StrictMetricsEvaluator::Evaluate currently returns ROWS_MUST_MATCH when data_file.record_count <= 0:

Result<bool> StrictMetricsEvaluator::Evaluate(const DataFile& data_file) const {
if (data_file.record_count <= 0) {
return kRowsMustMatch;
}

This is correct for record_count == 0, but not for record_count == -1. iceberg-cpp uses -1 as an unknown row-count sentinel when writer metrics do not include row_count:

  • data writer:
    auto data_file = std::make_shared<DataFile>(DataFile{
    .content = DataFile::Content::kData,
    .file_path = options_.path,
    .file_format = options_.format,
    .partition = options_.partition,
    .record_count = metrics.row_count.value_or(-1),
    .file_size_in_bytes = length,
  • equality delete writer:
    auto data_file = std::make_shared<DataFile>(DataFile{
    .content = DataFile::Content::kEqualityDeletes,
    .file_path = options_.path,
    .file_format = options_.format,
    .partition = options_.partition,
    .record_count = metrics.row_count.value_or(-1),
    .file_size_in_bytes = length,
  • position delete writer:
    auto data_file = std::make_shared<DataFile>(DataFile{
    .content = DataFile::Content::kPositionDeletes,
    .file_path = options_.path,
    .file_format = options_.format,
    .partition = options_.partition,
    .record_count = metrics.row_count.value_or(-1),
    .file_size_in_bytes = length,

Other code already treats negative record count as unknown/missing:

  • inclusive metrics:
    Result<bool> InclusiveMetricsEvaluator::Evaluate(const DataFile& data_file) const {
    if (data_file.record_count == 0) {
    return kRowCannotMatch;
    }
    if (data_file.record_count < 0) {
    // we haven't implemented parsing record count from avro file and thus set record
    // count -1 when importing avro tables to iceberg tables. This should be updated once
    // we implemented and set correct record count.
    return kRowsMightMatch;
    }
  • count aggregate:
    Result<int64_t> CountStarAggregate::CountFor(const DataFile& file) const {
    if (!HasValue(file)) {
    return NotFound("Record count is missing");
    }
    return file.record_count;
    }
    bool CountStarAggregate::HasValue(const DataFile& file) const {
    return file.record_count >= 0;
    }

Impact: a file with unknown row count can be treated as if every row must match, even for predicates like AlwaysFalse.

Suggested fix: only special-case record_count == 0; let negative/unknown counts fall through to normal strict metrics evaluation.

A focused regression test with record_count = -1 and AlwaysFalse fails before changing <= 0 to == 0, and passes after.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions