Skip to content

Add supporting of timestamp type in statistics for hive 3 and 4 versions#28296

Open
anton-kutuzov wants to merge 1 commit intotrinodb:masterfrom
anton-kutuzov:add_suporting_tymestamp_in_statistics
Open

Add supporting of timestamp type in statistics for hive 3 and 4 versions#28296
anton-kutuzov wants to merge 1 commit intotrinodb:masterfrom
anton-kutuzov:add_suporting_tymestamp_in_statistics

Conversation

@anton-kutuzov
Copy link
Contributor

@anton-kutuzov anton-kutuzov commented Feb 14, 2026

Description

Added support for the timestamp type in statistics calculation.
To maintain compatibility with Hive statistics, the computed min and max values are truncated to seconds during write by dividing by MICROSECONDS_PER_SECOND, using floor for min and ceil for max to avoid losing millisecond precision. When reading, values are converted back to Trino format by multiplying by MICROSECONDS_PER_SECOND.
This solution supports both Hive Metastore version 3 and version 4.

Release notes

(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimmed.

.addCoordinatorProperty("http-server.http.port", "8080")
.addHiveProperty("hive.metastore", "glue")
.addHiveProperty("hive.metastore.glue.default-warehouse-dir", "local:///glue")
.addHiveProperty("hive.metastore.uri", "thrift://hive-metastore:9083")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert unrelated change. This is HiveGlueQueryRunnerMain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed


queryRunner.installPlugin(new HivePlugin());
queryRunner.createCatalog("hive", "hive");

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert. When a new plugin is required in the test class, please install it via AbstractTestQueryFramework#createQueryRunner method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed

.setCatalogSessionProperty("hive", "insert_existing_partitions_behavior", "APPEND")
.setCatalogSessionProperty("hive", "collect_column_statistics_on_write", "true")
.build();
String testTable = HIVE_TEST_SCHEMA + ".test_tabble_" + randomNameSuffix();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix typo > tabble

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed

@anton-kutuzov anton-kutuzov force-pushed the add_suporting_tymestamp_in_statistics branch 2 times, most recently from 3789d9e to eebfb46 Compare February 17, 2026 21:34
@findepi
Copy link
Member

findepi commented Feb 19, 2026

Related issues: #5860, #26214, #26338

Is this PR "fixes #26214"?

if (timestamp == null) {
return OptionalLong.empty();
}
return OptionalLong.of(LocalDateTime.ofEpochSecond(timestamp.getSecondsSinceEpoch(), 0, ZoneOffset.UTC).toInstant(ZoneOffset.UTC).toEpochMilli() * MICROSECONDS_PER_MILLISECOND);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done simpler?

Suggested change
return OptionalLong.of(LocalDateTime.ofEpochSecond(timestamp.getSecondsSinceEpoch(), 0, ZoneOffset.UTC).toInstant(ZoneOffset.UTC).toEpochMilli() * MICROSECONDS_PER_MILLISECOND);
return OptionalLong.of(timestamp.getSecondsSinceEpoch() * MICROSECONDS_PER_SECOND);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed

LongColumnStatsData data = new LongColumnStatsData();
statistics.getIntegerStatistics().ifPresent(timestampStatistics -> {
timestampStatistics.getMin().ifPresent(value -> data.setLowValue(floorDiv(value, MICROSECONDS_PER_SECOND)));
timestampStatistics.getMax().ifPresent(value -> data.setHighValue(ceilDiv(value, MICROSECONDS_PER_SECOND)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Although, stats are only estimates, so ceil div is not strictly necessary.
In fact, it may cause some planning issues (a partition with single value will have min and max different).
My intuition is that, the code as currently written is a better way, so no changes requested.

cc @raunaqmorarka @sopel39

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used floorDiv and ceilDiv because I needed them to implement partition filtering based on statistics in the Hive catalog. With this implementation, I may include two extra partitions, which is not critical, but I won’t miss any.

return ImmutableSet.of(MIN_VALUE, MAX_VALUE, NUMBER_OF_DISTINCT_VALUES, NUMBER_OF_NON_NULL_VALUES);
}
if (type instanceof TimestampType || type instanceof TimestampWithTimeZoneType) {
// TODO (https://github.com/trinodb/trino/issues/5859) Add support for timestamp MIN_VALUE, MAX_VALUE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add "fixes #5859" to the PR description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

private static Optional<DoubleRange> createTimestampRange(IntegerStatistics statistics)
{
if (statistics.getMin().isPresent() && statistics.getMax().isPresent()) {
return Optional.of(new DoubleRange(statistics.getMin().getAsLong(), statistics.getMax().getAsLong()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this multiply by MICROSECONDS_PER_SECOND?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we multiply one time, when read from metastore.

if (block.isNull(0)) {
return OptionalLong.empty();
}
if (block instanceof Fixed12Block) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't inspect the block type. Use the type API to get the value

if (((TimestampType) type).isShort()) {
            return OptionalLong.of(type.getLong(block, 0));
        }
        else {
            return OptionalLong.of(((LongTimestamp) type.getObject(block, 0)).getEpochMicros());
        }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had that option, but I was getting an error class io.trino.spi.block.Fixed12Block cannot be cast to class io.trino.spi.block.LongArrayBlock, so I added a check on Fixed12Block

return Optional.of(LocalDate.ofEpochDay(days));
}

private static OptionalLong getTimestampValue(Type type, Block block)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't read the timestamp value, only part of it. Call it getTimestampEpochMicro

result.setDateStatistics(new DateStatistics(getDateValue(type, min), getDateValue(type, max)));
}
else if (type instanceof TimestampType) {
result.setIntegerStatistics(new IntegerStatistics(getTimestampValue(type, min), getTimestampValue(type, max)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to divide by MICROSECONDS_PER_SECOND?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (type instanceof TimestampType || type instanceof TimestampWithTimeZoneType) {
// TODO (https://github.com/trinodb/trino/issues/5859) Add support for timestamp MIN_VALUE, MAX_VALUE
return ImmutableSet.of(NUMBER_OF_DISTINCT_VALUES, NUMBER_OF_NON_NULL_VALUES);
return ImmutableSet.of(MIN_VALUE, MAX_VALUE, NUMBER_OF_DISTINCT_VALUES, NUMBER_OF_NON_NULL_VALUES);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR should change behavior for type instanceof TimestampType case, but not the TimestampWithTimeZoneType case. Split the if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

row("c_decimal", null, 2.0, 0.0, null, "345.0", "346.0"),
row("c_decimal_w_params", null, 2.0, 0.0, null, "345.671", "345.678"),
row("c_timestamp", null, 2.0, 0.0, null, null, null),
row("c_timestamp", null, 2.0, 0.0, null, "2015-05-10 12:15:31.000", "2015-05-10 12:15:36.000"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this table analyzed by hive?

(How do we verify hive timestamp stats being expressed in seconds in epoch?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood the question correctly(does hive compute statistics for timestamp in seconds), then yes, it also compute it in seconds: https://github.com/apache/hive/blob/46f27831/ql/src/java/org/apache/hadoop/hive/ql/stats/ColumnStatisticsObjTranslator.java#L303-L310

@anton-kutuzov
Copy link
Contributor Author

Related issues: #5860, #26214, #26338

Is this PR "fixes #26214"?

Yes, the test was added for this case

@anton-kutuzov anton-kutuzov force-pushed the add_suporting_tymestamp_in_statistics branch from eebfb46 to 613db4e Compare February 23, 2026 10:28
@ebyhr ebyhr requested a review from chenjian2664 March 13, 2026 02:36
Copy link
Contributor

@chenjian2664 chenjian2664 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skimmed about the test part

")");
assertUpdate(session, "INSERT INTO " + testTable + " VALUES(1, TIMESTAMP '2025-09-19 10:00:00', TIMESTAMP '2025-09-19 10:10:00', 2, 3.5, 1, 10.0, 2.0, 12.0)", 1);
assertUpdate(session, "ANALYZE " + testTable, 1);
assertUpdate(session, "ANALYZE " + testTable, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to analyze twice?

}

@Test
public void testSupportTimestampStatisticsForHive3andHive4()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ForHive3andHive4" suffix looks redundant

" Tip_amount DOUBLE,\n" +
" Total_amount DOUBLE\n" +
")");
assertUpdate(session, "INSERT INTO " + testTable + " VALUES(1, TIMESTAMP '2025-09-19 10:00:00', TIMESTAMP '2025-09-19 10:10:00', 2, 3.5, 1, 10.0, 2.0, 12.0)", 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let two timestamp value different(though there is only 1 row)

.setCatalogSessionProperty("hive", "insert_existing_partitions_behavior", "APPEND")
.setCatalogSessionProperty("hive", "collect_column_statistics_on_write", "true")
.build();
String testTable = HIVE_TEST_SCHEMA + ".test_table_" + randomNameSuffix();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_table_ -> test_timestamp_statistics

"WITH (partitioned_by=ARRAY['dt'])");

assertUpdate(session, "INSERT INTO " + testTable + "(history_batch_timestamp, dt) VALUES (CAST(CURRENT_TIMESTAMP AS TIMESTAMP), '2026-02-03')", 1);
assertUpdate(session, "INSERT INTO " + testTable + "(history_batch_timestamp, dt) VALUES (CAST(CURRENT_TIMESTAMP AS TIMESTAMP), '2026-02-03')", 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the second insert operation for testing the overwrite behavior? if so, can we add a test i,e for value count about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector

4 participants