Add supporting of timestamp type in statistics for hive 3 and 4 versions#28296
Add supporting of timestamp type in statistics for hive 3 and 4 versions#28296anton-kutuzov wants to merge 1 commit intotrinodb:masterfrom
Conversation
| .addCoordinatorProperty("http-server.http.port", "8080") | ||
| .addHiveProperty("hive.metastore", "glue") | ||
| .addHiveProperty("hive.metastore.glue.default-warehouse-dir", "local:///glue") | ||
| .addHiveProperty("hive.metastore.uri", "thrift://hive-metastore:9083") |
There was a problem hiding this comment.
Revert unrelated change. This is HiveGlueQueryRunnerMain.
There was a problem hiding this comment.
Thanks. Fixed
|
|
||
| queryRunner.installPlugin(new HivePlugin()); | ||
| queryRunner.createCatalog("hive", "hive"); | ||
|
|
There was a problem hiding this comment.
Revert. When a new plugin is required in the test class, please install it via AbstractTestQueryFramework#createQueryRunner method.
There was a problem hiding this comment.
Thanks. Fixed
| .setCatalogSessionProperty("hive", "insert_existing_partitions_behavior", "APPEND") | ||
| .setCatalogSessionProperty("hive", "collect_column_statistics_on_write", "true") | ||
| .build(); | ||
| String testTable = HIVE_TEST_SCHEMA + ".test_tabble_" + randomNameSuffix(); |
There was a problem hiding this comment.
Thanks. Fixed
3789d9e to
eebfb46
Compare
| if (timestamp == null) { | ||
| return OptionalLong.empty(); | ||
| } | ||
| return OptionalLong.of(LocalDateTime.ofEpochSecond(timestamp.getSecondsSinceEpoch(), 0, ZoneOffset.UTC).toInstant(ZoneOffset.UTC).toEpochMilli() * MICROSECONDS_PER_MILLISECOND); |
There was a problem hiding this comment.
Could this be done simpler?
| return OptionalLong.of(LocalDateTime.ofEpochSecond(timestamp.getSecondsSinceEpoch(), 0, ZoneOffset.UTC).toInstant(ZoneOffset.UTC).toEpochMilli() * MICROSECONDS_PER_MILLISECOND); | |
| return OptionalLong.of(timestamp.getSecondsSinceEpoch() * MICROSECONDS_PER_SECOND); |
There was a problem hiding this comment.
Thanks. Fixed
| LongColumnStatsData data = new LongColumnStatsData(); | ||
| statistics.getIntegerStatistics().ifPresent(timestampStatistics -> { | ||
| timestampStatistics.getMin().ifPresent(value -> data.setLowValue(floorDiv(value, MICROSECONDS_PER_SECOND))); | ||
| timestampStatistics.getMax().ifPresent(value -> data.setHighValue(ceilDiv(value, MICROSECONDS_PER_SECOND))); |
There was a problem hiding this comment.
❤️
Although, stats are only estimates, so ceil div is not strictly necessary.
In fact, it may cause some planning issues (a partition with single value will have min and max different).
My intuition is that, the code as currently written is a better way, so no changes requested.
There was a problem hiding this comment.
I used floorDiv and ceilDiv because I needed them to implement partition filtering based on statistics in the Hive catalog. With this implementation, I may include two extra partitions, which is not critical, but I won’t miss any.
| return ImmutableSet.of(MIN_VALUE, MAX_VALUE, NUMBER_OF_DISTINCT_VALUES, NUMBER_OF_NON_NULL_VALUES); | ||
| } | ||
| if (type instanceof TimestampType || type instanceof TimestampWithTimeZoneType) { | ||
| // TODO (https://github.com/trinodb/trino/issues/5859) Add support for timestamp MIN_VALUE, MAX_VALUE |
| private static Optional<DoubleRange> createTimestampRange(IntegerStatistics statistics) | ||
| { | ||
| if (statistics.getMin().isPresent() && statistics.getMax().isPresent()) { | ||
| return Optional.of(new DoubleRange(statistics.getMin().getAsLong(), statistics.getMax().getAsLong())); |
There was a problem hiding this comment.
Shouldn't this multiply by MICROSECONDS_PER_SECOND?
There was a problem hiding this comment.
No, we multiply one time, when read from metastore.
| if (block.isNull(0)) { | ||
| return OptionalLong.empty(); | ||
| } | ||
| if (block instanceof Fixed12Block) { |
There was a problem hiding this comment.
Don't inspect the block type. Use the type API to get the value
if (((TimestampType) type).isShort()) {
return OptionalLong.of(type.getLong(block, 0));
}
else {
return OptionalLong.of(((LongTimestamp) type.getObject(block, 0)).getEpochMicros());
}There was a problem hiding this comment.
Yes, I had that option, but I was getting an error class io.trino.spi.block.Fixed12Block cannot be cast to class io.trino.spi.block.LongArrayBlock, so I added a check on Fixed12Block
| return Optional.of(LocalDate.ofEpochDay(days)); | ||
| } | ||
|
|
||
| private static OptionalLong getTimestampValue(Type type, Block block) |
There was a problem hiding this comment.
It doesn't read the timestamp value, only part of it. Call it getTimestampEpochMicro
| result.setDateStatistics(new DateStatistics(getDateValue(type, min), getDateValue(type, max))); | ||
| } | ||
| else if (type instanceof TimestampType) { | ||
| result.setIntegerStatistics(new IntegerStatistics(getTimestampValue(type, min), getTimestampValue(type, max))); |
There was a problem hiding this comment.
Don't you need to divide by MICROSECONDS_PER_SECOND?
There was a problem hiding this comment.
No, because we do it only before save to metastore
https://github.com/trinodb/trino/blob/master/plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/thrift/ThriftHiveMetastore.java#L570
| if (type instanceof TimestampType || type instanceof TimestampWithTimeZoneType) { | ||
| // TODO (https://github.com/trinodb/trino/issues/5859) Add support for timestamp MIN_VALUE, MAX_VALUE | ||
| return ImmutableSet.of(NUMBER_OF_DISTINCT_VALUES, NUMBER_OF_NON_NULL_VALUES); | ||
| return ImmutableSet.of(MIN_VALUE, MAX_VALUE, NUMBER_OF_DISTINCT_VALUES, NUMBER_OF_NON_NULL_VALUES); |
There was a problem hiding this comment.
This PR should change behavior for type instanceof TimestampType case, but not the TimestampWithTimeZoneType case. Split the if
| row("c_decimal", null, 2.0, 0.0, null, "345.0", "346.0"), | ||
| row("c_decimal_w_params", null, 2.0, 0.0, null, "345.671", "345.678"), | ||
| row("c_timestamp", null, 2.0, 0.0, null, null, null), | ||
| row("c_timestamp", null, 2.0, 0.0, null, "2015-05-10 12:15:31.000", "2015-05-10 12:15:36.000"), |
There was a problem hiding this comment.
Was this table analyzed by hive?
(How do we verify hive timestamp stats being expressed in seconds in epoch?)
There was a problem hiding this comment.
If I understood the question correctly(does hive compute statistics for timestamp in seconds), then yes, it also compute it in seconds: https://github.com/apache/hive/blob/46f27831/ql/src/java/org/apache/hadoop/hive/ql/stats/ColumnStatisticsObjTranslator.java#L303-L310
eebfb46 to
613db4e
Compare
chenjian2664
left a comment
There was a problem hiding this comment.
Skimmed about the test part
| ")"); | ||
| assertUpdate(session, "INSERT INTO " + testTable + " VALUES(1, TIMESTAMP '2025-09-19 10:00:00', TIMESTAMP '2025-09-19 10:10:00', 2, 3.5, 1, 10.0, 2.0, 12.0)", 1); | ||
| assertUpdate(session, "ANALYZE " + testTable, 1); | ||
| assertUpdate(session, "ANALYZE " + testTable, 1); |
There was a problem hiding this comment.
Is it necessary to analyze twice?
| } | ||
|
|
||
| @Test | ||
| public void testSupportTimestampStatisticsForHive3andHive4() |
There was a problem hiding this comment.
"ForHive3andHive4" suffix looks redundant
| " Tip_amount DOUBLE,\n" + | ||
| " Total_amount DOUBLE\n" + | ||
| ")"); | ||
| assertUpdate(session, "INSERT INTO " + testTable + " VALUES(1, TIMESTAMP '2025-09-19 10:00:00', TIMESTAMP '2025-09-19 10:10:00', 2, 3.5, 1, 10.0, 2.0, 12.0)", 1); |
There was a problem hiding this comment.
nit: let two timestamp value different(though there is only 1 row)
| .setCatalogSessionProperty("hive", "insert_existing_partitions_behavior", "APPEND") | ||
| .setCatalogSessionProperty("hive", "collect_column_statistics_on_write", "true") | ||
| .build(); | ||
| String testTable = HIVE_TEST_SCHEMA + ".test_table_" + randomNameSuffix(); |
There was a problem hiding this comment.
test_table_ -> test_timestamp_statistics
| "WITH (partitioned_by=ARRAY['dt'])"); | ||
|
|
||
| assertUpdate(session, "INSERT INTO " + testTable + "(history_batch_timestamp, dt) VALUES (CAST(CURRENT_TIMESTAMP AS TIMESTAMP), '2026-02-03')", 1); | ||
| assertUpdate(session, "INSERT INTO " + testTable + "(history_batch_timestamp, dt) VALUES (CAST(CURRENT_TIMESTAMP AS TIMESTAMP), '2026-02-03')", 1); |
There was a problem hiding this comment.
Is the second insert operation for testing the overwrite behavior? if so, can we add a test i,e for value count about it?
Description
Added support for the
timestamptype in statistics calculation.To maintain compatibility with
Hivestatistics, the computedminandmaxvalues are truncated to seconds during write by dividing byMICROSECONDS_PER_SECOND, usingfloorforminandceilformaxto avoid losing millisecond precision. When reading, values are converted back to Trino format by multiplying byMICROSECONDS_PER_SECOND.This solution supports both Hive Metastore version 3 and version 4.
TIMESTAMPstatistics in Hive #5859Release notes
(x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text: