Skip to content

branch-4.0: [fix](cloud) checkpoint save cloud tablet stats to image (#60705)#62514

Open
mymeiyi wants to merge 1 commit intoapache:branch-4.0from
mymeiyi:branch-4.0-pick-60705
Open

branch-4.0: [fix](cloud) checkpoint save cloud tablet stats to image (#60705)#62514
mymeiyi wants to merge 1 commit intoapache:branch-4.0from
mymeiyi:branch-4.0-pick-60705

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented Apr 15, 2026

pick #60705

Copilot AI review requested due to automatic review settings April 15, 2026 06:10
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 15, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 15, 2026

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backports a cloud-mode fix to ensure checkpoint images remain fresh enough to carry up-to-date cloud tablet/replica statistics, and adds logic to post-process checkpoint metadata using the serving FE’s latest stats.

Changes:

  • Track latest image file timestamp and trigger checkpoint in cloud mode when the image is older than a configurable threshold.
  • Post-process checkpoint metadata in cloud mode to copy tablet/replica stats from the serving Env into the checkpoint Env before saving the image.
  • Persist additional cloud/replica stat fields into the image via @SerializedName, and initialize cloud table stats on FE start when empty.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
fe/fe-core/src/main/java/org/apache/doris/persist/Storage.java Track latest image “create time” (via file mtime) to support staleness-based checkpoint triggering.
fe/fe-core/src/main/java/org/apache/doris/master/Checkpoint.java Add cloud staleness trigger and post-process cloud metadata to copy stats into the checkpoint image.
fe/fe-core/src/main/java/org/apache/doris/cloud/catalog/CloudReplica.java Persist cloud replica tablet-stat timing fields into the image using short serialized names.
fe/fe-core/src/main/java/org/apache/doris/catalog/Replica.java Persist segmentCount / rowsetCount into the image via @SerializedName.
fe/fe-core/src/main/java/org/apache/doris/catalog/CloudTabletStatMgr.java On startup, initialize table stats from image-loaded tablet stats when current stats list is empty.
fe/fe-common/src/main/java/org/apache/doris/common/Config.java Add cloud checkpoint image staleness threshold config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +455 to +465
for (Partition partition : partitions) {
Partition servingPartition = servingOlapTable.getPartition(partition.getId());
if (servingPartition == null) {
if (LOG.isDebugEnabled()) {
LOG.debug("serving partition is null. tableId: {}, partitionId: {}", table.getId(),
partition.getId());
}
continue;
}
// set tablet stats
setTabletStats(table.getId(), partition, servingPartition);
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postProcessCloudMetadata() traverses servingEnv metadata (tables/partitions/indexes/tablets) without taking the usual read locks. Partition’s index maps are plain HashMaps, so concurrent DDL/rollup updates on the serving catalog can lead to inconsistent reads or ConcurrentModificationException during checkpoint. Consider acquiring servingOlapTable.readLock() (and releasing in finally) while you resolve partitions/index/tablet/replica lookups (or otherwise snapshot the needed structures under lock).

Suggested change
for (Partition partition : partitions) {
Partition servingPartition = servingOlapTable.getPartition(partition.getId());
if (servingPartition == null) {
if (LOG.isDebugEnabled()) {
LOG.debug("serving partition is null. tableId: {}, partitionId: {}", table.getId(),
partition.getId());
}
continue;
}
// set tablet stats
setTabletStats(table.getId(), partition, servingPartition);
servingOlapTable.readLock();
try {
for (Partition partition : partitions) {
Partition servingPartition = servingOlapTable.getPartition(partition.getId());
if (servingPartition == null) {
if (LOG.isDebugEnabled()) {
LOG.debug("serving partition is null. tableId: {}, partitionId: {}", table.getId(),
partition.getId());
}
continue;
}
// set tablet stats
setTabletStats(table.getId(), partition, servingPartition);
}
} finally {
servingOlapTable.readUnlock();

Copilot uses AI. Check for mistakes.
Comment on lines +508 to +511
CloudReplica cloudReplica = (CloudReplica) replica;
CloudReplica servingCloudReplica = (CloudReplica) servingReplica;
cloudReplica.setStatsIntervalIndex(servingCloudReplica.getStatsIntervalIndex());
cloudReplica.setLastGetTabletStatsTime(servingCloudReplica.getLastGetTabletStatsTime());
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setTabletStats() unconditionally casts both replica and servingReplica to CloudReplica. In cloud mode, the image loader can still deserialize replicas as base Replica (e.g., compatibility subtype LocalReplica or default subtype), so this can throw ClassCastException and fail the whole checkpoint. Please guard with instanceof (and only copy the cloud-specific fields when both sides are CloudReplica), so checkpoint remains best-effort for stats copying.

Suggested change
CloudReplica cloudReplica = (CloudReplica) replica;
CloudReplica servingCloudReplica = (CloudReplica) servingReplica;
cloudReplica.setStatsIntervalIndex(servingCloudReplica.getStatsIntervalIndex());
cloudReplica.setLastGetTabletStatsTime(servingCloudReplica.getLastGetTabletStatsTime());
if (replica instanceof CloudReplica && servingReplica instanceof CloudReplica) {
CloudReplica cloudReplica = (CloudReplica) replica;
CloudReplica servingCloudReplica = (CloudReplica) servingReplica;
cloudReplica.setStatsIntervalIndex(servingCloudReplica.getStatsIntervalIndex());
cloudReplica.setLastGetTabletStatsTime(servingCloudReplica.getLastGetTabletStatsTime());
}

Copilot uses AI. Check for mistakes.
@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 6.59% (6/91) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants