[improve](compaction) Use segment footer raw_data_bytes for first-time batch size estimation by Yukang-Lian · Pull Request #62263 · apache/doris

Yukang-Lian · 2026-04-09T06:43:08Z

Summary

When vertical compaction runs for the first time on a tablet (no historical sampling data), estimate_batch_size() previously returned a hardcoded value of 992, which could cause OOM for wide tables or be too conservative for narrow tables
This change uses ColumnMetaPB.raw_data_bytes from segment footer to compute a per-row size estimate for the first compaction. raw_data_bytes records the original data size before encoding, which closely approximates runtime Block::bytes()
Historical sampling now uses Block::allocated_bytes() instead of bytes() for more accurate memory estimation (size() vs capacity())
Subsequent compactions with historical sampling data are completely unchanged

Key design decisions

Column type	Estimation strategy
Scalar (INT/VARCHAR etc.)	`raw_data_bytes / rows_with_data` + structural compensation (+1 null map, +8 offset)
Complex (ARRAY/MAP/STRUCT)	`raw_data_bytes / rows_with_data`, no compensation (already includes recursive sub-writer data)
VARIANT (root/subcolumn)	Fallback to 992 (`raw_data_bytes=0 // TODO` in writer)

Performance safeguards

Footer collection only runs on first compaction (no historical sampling data)
Skipped entirely when compaction_batch_size is manually set
OOM backoff and sparse optimization paths are untouched

Test plan

Wide table (200+ columns) first compaction does not OOM
Narrow table first compaction batch_size is close to upper limit
Multi-round compaction: first round uses footer, subsequent rounds use historical sampling
Variant columns fallback to 992
Sparse optimization is not affected
TestFirstCompactionUsesFooterEstimation unit test passes

…e batch size estimation When vertical compaction runs for the first time on a tablet (no historical sampling data), estimate_batch_size() previously returned a hardcoded value of 992, which could cause OOM for wide tables or be too conservative for narrow tables. This change uses ColumnMetaPB.raw_data_bytes from segment footer to compute a per-row size estimate for the first compaction. raw_data_bytes records the original data size before encoding, which closely approximates runtime Block::bytes(). Subsequent compactions continue to use the existing historical sampling mechanism unchanged. Key design decisions: - Footer collection only runs when needed (no manual override, and at least one column group lacks historical sampling data) - Variant columns (raw_data_bytes=0 TODO) trigger fallback to 992 - Structural overhead (+1 null map, +8 offset) only added for scalar columns with actual footer data - Complex types (ARRAY/MAP/STRUCT) use raw_data_bytes directly without structural compensation as it already includes recursive sub-writer data - Historical sampling now uses Block::allocated_bytes() instead of bytes() for more accurate memory estimation

hello-stephen · 2026-04-09T06:43:24Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

…sertions

Yukang-Lian · 2026-04-09T09:05:58Z

run buildall

hello-stephen · 2026-04-09T11:54:19Z

BE Regression && UT Coverage Report

Increment line coverage 88.54% (85/96) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.63% (27379/37187)
Line Coverage	57.27% (295605/516179)
Region Coverage	54.53% (246468/451954)
Branch Coverage	56.21% (106878/190125)

…ion init Log per_row, sample_bytes, sample_rows immediately after all merge inputs finish loading their first block, before the actual merge starts. This helps diagnose memory issues by showing the actual per-row memory size at init time.

The log was added to help diagnose vertical compaction memory issues. Investigation is complete; the existing 'estimate batch size' log in merger.cpp already provides per-group batch_size and per_row info for daily monitoring.

Yukang-Lian · 2026-04-22T09:46:31Z

run buildall

hello-stephen · 2026-04-22T10:58:35Z

BE UT Coverage Report

Increment line coverage 86.46% (83/96) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	53.34% (20373/38195)
Line Coverage	36.88% (192015/520641)
Region Coverage	33.19% (149367/450086)
Branch Coverage	34.30% (65335/190467)

hello-stephen · 2026-04-22T11:47:29Z

BE Regression && UT Coverage Report

Increment line coverage 88.54% (85/96) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.48% (27488/37408)
Line Coverage	57.29% (297376/519044)
Region Coverage	54.47% (247437/454243)
Branch Coverage	56.08% (107127/191040)

Yukang-Lian added 3 commits April 9, 2026 14:51

[improve](compaction) clang-format

6943519

[improve](compaction) add raw_data_bytes accuracy test and tighten as…

1e27cbb

…sertions

1

b9888f2

Yukang-Lian added 2 commits April 15, 2026 11:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve](compaction) Use segment footer raw_data_bytes for first-time batch size estimation#62263

[improve](compaction) Use segment footer raw_data_bytes for first-time batch size estimation#62263
Yukang-Lian wants to merge 6 commits intoapache:masterfrom
Yukang-Lian:fix/compaction-batch-size-adaptive

Yukang-Lian commented Apr 9, 2026 •

edited

Loading

Uh oh!

hello-stephen commented Apr 9, 2026

Uh oh!

Yukang-Lian commented Apr 9, 2026

Uh oh!

hello-stephen commented Apr 9, 2026

Uh oh!

Yukang-Lian commented Apr 22, 2026

Uh oh!

hello-stephen commented Apr 22, 2026

Uh oh!

hello-stephen commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yukang-Lian commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key design decisions

Performance safeguards

Test plan

Uh oh!

hello-stephen commented Apr 9, 2026

Uh oh!

Yukang-Lian commented Apr 9, 2026

Uh oh!

hello-stephen commented Apr 9, 2026

BE Regression && UT Coverage Report

Uh oh!

Yukang-Lian commented Apr 22, 2026

Uh oh!

hello-stephen commented Apr 22, 2026

BE UT Coverage Report

Uh oh!

hello-stephen commented Apr 22, 2026

BE Regression && UT Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yukang-Lian commented Apr 9, 2026 •

edited

Loading