[Enhancement](ms) Add sharded LRU cache for tablet index metadata to reduce FDB IO#61666
[Enhancement](ms) Add sharded LRU cache for tablet index metadata to reduce FDB IO#61666wyxxxcat wants to merge 3 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
4311778 to
6b47a05
Compare
|
run buildall |
3065073 to
e15e9bc
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
Code Review Summary
This PR adds a sharded LRU cache for meta_tablet_idx_key lookups in both MetaService and Recycler, and also refactors the LRU cache from be/src/util/ to common/cpp/ to share it with the Cloud module. The refactoring to separate metrics from the core cache is well-structured.
However, I found several issues ranging from correctness to performance concerns.
Critical Checkpoint Conclusions
1. Goal and Correctness: The goal is to cache immutable tablet index metadata to reduce FDB reads. The core caching logic works, but there are missing cache invalidation paths in the MetaService (see below), and the LOG(INFO) on every cache hit is a performance anti-pattern that will flood logs.
2. Minimal and Focused: The PR mixes two independent concerns: (a) LRU cache refactoring from BE to common/cpp with metrics extraction, and (b) adding KvCache for cloud. These should ideally be separate PRs, but this is a style concern.
3. Concurrency: The ShardedLRUCache handles concurrency via per-shard mutexes. The g_ms_cache_manager and g_recycler_cache_manager globals are initialized once and read concurrently thereafter (the cache methods are thread-safe). No concurrency issues found.
4. Lifecycle Management: Both g_ms_cache_manager and g_recycler_cache_manager are allocated with new but never deleted. This is an intentional process-lifetime pattern but creates issues: (a) If MetaServiceImpl is constructed multiple times (e.g., in tests), the old cache pointer leaks. (b) ASAN/LeakSanitizer will flag these.
5. Configuration: New configs (enable_tablet_index_cache, ms_tablet_index_cache_capacity, etc.) are CONF_mBool/CONF_mInt64 (mutable), but the cache is only created once at startup. Changing these configs at runtime has no effect - the cache won't be resized or recreated. Either make them non-mutable or implement runtime reconfiguration.
6. Incompatible Changes: The LRU cache insert() now requires an explicit CacheValueDeleter parameter where it previously defaulted to nullptr. All existing callers are updated. No incompatibility concern.
7. Parallel Code Paths: The MS cache lacks invalidation in repair_tablet_index (meta_service_txn.cpp:2150) and fix_tablet_db_id (meta_service.cpp:5719) which modify TabletIndexPB.db_id via txn->put() without invalidating the cache.
8. Test Coverage: Unit tests cover basic get/put, eviction, invalidation, clear, and concurrent access. Missing: TTL expiration test, test for the key encoding edge cases, integration test verifying cache correctness with actual MS operations.
9. Observability: Cache hit/miss metrics are not added for the cloud KvCache (only for BE LRU caches via BELRUCacheMetricsRecorder). The LOG(INFO) on every cache hit is not a substitute for proper metrics.
10. Performance: The LOG(INFO) on every cache hit is the most significant performance issue - this function is called on hot paths like commit_txn, and the whole point is to reduce overhead.
11. Missing blank line between init_recycler_cache() closing brace and namespace config { in recycler/util.cpp.
|
run buildall |
TPC-H: Total hot run time: 26844 ms |
TPC-DS: Total hot run time: 169470 ms |
|
run p0 |
d70b6c8 to
4c0db13
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 26581 ms |
f56e0b0 to
25f5ef3
Compare
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
|
run buildall |
|
run cloudut |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
| if (cache == nullptr) { | ||
| return false; | ||
| } | ||
| LOG(INFO) << "get from cache"; |
There was a problem hiding this comment.
debug, should remove
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Findings
- The new tablet-index caches are never invalidated when
meta_tablet_idx_keyis rewritten or removed. That makes the cache diverge from TxnKv on existing repair/delete paths such asfix_tablet_db_id()and recycler cleanup. With the defaulttablet_index_cache_ttl_seconds = 0, those staleTabletIndexPBentries can survive until process restart. - The recycler cache can therefore make
check_lazy_txn_finished()consult the wrong partition-version key after a tablet-index repair. If the cacheddb_idis stale, the current code can seeTXN_KEY_NOT_FOUNDand treat a still-pending lazy txn as finished. - The new cache configs are declared mutable, but the caches are constructed as function-local statics. After the first lookup, runtime changes to
ms_tablet_index_cache_capacity,recycler_tablet_index_cache_capacity, ortablet_index_cache_ttl_secondshave no effect until restart.
Critical Checkpoints
- Goal / correctness / tests: The PR goal is clear: reuse the shared LRU cache and reduce repeated TxnKv reads for tablet-index metadata. The basic plumbing and unit tests are in place, but the cache is not yet correct under existing tablet-index mutation paths, and the added tests do not cover those stale-cache cases.
- Scope / focus: The change is reasonably focused on cache reuse and metrics, but the Cloud cache integration is not fully wired into all existing write/delete paths for the same metadata.
- Concurrency: The underlying cache implementation is thread-safe, but concurrent readers and writers are not coherent because tablet-index writers do not invalidate cached entries. With TTL 0, stale metadata can persist indefinitely.
- Lifecycle / static initialization: Function-local statics avoid cross-TU initialization hazards, but they also freeze cache parameters after first construction.
- Config items:
enable_tablet_index_cache,ms_tablet_index_cache_capacity,recycler_tablet_index_cache_capacity, andtablet_index_cache_ttl_secondswere added. Only the boolean is effectively re-read dynamically; capacity and TTL are not. - Parallel code paths: The non-versioned MetaService and Recycler read paths were updated to use the cache, but existing repair/delete paths that mutate
meta_tablet_idx_keywere not updated to invalidate it. - Conditional checks: The compatibility handling for historical
TabletIndexPBwithoutdb_idis understandable, but stale cached routing info now undermines that upgrade path. - Test coverage: Added tests cover basic cache operations and some lazy-commit timing behavior, but there is no test for cache invalidation after tablet-index repair/delete or for runtime config changes after first cache creation.
- Test results changed: No generated regression outputs are involved. The new unit tests do not prove correctness of the mutation/invalidation cases above.
- Observability: BE cache metrics wiring looks fine from review. Cloud cache paths still lack observability around invalidation/staleness; not blocking on its own.
- Transaction / persistence: TxnKv writes remain transactional, but cached tablet-index state can diverge from persisted state after rewrites/removals.
- Data writes / modifications: Underlying writes/removals are atomic; the problem is stale cached metadata driving later read/recycle decisions.
- FE/BE variable passing: Not applicable.
- Performance: The steady-state read optimization is valuable, but the zero-TTL default makes missing invalidation especially risky because stale entries do not self-heal.
- Other issues: I did not find another blocking issue in the BE metrics refactor itself from code inspection.
Tests reviewed but not executed in this review.
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Summary
This commit mainly does three things:
lru_cachefrombe/src/util/tocommon/cpp/so Cloud components can reuse the sameShardedLRUCacheimplementation.LRUCachePolicyinto a unified cache metrics recorder, including capacity, usage, hit ratio, lookup/hit/miss, and stampede metrics.MetaServiceandRecycler, introduce capacity and TTL configs, invalidate cache entries when tablet index records are removed, and add Cloud-side unit tests.What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary:
Cloud tablet index lookups in meta service and recycler repeatedly read from txn kv, while
the LRU cache implementation was only available under BE utility code and could not be reused
directly by cloud components. This change moves the shared LRU cache into common code, adds
a cloud KV cache wrapper for tablet index metadata with configurable capacity and TTL, wires
cache invalidation into recycler cleanup, and hooks BE LRU caches into unified cache metrics.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)