Skip to content

[Enhancement](ms) Add sharded LRU cache for tablet index metadata to reduce FDB IO#61666

Open
wyxxxcat wants to merge 3 commits intoapache:masterfrom
wyxxxcat:ms_lru_cache
Open

[Enhancement](ms) Add sharded LRU cache for tablet index metadata to reduce FDB IO#61666
wyxxxcat wants to merge 3 commits intoapache:masterfrom
wyxxxcat:ms_lru_cache

Conversation

@wyxxxcat
Copy link
Copy Markdown
Collaborator

@wyxxxcat wyxxxcat commented Mar 24, 2026

Summary

This commit mainly does three things:

  1. Move lru_cache from be/src/util/ to common/cpp/ so Cloud components can reuse the same ShardedLRUCache implementation.
  2. Hook BE LRUCachePolicy into a unified cache metrics recorder, including capacity, usage, hit ratio, lookup/hit/miss, and stampede metrics.
  3. Add a tablet index KV cache to Cloud MetaService and Recycler, introduce capacity and TTL configs, invalidate cache entries when tablet index records are removed, and add Cloud-side unit tests.

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary:
Cloud tablet index lookups in meta service and recycler repeatedly read from txn kv, while
the LRU cache implementation was only available under BE utility code and could not be reused
directly by cloud components. This change moves the shared LRU cache into common code, adds
a cloud KV cache wrapper for tablet index metadata with configurable capacity and TTL, wires
cache invalidation into recycler cleanup, and hooks BE LRU caches into unified cache metrics.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@wyxxxcat wyxxxcat force-pushed the ms_lru_cache branch 2 times, most recently from 4311778 to 6b47a05 Compare March 24, 2026 09:26
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

Comment thread cloud/src/common/kv_cache.h
@wyxxxcat wyxxxcat force-pushed the ms_lru_cache branch 3 times, most recently from 3065073 to e15e9bc Compare March 30, 2026 06:15
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@wyxxxcat wyxxxcat marked this pull request as ready for review March 30, 2026 07:04
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

This PR adds a sharded LRU cache for meta_tablet_idx_key lookups in both MetaService and Recycler, and also refactors the LRU cache from be/src/util/ to common/cpp/ to share it with the Cloud module. The refactoring to separate metrics from the core cache is well-structured.

However, I found several issues ranging from correctness to performance concerns.

Critical Checkpoint Conclusions

1. Goal and Correctness: The goal is to cache immutable tablet index metadata to reduce FDB reads. The core caching logic works, but there are missing cache invalidation paths in the MetaService (see below), and the LOG(INFO) on every cache hit is a performance anti-pattern that will flood logs.

2. Minimal and Focused: The PR mixes two independent concerns: (a) LRU cache refactoring from BE to common/cpp with metrics extraction, and (b) adding KvCache for cloud. These should ideally be separate PRs, but this is a style concern.

3. Concurrency: The ShardedLRUCache handles concurrency via per-shard mutexes. The g_ms_cache_manager and g_recycler_cache_manager globals are initialized once and read concurrently thereafter (the cache methods are thread-safe). No concurrency issues found.

4. Lifecycle Management: Both g_ms_cache_manager and g_recycler_cache_manager are allocated with new but never deleted. This is an intentional process-lifetime pattern but creates issues: (a) If MetaServiceImpl is constructed multiple times (e.g., in tests), the old cache pointer leaks. (b) ASAN/LeakSanitizer will flag these.

5. Configuration: New configs (enable_tablet_index_cache, ms_tablet_index_cache_capacity, etc.) are CONF_mBool/CONF_mInt64 (mutable), but the cache is only created once at startup. Changing these configs at runtime has no effect - the cache won't be resized or recreated. Either make them non-mutable or implement runtime reconfiguration.

6. Incompatible Changes: The LRU cache insert() now requires an explicit CacheValueDeleter parameter where it previously defaulted to nullptr. All existing callers are updated. No incompatibility concern.

7. Parallel Code Paths: The MS cache lacks invalidation in repair_tablet_index (meta_service_txn.cpp:2150) and fix_tablet_db_id (meta_service.cpp:5719) which modify TabletIndexPB.db_id via txn->put() without invalidating the cache.

8. Test Coverage: Unit tests cover basic get/put, eviction, invalidation, clear, and concurrent access. Missing: TTL expiration test, test for the key encoding edge cases, integration test verifying cache correctness with actual MS operations.

9. Observability: Cache hit/miss metrics are not added for the cloud KvCache (only for BE LRU caches via BELRUCacheMetricsRecorder). The LOG(INFO) on every cache hit is not a substitute for proper metrics.

10. Performance: The LOG(INFO) on every cache hit is the most significant performance issue - this function is called on hot paths like commit_txn, and the whole point is to reduce overhead.

11. Missing blank line between init_recycler_cache() closing brace and namespace config { in recycler/util.cpp.

Comment thread cloud/src/meta-service/meta_service.cpp Outdated
Comment thread cloud/src/meta-service/meta_service.cpp
Comment thread cloud/src/recycler/util.cpp Outdated
Comment thread cloud/src/recycler/util.cpp
Comment thread cloud/src/common/config.h
Comment thread cloud/src/recycler/util.cpp
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26844 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8d0e9b8f0db6b371de2f4bfe8e28a4fe03fc6eca, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17630	4487	4311	4311
q2	q3	10638	799	524	524
q4	4684	381	251	251
q5	7558	1221	1043	1043
q6	174	173	145	145
q7	796	839	683	683
q8	9848	1516	1329	1329
q9	5050	4986	4728	4728
q10	6331	1950	1669	1669
q11	489	245	258	245
q12	746	575	469	469
q13	18091	2744	1930	1930
q14	225	236	212	212
q15	q16	729	738	671	671
q17	773	866	460	460
q18	5993	5458	5293	5293
q19	1413	967	610	610
q20	542	496	393	393
q21	4505	1841	1604	1604
q22	488	366	274	274
Total cold run time: 96703 ms
Total hot run time: 26844 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4905	4688	4551	4551
q2	q3	3881	4363	3833	3833
q4	873	1200	801	801
q5	4150	4485	4369	4369
q6	182	177	140	140
q7	1767	1685	1525	1525
q8	2533	2703	2621	2621
q9	7689	7450	7422	7422
q10	3860	4016	3604	3604
q11	528	432	420	420
q12	494	617	465	465
q13	2596	3008	2056	2056
q14	281	318	302	302
q15	q16	724	758	704	704
q17	1213	1365	1377	1365
q18	7229	6896	6589	6589
q19	899	941	917	917
q20	2088	2162	1997	1997
q21	4067	3548	3329	3329
q22	461	445	379	379
Total cold run time: 50420 ms
Total hot run time: 47389 ms

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 169470 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8d0e9b8f0db6b371de2f4bfe8e28a4fe03fc6eca, data reload: false

query5	4357	647	508	508
query6	345	227	201	201
query7	4223	466	264	264
query8	347	238	227	227
query9	8710	2732	2733	2732
query10	553	406	326	326
query11	6989	5099	4891	4891
query12	181	132	123	123
query13	1274	483	358	358
query14	5665	3704	3463	3463
query14_1	2876	2900	2844	2844
query15	209	201	179	179
query16	1001	469	476	469
query17	1018	702	588	588
query18	2438	441	350	350
query19	218	206	181	181
query20	130	125	125	125
query21	213	136	112	112
query22	13481	14336	14611	14336
query23	16779	16298	15988	15988
query23_1	16167	16059	15576	15576
query24	7200	1626	1233	1233
query24_1	1230	1234	1224	1224
query25	533	464	417	417
query26	1238	258	151	151
query27	2783	492	298	298
query28	4502	1853	1862	1853
query29	877	559	469	469
query30	302	223	190	190
query31	993	968	868	868
query32	89	68	69	68
query33	508	337	288	288
query34	904	857	521	521
query35	635	690	612	612
query36	1038	1119	996	996
query37	130	94	82	82
query38	2946	2899	2886	2886
query39	857	845	826	826
query39_1	795	805	796	796
query40	237	151	135	135
query41	66	62	59	59
query42	257	254	258	254
query43	247	244	223	223
query44	
query45	231	189	187	187
query46	881	985	608	608
query47	2173	2115	2064	2064
query48	311	323	227	227
query49	630	461	398	398
query50	680	292	205	205
query51	4068	4055	4009	4009
query52	268	269	257	257
query53	291	346	289	289
query54	316	265	265	265
query55	92	85	91	85
query56	325	313	302	302
query57	1835	1700	1675	1675
query58	276	284	267	267
query59	2783	2973	2753	2753
query60	334	335	316	316
query61	158	148	153	148
query62	632	583	536	536
query63	307	283	276	276
query64	5133	1271	1023	1023
query65	
query66	1473	454	370	370
query67	24476	24323	24222	24222
query68	
query69	414	322	307	307
query70	1006	940	939	939
query71	346	321	313	313
query72	3120	2886	2648	2648
query73	545	553	317	317
query74	9652	9648	9473	9473
query75	2900	2777	2507	2507
query76	2275	1044	698	698
query77	369	409	312	312
query78	10917	11050	10501	10501
query79	3180	778	560	560
query80	1731	613	527	527
query81	574	262	222	222
query82	1019	154	121	121
query83	334	267	248	248
query84	281	126	97	97
query85	917	483	466	466
query86	496	293	302	293
query87	3158	3139	2994	2994
query88	3587	2671	2656	2656
query89	431	369	344	344
query90	2073	192	177	177
query91	175	164	137	137
query92	83	78	74	74
query93	1606	839	506	506
query94	653	311	296	296
query95	574	338	324	324
query96	641	517	228	228
query97	2452	2512	2386	2386
query98	238	224	238	224
query99	1017	1001	913	913
Total cold run time: 254485 ms
Total hot run time: 169470 ms

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run p0

@wyxxxcat wyxxxcat force-pushed the ms_lru_cache branch 4 times, most recently from d70b6c8 to 4c0db13 Compare April 1, 2026 02:51
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

wyxxxcat commented Apr 1, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 68.35% (108/158) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 78.09% (1846/2364)
Line Coverage 64.04% (32715/51083)
Region Coverage 64.90% (16378/25235)
Branch Coverage 55.33% (8724/15766)

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

wyxxxcat commented Apr 1, 2026

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 26581 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3d6117ddfee4422530926cef13fcfd1162c1a1ba, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17634	4440	4318	4318
q2	q3	10665	785	524	524
q4	4681	348	256	256
q5	7565	1210	1014	1014
q6	170	172	143	143
q7	771	843	666	666
q8	9304	1481	1355	1355
q9	4786	4790	4727	4727
q10	6251	1933	1655	1655
q11	456	269	247	247
q12	699	580	465	465
q13	18037	2686	1939	1939
q14	223	229	205	205
q15	q16	719	743	664	664
q17	748	840	445	445
q18	6030	5377	5291	5291
q19	1119	985	616	616
q20	534	488	377	377
q21	4377	1866	1422	1422
q22	347	297	252	252
Total cold run time: 95116 ms
Total hot run time: 26581 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4877	4711	4614	4614
q2	q3	3909	4352	3856	3856
q4	864	1233	795	795
q5	4113	4402	4340	4340
q6	186	180	145	145
q7	1781	1607	1548	1548
q8	2507	2756	2561	2561
q9	7617	7468	7493	7468
q10	3788	4025	3571	3571
q11	526	460	416	416
q12	482	597	444	444
q13	2496	2906	2103	2103
q14	279	297	289	289
q15	q16	711	753	713	713
q17	1141	1319	1321	1319
q18	7337	6881	6884	6881
q19	908	919	886	886
q20	2154	2210	2028	2028
q21	3930	3641	3282	3282
q22	463	419	384	384
Total cold run time: 50069 ms
Total hot run time: 47643 ms

@wyxxxcat wyxxxcat force-pushed the ms_lru_cache branch 3 times, most recently from f56e0b0 to 25f5ef3 Compare April 15, 2026 02:54
@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (54/54) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.56% (27352/37181)
Line Coverage 57.22% (295517/516435)
Region Coverage 54.41% (246034/452167)
Branch Coverage 56.06% (106613/190184)

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run cloudut

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 66.67% (36/54) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.22% (20307/38158)
Line Coverage 36.77% (191310/520265)
Region Coverage 33.09% (148755/449584)
Branch Coverage 34.17% (64996/190216)

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 66.67% (36/54) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.25% (20307/38135)
Line Coverage 36.78% (191282/520056)
Region Coverage 33.11% (148726/449246)
Branch Coverage 34.20% (65013/190122)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (54/54) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.81% (27575/37359)
Line Coverage 57.52% (298258/518512)
Region Coverage 54.64% (247744/453451)
Branch Coverage 56.24% (107269/190718)

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

Comment thread cloud/src/meta-service/meta_service.cpp Outdated
if (cache == nullptr) {
return false;
}
LOG(INFO) << "get from cache";
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug, should remove

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 66.67% (36/54) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.30% (20335/38153)
Line Coverage 36.85% (191734/520341)
Region Coverage 33.14% (149015/449651)
Branch Coverage 34.26% (65206/190326)

@w41ter
Copy link
Copy Markdown
Contributor

w41ter commented Apr 22, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. The new tablet-index caches are never invalidated when meta_tablet_idx_key is rewritten or removed. That makes the cache diverge from TxnKv on existing repair/delete paths such as fix_tablet_db_id() and recycler cleanup. With the default tablet_index_cache_ttl_seconds = 0, those stale TabletIndexPB entries can survive until process restart.
  2. The recycler cache can therefore make check_lazy_txn_finished() consult the wrong partition-version key after a tablet-index repair. If the cached db_id is stale, the current code can see TXN_KEY_NOT_FOUND and treat a still-pending lazy txn as finished.
  3. The new cache configs are declared mutable, but the caches are constructed as function-local statics. After the first lookup, runtime changes to ms_tablet_index_cache_capacity, recycler_tablet_index_cache_capacity, or tablet_index_cache_ttl_seconds have no effect until restart.

Critical Checkpoints

  • Goal / correctness / tests: The PR goal is clear: reuse the shared LRU cache and reduce repeated TxnKv reads for tablet-index metadata. The basic plumbing and unit tests are in place, but the cache is not yet correct under existing tablet-index mutation paths, and the added tests do not cover those stale-cache cases.
  • Scope / focus: The change is reasonably focused on cache reuse and metrics, but the Cloud cache integration is not fully wired into all existing write/delete paths for the same metadata.
  • Concurrency: The underlying cache implementation is thread-safe, but concurrent readers and writers are not coherent because tablet-index writers do not invalidate cached entries. With TTL 0, stale metadata can persist indefinitely.
  • Lifecycle / static initialization: Function-local statics avoid cross-TU initialization hazards, but they also freeze cache parameters after first construction.
  • Config items: enable_tablet_index_cache, ms_tablet_index_cache_capacity, recycler_tablet_index_cache_capacity, and tablet_index_cache_ttl_seconds were added. Only the boolean is effectively re-read dynamically; capacity and TTL are not.
  • Parallel code paths: The non-versioned MetaService and Recycler read paths were updated to use the cache, but existing repair/delete paths that mutate meta_tablet_idx_key were not updated to invalidate it.
  • Conditional checks: The compatibility handling for historical TabletIndexPB without db_id is understandable, but stale cached routing info now undermines that upgrade path.
  • Test coverage: Added tests cover basic cache operations and some lazy-commit timing behavior, but there is no test for cache invalidation after tablet-index repair/delete or for runtime config changes after first cache creation.
  • Test results changed: No generated regression outputs are involved. The new unit tests do not prove correctness of the mutation/invalidation cases above.
  • Observability: BE cache metrics wiring looks fine from review. Cloud cache paths still lack observability around invalidation/staleness; not blocking on its own.
  • Transaction / persistence: TxnKv writes remain transactional, but cached tablet-index state can diverge from persisted state after rewrites/removals.
  • Data writes / modifications: Underlying writes/removals are atomic; the problem is stale cached metadata driving later read/recycle decisions.
  • FE/BE variable passing: Not applicable.
  • Performance: The steady-state read optimization is valuable, but the zero-TTL default makes missing invalidation especially risky because stale entries do not self-heal.
  • Other issues: I did not find another blocking issue in the BE metrics refactor itself from code inspection.

Tests reviewed but not executed in this review.

Comment thread cloud/src/meta-service/meta_service.cpp
Comment thread cloud/src/recycler/util.cpp
Comment thread cloud/src/common/config.h Outdated
@hello-stephen
Copy link
Copy Markdown
Contributor

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

Cloud UT Coverage Report

Increment line coverage 83.51% (157/188) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 77.72% (1894/2437)
Line Coverage 64.70% (33493/51765)
Region Coverage 65.16% (16586/25454)
Branch Coverage 55.72% (8852/15888)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (54/54) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.55% (27501/37390)
Line Coverage 57.30% (297481/519158)
Region Coverage 54.49% (247507/454183)
Branch Coverage 56.07% (107142/191096)

@wyxxxcat
Copy link
Copy Markdown
Collaborator Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (54/54) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.83% (26857/37390)
Line Coverage 55.01% (285597/519164)
Region Coverage 51.95% (235926/454183)
Branch Coverage 53.37% (101997/191096)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants