Skip to content

Conversation

@uchenily
Copy link
Contributor

@uchenily uchenily commented Jan 12, 2026

What problem does this PR solve?

This PR introduces asynchronous processing for edit log operations in the TabletScheduler to reduce blocking and improve overall system responsiveness.

Previously, synchronized EditLog operations in the TabletScheduler slowed down the speed of sending clone tasks, making it impossible to effectively increase the replica repair rate across the entire cluster even when raising values such as schedule_batch_size, schedule_slot_num_per_hdd_path/schedule_slot_num_per_ssd_path to large values, particularly in large-scale clusters, the overall replica repair rate is constrained by the FE TabletScheduler and does not increase with the addition of BE nodes. Therefore, we implement the following improvements:

  • move EditLog operations to a dedicated thread pool, do not block the scheduler thread
  • remove unnecessary write locks to reduce lock contention

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jan 12, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@uchenily
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32086 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0a1d6fb5dd6a44bf0b6ee6c5d93256cc9cdb0156, data reload: false

------ Round 1 ----------------------------------
q1	17669	4376	4057	4057
q2	2021	357	235	235
q3	10167	1262	730	730
q4	10220	886	318	318
q5	7528	2027	1929	1929
q6	190	169	141	141
q7	938	781	662	662
q8	9260	1371	1137	1137
q9	4904	4650	4544	4544
q10	6802	1825	1399	1399
q11	519	297	288	288
q12	687	738	597	597
q13	17769	3827	3101	3101
q14	296	303	276	276
q15	598	527	516	516
q16	707	711	630	630
q17	672	795	519	519
q18	6990	6487	6862	6487
q19	1136	1077	626	626
q20	401	404	268	268
q21	3272	2686	2595	2595
q22	1116	1136	1031	1031
Total cold run time: 103862 ms
Total hot run time: 32086 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4338	4211	4298	4211
q2	350	408	331	331
q3	2211	2821	2369	2369
q4	1443	1926	1426	1426
q5	4560	4391	4262	4262
q6	220	164	125	125
q7	1991	1883	1835	1835
q8	2564	2342	2334	2334
q9	7041	7058	7016	7016
q10	2547	2736	2170	2170
q11	529	467	439	439
q12	656	690	585	585
q13	3325	3801	3067	3067
q14	260	294	276	276
q15	516	487	487	487
q16	648	657	628	628
q17	1087	1307	1328	1307
q18	7457	7259	6991	6991
q19	830	762	773	762
q20	1864	1950	1783	1783
q21	4546	4252	4095	4095
q22	1059	1005	1006	1005
Total cold run time: 50042 ms
Total hot run time: 47504 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 172712 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0a1d6fb5dd6a44bf0b6ee6c5d93256cc9cdb0156, data reload: false

query5	4813	586	433	433
query6	326	219	208	208
query7	4220	452	255	255
query8	333	243	234	234
query9	8781	2662	2646	2646
query10	500	378	318	318
query11	15354	15096	14982	14982
query12	172	116	117	116
query13	1277	494	379	379
query14	6407	2981	2752	2752
query14_1	2654	2642	2658	2642
query15	206	201	178	178
query16	995	478	458	458
query17	1113	684	585	585
query18	2551	445	351	351
query19	234	228	196	196
query20	121	116	115	115
query21	213	142	118	118
query22	4000	3989	3940	3940
query23	16061	15588	15260	15260
query23_1	15406	15408	15616	15408
query24	7398	1538	1153	1153
query24_1	1163	1165	1181	1165
query25	522	448	395	395
query26	1241	262	153	153
query27	2768	445	285	285
query28	4537	2131	2126	2126
query29	773	531	430	430
query30	306	239	215	215
query31	820	637	529	529
query32	76	70	65	65
query33	534	331	318	318
query34	896	878	515	515
query35	738	755	668	668
query36	882	889	834	834
query37	133	91	81	81
query38	2665	2701	2651	2651
query39	788	763	740	740
query39_1	713	706	711	706
query40	219	132	115	115
query41	66	61	61	61
query42	107	103	102	102
query43	431	425	424	424
query44	1294	720	714	714
query45	183	181	177	177
query46	856	958	585	585
query47	1375	1459	1317	1317
query48	309	322	231	231
query49	600	419	328	328
query50	632	279	196	196
query51	3739	3757	3799	3757
query52	108	107	94	94
query53	292	324	270	270
query54	278	255	255	255
query55	79	75	67	67
query56	282	279	277	277
query57	1037	1034	939	939
query58	269	249	267	249
query59	1974	2046	2095	2046
query60	321	313	303	303
query61	156	157	157	157
query62	383	352	325	325
query63	297	272	262	262
query64	4963	1274	986	986
query65	3856	3689	3743	3689
query66	1422	419	299	299
query67	15553	14850	15114	14850
query68	2745	1008	743	743
query69	446	348	366	348
query70	1023	947	909	909
query71	309	296	272	272
query72	5894	3459	3442	3442
query73	589	715	293	293
query74	8872	8763	8599	8599
query75	2748	2789	2467	2467
query76	2459	1030	635	635
query77	355	347	273	273
query78	9681	9882	9168	9168
query79	945	887	581	581
query80	605	569	491	491
query81	500	256	222	222
query82	650	140	111	111
query83	255	253	236	236
query84	257	116	103	103
query85	835	522	446	446
query86	327	304	318	304
query87	2846	2840	2816	2816
query88	3150	2217	2193	2193
query89	391	340	330	330
query90	2080	154	147	147
query91	172	161	144	144
query92	69	66	60	60
query93	920	904	530	530
query94	457	330	291	291
query95	552	325	347	325
query96	576	450	202	202
query97	2362	2359	2285	2285
query98	213	200	197	197
query99	560	602	517	517
Total cold run time: 246920 ms
Total hot run time: 172712 ms

@uchenily
Copy link
Contributor Author

run feut

@uchenily
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31981 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 1c712bced61124d3f2169ad328da9096cc348ebd, data reload: false

------ Round 1 ----------------------------------
q1	17629	4246	4036	4036
q2	2051	364	238	238
q3	10161	1321	723	723
q4	10204	829	320	320
q5	7505	2061	1847	1847
q6	192	172	139	139
q7	933	796	654	654
q8	9262	1432	1197	1197
q9	4772	4679	4533	4533
q10	6763	1820	1385	1385
q11	527	290	288	288
q12	692	767	561	561
q13	17763	3781	3065	3065
q14	284	308	268	268
q15	575	512	508	508
q16	701	673	631	631
q17	656	766	541	541
q18	6589	6495	6827	6495
q19	1109	1059	685	685
q20	437	416	267	267
q21	3291	2616	2562	2562
q22	1167	1098	1038	1038
Total cold run time: 103263 ms
Total hot run time: 31981 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4436	4169	4279	4169
q2	334	400	310	310
q3	2324	2824	2393	2393
q4	1393	1913	1404	1404
q5	4397	4205	4376	4205
q6	217	172	132	132
q7	2001	1934	1752	1752
q8	2519	2591	2437	2437
q9	7122	7054	7049	7049
q10	2428	2651	2334	2334
q11	553	461	441	441
q12	728	793	640	640
q13	3592	4051	3307	3307
q14	263	308	253	253
q15	534	486	472	472
q16	634	627	625	625
q17	1097	1340	1378	1340
q18	7416	7263	7374	7263
q19	836	801	807	801
q20	1898	1965	1802	1802
q21	4524	4289	4108	4108
q22	1054	1040	968	968
Total cold run time: 50300 ms
Total hot run time: 48205 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 173128 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 1c712bced61124d3f2169ad328da9096cc348ebd, data reload: false

query5	5226	615	441	441
query6	340	225	215	215
query7	4222	467	294	294
query8	335	247	232	232
query9	8762	2656	2652	2652
query10	544	376	320	320
query11	15242	15193	14899	14899
query12	177	113	114	113
query13	1243	483	384	384
query14	7141	2995	2780	2780
query14_1	2700	2677	2676	2676
query15	202	196	177	177
query16	987	481	464	464
query17	1137	699	581	581
query18	2698	451	346	346
query19	235	229	194	194
query20	123	119	112	112
query21	216	145	124	124
query22	3724	4055	3881	3881
query23	16159	15533	15255	15255
query23_1	15408	15367	15379	15367
query24	7195	1564	1201	1201
query24_1	1214	1197	1214	1197
query25	576	483	436	436
query26	1241	263	160	160
query27	2739	474	295	295
query28	4452	2158	2149	2149
query29	790	574	459	459
query30	311	250	215	215
query31	815	639	541	541
query32	78	71	70	70
query33	561	346	306	306
query34	916	918	538	538
query35	733	777	686	686
query36	885	874	848	848
query37	128	95	84	84
query38	2770	2754	2647	2647
query39	772	747	735	735
query39_1	725	728	712	712
query40	222	143	122	122
query41	74	73	68	68
query42	110	109	106	106
query43	478	472	427	427
query44	1371	740	735	735
query45	188	189	181	181
query46	901	996	605	605
query47	1404	1479	1384	1384
query48	315	348	273	273
query49	692	423	323	323
query50	664	296	210	210
query51	3806	3745	3716	3716
query52	106	109	100	100
query53	298	319	270	270
query54	283	256	248	248
query55	80	79	70	70
query56	298	296	294	294
query57	989	1013	1035	1013
query58	262	243	250	243
query59	2201	2267	2187	2187
query60	321	319	298	298
query61	162	163	159	159
query62	381	341	320	320
query63	308	268	271	268
query64	4845	1287	1018	1018
query65	3811	3792	3688	3688
query66	1374	411	307	307
query67	15639	15537	15178	15178
query68	6111	1017	716	716
query69	489	353	314	314
query70	999	980	872	872
query71	377	309	301	301
query72	6080	3507	3411	3411
query73	780	757	299	299
query74	8771	8793	8481	8481
query75	2825	2812	2425	2425
query76	3442	1078	652	652
query77	524	401	286	286
query78	9618	9921	9130	9130
query79	1231	928	592	592
query80	666	562	466	466
query81	487	261	229	229
query82	209	152	111	111
query83	273	263	259	259
query84	255	130	105	105
query85	872	517	447	447
query86	340	333	282	282
query87	2843	2922	2749	2749
query88	3137	2222	2213	2213
query89	393	361	334	334
query90	2046	154	152	152
query91	163	182	145	145
query92	71	71	61	61
query93	973	934	534	534
query94	576	331	286	286
query95	569	328	312	312
query96	608	466	209	209
query97	2328	2378	2282	2282
query98	229	209	197	197
query99	585	572	490	490
Total cold run time: 253129 ms
Total hot run time: 173128 ms

@dataroaring
Copy link
Contributor

There is a sync edit log api, just use it.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use async editlog api.

@uchenily
Copy link
Contributor Author

uchenily commented Jan 13, 2026

Use async editlog api.

I noticed that there is a logEditWithQueue method, but it is sync api in actually and will block current thread, so it doesn't seem to meet the usage needs. We need to make some updates after editlog, such as finalizeTabletCtx

Besides, it seems that logEditWithQueue was introduced in a recent version and cannot be used in earlier versions

@uchenily
Copy link
Contributor Author

run p0

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 7.89% (6/76) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants