Skip to content

perf(amber): replace SyncExecutionResource sleeps with bounded result poll#5714

Open
Ma77Ball wants to merge 14 commits into
apache:mainfrom
Ma77Ball:perf/sync-execution-result-readiness-poll
Open

perf(amber): replace SyncExecutionResource sleeps with bounded result poll#5714
Ma77Ball wants to merge 14 commits into
apache:mainfrom
Ma77Ball:perf/sync-execution-result-readiness-poll

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • Replaces the two unconditional Thread.sleep(500) calls in SyncExecutionResource::executeWorkflowSync with a bounded readiness-poll (awaitResultsPersisted) that, per target operator, compares the expected output count from the in-memory stats store against the count committed to result storage (document.getCount), polling at 25ms with a 2s cap and early-exit.
  • Wires the poll into the two success paths only (TargetResultsReady before client shutdown, TerminalStateReached(COMPLETED) before reading storage); error, console-error, and timeout paths no longer wait.
  • Extracts the poll core into a package-private, injectable awaitUntil(targetIds, expectedCountOf, committedCountOf, timeout, interval, now, sleep) so the timeout and early-exit behavior is unit-testable with a fake clock.
  • Removes ~1s of latency from every synchronous run: the engine commits the result writer before COMPLETED is observable, so the poll returns on its first check in the normal case, and the cap is defensive insurance for storage lag.

Any related issues, documentation, discussions?

Closes: #5713

How was this PR tested?

  • Run the new unit spec: sbt "WorkflowExecutionService/testOnly org.apache.texera.web.resource.SyncExecutionResourceSpec", expecting succeeded 7, failed 0 (run under JDK 21 as CI does; the repo's JaCoCo 0.8.11 cannot instrument JDK 25 bytecode).
  • The 7 cases cover: empty target list (no wait), non-positive expected count (ready), operator with no result storage (ready), already-committed counts (returns on first check), counts that land mid-poll (waits then returns), all-targets-must-be-ready, and never-landing counts (gives up exactly at the 2s cap).
  • Manual end-to-end check: run a workflow through the synchronous endpoint (agent "run workflow" or "Execute To") and confirm results return with the same rows as before, now without the fixed ~1s delay.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions github-actions Bot added engine ci changes related to CI labels Jun 14, 2026
@codecov-commenter

codecov-commenter commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 19.23077% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.05%. Comparing base (5869492) to head (9d6a21f).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...he/texera/web/resource/SyncExecutionResource.scala 19.23% 21 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5714      +/-   ##
============================================
- Coverage     53.17%   53.05%   -0.13%     
  Complexity     2660     2660              
============================================
  Files          1094     1094              
  Lines         42363    42350      -13     
  Branches       4556     4554       -2     
============================================
- Hits          22528    22469      -59     
- Misses        18507    18554      +47     
+ Partials       1328     1327       -1     
Flag Coverage Δ *Carryforward flag
access-control-service 70.31% <ø> (-0.14%) ⬇️ Carriedforward from 83df2b5
agent-service 34.36% <ø> (ø) Carriedforward from 83df2b5
amber 53.45% <19.23%> (-0.09%) ⬇️
computing-unit-managing-service 1.65% <ø> (+<0.01%) ⬆️ Carriedforward from 83df2b5
config-service 56.71% <ø> (ø) Carriedforward from 83df2b5
file-service 57.06% <ø> (ø) Carriedforward from 83df2b5
frontend 47.86% <ø> (-0.10%) ⬇️ Carriedforward from 83df2b5
pyamber 89.84% <ø> (-0.29%) ⬇️ Carriedforward from 83df2b5
python 90.80% <ø> (ø) Carriedforward from 83df2b5
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from 83df2b5

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 2 worse · ⚪ 11 noise (<±5%) · 0 without baseline

Compared against main 5869492 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 437 0.267 21,821/30,886/30,886 us 🔴 -6.6% / 🟢 -11.7%
🟢 bs=100 sw=10 sl=64 938 0.572 103,720/126,469/126,469 us 🟢 -16.7% / 🟢 -9.5%
bs=1000 sw=10 sl=64 1,095 0.669 907,927/970,259/970,259 us ⚪ within ±5% / 🟢 -6.7%
Baseline details

Latest main 5869492 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 437 tuples/sec 468 tuples/sec 410.82 tuples/sec -6.6% +6.4%
bs=10 sw=10 sl=64 MB/s 0.267 MB/s 0.285 MB/s 0.251 MB/s -6.3% +6.5%
bs=10 sw=10 sl=64 p50 21,821 us 21,332 us 23,785 us +2.3% -8.3%
bs=10 sw=10 sl=64 p95 30,886 us 30,180 us 34,980 us +2.3% -11.7%
bs=10 sw=10 sl=64 p99 30,886 us 30,180 us 34,980 us +2.3% -11.7%
bs=100 sw=10 sl=64 throughput 938 tuples/sec 947 tuples/sec 891.94 tuples/sec -1.0% +5.2%
bs=100 sw=10 sl=64 MB/s 0.572 MB/s 0.578 MB/s 0.544 MB/s -1.0% +5.1%
bs=100 sw=10 sl=64 p50 103,720 us 105,540 us 112,277 us -1.7% -7.6%
bs=100 sw=10 sl=64 p95 126,469 us 151,831 us 139,802 us -16.7% -9.5%
bs=100 sw=10 sl=64 p99 126,469 us 151,831 us 139,802 us -16.7% -9.5%
bs=1000 sw=10 sl=64 throughput 1,095 tuples/sec 1,090 tuples/sec 1,041 tuples/sec +0.5% +5.2%
bs=1000 sw=10 sl=64 MB/s 0.669 MB/s 0.665 MB/s 0.635 MB/s +0.6% +5.3%
bs=1000 sw=10 sl=64 p50 907,927 us 920,944 us 972,714 us -1.4% -6.7%
bs=1000 sw=10 sl=64 p95 970,259 us 979,192 us 1,023,057 us -0.9% -5.2%
bs=1000 sw=10 sl=64 p99 970,259 us 979,192 us 1,023,057 us -0.9% -5.2%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,457.55,200,128000,437,0.267,21820.91,30886.37,30886.37
1,100,10,64,20,2132.83,2000,1280000,938,0.572,103720.49,126468.68,126468.68
2,1000,10,64,20,18258.70,20000,12800000,1095,0.669,907926.61,970258.95,970258.95

@github-actions github-actions Bot removed the ci changes related to CI label Jun 14, 2026
@Ma77Ball Ma77Ball marked this pull request as ready for review June 14, 2026 23:30
@Ma77Ball

Copy link
Copy Markdown
Contributor Author

/request-review @Yicong-Huang

@github-actions github-actions Bot requested a review from Yicong-Huang June 14, 2026 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Replace fixed Thread.sleep calls in SyncExecutionResource with a bounded result-readiness poll

2 participants