Skip to content

fix(amber): add timeout and retry to dataset file-service requests#5667

Draft
Ma77Ball wants to merge 3 commits into
apache:mainfrom
Ma77Ball:fix/dataset-file-document-request-timeout
Draft

fix(amber): add timeout and retry to dataset file-service requests#5667
Ma77Ball wants to merge 3 commits into
apache:mainfrom
Ma77Ball:fix/dataset-file-document-request-timeout

Conversation

@Ma77Ball

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • Route DatasetFileDocument's presigned-URL fetch and file download through a requests.Session with a (10s connect, 60s read) timeout, so a hung or unreachable file-service fails in bounded time instead of blocking the worker thread forever.
  • Mount a urllib3 Retry policy on the session (3 retries, exponential backoff, retrying on connection errors and 5xx); both calls are idempotent GETs, so retrying is safe.

Any related issues, documentation, discussions?

Closes: #5666

How was this PR tested?

  • Hardening change with no existing spec for this path; verified manually by loading the module and asserting the retry policy is wired (total=3, backoff=0.5, status_forcelist={500,502,503,504}, GET-only).
  • Confirmed a request to a non-routable host now fails in bounded time (ConnectTimeout) rather than hanging, where the old no-timeout call would block indefinitely.
  • ruff check and ruff format --check pass on the modified file.

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

  file-service requests

  DatasetFileDocument made two requests.get() calls
  (presigned-URL fetch
  and file download) with no timeout, so a hung or
  unreachable
  file-service would block the worker thread
  indefinitely.

  Route both calls through a Session with a (10s
  connect, 60s read)
  timeout and a urllib3 Retry policy (3 retries,
  exponential backoff,
  retrying on connection errors and 5xx). Both calls
  are idempotent GETs,
  so retrying is safe.
@github-actions

github-actions Bot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 0 better · 🔴 9 worse · ⚪ 6 noise (<±5%) · 0 without baseline

Compared against main aca41f3 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 359 0.219 26,562/39,797/39,797 us 🔴 +23.3% / 🔴 +13.8%
🔴 bs=100 sw=10 sl=64 827 0.505 119,868/143,561/143,561 us 🔴 +10.2% / 🔴 -7.3%
🔴 bs=1000 sw=10 sl=64 934 0.57 1,063,354/1,158,802/1,158,802 us 🔴 +5.1% / 🔴 +13.3%
Baseline details

Latest main aca41f3 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 359 tuples/sec 442 tuples/sec 410.82 tuples/sec -18.8% -12.6%
bs=10 sw=10 sl=64 MB/s 0.219 MB/s 0.27 MB/s 0.251 MB/s -18.9% -12.7%
bs=10 sw=10 sl=64 p50 26,562 us 21,549 us 23,785 us +23.3% +11.7%
bs=10 sw=10 sl=64 p95 39,797 us 33,026 us 34,980 us +20.5% +13.8%
bs=10 sw=10 sl=64 p99 39,797 us 33,026 us 34,980 us +20.5% +13.8%
bs=100 sw=10 sl=64 throughput 827 tuples/sec 858 tuples/sec 891.94 tuples/sec -3.6% -7.3%
bs=100 sw=10 sl=64 MB/s 0.505 MB/s 0.524 MB/s 0.544 MB/s -3.6% -7.2%
bs=100 sw=10 sl=64 p50 119,868 us 115,929 us 112,277 us +3.4% +6.8%
bs=100 sw=10 sl=64 p95 143,561 us 130,299 us 139,802 us +10.2% +2.7%
bs=100 sw=10 sl=64 p99 143,561 us 130,299 us 139,802 us +10.2% +2.7%
bs=1000 sw=10 sl=64 throughput 934 tuples/sec 940 tuples/sec 1,041 tuples/sec -0.6% -10.3%
bs=1000 sw=10 sl=64 MB/s 0.57 MB/s 0.574 MB/s 0.635 MB/s -0.7% -10.3%
bs=1000 sw=10 sl=64 p50 1,063,354 us 1,065,197 us 972,714 us -0.2% +9.3%
bs=1000 sw=10 sl=64 p95 1,158,802 us 1,102,635 us 1,023,057 us +5.1% +13.3%
bs=1000 sw=10 sl=64 p99 1,158,802 us 1,102,635 us 1,023,057 us +5.1% +13.3%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,557.33,200,128000,359,0.219,26562.45,39797.14,39797.14
1,100,10,64,20,2417.30,2000,1280000,827,0.505,119868.26,143561.40,143561.40
2,1000,10,64,20,21405.25,20000,12800000,934,0.570,1063354.13,1158801.97,1158801.97

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.76%. Comparing base (aca41f3) to head (df4a363).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5667      +/-   ##
============================================
- Coverage     52.85%   52.76%   -0.09%     
+ Complexity     2620     2506     -114     
============================================
  Files          1090     1090              
  Lines         42176    42098      -78     
  Branches       4531     4525       -6     
============================================
- Hits          22293    22214      -79     
+ Misses        18573    18572       -1     
- Partials       1310     1312       +2     
Flag Coverage Δ *Carryforward flag
access-control-service 71.42% <ø> (ø) Carriedforward from a044287
agent-service 34.36% <ø> (ø) Carriedforward from a044287
amber 52.59% <ø> (-0.28%) ⬇️ Carriedforward from a044287
computing-unit-managing-service 1.65% <ø> (ø) Carriedforward from a044287
config-service 56.71% <ø> (ø) Carriedforward from a044287
file-service 57.06% <ø> (ø) Carriedforward from a044287
frontend 47.93% <ø> (ø) Carriedforward from a044287
pyamber 90.76% <100.00%> (+0.03%) ⬆️
python 90.74% <ø> (-0.01%) ⬇️ Carriedforward from a044287
workflow-compiling-service 58.69% <ø> (ø) Carriedforward from a044287

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add timeout and retry to DatasetFileDocument file-service requests

2 participants