fix(file-service): retry LakeFS health check on startup#5647
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5647 +/- ##
============================================
- Coverage 52.91% 52.69% -0.22%
Complexity 2626 2626
============================================
Files 1090 1089 -1
Lines 42188 42180 -8
Branches 4531 4534 +3
============================================
- Hits 22324 22228 -96
- Misses 18555 18653 +98
+ Partials 1309 1299 -10
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
| config | throughput | MB/s | latency | max Δ latest / 7d | |
|---|---|---|---|---|---|
| 🔴 | bs=10 sw=10 sl=64 | 378 | 0.231 | 26,225/35,569/35,569 us | 🟢 -8.4% / 🔴 +10.3% |
| ⚪ | bs=100 sw=10 sl=64 | 809 | 0.494 | 124,661/141,076/141,076 us | ⚪ within ±5% / 🔴 +11.0% |
| ⚪ | bs=1000 sw=10 sl=64 | 935 | 0.571 | 1,062,066/1,163,386/1,163,386 us | ⚪ within ±5% / 🔴 +13.7% |
Baseline details
Latest main dfa0434 from same runner
| config | metric | PR | latest main | 7d avg | Δ latest | Δ 7d |
|---|---|---|---|---|---|---|
| bs=10 sw=10 sl=64 | throughput | 378 tuples/sec | 383 tuples/sec | 410.82 tuples/sec | -1.3% | -8.0% |
| bs=10 sw=10 sl=64 | MB/s | 0.231 MB/s | 0.234 MB/s | 0.251 MB/s | -1.3% | -7.9% |
| bs=10 sw=10 sl=64 | p50 | 26,225 us | 24,598 us | 23,785 us | +6.6% | +10.3% |
| bs=10 sw=10 sl=64 | p95 | 35,569 us | 38,825 us | 34,980 us | -8.4% | +1.7% |
| bs=10 sw=10 sl=64 | p99 | 35,569 us | 38,825 us | 34,980 us | -8.4% | +1.7% |
| bs=100 sw=10 sl=64 | throughput | 809 tuples/sec | 823 tuples/sec | 891.94 tuples/sec | -1.7% | -9.3% |
| bs=100 sw=10 sl=64 | MB/s | 0.494 MB/s | 0.502 MB/s | 0.544 MB/s | -1.6% | -9.3% |
| bs=100 sw=10 sl=64 | p50 | 124,661 us | 119,075 us | 112,277 us | +4.7% | +11.0% |
| bs=100 sw=10 sl=64 | p95 | 141,076 us | 145,527 us | 139,802 us | -3.1% | +0.9% |
| bs=100 sw=10 sl=64 | p99 | 141,076 us | 145,527 us | 139,802 us | -3.1% | +0.9% |
| bs=1000 sw=10 sl=64 | throughput | 935 tuples/sec | 927 tuples/sec | 1,041 tuples/sec | +0.9% | -10.2% |
| bs=1000 sw=10 sl=64 | MB/s | 0.571 MB/s | 0.566 MB/s | 0.635 MB/s | +0.9% | -10.1% |
| bs=1000 sw=10 sl=64 | p50 | 1,062,066 us | 1,080,005 us | 972,714 us | -1.7% | +9.2% |
| bs=1000 sw=10 sl=64 | p95 | 1,163,386 us | 1,128,997 us | 1,023,057 us | +3.0% | +13.7% |
| bs=1000 sw=10 sl=64 | p99 | 1,163,386 us | 1,128,997 us | 1,023,057 us | +3.0% | +13.7% |
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,529.59,200,128000,378,0.231,26224.59,35569.18,35569.18
1,100,10,64,20,2471.79,2000,1280000,809,0.494,124661.22,141075.60,141075.60
2,1000,10,64,20,21391.48,20000,12800000,935,0.571,1062065.91,1163385.91,1163385.91There was a problem hiding this comment.
Pull request overview
This PR improves file-service startup robustness by retrying the LakeFS health check instead of failing immediately on transient connection errors, addressing the reported startup race with LakeFS (issue #5646). It also introduces a new GitHub Actions workflow related to PR/issue template compliance, which expands the PR’s scope beyond the stated goal.
Changes:
- Add retry loop + warning logs to
LakeFSStorageClient.healthCheck()to tolerate transient LakeFS unavailability on startup. - Add
.github/workflows/template-compliance-warning.ymlto warn on non-compliant PR/issue templates (and clear the warning once fixed).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| common/workflow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala | Adds retry-on-failure logic for LakeFS health checks during startup. |
| .github/workflows/template-compliance-warning.yml | Adds an Actions workflow that posts/removes a sticky template-compliance warning comment. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@Ma77Ball could you please add some tests? Currently the coverage is 0.. |
Yicong-Huang
left a comment
There was a problem hiding this comment.
mainly need to add tests. see inline comment as well
|
Added |
LakeFS health check + add tests Replace the constant 3s × 10-attempt retry with exponential backoff: 5 attempts starting at 200ms (200, 400, 800, 1600ms), capping total wait at ~3s instead of ~30s. Extract the retry loop into a testable retryWithBackoff helper with an injectable sleep, and add LakeFSStorageClientSpec covering immediate success, retry-then-success, attempt exhaustion (preserving the original cause), and interrupt fail-fast.
|
Please also update Pr description, thanks |
|
The description is now updated. |
What changes were proposed in this PR?
LakeFSStorageClient.healthCheck()now retries the LakeFS health check with exponential backoff before failing, so a transientConnection resetduring concurrent startup no longer crashes the file-service.retryWithBackoffhelper with an injectablesleepfunction so the backoff can be unit-tested without real waiting.InterruptedExceptionwhile waiting restores the thread's interrupt status and fails fast instead of retrying.LakeFSStorageClientSpeccovering immediate success, retry-then-success with doubling delays, give-up-after-max-attempts (with cause preserved), and interrupt handling.Any related issues, documentation, discussions?
Closes: #5646
How was this PR tested?
LakeFSStorageClientSpecexercises the backoff behavior (delay sequence, max-attempt failure, interrupt handling) using an injectedsleep.:9092andcurl http://127.0.0.1:9092/api/healthcheckreturns HTTP 200.sbt "FileService/compile"and expect[success].Was this PR authored or co-authored using generative AI tooling?
Co-authored with Claude Opus 4.8 in compliance with ASF