Skip to content

fix(file-service): retry LakeFS health check on startup#5647

Merged
Yicong-Huang merged 12 commits into
apache:mainfrom
Ma77Ball:fix/file-service-lakefs-startup-race
Jun 15, 2026
Merged

fix(file-service): retry LakeFS health check on startup#5647
Yicong-Huang merged 12 commits into
apache:mainfrom
Ma77Ball:fix/file-service-lakefs-startup-race

Conversation

@Ma77Ball

@Ma77Ball Ma77Ball commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this PR?

  • LakeFSStorageClient.healthCheck() now retries the LakeFS health check with exponential backoff before failing, so a transient Connection reset during concurrent startup no longer crashes the file-service.
  • Retry policy: 5 attempts with the delay doubling each time, starting at 200ms (200, 400, 800, 1600ms), capping total wait at ~3s.
  • The retry logic is factored into a small, reusable retryWithBackoff helper with an injectable sleep function so the backoff can be unit-tested without real waiting.
  • Each failed attempt logs a warning; if LakeFS is still unreachable after all attempts, startup fails with a clear aggregated error (preserving the last exception as the cause) so a genuine outage is still surfaced.
  • An InterruptedException while waiting restores the thread's interrupt status and fails fast instead of retrying.
  • Added LakeFSStorageClientSpec covering immediate success, retry-then-success with doubling delays, give-up-after-max-attempts (with cause preserved), and interrupt handling.

Any related issues, documentation, discussions?

Closes: #5646

How was this PR tested?

  • Unit tests: LakeFSStorageClientSpec exercises the backoff behavior (delay sequence, max-attempt failure, interrupt handling) using an injected sleep.
  • Reproduce the race: stop LakeFS, start file-service, then start LakeFS within a few seconds; confirm file-service logs retry warnings and then starts successfully instead of exiting.
  • Regression with LakeFS already up: start file-service, confirm it binds :9092 and curl http://127.0.0.1:9092/api/healthcheck returns HTTP 200.
  • Compile: run sbt "FileService/compile" and expect [success].

Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.8 in compliance with ASF

@github-actions github-actions Bot added fix ci changes related to CI common labels Jun 12, 2026
@codecov-commenter

codecov-commenter commented Jun 12, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 72.72727% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.69%. Comparing base (dfa0434) to head (3b8a7de).

Files with missing lines Patch % Lines
.../amber/core/storage/util/LakeFSStorageClient.scala 72.72% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5647      +/-   ##
============================================
- Coverage     52.91%   52.69%   -0.22%     
  Complexity     2626     2626              
============================================
  Files          1090     1089       -1     
  Lines         42188    42180       -8     
  Branches       4531     4534       +3     
============================================
- Hits          22324    22228      -96     
- Misses        18555    18653      +98     
+ Partials       1309     1299      -10     
Flag Coverage Δ *Carryforward flag
access-control-service 70.91% <ø> (ø)
agent-service 34.36% <ø> (ø) Carriedforward from 6032413
amber 53.07% <72.72%> (+0.05%) ⬆️
computing-unit-managing-service 1.65% <ø> (ø)
config-service 56.71% <ø> (ø)
file-service 57.06% <ø> (ø)
frontend 47.38% <ø> (-0.56%) ⬇️ Carriedforward from 6032413
pyamber 90.71% <ø> (-0.03%) ⬇️ Carriedforward from 6032413
python 90.73% <ø> (ø) Carriedforward from 6032413
workflow-compiling-service 58.69% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

⚠️ Benchmark changes need a look

🟢 2 better · 🔴 1 worse · ⚪ 12 noise (<±5%) · 0 without baseline

Compared against main dfa0434 benchmarked on this same runner, so the delta is largely free of cross-runner hardware noise. The "7d avg" column still reflects the gh-pages dashboard. Treat <±5% as noise unless repeated.

Dashboard · Run

config throughput MB/s latency max Δ latest / 7d
🔴 bs=10 sw=10 sl=64 378 0.231 26,225/35,569/35,569 us 🟢 -8.4% / 🔴 +10.3%
bs=100 sw=10 sl=64 809 0.494 124,661/141,076/141,076 us ⚪ within ±5% / 🔴 +11.0%
bs=1000 sw=10 sl=64 935 0.571 1,062,066/1,163,386/1,163,386 us ⚪ within ±5% / 🔴 +13.7%
Baseline details

Latest main dfa0434 from same runner

config metric PR latest main 7d avg Δ latest Δ 7d
bs=10 sw=10 sl=64 throughput 378 tuples/sec 383 tuples/sec 410.82 tuples/sec -1.3% -8.0%
bs=10 sw=10 sl=64 MB/s 0.231 MB/s 0.234 MB/s 0.251 MB/s -1.3% -7.9%
bs=10 sw=10 sl=64 p50 26,225 us 24,598 us 23,785 us +6.6% +10.3%
bs=10 sw=10 sl=64 p95 35,569 us 38,825 us 34,980 us -8.4% +1.7%
bs=10 sw=10 sl=64 p99 35,569 us 38,825 us 34,980 us -8.4% +1.7%
bs=100 sw=10 sl=64 throughput 809 tuples/sec 823 tuples/sec 891.94 tuples/sec -1.7% -9.3%
bs=100 sw=10 sl=64 MB/s 0.494 MB/s 0.502 MB/s 0.544 MB/s -1.6% -9.3%
bs=100 sw=10 sl=64 p50 124,661 us 119,075 us 112,277 us +4.7% +11.0%
bs=100 sw=10 sl=64 p95 141,076 us 145,527 us 139,802 us -3.1% +0.9%
bs=100 sw=10 sl=64 p99 141,076 us 145,527 us 139,802 us -3.1% +0.9%
bs=1000 sw=10 sl=64 throughput 935 tuples/sec 927 tuples/sec 1,041 tuples/sec +0.9% -10.2%
bs=1000 sw=10 sl=64 MB/s 0.571 MB/s 0.566 MB/s 0.635 MB/s +0.9% -10.1%
bs=1000 sw=10 sl=64 p50 1,062,066 us 1,080,005 us 972,714 us -1.7% +9.2%
bs=1000 sw=10 sl=64 p95 1,163,386 us 1,128,997 us 1,023,057 us +3.0% +13.7%
bs=1000 sw=10 sl=64 p99 1,163,386 us 1,128,997 us 1,023,057 us +3.0% +13.7%
Raw CSV
config_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,529.59,200,128000,378,0.231,26224.59,35569.18,35569.18
1,100,10,64,20,2471.79,2000,1280000,809,0.494,124661.22,141075.60,141075.60
2,1000,10,64,20,21391.48,20000,12800000,935,0.571,1062065.91,1163385.91,1163385.91

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves file-service startup robustness by retrying the LakeFS health check instead of failing immediately on transient connection errors, addressing the reported startup race with LakeFS (issue #5646). It also introduces a new GitHub Actions workflow related to PR/issue template compliance, which expands the PR’s scope beyond the stated goal.

Changes:

  • Add retry loop + warning logs to LakeFSStorageClient.healthCheck() to tolerate transient LakeFS unavailability on startup.
  • Add .github/workflows/template-compliance-warning.yml to warn on non-compliant PR/issue templates (and clear the warning once fixed).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
common/workflow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala Adds retry-on-failure logic for LakeFS health checks during startup.
.github/workflows/template-compliance-warning.yml Adds an Actions workflow that posts/removes a sticky template-compliance warning comment.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/template-compliance-warning.yml Outdated
Comment thread .github/workflows/template-compliance-warning.yml Outdated
Comment thread .github/workflows/template-compliance-warning.yml Outdated
@github-actions github-actions Bot removed the ci changes related to CI label Jun 13, 2026
@Ma77Ball Ma77Ball requested a review from Yicong-Huang June 13, 2026 03:07
@Yicong-Huang

Copy link
Copy Markdown
Contributor

@Ma77Ball could you please add some tests? Currently the coverage is 0..

@Yicong-Huang Yicong-Huang left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mainly need to add tests. see inline comment as well

@Ma77Ball

Copy link
Copy Markdown
Contributor Author

Added LakeFSStorageClientSpec covering the retry helper: immediate success, retry-then-success with the expected backoff delays, exhaustion preserving the original exception as cause, and the interrupt fail-fast path. To test without a live LakeFS server, I extracted the retry loop into retryWithBackoff(maxAttempts, initialDelayMillis, sleep)(operation) with an injectable sleep. All 4 tests pass locally.

Ma77Ball added 2 commits June 13, 2026 19:17
  LakeFS health check + add tests

  Replace the constant 3s × 10-attempt retry with
  exponential backoff: 5
  attempts starting at 200ms (200, 400, 800, 1600ms),
  capping total wait at
  ~3s instead of ~30s. Extract the retry loop into a
  testable
  retryWithBackoff helper with an injectable sleep,
  and add
  LakeFSStorageClientSpec covering immediate success,
  retry-then-success,
  attempt exhaustion (preserving the original cause),
  and interrupt
  fail-fast.
@Ma77Ball Ma77Ball requested a review from Yicong-Huang June 14, 2026 02:22
@Yicong-Huang

Copy link
Copy Markdown
Contributor

Please also update Pr description, thanks

@Ma77Ball

Copy link
Copy Markdown
Contributor Author

The description is now updated.

@Yicong-Huang Yicong-Huang enabled auto-merge June 14, 2026 05:20
@Yicong-Huang Yicong-Huang added this pull request to the merge queue Jun 15, 2026
Merged via the queue into apache:main with commit 263093a Jun 15, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

file-service crashes on startup when LakeFS health check races LakeFS boot

4 participants