feat(file-service): cleanup stale uncommitted dataset uploads#5643
feat(file-service): cleanup stale uncommitted dataset uploads#5643eugenegujing wants to merge 4 commits into
Conversation
Add a configurable scheduled job that reclaims storage left behind by uploads that were never committed (issue apache#3681): - StagedFileCleanupJob, registered as a Dropwizard Managed lifecycle in FileService, runs every check interval on a single daemon thread - path 1: DATASET_UPLOAD_SESSION rows older than the retention period are removed after aborting their LakeFS multipart upload (abort before delete, so a crash between the two steps is retried next round; HTTP 404 on abort is treated as already-aborted) - path 2: LakeFS staged (uncommitted) objects whose mtime exceeds the retention period are reset via the existing branch-reset API; objects belonging to a non-expired upload session are excluded, staged deletions and orphaned repos are skipped - config: storage.cleanup.enabled / retention-hours (default 72, above the 24h presigned-URL expiry so only non-resumable sessions are removed) / interval-minutes (default 60), all env-overridable - every action is idempotent; per-item failures are logged and counted without aborting the batch; each round logs an audit summary - file-service test suites now fork one JVM per suite: each testcontainers suite boots its own LakeFS/MinIO/Postgres stack and mutates JVM-wide singletons, which broke whichever suite ran second Closes apache#3681
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #5643 +/- ##
============================================
- Coverage 52.51% 52.50% -0.01%
- Complexity 2484 2550 +66
============================================
Files 1071 1086 +15
Lines 41363 41884 +521
Branches 4441 4482 +41
============================================
+ Hits 21720 21990 +270
- Misses 18371 18596 +225
- Partials 1272 1298 +26
*This pull request uses carry forward flags. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
✅ No material benchmark regressions detected🟢 15 better · 🔴 0 worse · ⚪ 0 noise (<±5%) · 0 without baseline
Baseline detailsLatest main
Raw CSVconfig_idx,batch_size,schema_width,string_len,num_batches,total_ms,total_tuples,total_bytes,tuples_per_sec,mb_per_sec,lat_p50_us,lat_p95_us,lat_p99_us
0,10,10,64,20,372.41,200,128000,537,0.328,16994.67,28291.91,28291.91
1,100,10,64,20,1626.38,2000,1280000,1230,0.751,81615.51,93374.03,93374.03
2,1000,10,64,20,13948.42,20000,12800000,1434,0.875,699209.92,730445.62,730445.62 |
|
@eugenegujing thanks for the PR. Please add more tests to guard the behavior. |
…type branches Address review feedback (add more tests to guard the behavior) and the Codecov gaps in StagedFileCleanupJob. Adds 7 cases to StagedFileCleanupJobSpec covering branches the first 11 tests missed: - start()/stop() lifecycle, including stop() before start() (the executor-null guard); the started daemon executor is always torn down in a finally - session-cleanup path: an orphan session whose dataset has a null repository_name is deleted with no multipart abort attempted and no error - staged-object path: a staged deletion (a REMOVED diff from deleting a committed object on the branch without committing) is skipped, not reset, with no error - staged-object path: a dataset pointing at a non-existent LakeFS repo makes retrieveUncommittedObjects throw 404, which is caught and skipped without counting an error - staged-object path: a staged content change (a CHANGED diff from re-uploading to a previously committed path) is reset to the committed version, exercising the CHANGED half of the object-write check that ADDED-only tests missed - staged-object path across multiple datasets in one round: expired staged objects in two separate repos are both reset, and an active session in one dataset does not protect a same-named path in another, proving per-dataset keying Test-only; extra DATASET rows and repos are removed in a finally so the suite's single-dataset assumptions and per-test counts are unaffected.
4021691 to
820a0c0
Compare
|
@Yicong-Huang I have added 7 more tests for this. Could you please review it again? |
There was a problem hiding this comment.
Pull request overview
Adds a scheduled, configurable cleanup job in file-service to reclaim storage used by abandoned (staged but uncommitted) dataset uploads, plus supporting config, LakeFS client helper, and isolation for testcontainers-based test suites.
Changes:
- Introduce
StagedFileCleanupJobto delete expiredDATASET_UPLOAD_SESSIONrows (after multipart abort) and reset expired staged LakeFS objects while protecting active sessions. - Wire the job into
FileServicebehind newstorage.cleanup.*config (with env overrides), and add a LakeFS helper to read staged object mtime. - Add an extensive
StagedFileCleanupJobSpecand fork FileService test suites into separate JVMs to avoid shared-singleton/testcontainers interference.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| file-service/src/main/scala/org/apache/texera/service/util/StagedFileCleanupJob.scala | New managed scheduled job implementing staged-upload cleanup logic and reporting. |
| file-service/src/main/scala/org/apache/texera/service/FileService.scala | Registers the cleanup job with Dropwizard lifecycle when enabled. |
| common/workflow-core/src/main/scala/org/apache/texera/amber/core/storage/util/LakeFSStorageClient.scala | Adds getStagedObjectMtime helper for cleanup’s age-based filtering. |
| common/config/src/main/scala/org/apache/texera/amber/config/StorageConfig.scala | Exposes cleanup enable/retention/interval config values. |
| common/config/src/main/resources/storage.conf | Adds storage.cleanup configuration block with env var overrides and defaults. |
| file-service/src/test/scala/org/apache/texera/service/util/StagedFileCleanupJobSpec.scala | New integration-style spec covering session deletion, staged resets, idempotence, and failure handling. |
| build.sbt | Forks FileService test suites into separate JVMs and groups them to run serially. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Signed-off-by: Eugene Gu <eugenegujing@outlook.com>
Yicong-Huang
left a comment
There was a problem hiding this comment.
I do have some concern on the atomicity of the operation. Please see my inline comments for details.
…r-object 404 as a no-op Address review feedback on the staged-file cleanup job. - Path 1 (abandoned upload sessions): the session-row delete and the LakeFS multipart abort now run inside one DB transaction (SqlServer.withTransaction), deleting FIRST. If the abort then fails with a non-404 error, the transaction rolls back and the session row survives, so the next round retries instead of leaving an orphaned multipart with no tracking record. LakeFS is external and cannot truly enroll in a DB transaction, so a *successful* abort is not undone by a later failure -- but the abort is idempotent (re-aborting an already-aborted upload returns 404, treated as success), so partial state is never permanent. The transaction is per session (not per round) to avoid holding a DB connection open across many LakeFS HTTP calls. - Path 2 (staged objects): an ApiException 404 from the per-object stat/reset is now treated as a successful no-op (a concurrent commit/reset, or another cleanup round, already removed it), matching the job's idempotent design instead of logging a warning and counting an error. Adds failure-mode tests for cleaning a single item: - a non-404 abort failure rolls back the row delete (the row survives) and is counted, so it is retried; - a transiently-failing session is cleaned on a later round once the failure clears (self-heals, not stuck); - a failing item does not prevent a healthy item in the same round from being cleaned (the healthy row is deleted, the failing row is kept for retry). Together with existing tests these cover the named failure cases: no DB record (orphan/null-repo session), no file found (already-aborted / repo 404), and a generic/timeout LakeFS error (rolled back, retried). Also renames the test section comments from F1/F2 to "Path 1 (session cleanup)" / "Path 2 (staged objects)" to match the production-code naming.
|
@Yicong-Huang Thank you for the detailed review. It was really helpful. I've tried to address all three points in the latest commit. Here's what I did, please let me know if I've misunderstood anything. On combining the steps in a transaction (comments 1 & 3): I've wrapped Path 1 so that, for each session, the row delete and the multipart abort run in one One thing I wasn't sure how to fully solve: since LakeFS is an external service, I don't think it can really take part in a Postgres transaction (a DB rollback can't undo an abort that already went through), so I couldn't get true all-or-nothing across both systems. What I did instead was lean on ordering + idempotency. The DB write is gated on the abort, and the abort seems to be idempotent (re-aborting an already-aborted upload returns 404, which we treat as success), so in my understanding a successful abort followed by a later failure should self-heal on the next round. I also kept the transaction per-session rather than per-round, mainly so we don't hold a DB connection open across many LakeFS calls, but I'm happy to change the granularity if you'd prefer. For Path 2 (resetting uncommitted objects) there's no DB write, so I didn't see anything to wrap there; it also runs as an independent sweep that doesn't rely on the session records, which I think gives us a second chance to clean any orphaned object even if its row is already gone. On testing the different step failures (comment 2): I added tests for each case you mentioned, checking the result is either rolled back or cleaned next round: no DB record (orphan / null- Thanks again for that and I'm happy to adjust any of this. |
What changes were proposed in this PR?
Adds a configurable scheduled cleanup job to file-service that reclaims the storage held by uploads that were started (or staged) but never committed.
Problem. Files uploaded to a dataset sit in LakeFS staging until the user commits a version. LakeFS GC only manages committed data, and Texera has no background cleanup, so abandoned uploads accumulate forever in MinIO/S3, along with their
DATASET_UPLOAD_SESSIONrows. The only existing mitigations are the 24h presigned-URL expiry and the lazy re-init cleanup ininitMultipartUpload, neither of which removes staged data for datasets nobody touches again.Design.
Safety properties (each independently enforced):
diffBranch, branch reset, multipart abort) — committed objects never appear in its inputsPHYSICAL_ADDRESS_EXPIRATION_TIME_HRS, so deleted sessions are already non-resumablestorage.cleanup.enabled/retention-hours/interval-minutes, each overridable viaSTORAGE_CLEANUP_*env vars; the whole feature can be disabledAlso in this PR: file-service test suites now fork one JVM per suite (
build.sbt). Each testcontainers-based suite boots its own LakeFS/MinIO/Postgres stack and mutates JVM-wide singletons (StorageConfigendpoints), so sharing one JVM broke whichever container suite ran second; forking isolates the Docker client and singleton state.LakeFSStorageClientgains one helper (getStagedObjectMtime) because the diff API carries no timestamps.Feedback welcome on the default retention (72h ≈ covers a weekend; it is a config default, not a hard-coded value).
Any related issues, documentation, discussions?
Closes #3681.
How was this PR tested?
New
StagedFileCleanupJobSpec(11 tests, real LakeFS/MinIO/Postgres via testcontainers + embedded Postgres for the DAO): expired session deleted with part-row cascade; fresh session survives; expired staged object reset while a committed object survives; fresh staged object survives; idempotent second run reports zeros; active upload and its staged file untouched; mixed-batch exact counts; ±1min boundary around the cutoff; out-of-band-aborted multipart treated as already-aborted (404 rule); per-item failure counted without aborting the batch.111 tests, 4 suites, all passed (includes the pre-existing
DatasetResourceSpec93 tests — no regression). Also ransbt scalafixAllandsbt scalafmtAll(clean). Requires Docker for the testcontainers suites.Was this PR authored or co-authored using generative AI tooling?
Co-authored by: Claude Code (Claude Fable 5)