fix: [BUG] no partition of relation "v1_payload" found for row #2803#3328
fix: [BUG] no partition of relation "v1_payload" found for row #2803#3328Akhilesh29 wants to merge 2 commits intohatchet-dev:mainfrom
Conversation
|
@Akhilesh29 is attempting to deploy a commit to the Hatchet Team on Vercel. A member of the Team first needs to authorize it. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is kicking off a free cloud agent to fix these issues. This run is complimentary, but you can enable autofix for all future PRs in the Cursor dashboard.
| continue | ||
| } | ||
| return nil, fmt.Errorf("failed to store payloads for step id %s: %w", stepId, err) | ||
| } |
There was a problem hiding this comment.
Replay modifies task state before partition error is caught
High Severity
When payloadStore.Store fails with a partition error, the continue skips storing the payload, but r.queries.ReplayTasks on line 2520 has already UPDATEd the v1_task rows in the same transaction (incrementing retry_count, resetting initial_state, etc.). The transaction still commits at line 3414, and the affected tasks remain in the outer replayedTasks list (built at line 3079 before replayTasks is called). This leaves tasks in an inconsistent state — their DB state is modified as "replayed" but no payload exists — and they're signaled to the controller as successfully replayed.
Additional Locations (1)
There was a problem hiding this comment.
Bugbot Autofix determined this is a false positive.
Current replayTasks returns the payload-store error immediately so the enclosing transaction rolls back and no replayed task state is committed without payloads.
This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.


Description
Fixes #2803
When replaying a task whose
inserted_atfalls outside all existingv1_payloadpartition boundaries (typically because the task is older than the retention window), Postgres throwsSQLSTATE 23514 — no partition of relation "v1_payload" found for row. Previously this error bubbled all the way back to the RabbitMQ consumer which treated it as a transient failure and retried indefinitely, flooding the logs.This fix catches the partition error at two points in the replay path and skips gracefully with a warning log instead of returning an error that triggers endless retries.
Type of change
What's Changed
pkg/repository/v1/task.go— catchSQLSTATE 23514inreplayTasks()afterpayloadStore.Store, log a warning andcontinueinstead of returning the errorinternal/services/controllers/v1/task_controller.go— catchSQLSTATE 23514inhandleReplayTasks()afterReplayTasks(), log a warning andcontinueinstead of returning the error which triggered infinite RabbitMQ retriesisPostgresPartitionErrorhelper in both files to detect the partition constraint violation by SQLSTATE codeNote
Medium Risk
Changes error handling in the task replay path to swallow specific Postgres partition failures; incorrect detection or overly-broad matching could hide real replay errors or skip legitimate replays. Also introduces new imports/helpers that must compile correctly across packages.
Overview
Prevents infinite RabbitMQ retries when replaying tasks that fall outside existing
v1_payloadpartitions by catching Postgres23514partition errors and skipping replay with a warning.This adds
isPostgresPartitionErrorchecks in both the task controller (handleReplayTasks) and repository replay flow (afterpayloadStore.Store) so old tasks that can’t be persisted due to missing partitions don’t bubble up as consumer errors.Written by Cursor Bugbot for commit 773e53c. This will update automatically on new commits. Configure here.