[BugFix] Fix UpdateTabletSchemaTask signature collision across alter jobs#71242
[BugFix] Fix UpdateTabletSchemaTask signature collision across alter jobs#71242wxl24life wants to merge 1 commit intoStarRocks:mainfrom
Conversation
…jobs UpdateTabletSchemaTask used tablets.hashCode() as its AgentTaskQueue signature. When multiple ALTER TABLE jobs operate on the same tablet set (same table/partition), they produce identical signatures, causing: 1. AgentTaskQueue rejects the new task while old task is queued 2. After old task removal, new task enters with same signature 3. BE reports old task completion, FE matches it to new task (wrong job) 4. New job publishes with wrong txnId -> txn log 404 Fix: override setTxnId() in UpdateTabletSchemaTask to update signature to Objects.hash(tablets, txnId), ensuring different alter jobs produce different signatures even for the same tablet set. This is safe because setTxnId() is always called before AgentTaskQueue.addBatchTask() in the LakeTableAlterMetaJobBase.sendAgentTask() call chain.
03292ff to
a22c2cc
Compare
|
@codex review |
|
Codex Review: Didn't find any major issues. You're on a roll. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 3 / 3 (100.00%) file detail
|
[BE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
|
There was a problem hiding this comment.
Pull request overview
Fixes a task-signature collision in FE’s UpdateTabletSchemaTask that can occur when multiple ALTER jobs act on the same tablet set, by incorporating txnId into the AgentTaskQueue signature to prevent cross-job task dedup/mismatch.
Changes:
- Override
UpdateTabletSchemaTask.setTxnId()to recomputesignatureasObjects.hash(tablets, txnId). - Add a new FE unit test class covering signature divergence/coexistence and the previously observed race scenario.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
fe/fe-core/src/main/java/com/starrocks/task/TabletMetadataUpdateAgentTaskFactory.java |
Updates UpdateTabletSchemaTask signature derivation to include txnId, avoiding signature collisions across alter jobs on the same tablets. |
fe/fe-core/src/test/java/com/starrocks/task/AgentTaskQueueSignatureCollisionTest.java |
Adds unit tests validating signature uniqueness across txnIds and queue behavior under the collision/race scenario. |
| @Test | ||
| public void testSameTxnIdProducesSameSignature() { | ||
| TabletMetadataUpdateAgentTask task1 = TabletMetadataUpdateAgentTaskFactory | ||
| .createTabletSchemaUpdateTask(BACKEND_ID, TABLETS, new TTabletSchema(), true); | ||
| task1.setTxnId(1000L); | ||
|
|
||
| TabletMetadataUpdateAgentTask task2 = TabletMetadataUpdateAgentTaskFactory | ||
| .createTabletSchemaUpdateTask(BACKEND_ID, TABLETS, new TTabletSchema(), true); | ||
| task2.setTxnId(1000L); | ||
|
|
||
| assertEquals(task1.getSignature(), task2.getSignature(), | ||
| "Same tablets + same txnId should produce same signature (dedup still works)"); | ||
|
|
||
| // AgentTaskQueue correctly deduplicates retries | ||
| assertTrue(AgentTaskQueue.addTask(task1)); | ||
| // task2 has the same signature, so it should be rejected (correct dedup) | ||
| // This is intentional: if the same job retries, the old task must be removed first | ||
| } |
There was a problem hiding this comment.
testSameTxnIdProducesSameSignature claims to verify AgentTaskQueue deduplication for retries, but it never actually attempts to enqueue task2 (and therefore doesn’t assert the expected rejection). As written, the test would pass even if deduplication were broken.
Add an explicit AgentTaskQueue.addTask(task2) call and assert it returns false (and optionally assert the queue still returns task1 for that signature).
| } | ||
|
|
||
| @Override | ||
| public void setTxnId(long txnId) { |
There was a problem hiding this comment.
why only override UpdateTabletSchemaTask, how about other tasks that can also set transaction id?
| // table/partition produce identical signatures (tablets.hashCode()), causing | ||
| // AgentTaskQueue to reject the second task or FE to match BE's old task | ||
| // completion report to the wrong job. | ||
| this.signature = Objects.hash(tablets, txnId); |
There was a problem hiding this comment.
the other question is the backwards compatibility. Will an upgrade cause all agentTasks for this type invalid due to signature changed.



Why I'm doing:
When multiple ALTER TABLE jobs operate on the same table/partition in quick succession,
UpdateTabletSchemaTaskproduces identicalAgentTaskQueuesignatures because the signature is computed astablets.hashCode(). Since different alter jobs on the same table operate on the same tablet set, they get the same signature, leading to a race condition:(backendId, taskType, signature)— matching the wrong jobThis was observed in production where two consecutive
ALTER TABLE ADD COLUMNstatements on the same table caused the second job to get stuck inFINISHED_REWRITINGstate, continuously failing to publish with "404 Not Found" on txn log.What I'm doing:
Override
setTxnId()inUpdateTabletSchemaTaskto update the signature toObjects.hash(tablets, txnId), ensuring different alter jobs produce different signatures even when operating on the same tablet set.This is safe because in the
LakeTableAlterMetaJobBase.sendAgentTask()call chain,setTxnId(watershedTxnId)is always called beforeAgentTaskQueue.addBatchTask(), so the signature includes the txnId before the task enters the queue.Changed files:
TabletMetadataUpdateAgentTaskFactory.java: AddedsetTxnId()override inUpdateTabletSchemaTaskinner classAgentTaskQueueSignatureCollisionTest.java: Added 4 test cases covering signature divergence, queue coexistence, race condition prevention, and dedup preservationWhat type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: