Skip to content

[BugFix] Fix UpdateTabletSchemaTask signature collision across alter jobs#71242

Open
wxl24life wants to merge 1 commit intoStarRocks:mainfrom
wxl24life:fix/agent-task-signature-collision
Open

[BugFix] Fix UpdateTabletSchemaTask signature collision across alter jobs#71242
wxl24life wants to merge 1 commit intoStarRocks:mainfrom
wxl24life:fix/agent-task-signature-collision

Conversation

@wxl24life
Copy link
Copy Markdown
Contributor

@wxl24life wxl24life commented Apr 2, 2026

Why I'm doing:

When multiple ALTER TABLE jobs operate on the same table/partition in quick succession, UpdateTabletSchemaTask produces identical AgentTaskQueue signatures because the signature is computed as tablets.hashCode(). Since different alter jobs on the same table operate on the same tablet set, they get the same signature, leading to a race condition:

  1. Task rejection: AgentTaskQueue rejects the new task while the old task with the same signature is still queued
  2. Task mismatch: After the old task is removed from FE queue (job timeout/cancel), the new task enters the queue with the same signature. When BE finishes executing the old task and reports completion, FE matches it to the new job's task by (backendId, taskType, signature) — matching the wrong job
  3. Publish failure: The new job proceeds to publish with a txnId whose txn log was never written by BE, resulting in a 404 error on object storage

This was observed in production where two consecutive ALTER TABLE ADD COLUMN statements on the same table caused the second job to get stuck in FINISHED_REWRITING state, continuously failing to publish with "404 Not Found" on txn log.

What I'm doing:

Override setTxnId() in UpdateTabletSchemaTask to update the signature to Objects.hash(tablets, txnId), ensuring different alter jobs produce different signatures even when operating on the same tablet set.

This is safe because in the LakeTableAlterMetaJobBase.sendAgentTask() call chain, setTxnId(watershedTxnId) is always called before AgentTaskQueue.addBatchTask(), so the signature includes the txnId before the task enters the queue.

Changed files:

  • TabletMetadataUpdateAgentTaskFactory.java: Added setTxnId() override in UpdateTabletSchemaTask inner class
  • AgentTaskQueueSignatureCollisionTest.java: Added 4 test cases covering signature divergence, queue coexistence, race condition prevention, and dedup preservation

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4

…jobs

UpdateTabletSchemaTask used tablets.hashCode() as its AgentTaskQueue
signature. When multiple ALTER TABLE jobs operate on the same tablet set
(same table/partition), they produce identical signatures, causing:

1. AgentTaskQueue rejects the new task while old task is queued
2. After old task removal, new task enters with same signature
3. BE reports old task completion, FE matches it to new task (wrong job)
4. New job publishes with wrong txnId -> txn log 404

Fix: override setTxnId() in UpdateTabletSchemaTask to update signature
to Objects.hash(tablets, txnId), ensuring different alter jobs produce
different signatures even for the same tablet set. This is safe because
setTxnId() is always called before AgentTaskQueue.addBatchTask() in the
LakeTableAlterMetaJobBase.sendAgentTask() call chain.
@wxl24life wxl24life force-pushed the fix/agent-task-signature-collision branch from 03292ff to a22c2cc Compare April 3, 2026 02:26
@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. You're on a roll.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

[FE Incremental Coverage Report]

pass : 3 / 3 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 com/starrocks/task/TabletMetadataUpdateAgentTaskFactory.java 3 3 100.00% []

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

[BE Incremental Coverage Report]

pass : 0 / 0 (0%)

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 3, 2026

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a task-signature collision in FE’s UpdateTabletSchemaTask that can occur when multiple ALTER jobs act on the same tablet set, by incorporating txnId into the AgentTaskQueue signature to prevent cross-job task dedup/mismatch.

Changes:

  • Override UpdateTabletSchemaTask.setTxnId() to recompute signature as Objects.hash(tablets, txnId).
  • Add a new FE unit test class covering signature divergence/coexistence and the previously observed race scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
fe/fe-core/src/main/java/com/starrocks/task/TabletMetadataUpdateAgentTaskFactory.java Updates UpdateTabletSchemaTask signature derivation to include txnId, avoiding signature collisions across alter jobs on the same tablets.
fe/fe-core/src/test/java/com/starrocks/task/AgentTaskQueueSignatureCollisionTest.java Adds unit tests validating signature uniqueness across txnIds and queue behavior under the collision/race scenario.

Comment on lines +173 to +190
@Test
public void testSameTxnIdProducesSameSignature() {
TabletMetadataUpdateAgentTask task1 = TabletMetadataUpdateAgentTaskFactory
.createTabletSchemaUpdateTask(BACKEND_ID, TABLETS, new TTabletSchema(), true);
task1.setTxnId(1000L);

TabletMetadataUpdateAgentTask task2 = TabletMetadataUpdateAgentTaskFactory
.createTabletSchemaUpdateTask(BACKEND_ID, TABLETS, new TTabletSchema(), true);
task2.setTxnId(1000L);

assertEquals(task1.getSignature(), task2.getSignature(),
"Same tablets + same txnId should produce same signature (dedup still works)");

// AgentTaskQueue correctly deduplicates retries
assertTrue(AgentTaskQueue.addTask(task1));
// task2 has the same signature, so it should be rejected (correct dedup)
// This is intentional: if the same job retries, the old task must be removed first
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testSameTxnIdProducesSameSignature claims to verify AgentTaskQueue deduplication for retries, but it never actually attempts to enqueue task2 (and therefore doesn’t assert the expected rejection). As written, the test would pass even if deduplication were broken.

Add an explicit AgentTaskQueue.addTask(task2) call and assert it returns false (and optionally assert the queue still returns task1 for that signature).

Copilot uses AI. Check for mistakes.
}

@Override
public void setTxnId(long txnId) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why only override UpdateTabletSchemaTask, how about other tasks that can also set transaction id?

// table/partition produce identical signatures (tablets.hashCode()), causing
// AgentTaskQueue to reject the second task or FE to match BE's old task
// completion report to the wrong job.
this.signature = Objects.hash(tablets, txnId);
Copy link
Copy Markdown
Contributor

@kevincai kevincai Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other question is the backwards compatibility. Will an upgrade cause all agentTasks for this type invalid due to signature changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants