Skip to content

[BugFix] Fix use-after-free in ThreadPool::do_submit when thread creation fails#71276

Open
sevev wants to merge 2 commits intoStarRocks:mainfrom
sevev:fix/threadpool-submit-use-after-free
Open

[BugFix] Fix use-after-free in ThreadPool::do_submit when thread creation fails#71276
sevev wants to merge 2 commits intoStarRocks:mainfrom
sevev:fix/threadpool-submit-use-after-free

Conversation

@sevev
Copy link
Copy Markdown
Contributor

@sevev sevev commented Apr 3, 2026

Why I'm doing:

When ThreadPool has min_threads=0, all worker threads can idle-exit, leaving _num_threads == 0. A new submit() would:

  1. Enqueue the task into the token's priority queue
  2. Attempt create_thread() outside the lock
  3. If creation fails and _num_threads + _num_threads_pending_start == 0, return error

The caller receives the error and performs cleanup (e.g. counting down a latch). But the task remains in the queue. A subsequent successful submit() creates a thread that picks up the orphaned task, executing it against already-destroyed state — causing a SIGSEGV crash in butil::LinkNode::InsertBefore due to use-after-free on a stack-allocated BThreadCountDownLatch.

Key Logs

W20260403 06:43:42.119449 47099786790656 lake_service.cpp:1217] Fail to get tablet stats task: Runtime error: Could not create thread: Resource temporarily unavailable
W20260403 06:43:42.119535 47099786790656 lake_service.cpp:1217] Fail to get tablet stats task: Runtime error: Could not create thread: Resource temporarily unavailable
W20260403 06:43:42.119611 47099786790656 lake_service.cpp:1217] Fail to get tablet stats task: Runtime error: Could not create thread: Resource temporarily unavailable
W20260403 06:43:42.121087 47099653371648 lake_service.cpp:1217] Fail to get tablet stats task: Runtime error: Could not create thread: Resource temporarily unavailable
W20260403 06:43:42.121156 47099653371648 lake_service.cpp:1217] Fail to get tablet stats task: Runtime error: Could not create thread: Resource temporarily unavailable

What I'm doing:

Introduce a sole_thread flag: when _num_threads + _num_threads_pending_start == 1 after incrementing (meaning we're creating the pool's only thread), attempt thread creation before enqueuing. If it fails, return error immediately — the task was never enqueued.

When the pool already has other threads, the original behavior is preserved (enqueue first, then create thread), since that path always returns OK.

Fixes https://github.com/StarRocks/StarRocksTest/issues/11218

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
    • This pr needs auto generate documentation
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 4.1
    • 4.0
    • 3.5
    • 3.4

@sevev sevev requested a review from a team as a code owner April 3, 2026 10:52
@mergify mergify bot assigned sevev Apr 3, 2026
@CelerData-Reviewer
Copy link
Copy Markdown

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e9c7dfe97c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a use-after-free scenario in the BE ThreadPool::do_submit() path when min_threads=0 and thread creation fails, which could leave an orphaned task in the queue and later execute it against already-destroyed state.

Changes:

  • Modify ThreadPool::do_submit() to attempt thread creation before enqueuing when creating the pool’s first worker thread.
  • Add a SyncPoint hook in ThreadPool::create_thread() to deterministically inject thread-creation failures in tests.
  • Add regression tests covering failed submit when no threads exist and verifying no UAF/crash scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
be/src/common/thread/threadpool.cpp Adds “sole thread” pre-create logic in do_submit() and a SyncPoint injection point in create_thread().
be/test/common/thread/threadpool_test.cpp Adds regression tests using SyncPoint to simulate thread creation failures and validate post-fix behavior.

…tion fails

When a ThreadPool has min_threads=0 and all worker threads have exited,
a new submit() must create a thread. The old code enqueued the task first,
then attempted thread creation. If creation failed (e.g. EAGAIN due to
system thread exhaustion), it returned an error WITHOUT removing the task
from the queue.

The caller, seeing the error, would perform its own cleanup (e.g. counting
down a latch). A later successful submit could then create a thread that
picks up the orphaned task, executing it against already-destroyed state,
causing a use-after-free crash (SIGSEGV in butil::LinkNode::InsertBefore).

The fix creates the thread BEFORE enqueuing the task when the pool has
zero existing threads (sole_thread path). If thread creation fails, the
task is never enqueued and the caller can safely handle the error. When
the pool already has threads, the original enqueue-first behavior is
preserved since submit always returns OK in that path.

Signed-off-by: sevev <qiangzh95@gmail.com>
@sevev sevev force-pushed the fix/threadpool-submit-use-after-free branch from e9c7dfe to 5e67b46 Compare April 3, 2026 11:24
@sevev sevev requested a review from Copilot April 3, 2026 11:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Copy link
Copy Markdown
Contributor

@luohaha luohaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The fix correctly addresses the use-after-free by creating the thread before enqueuing the task when the pool has zero threads (sole_thread path).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 3, 2026

[BE Incremental Coverage Report]

pass : 12 / 12 (100.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/common/thread/threadpool.cpp 12 12 100.00% []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants