Skip to content

Adding a yeild in getTeamByServers to prevent CPU monopolization#12860

Open
neethuhaneesha wants to merge 1 commit intoapple:release-7.3from
neethuhaneesha:ddissue-7.3
Open

Adding a yeild in getTeamByServers to prevent CPU monopolization#12860
neethuhaneesha wants to merge 1 commit intoapple:release-7.3from
neethuhaneesha:ddissue-7.3

Conversation

@neethuhaneesha
Copy link
Copy Markdown
Contributor

Replace this text with your description here...

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@neethuhaneesha neethuhaneesha changed the title Ddissue 7.3 Adding a yeild in getTeamByServers to prevent CPU monopolization Mar 26, 2026
@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 8c0f52d
  • Duration 0:03:58
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 393da65
  • Duration 0:03:30
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 8c0f52d
  • Duration 0:08:16
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 8c0f52d
  • Duration 0:08:25
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 8c0f52d
  • Duration 0:10:01
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 393da65
  • Duration 0:08:02
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 393da65
  • Duration 0:08:09
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 393da65
  • Duration 0:08:09
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

break;
}
teamCount++;
if (teamCount % 200 == 0) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is important stuff to run at flow prioroity below the priority of whatever callback is running this, does that stuff get starved? I can't remember.

In any case I'd be in favor of a hard cap on requests per second through this code path given its problematic nature thus far. Like say

if (++teamCount % 1000 == 0) {
wait(SERVER_KNOBS->DD_GET_TEAM_BY_SERVERS_DELAY_PER_1K_REQUESTS);
}

and define that knob as 0.01 to start with, 1 second/(0.01 seconds/1000K requests)) ==> 100K requests/second which I'm guessing is a bit less than one core can do. But we can lower the knob if needed.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI is telling that this should give lower priority work a chance within the same outer run loop (I haven't actually looked at the code to tell). If you're sure this will work I'm OK with it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately I think the yield check should go before checking any team because otherwise if there are bazillion requests all handled by self->teams.begin() then we will never yield. Unlikely maybe, but why have to think about it.

state const std::string servers = TCTeamInfo::serversToString(req.src);
state Optional<Reference<IDataDistributionTeam>> res;
state int teamCount = 0;
state std::vector<Reference<TCTeamInfo>>::iterator teamIt = self->teams.begin();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there is a wait(yield()) at the bottom now, it's possible other actors mutate self->teams in between yield points of this actor. Will that invalidate the iterator? Previously this was a blocking/sync call so iterator never got invalidated since only this function could mutate self->teams while it was running.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point. Given that the number of teams is on the order of hundreds would it be simpler to just put a single yield() at the top of this function?

Alternatively put the yield at the top of the loop in serverGetTeamRequests(). I feel yields at the tops of potentially tight loops is a bit more idiomatic.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, assuming O(100) teams and therefore relatively low cost of a single invocation of getTeamByServers, having a single yield at the top would help. It does smell like the outer loop (serverGetTeamRequests) is not giving time to other tasks and it's invoking getTeamByServers back to back, and so an explicit yield within getTeamByServers would help break that cycle. This can help the immediate problem of txns timing out in other actors.

Perhaps we can add some counters here to validate the above assumptions?

Copy link
Copy Markdown
Collaborator

@spraza spraza Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(thinking more)

Above may solve the txn timing out problem but perhaps the next problem can be that as a result of mass storage server migration, the queue in serverGetTeamRequests is quite large (again, we need a counter here). If that were the case, then we have no option but to make the overall processing more efficient. One implementation I can think of:

  • have a yield in the outer loop (in serverGetTeamRequests), maybe not add a big delay, but atleast give other tasks in DD time on the run loop in between calls to getTeamByServers
  • keep getTeamByServers sync/blocking
  • but make getTeamByServers O(1) by using a hash map

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untested implementation of the latter approach above: #12861

Copy link
Copy Markdown
Contributor Author

@neethuhaneesha neethuhaneesha Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an FYI, Looks like we have ~3500 teams and ~5400 getTeamByServers calls need to be done until all the data moves are recovered.

Copy link
Copy Markdown
Collaborator

@spraza spraza Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, if I'm reading this correctly, 5400 calls × 3500 teams per call = ~18.9 million getServerIDsStr(). That's a lot, can definitely consume most of the CPU.

with this PR, we'll still be doing 18.9M getServerIDsStr, just spread over time, so we'll slowly process the queue. Is the impact there just slow data movement or we think it's more than that?

given the numbers, I think making getTeamByServers O(1) and therefore doing 5400 getServerIDsStr would be ideal. Let me start simulation testing against #12861.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#12861 looks good to me. But we need to test more thoroughly.

Copy link
Copy Markdown
Collaborator

@spraza spraza Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that PR needs more testing. There's the teams vector and then the newly added hash map, they must always be consistent. The vector is only mutated in addTeam or removeTeam as far as I can tell, so it does not seem too risky.

so far the results on that PR are: 20260326-072449-praza-e8c2a8291ccb74a8 compressed=True data_size=35199917 duration=1719043 ended=38366 fail_fast=10 max_runs=500000 pass=38366 priority=100 remaining=3:27:09 runtime=0:17:13 sanity=False started=40128 submitted=20260326-072449 timeout=5400 username=praza

having said that, maybe we can try this change (post feedback address) first which feels less invasive and hopefully address txn timeouts. If then data movement is slow due to large queues and slow processing, then we'll have #12861 as a solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants