Skip to content

Adding a delay in getTeamByServers to prevent CPU monopolization#12865

Open
neethuhaneesha wants to merge 1 commit intoapple:release-7.3from
neethuhaneesha:noPushDiff-7.3
Open

Adding a delay in getTeamByServers to prevent CPU monopolization#12865
neethuhaneesha wants to merge 1 commit intoapple:release-7.3from
neethuhaneesha:noPushDiff-7.3

Conversation

@neethuhaneesha
Copy link
Copy Markdown
Contributor

Replace this text with your description here...

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: d6fec05
  • Duration 0:04:09
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: d6fec05
  • Duration 0:08:49
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: d6fec05
  • Duration 0:08:51
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Copy Markdown
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: d6fec05
  • Duration 0:09:00
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

state Optional<Reference<IDataDistributionTeam>> res;
state std::vector<Reference<TCTeamInfo>>::iterator teamIt;

TraceEvent("GetTeamByServersStart");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How frequently is this being called today? I have the impression it's being called from within an outer loop and possibly hundreds to thousands of times per second. Changing that to once every 30s seems pretty radical.

If we want a lightweight metric to count this, something like this would work:

                                    static SimpleCounter<int64_t>* counter =
                                        SimpleCounter<int64_t>::makeCounter("/DD/getTeamByServers");
                                    counter->increment(1);

(gets logged under SimpleCounters every few minutes and is way less expensive than a TraceEvent per invocation)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code to print under suppress.. so it will print once every second

teamIt = self->teams.begin();
for (; teamIt != self->teams.end(); ++teamIt) {
if ((*teamIt)->getServerIDsStr() == servers) {
res = *teamIt;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if this is just adding noise, but should there be a yield somewhere in this loop? I'm asking because it seems like if the problem is that this task can result in CPU monopolization, it would make sense to let other tasks interrupt execution so that they can make progress

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a possibility of teams getting changed/updated by other code, if we yield in between the loop. Exactly not sure if that behavior is accepted or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants