Fix hot polling bug and add graceful AGENT_TIMEOUT support#124
Merged
saintstack merged 2 commits intoFoundationDB:mainfrom Jan 8, 2026
Merged
Fix hot polling bug and add graceful AGENT_TIMEOUT support#124saintstack merged 2 commits intoFoundationDB:mainfrom
saintstack merged 2 commits intoFoundationDB:mainfrom
Conversation
- Fix should_run_ensemble to check completed runs instead of started count - Add AGENT_TIMEOUT environment variable support with graceful shutdown - Add timestamped logging for better debugging
johscheuer
reviewed
Jan 7, 2026
Member
johscheuer
left a comment
There was a problem hiding this comment.
One comment about the redundant import.
| def log(outputText, newline=True): | ||
| import datetime | ||
| timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") | ||
| message = f"[{timestamp}] {outputText}" |
Member
There was a problem hiding this comment.
Is there a reason not to use the logging package? That would add the timestamp. Probably more refactoring?
Contributor
Author
There was a problem hiding this comment.
Yeah, that would be better but was thinking minimal change. Should I go for it?
Member
There was a problem hiding this comment.
We could do that in another PR if we think it's worth to refactor :)
Contributor
Author
|
Address review comment (and added a few timestamps to agent-scaler.sh ouputs) |
johscheuer
approved these changes
Jan 7, 2026
| def log(outputText, newline=True): | ||
| import datetime | ||
| timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") | ||
| message = f"[{timestamp}] {outputText}" |
Member
There was a problem hiding this comment.
We could do that in another PR if we think it's worth to refactor :)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This was an interesting one. Some of the nightly jobs were failing. They were 'timing out' running joshua tests. In the nightly report, the jobs would be 'in progress' usually and then much later in the morning report as joshua test runs timed out with NO results. Looking at the joshua cluster that runs the nightlies, joshua-agent pod counts had us pegged up at near the 10k limit. Poking around it seemed odd... pods didn't seem to be doing anything.
Turns out ensembles stick around in the database for a while after they finish -- 7 days -- but for some of the ensembles, even though they were 'done', they would report that they were still 'alive' ("Can run: True "). This would happen when the start count fell below max_runs count which could happen if an agent died for whatever reason (many): when an agent dies, start is decremented in the cleanup and should_run_ensemble method in joshua_model.py would return 'True'.
The cluster had 70 odd ensembles when I went to look at it. They were left over mostly from December 30/31st. These were reporting they had work still to be done. So agents and scheduler would asking the ensemble for work... but there was none. This happened thousands of times. This behavior kept the pod count elevated.
joshua_model.py is used by two images... joshua-agent and agent-scaler. I deployed both with the fixes here to see how they do. With the fixes deployed, the problem ensembles are not longer reporting 'True' out of should_run_ensemble for these old ensembles queued last year.
While in here updated logging to include timestamp and read a timeout environment varible that was previously ignored (nit).