Fix hot polling bug and add graceful AGENT_TIMEOUT support by saintstack · Pull Request #124 · FoundationDB/fdb-joshua

saintstack · 2026-01-06T18:33:53Z

Fix should_run_ensemble to check completed runs instead of started count
Add AGENT_TIMEOUT environment variable support with graceful shutdown
Add timestamped logging for better debugging

This was an interesting one. Some of the nightly jobs were failing. They were 'timing out' running joshua tests. In the nightly report, the jobs would be 'in progress' usually and then much later in the morning report as joshua test runs timed out with NO results. Looking at the joshua cluster that runs the nightlies, joshua-agent pod counts had us pegged up at near the 10k limit. Poking around it seemed odd... pods didn't seem to be doing anything.

Turns out ensembles stick around in the database for a while after they finish -- 7 days -- but for some of the ensembles, even though they were 'done', they would report that they were still 'alive' ("Can run: True "). This would happen when the start count fell below max_runs count which could happen if an agent died for whatever reason (many): when an agent dies, start is decremented in the cleanup and should_run_ensemble method in joshua_model.py would return 'True'.

The cluster had 70 odd ensembles when I went to look at it. They were left over mostly from December 30/31st. These were reporting they had work still to be done. So agents and scheduler would asking the ensemble for work... but there was none. This happened thousands of times. This behavior kept the pod count elevated.

joshua_model.py is used by two images... joshua-agent and agent-scaler. I deployed both with the fixes here to see how they do. With the fixes deployed, the problem ensembles are not longer reporting 'True' out of should_run_ensemble for these old ensembles queued last year.

While in here updated logging to include timestamp and read a timeout environment varible that was previously ignored (nit).

- Fix should_run_ensemble to check completed runs instead of started count - Add AGENT_TIMEOUT environment variable support with graceful shutdown - Add timestamped logging for better debugging

johscheuer

One comment about the redundant import.

joshua/joshua_agent.py

johscheuer · 2026-01-07T08:42:30Z

joshua/joshua_agent.py

 def log(outputText, newline=True):
+    import datetime
+    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    message = f"[{timestamp}] {outputText}"


Is there a reason not to use the logging package? That would add the timestamp. Probably more refactoring?

Yeah, that would be better but was thinking minimal change. Should I go for it?

We could do that in another PR if we think it's worth to refactor :)

…r.sh

saintstack · 2026-01-07T16:33:19Z

Address review comment (and added a few timestamps to agent-scaler.sh ouputs)

johscheuer

LGTM 👍

johscheuer · 2026-01-07T16:42:14Z

joshua/joshua_agent.py

 def log(outputText, newline=True):
+    import datetime
+    timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+    message = f"[{timestamp}] {outputText}"


We could do that in another PR if we think it's worth to refactor :)

Fix hot polling bug and add graceful AGENT_TIMEOUT support

2248835

- Fix should_run_ensemble to check completed runs instead of started count - Add AGENT_TIMEOUT environment variable support with graceful shutdown - Add timestamped logging for better debugging

saintstack requested a review from johscheuer January 6, 2026 18:37

johscheuer closed this Jan 7, 2026

johscheuer reopened this Jan 7, 2026

johscheuer reviewed Jan 7, 2026

View reviewed changes

Remove double import. Add timestamp to main log events in agent-scale…

0b02d67

…r.sh

johscheuer approved these changes Jan 7, 2026

View reviewed changes

saintstack mentioned this pull request Jan 7, 2026

Add cleanup of 'Failed' jobs #115

Merged

saintstack merged commit a282dab into FoundationDB:main Jan 8, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hot polling bug and add graceful AGENT_TIMEOUT support#124

Fix hot polling bug and add graceful AGENT_TIMEOUT support#124
saintstack merged 2 commits intoFoundationDB:mainfrom
saintstack:polling

saintstack commented Jan 6, 2026 •

edited

Loading

Uh oh!

johscheuer left a comment

Uh oh!

Uh oh!

johscheuer Jan 7, 2026

Uh oh!

saintstack Jan 7, 2026

Uh oh!

johscheuer Jan 7, 2026

Uh oh!

saintstack commented Jan 7, 2026

Uh oh!

johscheuer left a comment

Uh oh!

johscheuer Jan 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

saintstack commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johscheuer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johscheuer Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

saintstack Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

johscheuer Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

saintstack commented Jan 7, 2026

Uh oh!

johscheuer left a comment

Choose a reason for hiding this comment

Uh oh!

johscheuer Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

saintstack commented Jan 6, 2026 •

edited

Loading