Fix inaccessible job store errors for grid batch systems (#5007) by annagiroti · Pull Request #5520 · DataBiosphere/toil

annagiroti · 2026-05-19T05:25:50Z

Grid batch systems (Slurm, LSF, etc.) now raise an error upfront when no --jobStore is specified, directing users to provide a shared filesystem path. Workers that cannot access the job store now exit with a sentinel code, prompting a clear warning from the leader instead of a confusing traceback.

Resolves #5007

Changelog Entry

To be copied to the draft changelog by merger:

PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
- If it is coming from an external repo, make sure to pull it in for CI with:
```
contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
```
- If there is no associated issue, create one.
Read through the code changes. Make sure that it doesn't have:
- Addition of trailing whitespace.
- New variable or member names in camelCase that want to be in snake_case.
- New functions without type hints.
- New functions or classes without informative docstrings.
- Changes to semantics not reflected in the relevant docstrings.
- New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
- New features without tests.
Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
Finish the review with an overall description of your opinion.

Merger Checklist

Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
Merge with the Github "Squash and merge" feature.
- If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
Copy its recommended changelog entry to the Draft Changelog.
Append the issue number in parentheses to the changelog entry.

…stems

…messaging

adamnovak

This looks pretty good, but I think the tests need some rationalization.

adamnovak · 2026-05-19T20:16:47Z

+    def test_decoration_included_when_no_local_suggestion(self):
+        """Decoration string should appear in the auto-generated fallback path when no local suggestion is given."""
+        from toil.jobStores.utils import generate_locator
+        result = generate_locator("file", local_suggestion=None, decoration="wdl")
+        self.assertIn("wdl", result)


This is not a test for generate_default_job_store, but the class claims to be all tests for generate_default_job_store.

This could be bumped up a level and named test_generate_locator_includes_decoration_when_no_local_suggestion. Or maybe test_generate_locator_includes_decoration_without_local_suggestion. Or if we're only going to have the one it could just be test_generate_locator_includes_decoration.

adamnovak · 2026-05-19T20:25:51Z

+class GenerateDefaultJobStoreTest(ToilTest):
+    """Tests for generate_default_job_store in toil.jobStores.utils."""
+
+    def setUp(self):
+        super().setUp()
+        from toil.lib.io import mkdtemp
+        self.local_dir = mkdtemp()
+
+    def tearDown(self):
+        super().tearDown()
+        shutil.rmtree(self.local_dir, ignore_errors=True)


Probably we want this to be a new-style test that uses tmp_path: Path as an argument to get a temp directory fixture, rather than an old-style test extending ToilTest.

If we do extend ToilTest, we'd want to use its methods for getting temp directories in tests.

But in neither case do we want to add new code to set up and tear down temp directories.

adamnovak · 2026-05-19T20:34:27Z

+    def test_slurm_raises_no_available_job_store(self):
+        """Slurm batch system should raise NoAvailableJobStoreException."""
+        from toil.jobStores.utils import generate_default_job_store, NoAvailableJobStoreException
+        with self.assertRaises(NoAvailableJobStoreException):
+            generate_default_job_store("slurm", None, self.local_dir)
+
+    def test_slurm_error_mentions_shared_filesystem(self):
+        """Slurm error message should tell the user to use a shared filesystem."""
+        from toil.jobStores.utils import generate_default_job_store, NoAvailableJobStoreException
+        with self.assertRaises(NoAvailableJobStoreException) as cm:
+            generate_default_job_store("slurm", None, self.local_dir)
+        self.assertIn("shared", str(cm.exception).lower())
+        self.assertIn("--jobStore", str(cm.exception))


Does it really make sense to have one test to check that the error happens and a different test to make sure that the error happens with the right message text? It's not possible for the second test to pass if the first test fails.

adamnovak · 2026-05-19T20:58:16Z

+        self.assertIn("shared", str(cm.exception).lower())
+        self.assertIn("--jobStore", str(cm.exception))
+
+    def test_grid_engine_batch_systems_raise_no_available_job_store(self):


I don't think the test function names here are right.

Mostly in Toil we use short and not-so-descriptive test names; the test class might be JobStoreGenerationTest and we'd have things like test_local(), test_slurm(), etc.

But that's not really unit testing; that's writing one test per situation when really you want one test per thing that needs to be true about the unit under test. See for example this discussion on whether tests should always contain exactly one assert.

But then if you're taking a more proper unit testing approach, you want your functions to be named something like test_<function>_<scenario>_<outcome>. (You also might not put them in class if the thing they are testing is not itself a class? But also it makes sense to put them in a class to group them.)

The names here are sort of test_<scenario>_<outcome>, I think, except when they aren't. test_decoration_included_when_no_local_suggestion is test_<outcome>_<scenario>, for example. I think that might actually be a more sensible order when the <function> has to be implied by being in the test class, because you are not awkwardly missing your subject noun phrase in the name.

The test_<scenario>_<outcome> names can kind of read like they might be test_<unit>_<outcome>, which I find confusing: test_single_machine_returns_local_path parses to me like "This is a test that makes sure that single machine (whatever that is) returns a local path". And test_grid_engine_batch_systems_raise_no_available_job_store has "raise" instead of "raises", which also makes it sound like the gird engine batch systems, and not the generate_default_job_store unit from the enclosing class, is what's meant to be doing the raising.

I think the test functions should be harmonized to use one naming convention.

adamnovak · 2026-05-19T21:09:27Z

+        jobstore = mkdtemp(prefix="toil-wdl-", dir=os.getcwd())
+        os.rmdir(jobstore)


I think the file job store logic insists on actually creating the directory, which means that if someone malicious can write to the directory we're standing in, they can cause our run to fail by creating the directory themselves between when we delete it and when we go to re-create it, but they can't force us to use a directory they provide full of malicious files.

But it still would be good to eventually replace the notion of passing a suggested directory to use as the job store with instead an approach where we pass a suggested directory to put the job store in. We could also use a high-entropy UUID that is not created as a directory until we actually go to use it for a job store.

adamnovak · 2026-05-19T22:20:04Z

@annagiroti Can you also fill in the proposed changelog entry for this?

annagiroti added 3 commits May 14, 2026 10:56

Fix default job store selection and error messaging for grid batch sy…

c959267

…stems

Add tests for grid batch system job store selection and worker error …

085cf68

…messaging

Add sentinel exit code for inaccessible job store and leader warning

bb70d14

annagiroti requested a review from adamnovak May 19, 2026 05:25

adamnovak requested changes May 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix inaccessible job store errors for grid batch systems (#5007)#5520

Fix inaccessible job store errors for grid batch systems (#5007)#5520
annagiroti wants to merge 3 commits into
masterfrom
issues/5007-slurm-default-jobstore-fix

annagiroti commented May 19, 2026

Uh oh!

adamnovak left a comment

Uh oh!

adamnovak May 19, 2026

Uh oh!

adamnovak May 19, 2026

Uh oh!

adamnovak May 19, 2026

Uh oh!

adamnovak May 19, 2026

Uh oh!

adamnovak May 19, 2026

Uh oh!

adamnovak commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		jobstore = mkdtemp(prefix="toil-wdl-", dir=os.getcwd())
		os.rmdir(jobstore)

Conversation

annagiroti commented May 19, 2026

Changelog Entry

Reviewer Checklist

Merger Checklist

Uh oh!

adamnovak left a comment

Choose a reason for hiding this comment

Uh oh!

adamnovak May 19, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 19, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 19, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 19, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak May 19, 2026

Choose a reason for hiding this comment

Uh oh!

adamnovak commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants