Skip to content

Fix inaccessible job store errors for grid batch systems (#5007)#5520

Open
annagiroti wants to merge 3 commits into
masterfrom
issues/5007-slurm-default-jobstore-fix
Open

Fix inaccessible job store errors for grid batch systems (#5007)#5520
annagiroti wants to merge 3 commits into
masterfrom
issues/5007-slurm-default-jobstore-fix

Conversation

@annagiroti
Copy link
Copy Markdown
Collaborator

Grid batch systems (Slurm, LSF, etc.) now raise an error upfront when no --jobStore is specified, directing users to provide a shared filesystem path. Workers that cannot access the job store now exit with a sentinel code, prompting a clear warning from the leader instead of a confusing traceback.

Resolves #5007

Changelog Entry

To be copied to the draft changelog by merger:

  • PR submitter writes their recommendation for a changelog entry here

Reviewer Checklist

  • Make sure it is coming from issues/XXXX-fix-the-thing in the Toil repo, or from an external repo.
    • If it is coming from an external repo, make sure to pull it in for CI with:
      contrib/admin/test-pr otheruser theirbranchname issues/XXXX-fix-the-thing
      
    • If there is no associated issue, create one.
  • Read through the code changes. Make sure that it doesn't have:
    • Addition of trailing whitespace.
    • New variable or member names in camelCase that want to be in snake_case.
    • New functions without type hints.
    • New functions or classes without informative docstrings.
    • Changes to semantics not reflected in the relevant docstrings.
    • New or changed command line options for Toil workflows that are not reflected in docs/running/{cliOptions,cwl,wdl}.rst
    • New features without tests.
  • Comment on the lines of code where problems exist with a review comment. You can shift-click the line numbers in the diff to select multiple lines.
  • Finish the review with an overall description of your opinion.

Merger Checklist

  • Make sure the PR passed tests, including the Gitlab tests, for the most recent commit in its branch.
  • Make sure the PR has been reviewed. If not, review it. If it has been reviewed and any requested changes seem to have been addressed, proceed.
  • Merge with the Github "Squash and merge" feature.
    • If there are multiple authors' commits, add Co-authored-by to give credit to all contributing authors.
  • Copy its recommended changelog entry to the Draft Changelog.
  • Append the issue number in parentheses to the changelog entry.

@annagiroti annagiroti requested a review from adamnovak May 19, 2026 05:25
Copy link
Copy Markdown
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but I think the tests need some rationalization.

Comment on lines +1885 to +1889
def test_decoration_included_when_no_local_suggestion(self):
"""Decoration string should appear in the auto-generated fallback path when no local suggestion is given."""
from toil.jobStores.utils import generate_locator
result = generate_locator("file", local_suggestion=None, decoration="wdl")
self.assertIn("wdl", result)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a test for generate_default_job_store, but the class claims to be all tests for generate_default_job_store.

This could be bumped up a level and named test_generate_locator_includes_decoration_when_no_local_suggestion. Or maybe test_generate_locator_includes_decoration_without_local_suggestion. Or if we're only going to have the one it could just be test_generate_locator_includes_decoration.

Comment on lines +1838 to +1848
class GenerateDefaultJobStoreTest(ToilTest):
"""Tests for generate_default_job_store in toil.jobStores.utils."""

def setUp(self):
super().setUp()
from toil.lib.io import mkdtemp
self.local_dir = mkdtemp()

def tearDown(self):
super().tearDown()
shutil.rmtree(self.local_dir, ignore_errors=True)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we want this to be a new-style test that uses tmp_path: Path as an argument to get a temp directory fixture, rather than an old-style test extending ToilTest.

If we do extend ToilTest, we'd want to use its methods for getting temp directories in tests.

But in neither case do we want to add new code to set up and tear down temp directories.

Comment on lines +1862 to +1874
def test_slurm_raises_no_available_job_store(self):
"""Slurm batch system should raise NoAvailableJobStoreException."""
from toil.jobStores.utils import generate_default_job_store, NoAvailableJobStoreException
with self.assertRaises(NoAvailableJobStoreException):
generate_default_job_store("slurm", None, self.local_dir)

def test_slurm_error_mentions_shared_filesystem(self):
"""Slurm error message should tell the user to use a shared filesystem."""
from toil.jobStores.utils import generate_default_job_store, NoAvailableJobStoreException
with self.assertRaises(NoAvailableJobStoreException) as cm:
generate_default_job_store("slurm", None, self.local_dir)
self.assertIn("shared", str(cm.exception).lower())
self.assertIn("--jobStore", str(cm.exception))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it really make sense to have one test to check that the error happens and a different test to make sure that the error happens with the right message text? It's not possible for the second test to pass if the first test fails.

self.assertIn("shared", str(cm.exception).lower())
self.assertIn("--jobStore", str(cm.exception))

def test_grid_engine_batch_systems_raise_no_available_job_store(self):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test function names here are right.

Mostly in Toil we use short and not-so-descriptive test names; the test class might be JobStoreGenerationTest and we'd have things like test_local(), test_slurm(), etc.

But that's not really unit testing; that's writing one test per situation when really you want one test per thing that needs to be true about the unit under test. See for example this discussion on whether tests should always contain exactly one assert.

But then if you're taking a more proper unit testing approach, you want your functions to be named something like test_<function>_<scenario>_<outcome>. (You also might not put them in class if the thing they are testing is not itself a class? But also it makes sense to put them in a class to group them.)

The names here are sort of test_<scenario>_<outcome>, I think, except when they aren't. test_decoration_included_when_no_local_suggestion is test_<outcome>_<scenario>, for example. I think that might actually be a more sensible order when the <function> has to be implied by being in the test class, because you are not awkwardly missing your subject noun phrase in the name.

The test_<scenario>_<outcome> names can kind of read like they might be test_<unit>_<outcome>, which I find confusing: test_single_machine_returns_local_path parses to me like "This is a test that makes sure that single machine (whatever that is) returns a local path". And test_grid_engine_batch_systems_raise_no_available_job_store has "raise" instead of "raises", which also makes it sound like the gird engine batch systems, and not the generate_default_job_store unit from the enclosing class, is what's meant to be doing the raising.

I think the test functions should be harmonized to use one naming convention.

Comment thread src/toil/wdl/wdltoil.py
Comment on lines +6100 to +6101
jobstore = mkdtemp(prefix="toil-wdl-", dir=os.getcwd())
os.rmdir(jobstore)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the file job store logic insists on actually creating the directory, which means that if someone malicious can write to the directory we're standing in, they can cause our run to fail by creating the directory themselves between when we delete it and when we go to re-create it, but they can't force us to use a directory they provide full of malicious files.

But it still would be good to eventually replace the notion of passing a suggested directory to use as the job store with instead an approach where we pass a suggested directory to put the job store in. We could also use a high-entropy UUID that is not created as a directory until we actually go to use it for a job store.

@adamnovak
Copy link
Copy Markdown
Member

@annagiroti Can you also fill in the proposed changelog entry for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

It's hard to figure out that a Slurm WDL workflow is failing because the default job store isn't on a shared filesystem

2 participants