Fix inaccessible job store errors for grid batch systems (#5007)#5520
Fix inaccessible job store errors for grid batch systems (#5007)#5520annagiroti wants to merge 3 commits into
Conversation
adamnovak
left a comment
There was a problem hiding this comment.
This looks pretty good, but I think the tests need some rationalization.
| def test_decoration_included_when_no_local_suggestion(self): | ||
| """Decoration string should appear in the auto-generated fallback path when no local suggestion is given.""" | ||
| from toil.jobStores.utils import generate_locator | ||
| result = generate_locator("file", local_suggestion=None, decoration="wdl") | ||
| self.assertIn("wdl", result) |
There was a problem hiding this comment.
This is not a test for generate_default_job_store, but the class claims to be all tests for generate_default_job_store.
This could be bumped up a level and named test_generate_locator_includes_decoration_when_no_local_suggestion. Or maybe test_generate_locator_includes_decoration_without_local_suggestion. Or if we're only going to have the one it could just be test_generate_locator_includes_decoration.
| class GenerateDefaultJobStoreTest(ToilTest): | ||
| """Tests for generate_default_job_store in toil.jobStores.utils.""" | ||
|
|
||
| def setUp(self): | ||
| super().setUp() | ||
| from toil.lib.io import mkdtemp | ||
| self.local_dir = mkdtemp() | ||
|
|
||
| def tearDown(self): | ||
| super().tearDown() | ||
| shutil.rmtree(self.local_dir, ignore_errors=True) |
There was a problem hiding this comment.
Probably we want this to be a new-style test that uses tmp_path: Path as an argument to get a temp directory fixture, rather than an old-style test extending ToilTest.
If we do extend ToilTest, we'd want to use its methods for getting temp directories in tests.
But in neither case do we want to add new code to set up and tear down temp directories.
| def test_slurm_raises_no_available_job_store(self): | ||
| """Slurm batch system should raise NoAvailableJobStoreException.""" | ||
| from toil.jobStores.utils import generate_default_job_store, NoAvailableJobStoreException | ||
| with self.assertRaises(NoAvailableJobStoreException): | ||
| generate_default_job_store("slurm", None, self.local_dir) | ||
|
|
||
| def test_slurm_error_mentions_shared_filesystem(self): | ||
| """Slurm error message should tell the user to use a shared filesystem.""" | ||
| from toil.jobStores.utils import generate_default_job_store, NoAvailableJobStoreException | ||
| with self.assertRaises(NoAvailableJobStoreException) as cm: | ||
| generate_default_job_store("slurm", None, self.local_dir) | ||
| self.assertIn("shared", str(cm.exception).lower()) | ||
| self.assertIn("--jobStore", str(cm.exception)) |
There was a problem hiding this comment.
Does it really make sense to have one test to check that the error happens and a different test to make sure that the error happens with the right message text? It's not possible for the second test to pass if the first test fails.
| self.assertIn("shared", str(cm.exception).lower()) | ||
| self.assertIn("--jobStore", str(cm.exception)) | ||
|
|
||
| def test_grid_engine_batch_systems_raise_no_available_job_store(self): |
There was a problem hiding this comment.
I don't think the test function names here are right.
Mostly in Toil we use short and not-so-descriptive test names; the test class might be JobStoreGenerationTest and we'd have things like test_local(), test_slurm(), etc.
But that's not really unit testing; that's writing one test per situation when really you want one test per thing that needs to be true about the unit under test. See for example this discussion on whether tests should always contain exactly one assert.
But then if you're taking a more proper unit testing approach, you want your functions to be named something like test_<function>_<scenario>_<outcome>. (You also might not put them in class if the thing they are testing is not itself a class? But also it makes sense to put them in a class to group them.)
The names here are sort of test_<scenario>_<outcome>, I think, except when they aren't. test_decoration_included_when_no_local_suggestion is test_<outcome>_<scenario>, for example. I think that might actually be a more sensible order when the <function> has to be implied by being in the test class, because you are not awkwardly missing your subject noun phrase in the name.
The test_<scenario>_<outcome> names can kind of read like they might be test_<unit>_<outcome>, which I find confusing: test_single_machine_returns_local_path parses to me like "This is a test that makes sure that single machine (whatever that is) returns a local path". And test_grid_engine_batch_systems_raise_no_available_job_store has "raise" instead of "raises", which also makes it sound like the gird engine batch systems, and not the generate_default_job_store unit from the enclosing class, is what's meant to be doing the raising.
I think the test functions should be harmonized to use one naming convention.
| jobstore = mkdtemp(prefix="toil-wdl-", dir=os.getcwd()) | ||
| os.rmdir(jobstore) |
There was a problem hiding this comment.
I think the file job store logic insists on actually creating the directory, which means that if someone malicious can write to the directory we're standing in, they can cause our run to fail by creating the directory themselves between when we delete it and when we go to re-create it, but they can't force us to use a directory they provide full of malicious files.
But it still would be good to eventually replace the notion of passing a suggested directory to use as the job store with instead an approach where we pass a suggested directory to put the job store in. We could also use a high-entropy UUID that is not created as a directory until we actually go to use it for a job store.
|
@annagiroti Can you also fill in the proposed changelog entry for this? |
Grid batch systems (Slurm, LSF, etc.) now raise an error upfront when no
--jobStoreis specified, directing users to provide a shared filesystem path. Workers that cannot access the job store now exit with a sentinel code, prompting a clear warning from the leader instead of a confusing traceback.Resolves #5007
Changelog Entry
To be copied to the draft changelog by merger:
Reviewer Checklist
issues/XXXX-fix-the-thingin the Toil repo, or from an external repo.camelCasethat want to be insnake_case.docs/running/{cliOptions,cwl,wdl}.rstMerger Checklist