Heavily refactor test suite (with Claude) by Alasdair · Pull Request #1640 · rems-project/sail

Alasdair · 2026-03-09T18:54:55Z

This PR was mostly a weekend learning project to try to figure out for myself the capabilities of AI tools (Claude Code in this case), seeing as everyone is talking/hyped about them and it seems foolish to have a strong opinion without first hand experience.

Here's my (entirely non-AI summary):

I wanted to remove the duplication of the batch tests/fork loop in each test script. To do this I created a SailTest class in sailtest.py. This class has a run_tests method that handles the main loop. Subclasses override the abstract run method which and can then invoke run_tests as many times as they need. Each time they pass a function to run_tests that is run in the child subprocess and calls step to execute the test actions as before. For example a simplified version of the type check pass script:

@suite("typecheck.pass", work_dir="tcpass")
class TypecheckPassTests(SailTest):
    def run(self):
        self.banner("Testing passing programs")
        self.run_tests(
            "pass",
            Batcher(_PASS_DIR),
            self._test,
            skip_set=skip_pass,
        )

    def _test(self, test):
        test.copy_filename()
        step(
            f"'{self.sail}' --no-memo-z3 --just-check --strict-bitvector"
            f" --ddump-tc-ast {test.filename} 1> _rtpass/{test.filename}"
        )
        # more steps...

This is the end result after many steps (as you can see from the number of commits in this PR)

I created a single 'entry' point for the tests runner.py, so rather than running test/typecheck/run_tests.py. You now do test/suites/runner.py -s typecheck. There is now some hierarchy, so you can also do -s typecheck.pass. All the python code was moved into a single directory test/suites. Each 'suite' now registers itself using a python decorator. You can use inheritance to share behaviour. The exec tests (renamed from c) demonstrate this quite well, as there is the root SailTest class, then an _ExecCBase class, before the various C and C++ options. Using the decorators was my idea, Claude started doing something really ugly, scanning the file system for python files, loading them and looking for SailTest children dynamically - I had to tell it to use the decorator and registry pattern.
Separating the code that runs the tests from the test data itself was a nice start, but I also wanted to separate away the output too. Claude actually didn't want to do this - and it turned out it mostly couldn't without making a huge mess of things. Each time runner.py is invoked it generates a new temporary directory and makes a symlink to that in test/_runs. All test output is captured in that directory, with a separate folder for each 'suite'. The thing Claude wasn't able to do was update each script to actually work in that directory, even though it set up most of the infrastructure. I had to port each test suite by hand. Perhaps after doing it myself I could have prompted Claude better, but this was one point where I tried to make it do a 'big step' in the refactoring and it couldn't handle it. The reason this is nice is it makes it much easier to inspect the intermediate outputs of each test, whereas previously because the tests and build artefacts were all mixed we needed to do things like 'rm' intermediate steps to clean up between runs.

I might see if rebasing this PR into a more logical series of commits is something the AI can do. I haven't looked at updating the GitHub actions set so this needs to be a draft PR until that is fixed.

Add SailTest to sailtest.py with a run_tests() method that encapsulates the fork/collect loop, skip handling, expected failures, and XML writing. Each run_tests.py now subclasses SailTest and overrides run(). Also removes unused hashlib imports, dead return statements, and inconsistent print_ok usage throughout. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

@AbstractMethod

- Use os.environ.get() instead of try/except for env vars - Use subprocess.run() with capture_output/text instead of Popen - Use list accumulation instead of string concatenation for XML - Replace re.match with str.endswith() - Use _make_chunks() factory to eliminate three near-identical functions - Compute _parallel_count once at module level - Make SailTest(ABC) with @AbstractMethod - Use f-strings consistently Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- banner() and get_targets() become public methods (called from subclass run()) - _print_ok() and _print_skip() become private methods (used only by run_tests()) - _get_sail_dir() moves into the class; get_sail() is inlined in __init__ - Remove unused module-level is_compact() - Update all run_tests.py subclasses to use self.banner() and self.get_targets() Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Move banner() into _run_c_tests() in c/run_tests.py so it reflects the actual c_opts, sail_opts and valgrind parameters. Use shared variables for opts in sv/run_tests.py and ocaml/run_tests.py. Loop over (name, sail_opts) pairs in builtins/run_tests.py. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Replace all .format() calls with f-strings - Remove import re from smt and typecheck; use str.endswith() instead - Inline libpaths computation in mono/_test() using f-strings, eliminating the deferred-format template string pattern - Also fix lem banner to interpolate opts like other test scripts - Also refactor coq run() to loop over libs, extracting _have_bbv() Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

The exec tests are generic end-to-end tests used by multiple backends (C, C++, SystemVerilog, Lean, Coq, Lem, OCaml interpreter), not just the C/C++ targets. Renaming to test/exec better reflects their purpose. Also updates the Makefile target (c-tests -> exec-tests), .gitignore, and all references in shell scripts and GitHub Actions workflows. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

run_tests() now takes an explicit testdir keyword argument. Before calling the test function in each forked child, the child os.chdir()s to testdir. The parent also temporarily chdirs to testdir while running the chunks_fn, so predicates like os.path.isdir() resolve correctly. Each suite file now: - Removes the os.chdir(mydir) startup coupling - Defines _SUITE_DIR via os.path.abspath(__file__) as its own location - Uses absolute paths for cross-suite directory references (lem, coq, sv, lean all read test data from other directories) - Passes testdir= to every run_tests() call - Passes xml_dir=_SUITE_DIR to main() to preserve per-suite tests.xml location The lean suite additionally removes the os.chdir(f"../{subdir}") inside _make_test() (redundant since the child is already at testdir) and updates _get_support_lib() to use absolute paths. No behaviour change: all suite results are identical to before. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

All run_tests.py scripts are moved from their individual test data directories (test/<name>/run_tests.py) to a dedicated test/suites/ directory (test/suites/<name>.py). Each suite file uses absolute path constants derived from __file__ so it works from any invocation location. Shell scripts, Makefile, CI config, and CLAUDE.md updated to reference the new paths. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Replace the module-level chunks/directory_chunks/project_chunks functions and the filenames + chunks_fn arguments to run_tests() with a single Batcher class. Batcher encapsulates a directory and a predicate, and exposes a batch(cores) method that lists the directory, filters by the predicate, and splits the results into parallel batches. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

The repo root sail script invokes the dev build with plugins loaded and SAIL_DIR set automatically, so SAIL_DIR no longer needs to be specified explicitly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

- Rename color -> Color (PEP 8) - Fix Batcher.batch() always appending an empty trailing batch - Replace deprecated datetime.utcnow() with datetime.now(timezone.utc) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Batcher.batch() now calls predicates with os.path.join(self.directory, filename) rather than bare filenames, so predicates like os.path.isdir work without relying on the current working directory. The saved_cwd / os.chdir dance inside batch() is removed entirely. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

The new flag accepts a number and overrides TEST_PAR, subsuming --seq (use -j 1 to run sequentially). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

All Python files are now in one directory. The sys.path.insert lines in each suite that added the parent directory to the path are no longer needed and have been removed. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

runner.py dispatches to a named suite via -s/--suite, e.g. `test/suites/runner.py -s lexing`. It owns argument parsing (deferred from sailtest.py) and dynamically imports the suite module, finds the SailTest subclass, and calls main(). Suite files are now pure modules: the top-level main() call is removed, the execute bit is cleared, and the shebang lines are dropped. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

@suite

…e vars sailtest.py gains a _suite_registry dict and a suite(name, xml_dir) decorator. Each suite class is decorated with @suite("name", _SUITE_DIR). runner.py auto-imports all suite modules by file path (avoiding Python built-in name collisions like builtins) so they self-register, then dispatches by name with a clear error listing available suites. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

…m to step - Split TypecheckTests into TypecheckPassTests, TypecheckProjectTests, TypecheckFailTests registered as typecheck.pass, typecheck.project, typecheck.fail respectively - Add dot-segment prefix matching to runner.py so -s typecheck runs all three sub-suites and -s typecheck.fail runs just that one - Named XML output: main() now writes <name>.xml instead of tests.xml - Add env= parameter to step/step_with_status for per-step environment overrides (inherits os.environ, then merges given vars) - Fix project.py SAIL_NEW_CLI env pollution: remove module-level os.environ mutation and pass env={"SAIL_NEW_CLI": "true"} per-step Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Replace the single ExecTests class (which used -t/--targets to select targets at runtime) with six independently registered sub-suites: exec.c, exec.cpp, exec.interpreter, exec.ocaml, exec.lem, exec.coq. Shared helpers (_run_c_tests, _test_interpreter, etc.) live in an unregistered _ExecBase class that all six inherit from. Users can now run e.g. `runner.py -s exec.cpp` or `runner.py -s exec` to run all exec sub-suites via the dot-segment prefix matching added in the previous commit. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

ocaml.py: Split into ocaml (default) and ocaml.trace using a shared _OcamlBase with a _opts class attribute overridden by each sub-class. builtins.py: Split into builtins.c, builtins.ocaml, builtins.lem, builtins.coq, builtins.isla. The C sub-suite runs three opt variants in sequence via a _run_c_tests helper method. sailtest.py: Remove get_targets() and the -t/--targets CLI flag, which are now superseded by the dot-segment sub-suite selection in runner.py. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Prints all registered suites as a pretty tree using box-drawing characters. Leaf nodes are the directly runnable suites; parent nodes (e.g. exec, typecheck) can also be run via prefix matching. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

sailtest.py now only creates the bare ArgumentParser and leaves args and parallelism as None with a comment that runner.py fills them in. All add_argument calls live in runner.py alongside -s/--suite and --list-suites, keeping the full CLI surface in one place. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

sailtest.TEST_DIR points to the test/ directory (one level up from test/suites/). All 17 suite files previously defined an identical _TEST_DIR local variable; they now use the shared TEST_DIR exported by sailtest via 'from sailtest import *'. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Each run of runner.py creates test/_runs/<timestamp>-<adj>-<colour>-<animal>/ where the timestamp is ISO 8601 with hyphens (YYYY-MM-DDTHH-MM-SS) so directories sort chronologically. The three random words make each run name memorable. The path is stored as sailtest.run_dir for use by suites. test/_runs/ is added to .gitignore. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add a Test class to sailtest.py with .path (absolute), .filename (relative), and .basename (no extension) attributes. Batcher.batch() now yields Test objects and run_tests() passes a single test to fn instead of separate filename/basename arguments. All suite _test methods and fn closures updated accordingly. skip_fn signature changes from skip_fn(filename, basename) to skip_fn(test). smt.py uses a local `basename` variable for the dots-replaced form rather than mutating test.basename. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

@suite

The @suite decorator no longer takes an xml_dir argument — the registry now maps names directly to classes. SailTest.main() writes its XML file to sailtest.run_dir (the timestamped directory created by runner.py) using the suite name as the filename. runner.py updated accordingly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

SailTest.main() now takes run_dir explicitly and stores it as self.run_dir, avoiding direct access to the sailtest module global. runner.py uses a local _run_dir variable instead of sailtest.run_dir. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

@suite

@suite() gains an optional work_dir parameter (defaults to the suite name). The registry stores (cls, work_dir) and raises ValueError at registration time if two suites would share the same work_dir. SailTest.main() creates os.path.join(run_dir, work_dir) and stores it as self.work_dir, making it available to the suite during run(). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

typecheck.pass -> tcpass, typecheck.project -> tcproj, typecheck.fail -> tcfail, ocaml.default -> ocaml, ocaml.trace -> ocaml_trace Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

By default, create the run directory in the system temp dir (typically tmpfs on Linux) and place a named symlink in test/_runs/ pointing to it. Pass --no-tmpdir to create the directory directly in _runs as before. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Document the Test object attributes, prepare() hook, work_dir isolation, tmpfs run directories, and the full list of available suite names. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Alasdair and others added 30 commits March 9, 2026 07:42

Apply black formatting to test Python scripts

5cdacb2

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Update CLAUDE.md to use the sail wrapper script for tests

5d24b4f

The repo root sail script invokes the dev build with plugins loaded and SAIL_DIR set automatically, so SAIL_DIR no longer needs to be specified explicitly. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Minor cleanups to sailtest.py

821a047

- Rename color -> Color (PEP 8) - Fix Batcher.batch() always appending an empty trailing batch - Replace deprecated datetime.utcnow() with datetime.now(timezone.utc) Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Replace parallel() function with a module-level parallelism constant

23e6658

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Replace --seq flag with -j/--parallelism

a82e97a

The new flag accepts a number and overrides TEST_PAR, subsuming --seq (use -j 1 to run sequentially). Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Allow -s to be passed multiple times to run several suites in sequence

092c72b

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Fix config paths in exec sail files after test/c -> test/exec rename

7f207cc

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add -t short option for --targets

3b64edd

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Add --list-suites flag to runner.py

a199e0f

Prints all registered suites as a pretty tree using box-drawing characters. Leaf nodes are the directly runnable suites; parent nodes (e.g. exec, typecheck) can also be run via prefix matching. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Alasdair and others added 9 commits March 9, 2026 07:43

Use short work_dir names for typecheck and ocaml sub-suites

123eacd

typecheck.pass -> tcpass, typecheck.project -> tcproj, typecheck.fail -> tcfail, ocaml.default -> ocaml, ocaml.trace -> ocaml_trace Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Modify test scripts to operate in clean working directory

00e53bd

Update CLAUDE.md with test infrastructure changes

c2a07c0

Document the Test object attributes, prepare() hook, work_dir isolation, tmpfs run directories, and the full list of available suite names. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Run black formatter on test/suites/*.py

f14c49a

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Just display the run name

df18d1c

Fix project suite

2335678

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Heavily refactor test suite (with Claude)#1640

Heavily refactor test suite (with Claude)#1640
Alasdair wants to merge 39 commits intosail2from
test_vibe

Alasdair commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Alasdair commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant