Skip to content

Heavily refactor test suite (with Claude)#1640

Draft
Alasdair wants to merge 39 commits intosail2from
test_vibe
Draft

Heavily refactor test suite (with Claude)#1640
Alasdair wants to merge 39 commits intosail2from
test_vibe

Conversation

@Alasdair
Copy link
Collaborator

@Alasdair Alasdair commented Mar 9, 2026

This PR was mostly a weekend learning project to try to figure out for myself the capabilities of AI tools (Claude Code in this case), seeing as everyone is talking/hyped about them and it seems foolish to have a strong opinion without first hand experience.

Here's my (entirely non-AI summary):

  • I wanted to remove the duplication of the batch tests/fork loop in each test script. To do this I created a SailTest class in sailtest.py. This class has a run_tests method that handles the main loop. Subclasses override the abstract run method which and can then invoke run_tests as many times as they need. Each time they pass a function to run_tests that is run in the child subprocess and calls step to execute the test actions as before. For example a simplified version of the type check pass script:
@suite("typecheck.pass", work_dir="tcpass")
class TypecheckPassTests(SailTest):
    def run(self):
        self.banner("Testing passing programs")
        self.run_tests(
            "pass",
            Batcher(_PASS_DIR),
            self._test,
            skip_set=skip_pass,
        )

    def _test(self, test):
        test.copy_filename()
        step(
            f"'{self.sail}' --no-memo-z3 --just-check --strict-bitvector"
            f" --ddump-tc-ast {test.filename} 1> _rtpass/{test.filename}"
        )
        # more steps...

This is the end result after many steps (as you can see from the number of commits in this PR)

  • I created a single 'entry' point for the tests runner.py, so rather than running test/typecheck/run_tests.py. You now do test/suites/runner.py -s typecheck. There is now some hierarchy, so you can also do -s typecheck.pass. All the python code was moved into a single directory test/suites. Each 'suite' now registers itself using a python decorator. You can use inheritance to share behaviour. The exec tests (renamed from c) demonstrate this quite well, as there is the root SailTest class, then an _ExecCBase class, before the various C and C++ options. Using the decorators was my idea, Claude started doing something really ugly, scanning the file system for python files, loading them and looking for SailTest children dynamically - I had to tell it to use the decorator and registry pattern.

  • Separating the code that runs the tests from the test data itself was a nice start, but I also wanted to separate away the output too. Claude actually didn't want to do this - and it turned out it mostly couldn't without making a huge mess of things. Each time runner.py is invoked it generates a new temporary directory and makes a symlink to that in test/_runs. All test output is captured in that directory, with a separate folder for each 'suite'. The thing Claude wasn't able to do was update each script to actually work in that directory, even though it set up most of the infrastructure. I had to port each test suite by hand. Perhaps after doing it myself I could have prompted Claude better, but this was one point where I tried to make it do a 'big step' in the refactoring and it couldn't handle it. The reason this is nice is it makes it much easier to inspect the intermediate outputs of each test, whereas previously because the tests and build artefacts were all mixed we needed to do things like 'rm' intermediate steps to clean up between runs.

I might see if rebasing this PR into a more logical series of commits is something the AI can do. I haven't looked at updating the GitHub actions set so this needs to be a draft PR until that is fixed.

Alasdair and others added 30 commits March 9, 2026 07:42
Add SailTest to sailtest.py with a run_tests() method that encapsulates
the fork/collect loop, skip handling, expected failures, and XML writing.
Each run_tests.py now subclasses SailTest and overrides run(). Also
removes unused hashlib imports, dead return statements, and inconsistent
print_ok usage throughout.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Use os.environ.get() instead of try/except for env vars
- Use subprocess.run() with capture_output/text instead of Popen
- Use list accumulation instead of string concatenation for XML
- Replace re.match with str.endswith()
- Use _make_chunks() factory to eliminate three near-identical functions
- Compute _parallel_count once at module level
- Make SailTest(ABC) with @AbstractMethod
- Use f-strings consistently

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- banner() and get_targets() become public methods (called from subclass run())
- _print_ok() and _print_skip() become private methods (used only by run_tests())
- _get_sail_dir() moves into the class; get_sail() is inlined in __init__
- Remove unused module-level is_compact()
- Update all run_tests.py subclasses to use self.banner() and self.get_targets()

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Move banner() into _run_c_tests() in c/run_tests.py so it reflects
the actual c_opts, sail_opts and valgrind parameters. Use shared
variables for opts in sv/run_tests.py and ocaml/run_tests.py. Loop
over (name, sail_opts) pairs in builtins/run_tests.py.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Replace all .format() calls with f-strings
- Remove import re from smt and typecheck; use str.endswith() instead
- Inline libpaths computation in mono/_test() using f-strings,
  eliminating the deferred-format template string pattern
- Also fix lem banner to interpolate opts like other test scripts
- Also refactor coq run() to loop over libs, extracting _have_bbv()

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The exec tests are generic end-to-end tests used by multiple backends
(C, C++, SystemVerilog, Lean, Coq, Lem, OCaml interpreter), not just
the C/C++ targets. Renaming to test/exec better reflects their purpose.

Also updates the Makefile target (c-tests -> exec-tests), .gitignore,
and all references in shell scripts and GitHub Actions workflows.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
run_tests() now takes an explicit testdir keyword argument. Before calling
the test function in each forked child, the child os.chdir()s to testdir.
The parent also temporarily chdirs to testdir while running the chunks_fn,
so predicates like os.path.isdir() resolve correctly.

Each suite file now:
- Removes the os.chdir(mydir) startup coupling
- Defines _SUITE_DIR via os.path.abspath(__file__) as its own location
- Uses absolute paths for cross-suite directory references (lem, coq, sv,
  lean all read test data from other directories)
- Passes testdir= to every run_tests() call
- Passes xml_dir=_SUITE_DIR to main() to preserve per-suite tests.xml location

The lean suite additionally removes the os.chdir(f"../{subdir}") inside
_make_test() (redundant since the child is already at testdir) and updates
_get_support_lib() to use absolute paths.

No behaviour change: all suite results are identical to before.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
All run_tests.py scripts are moved from their individual test data
directories (test/<name>/run_tests.py) to a dedicated test/suites/
directory (test/suites/<name>.py). Each suite file uses absolute path
constants derived from __file__ so it works from any invocation
location. Shell scripts, Makefile, CI config, and CLAUDE.md updated
to reference the new paths.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replace the module-level chunks/directory_chunks/project_chunks functions
and the filenames + chunks_fn arguments to run_tests() with a single
Batcher class. Batcher encapsulates a directory and a predicate, and
exposes a batch(cores) method that lists the directory, filters by the
predicate, and splits the results into parallel batches.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The repo root sail script invokes the dev build with plugins loaded and
SAIL_DIR set automatically, so SAIL_DIR no longer needs to be specified
explicitly.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
- Rename color -> Color (PEP 8)
- Fix Batcher.batch() always appending an empty trailing batch
- Replace deprecated datetime.utcnow() with datetime.now(timezone.utc)

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Batcher.batch() now calls predicates with os.path.join(self.directory,
filename) rather than bare filenames, so predicates like os.path.isdir
work without relying on the current working directory. The saved_cwd /
os.chdir dance inside batch() is removed entirely.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The new flag accepts a number and overrides TEST_PAR, subsuming --seq
(use -j 1 to run sequentially).

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
All Python files are now in one directory. The sys.path.insert lines in
each suite that added the parent directory to the path are no longer
needed and have been removed.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
runner.py dispatches to a named suite via -s/--suite, e.g.
`test/suites/runner.py -s lexing`. It owns argument parsing (deferred
from sailtest.py) and dynamically imports the suite module, finds the
SailTest subclass, and calls main().

Suite files are now pure modules: the top-level main() call is removed,
the execute bit is cleared, and the shebang lines are dropped.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…e vars

sailtest.py gains a _suite_registry dict and a suite(name, xml_dir)
decorator. Each suite class is decorated with @suite("name", _SUITE_DIR).
runner.py auto-imports all suite modules by file path (avoiding Python
built-in name collisions like builtins) so they self-register, then
dispatches by name with a clear error listing available suites.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
…m to step

- Split TypecheckTests into TypecheckPassTests, TypecheckProjectTests,
  TypecheckFailTests registered as typecheck.pass, typecheck.project,
  typecheck.fail respectively
- Add dot-segment prefix matching to runner.py so -s typecheck runs all
  three sub-suites and -s typecheck.fail runs just that one
- Named XML output: main() now writes <name>.xml instead of tests.xml
- Add env= parameter to step/step_with_status for per-step environment
  overrides (inherits os.environ, then merges given vars)
- Fix project.py SAIL_NEW_CLI env pollution: remove module-level
  os.environ mutation and pass env={"SAIL_NEW_CLI": "true"} per-step

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Replace the single ExecTests class (which used -t/--targets to select
targets at runtime) with six independently registered sub-suites:
exec.c, exec.cpp, exec.interpreter, exec.ocaml, exec.lem, exec.coq.

Shared helpers (_run_c_tests, _test_interpreter, etc.) live in an
unregistered _ExecBase class that all six inherit from.

Users can now run e.g. `runner.py -s exec.cpp` or `runner.py -s exec`
to run all exec sub-suites via the dot-segment prefix matching added
in the previous commit.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
ocaml.py: Split into ocaml (default) and ocaml.trace using a shared
_OcamlBase with a _opts class attribute overridden by each sub-class.

builtins.py: Split into builtins.c, builtins.ocaml, builtins.lem,
builtins.coq, builtins.isla. The C sub-suite runs three opt variants
in sequence via a _run_c_tests helper method.

sailtest.py: Remove get_targets() and the -t/--targets CLI flag, which
are now superseded by the dot-segment sub-suite selection in runner.py.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Prints all registered suites as a pretty tree using box-drawing
characters. Leaf nodes are the directly runnable suites; parent nodes
(e.g. exec, typecheck) can also be run via prefix matching.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
sailtest.py now only creates the bare ArgumentParser and leaves args
and parallelism as None with a comment that runner.py fills them in.
All add_argument calls live in runner.py alongside -s/--suite and
--list-suites, keeping the full CLI surface in one place.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
sailtest.TEST_DIR points to the test/ directory (one level up from
test/suites/). All 17 suite files previously defined an identical
_TEST_DIR local variable; they now use the shared TEST_DIR exported
by sailtest via 'from sailtest import *'.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Each run of runner.py creates test/_runs/<timestamp>-<adj>-<colour>-<animal>/
where the timestamp is ISO 8601 with hyphens (YYYY-MM-DDTHH-MM-SS) so
directories sort chronologically. The three random words make each run
name memorable. The path is stored as sailtest.run_dir for use by suites.

test/_runs/ is added to .gitignore.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Add a Test class to sailtest.py with .path (absolute), .filename
(relative), and .basename (no extension) attributes. Batcher.batch()
now yields Test objects and run_tests() passes a single test to fn
instead of separate filename/basename arguments.

All suite _test methods and fn closures updated accordingly.
skip_fn signature changes from skip_fn(filename, basename) to
skip_fn(test). smt.py uses a local `basename` variable for the
dots-replaced form rather than mutating test.basename.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
The @suite decorator no longer takes an xml_dir argument — the registry
now maps names directly to classes. SailTest.main() writes its XML file
to sailtest.run_dir (the timestamped directory created by runner.py)
using the suite name as the filename. runner.py updated accordingly.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Alasdair and others added 9 commits March 9, 2026 07:43
SailTest.main() now takes run_dir explicitly and stores it as
self.run_dir, avoiding direct access to the sailtest module global.
runner.py uses a local _run_dir variable instead of sailtest.run_dir.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@suite() gains an optional work_dir parameter (defaults to the suite
name). The registry stores (cls, work_dir) and raises ValueError at
registration time if two suites would share the same work_dir.

SailTest.main() creates os.path.join(run_dir, work_dir) and stores it
as self.work_dir, making it available to the suite during run().

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
typecheck.pass -> tcpass, typecheck.project -> tcproj,
typecheck.fail -> tcfail, ocaml.default -> ocaml,
ocaml.trace -> ocaml_trace

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
By default, create the run directory in the system temp dir (typically
tmpfs on Linux) and place a named symlink in test/_runs/ pointing to it.
Pass --no-tmpdir to create the directory directly in _runs as before.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Document the Test object attributes, prepare() hook, work_dir isolation,
tmpfs run directories, and the full list of available suite names.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant