It's possible to run the Coiled Runtime benchmarks on A/B comparisons, highlighting performance differences between different released versions of dask, distributed, or any of the dependencies and/or between different dask configs.
To run an A/B test:
Branch from main, on the coiled repo itself. Preferably, call the branch
something meaningful, e.g. AB/jobstealing.
You must create the branch on the Coiled repo (coiled/benchmarks); CI
workflows will not work on a fork (yourname/benchmarks).
Open the AB_environments/ directory and rename/create files as needed.
Each A/B runtime is made of exactly three files:
AB_<name>.conda.yaml(a conda environment file)AB_<name>.dask.yaml(a dask configuration file)AB_<name>.cluster.yaml(coiled.Cluster kwargs)
You may create as many A/B runtime configs as you want in a single coiled-runtime
branch.
You can use the utility make_envs.py <name>, [name], ... to automate file creation.
The conda environment file can contain whatever you want, as long as it can run the tests; e.g.
channels:
- conda-forge
dependencies:
- python =3.9
- <copy-paste from ci/environment.yaml, minus bits you want to change>
# Changes from the default environment start here
- dask ==2023.4.1
- distributed ==2023.4.1Instead of published packages, you could also use arbitrary git hashes of arbitrary forks, e.g.
- pip:
- git+https://github.com/dask/dask@b85bf5be72b02342222c8a0452596539fce19bce
- git+https://github.com/yourname/distributed@803c624fcef99e3b6f3f1c5bce61a2fb4c9a1717You may also ignore the default environment and go for a barebones environment. The bare
minimum you need to install is dask, distributed, coiled and s3fs.
This will however skip some tests, e.g. zarr and ML-related ones, and it will also
expose you to less controlled behaviour e.g. dependent on which versions of numpy and
pandas are pulled in:
channels:
- conda-forge
dependencies:
- python =3.9
- dask ==2023.4.1
- distributed ==2023.4.1
- coiled
- s3fsThe second file in each triplet is a dask config file. If you don't want to change the config, you must create an empty file.
e.g.
distributed:
scheduler:
work-stealing: FalseThe third and final file defines creation options to the dask cluster. It must be formatted as follows:
default:
<kwarg>: <value>
...
<cluster name>:
<kwarg>: <value>
...<cluster name> must be:
small_cluster, for all tests decorated withsmall_clusterparquet_cluster, for all tests intest_parquet.py- others: please refer to
cluster_kwargs.yaml
The default section applies to all <cluster name> sections, unless explicitly
overridden.
Anything that's omitted defaults to the contents of cluster_kwargs.yaml. Leave this
file blank if you're happy with the defaults.
For example:
small_cluster:
n_workers: 10
worker_vm_types: [m6i.large] # 2CPU, 8GiBIf you create any files in AB_environments/, you must create the baseline environment:
AB_baseline.conda.yamlAB_baseline.dask.yamlAB_baseline.cluster.yaml
Open AB_environments/config.yaml and set the repeat setting to a number higher than 0.
This enables the A/B tests.
Setting a low number of repeated runs is faster and cheaper, but will result in higher
variance. Setting it to 5 is a good value to get statistically significant results.
repeat must remain set to 0 in the main branch, thus completely disabling
A/B tests, in order to avoid unnecessary runs.
In the same file, you can also set the test_null_hypothesis flag to true to
automatically create a verbatim copy of AB_baseline and then compare the two in the A/B
tests. Set it to false to save some money if you are already confident that the 'repeat'
setting is high enough.
The file offers a targets list. These can be test directories, individual test files,
or individual tests that you wish to run.
The file offers a markers string expression, to be passed to the -m pytest parameter
if present. See setup.cfg for the available ones.
h2o_datasets is a list of datasets to run through in
tests/benchmarks/test_h2o.py. Refer to the file for the possible choices.
Finally, the max_parallel setting lets you tweak maximum test parallelism, both in
github actions and in pytest-xdist. Reducing parallelism is useful when testing on very
large clusters (e.g. to avoid having 20 clusters with 1000 workers each at the same
time).
Nothing prevents you from changing the tests themselves; for example, you may be interested in some specific test, but on double the regular size, half the chunk size, etc.
You want to test the impact of disabling work stealing on the latest version of dask. You'll create at least 4 files:
AB_environments/AB_baseline.conda.yaml:
channels:
- conda-forge
dependencies:
- python =3.9
- coiled
- dask
- distributed
- s3fsAB_environments/AB_baseline.dask.yaml: (empty file)AB_environments/AB_baseline.cluster.yaml: (empty file)AB_environments/AB_no_steal.conda.yaml: (same as baseline)AB_environments/AB_no_steal.dask.yaml:
distributed:
scheduler:
work-stealing: FalseAB_environments/AB_no_steal.cluster.yaml: (empty file)AB_environments/config.yaml:
repeat: 5
test_null_hypothesis: true
targets:
- tests/benchmarks
h2o_datasets:
- 5 GB (parquet+pyarrow)
max_parallel:
ci_jobs: 5
pytest_workers_per_job: 4git push. Note: you should not open a Pull Request.- Open [the GitHub Actions tab] (https://github.com/coiled/benchmarks/actions/workflows/ab_tests.yml) and wait for the run to complete.
- Open the run from the link above. In the Summary tab, scroll down and download the
static-dashboardartifact. Note: artifacts will appear only after the run is complete. - Decompress
static-dashboard.zipand openindex.htmlin your browser.
Remember to delete the branch once you're done.
Environment build fails with a message such as:
coiled 0.2.27 requires distributed>=2.23.0, but you have distributed 2.8.0+1709.ge0932ec2 which is incompatible.
Your conda environment points to a fork of dask/dask or dask/distributed, but its owner did not synchronize the tags. To fix, the owner of the fork must run:
$ git remote -v
origin https://github.com/yourname/distributed.git (fetch)
origin https://github.com/yourname/distributed.git (push)
upstream https://github.com/dask/distributed.git (fetch)
upstream https://github.com/dask/distributed.git (push)
$ git fetch upstream --tags # Or whatever alias the dask org was added as above
$ git push origin --tags # Or whatever alias the fork was added as aboveAs a handy copy-paste to run from the root dir of this repository:
pushd ../dask && git fetch upstream --tags && git push origin --tags && popd
pushd ../distributed && git fetch upstream --tags && git push origin --tags && popdThe conda environment fails to build, citing incompatibilities with openssl
Double check that you didn't accidentally type - python ==3.9, which means 3.9.0,
instead of - python =3.9, which means the latest available patch version of 3.9.
You get very obscure failures in the workflows, which you can't seem to replicate
Double check that you don't have the same packages listed as conda package and under the
special - pip: tag. Installing a package with conda and then upgrading it with pip
typically works, but it's been observed not to (e.g. xgboost).
Specifically, dask and distributed can be installed with conda and then upgraded
with pip, but they must be both upgraded. Note that specifying the same version with
pip of a package won't upgrade it.
This is bad:
dependencies:
- pip:
- git+https://github.com/dask/distributed@803c624fcef99e3b6f3f1c5bce61a2fb4c9a1717dask-2023.3.2 and distributed-2023.3.2 will be installed from conda anyway, e.g. by coiled.
This is bad:
dependencies:
- dask ==2023.3.2
- distributed ==2023.3.2
- pip:
- dask ==2023.3.2
- git+https://github.com/dask/distributed@803c624fcef99e3b6f3f1c5bce61a2fb4c9a1717The dask version is the same in pip and conda, so conda wins.
This is good:
dependencies:
- dask ==2023.3.2
- distributed ==2023.3.2
- pip:
- dask ==2023.4.1
- git+https://github.com/dask/distributed@803c624fcef99e3b6f3f1c5bce61a2fb4c9a1717dask and distributed versions from conda are properly uninstalled after they serve as a dependency for the other conda packages.