Release prep (#727)

wpietri · web-flow · commit d132c094a183 · 2024-12-03T15:02:16.000-06:00
* Rewrite README to reflect current conditions.

* More updates.

* Incorporating Barbara and Kurt's suggestions.
diff --git a/README.md b/README.md
@@ -11,31 +11,40 @@ Run safety benchmarks against AI models and view detailed reports showing how we
 
 ## Background
 
-This is a [MLCommons project](https://mlcommons.org/ai-safety), part of the [AI Safety Working Group](https://mlcommons.org/working-groups/ai-safety/ai-safety/).
-The project is at an early stage, and this should be considered a proof of concept. Results are not intended to
-indicate actual levels of AI system safety. You can see sample benchmarks [here](https://mlcommons.org/benchmarks/ai-safety/) and read the white paper [here](https://arxiv.org/abs/2404.12241).
-
-ModelBench is closely related to [ModelGauge](https://github.com/mlcommons/modelgauge). ModelGauge does most of the
-work of running Tests against SUTs (systems under test, that is machine learning models) and then using annotators
-to measure each response. This project, ModelBench, aggregates those measures, relates them to specific Harms, rolls
-those Harms up into Benchmarks, and produces reports. If you are looking to run a benchmark for your model, start by
-[adding a SUT](https://github.com/mlcommons/modelgauge/blob/main/docs/tutorial_suts.md) that works with ModelGauge. 
+This is a [MLCommons project](https://mlcommons.org/ai-safety),
+part of the [AI Risk & Reliability Working
+Group](https://mlcommons.org/ai-risk-and-reliability/).
+The project is at an early stage. You can see sample benchmarks
+[here](https://ailuminate.mlcommons.org/benchmarks/) and our 0.5 white paper
+[here](https://arxiv.org/abs/2404.12241).
+
+This project now contains both ModelGauge and ModelBench. ModelGauge does
+most of the work of running Tests against SUTs (systems under test,
+that is machine learning models and related tech) and then using
+annotators to measure each response. ModelBench aggregates those measures,
+relates them to specific Hazards, rolls those Hazards up into Benchmarks, and
+produces reports. If you are looking to run a benchmark for your model,
+start by [adding a SUT](docs/add-a-sut.md) that works with ModelGauge.
 
 ## Requirements
 
-The current benchmark uses LlamaGuard to evaluate the safety of responses. For now you will need a [TogetherAI](https://www.together.ai/)
-account to use it. For 0.5, all included models are also run via TogetherAI. If you're adding a SUT, you can use
-any service you like, including hosting it yourself.
+The current public practice benchmark uses LlamaGuard to
+evaluate the safety of responses. For now you will need a
+[Together AI](https://www.together.ai/) account to use it. For 1.0, we test
+models on a variety of services; if you want to duplicate our benchmarks
+you will need accounts with those services as well. If you're adding a
+SUT, you can use any service you like, including hosting it yourself.
 
-Note that running a full benchmark for all included models via TogetherAI currently takes about a week. Depending
-on response time, running your own SUT may be faster. We aim to speed things up substantially for 1.0. However, you
-can get lower-fidelity reports in minutes by running a benchmark with fewer items via the `--max-instances` or
-`-m` flag.
+Note that running a full benchmark to match our public set takes
+several days. Depending on response time, running your own SUT may be
+faster. However, you can get lower-fidelity reports in minutes by running
+a benchmark with fewer items via the `--max-instances` or `-m` flag.
 
 ## Installation
 
-Since this is under heavy development, the best way to run it is to check it out from GitHub. However, you can also 
-install ModelBench as a CLI tool or library to use in your own projects.
+Since this is under heavy development, the best way to run it is to
+check it out from GitHub. However, you can also install ModelBench as
+a CLI tool or library to use in your own projects.
 
 ### Install ModelBench with [Poetry](https://python-poetry.org/) for local development.
 
@@ -57,8 +66,10 @@ cd modelbench
 poetry install
 ```
 
-At this point you may optionally do `poetry shell` which will put you in a virtual environment that uses the installed packages
-for everything. If you do that, you don't have to explicitly say `poetry run` in the commands below.
+At this point you may optionally do `poetry shell` which will put you in a
+virtual environment that uses the installed packages for everything. If
+you do that, you don't have to explicitly say `poetry run` in the
+commands below.
 
 ### Install ModelBench from PyPI
 
@@ -77,15 +88,17 @@ poetry run pytest tests
 
 ## Trying It Out
 
-We encourage interested parties to try it out and give us feedback. For now, ModelBench is just a proof of
-concept, but over time we would like others to be able both test their own models and to create their own
-tests and benchmarks.
+We encourage interested parties to try it out and give us feedback. For
+now, ModelBench is mainly focused on us running our own benchmarks,
+but over time we would like others to be able both test their own models
+and to create their own tests and benchmarks.
 
 ### Running Your First Benchmark
 
-Before running any benchmarks, you'll need to create a secrets file that contains any necessary API keys and other sensitive information.
-Create a file at `config/secrets.toml` (in the current working directory if you've installed ModelBench from PyPi). 
-You can use the following as a template.
+Before running any benchmarks, you'll need to create a secrets file that
+contains any necessary API keys and other sensitive information. Create a
+file at `config/secrets.toml` (in the current working directory if you've
+installed ModelBench from PyPi). You can use the following as a template.
 
 ```toml
 [together]
@@ -101,46 +114,77 @@ Note: Omit `poetry run` in all example commands going forward if you've installe
 poetry run modelbench benchmark -m 10
 ```
 
-You should immediately see progress indicators, and depending on how loaded TogetherAI is,
-the whole run should take about 15 minutes.
+You should immediately see progress indicators, and depending on how
+loaded Together AI is, the whole run should take about 15 minutes.
 
 > [!IMPORTANT]
 > Sometimes, running a benchmark will fail due to temporary errors due to network issues, API outages, etc. While we are working
 > toward handling these errors gracefully, the current best solution is to simply attempt to rerun the benchmark if it fails.
 
 ### Viewing the Scores
 
-After a successful benchmark run, static HTML pages are generated that display scores on benchmarks and tests.
-These can be viewed by opening `web/index.html` in a web browser. E.g., `firefox web/index.html`. 
+After a successful benchmark run, static HTML pages are generated that
+display scores on benchmarks and tests. These can be viewed by opening
+`web/index.html` in a web browser. E.g., `firefox web/index.html`.
+
+Note that the HTML that ModelBench produces is an older version than is available
+on [the website](https://ailuminate.mlcommons.org/). Over time we'll simplify the
+direct ModelBench output to be more straightforward and more directly useful to
+people independently running ModelBench.
+
+### Using the journal
+
+As `modelbench` runs, it logs each important event to the journal. That includes
+every step of prompt processing. You can use that to extract most information
+that you might want about the run. The journal is a zstandard-compressed JSONL
+file, meaning that each line is a valid JSON object.
 
-If you would like to dump the raw scores, you can do:
+There are many tools that can work with those files. In the example below, we
+use [jq](https://jqlang.github.io/jq/, a JSON swiss army knife. For more
+information on the journal, see [the documentation](docs/run-journal.md).
+
+To dump the raw scores, you could do something like this
 
 ```shell
-poetry run modelbench grid -m 10 > scoring-grid.csv
+zstd -d -c $(ls run/journals/* | tail -1)  | jq -rn ' ["sut", "hazard", "score", "reference score"], (inputs | select(.message=="hazard scored") | [.sut, .hazard, .score, .reference]) | @csv'
 ```
 
-To see all raw requests, responses, and annotations, do:
+That will produce CSV for each hazard scored, as well as showing the reference
+score for that hazard.
+
+Or if you'd like to see the processing chain for a specific prompt, you could do:
 
 ```shell
-poetry run modelbench responses -m 10 response-output-dir
+zstd -d -c $(ls run/journals/* | tail -1)  | jq -r 'select(.prompt_id=="airr_practice_1_0_41321")'
 ```
-That will produce a series of CSV files, one per Harm, in the given output directory. Please note that many of the
-prompts may be uncomfortable or harmful to view, especially to people with a history of trauma related to one of the
-Harms that we test for. Consider carefully whether you need to view the prompts and responses, limit exposure to
-what's necessary, take regular breaks, and stop if you feel uncomfortable. For more information on the risks, see
-[this literature review on vicarious trauma](https://www.zevohealth.com/wp-content/uploads/2021/08/Literature-Review_Content-Moderators37779.pdf).
+
+That should output a series of JSON objects showing the flow from `queuing item`
+to `item finished`.
+
+**CAUTION**: Please note that many of the prompts may be uncomfortable or
+harmful to view, especially to people with a history of trauma related to
+one of the hazards that we test for. Consider carefully whether you need
+to view the prompts and responses, limit exposure to what's necessary,
+take regular breaks, and stop if you feel uncomfortable. For more
+information on the risks, see [this literature review on vicarious
+trauma](https://www.zevohealth.com/wp-content/uploads/2021/08/Literature-Review_Content-Moderators37779.pdf).
 
 ### Managing the Cache
 
-To speed up runs, ModelBench caches calls to both SUTs and annotators. That's normally what a benchmark-runner wants.
-But if you have changed your SUT in a way that ModelBench can't detect, like by deploying a new version of your model
-to the same endpoint, you may have to manually delete the cache. Look in `run/suts` for an `sqlite` file that matches
-the name of your SUT and either delete it or move it elsewhere. The cache will be created anew on the next run.
+To speed up runs, ModelBench caches calls to both SUTs and
+annotators. That's normally what a benchmark-runner wants. But if you
+have changed your SUT in a way that ModelBench can't detect, like by
+deploying a new version of your model to the same endpoint, you may
+have to manually delete the cache. Look in `run/suts` for an `sqlite`
+file that matches the name of your SUT and either delete it or move it
+elsewhere. The cache will be created anew on the next run.
 
 ### Running the benchmark on your SUT
 
-ModelBench uses the ModelGauge library to discover and manage SUTs. For an example of how you can run a benchmark 
-against a custom SUT, check out this [tutorial](https://github.com/mlcommons/modelbench/blob/main/docs/add-a-sut.md).
+ModelBench uses the ModelGauge library to discover
+and manage SUTs. For an example of how you can run
+a benchmark against a custom SUT, check out this
+[tutorial](https://github.com/mlcommons/modelbench/blob/main/docs/add-a-sut.md).
 
 ## Contributing