@@ -11,31 +11,40 @@ Run safety benchmarks against AI models and view detailed reports showing how we
1111
1212## Background
1313
14- This is a [ MLCommons project] ( https://mlcommons.org/ai-safety ) , part of the [ AI Safety Working Group] ( https://mlcommons.org/working-groups/ai-safety/ai-safety/ ) .
15- The project is at an early stage, and this should be considered a proof of concept. Results are not intended to
16- indicate actual levels of AI system safety. You can see sample benchmarks [ here] ( https://mlcommons.org/benchmarks/ai-safety/ ) and read the white paper [ here] ( https://arxiv.org/abs/2404.12241 ) .
17-
18- ModelBench is closely related to [ ModelGauge] ( https://github.com/mlcommons/modelgauge ) . ModelGauge does most of the
19- work of running Tests against SUTs (systems under test, that is machine learning models) and then using annotators
20- to measure each response. This project, ModelBench, aggregates those measures, relates them to specific Harms, rolls
21- those Harms up into Benchmarks, and produces reports. If you are looking to run a benchmark for your model, start by
22- [ adding a SUT] ( https://github.com/mlcommons/modelgauge/blob/main/docs/tutorial_suts.md ) that works with ModelGauge.
14+ This is a [ MLCommons project] ( https://mlcommons.org/ai-safety ) ,
15+ part of the [ AI Risk & Reliability Working
16+ Group] ( https://mlcommons.org/ai-risk-and-reliability/ ) .
17+ The project is at an early stage. You can see sample benchmarks
18+ [ here] ( https://ailuminate.mlcommons.org/benchmarks/ ) and our 0.5 white paper
19+ [ here] ( https://arxiv.org/abs/2404.12241 ) .
20+
21+ This project now contains both ModelGauge and ModelBench. ModelGauge does
22+ most of the work of running Tests against SUTs (systems under test,
23+ that is machine learning models and related tech) and then using
24+ annotators to measure each response. ModelBench aggregates those measures,
25+ relates them to specific Hazards, rolls those Hazards up into Benchmarks, and
26+ produces reports. If you are looking to run a benchmark for your model,
27+ start by [ adding a SUT] ( docs/add-a-sut.md ) that works with ModelGauge.
2328
2429## Requirements
2530
26- The current benchmark uses LlamaGuard to evaluate the safety of responses. For now you will need a [ TogetherAI] ( https://www.together.ai/ )
27- account to use it. For 0.5, all included models are also run via TogetherAI. If you're adding a SUT, you can use
28- any service you like, including hosting it yourself.
31+ The current public practice benchmark uses LlamaGuard to
32+ evaluate the safety of responses. For now you will need a
33+ [ Together AI] ( https://www.together.ai/ ) account to use it. For 1.0, we test
34+ models on a variety of services; if you want to duplicate our benchmarks
35+ you will need accounts with those services as well. If you're adding a
36+ SUT, you can use any service you like, including hosting it yourself.
2937
30- Note that running a full benchmark for all included models via TogetherAI currently takes about a week. Depending
31- on response time, running your own SUT may be faster. We aim to speed things up substantially for 1.0. However, you
32- can get lower-fidelity reports in minutes by running a benchmark with fewer items via the ` --max-instances ` or
33- ` -m ` flag.
38+ Note that running a full benchmark to match our public set takes
39+ several days. Depending on response time, running your own SUT may be
40+ faster. However, you can get lower-fidelity reports in minutes by running
41+ a benchmark with fewer items via the ` --max-instances ` or ` -m ` flag.
3442
3543## Installation
3644
37- Since this is under heavy development, the best way to run it is to check it out from GitHub. However, you can also
38- install ModelBench as a CLI tool or library to use in your own projects.
45+ Since this is under heavy development, the best way to run it is to
46+ check it out from GitHub. However, you can also install ModelBench as
47+ a CLI tool or library to use in your own projects.
3948
4049### Install ModelBench with [ Poetry] ( https://python-poetry.org/ ) for local development.
4150
@@ -57,8 +66,10 @@ cd modelbench
5766poetry install
5867```
5968
60- At this point you may optionally do ` poetry shell ` which will put you in a virtual environment that uses the installed packages
61- for everything. If you do that, you don't have to explicitly say ` poetry run ` in the commands below.
69+ At this point you may optionally do ` poetry shell ` which will put you in a
70+ virtual environment that uses the installed packages for everything. If
71+ you do that, you don't have to explicitly say ` poetry run ` in the
72+ commands below.
6273
6374### Install ModelBench from PyPI
6475
@@ -77,15 +88,17 @@ poetry run pytest tests
7788
7889## Trying It Out
7990
80- We encourage interested parties to try it out and give us feedback. For now, ModelBench is just a proof of
81- concept, but over time we would like others to be able both test their own models and to create their own
82- tests and benchmarks.
91+ We encourage interested parties to try it out and give us feedback. For
92+ now, ModelBench is mainly focused on us running our own benchmarks,
93+ but over time we would like others to be able both test their own models
94+ and to create their own tests and benchmarks.
8395
8496### Running Your First Benchmark
8597
86- Before running any benchmarks, you'll need to create a secrets file that contains any necessary API keys and other sensitive information.
87- Create a file at ` config/secrets.toml ` (in the current working directory if you've installed ModelBench from PyPi).
88- You can use the following as a template.
98+ Before running any benchmarks, you'll need to create a secrets file that
99+ contains any necessary API keys and other sensitive information. Create a
100+ file at ` config/secrets.toml ` (in the current working directory if you've
101+ installed ModelBench from PyPi). You can use the following as a template.
89102
90103``` toml
91104[together ]
@@ -101,46 +114,77 @@ Note: Omit `poetry run` in all example commands going forward if you've installe
101114poetry run modelbench benchmark -m 10
102115```
103116
104- You should immediately see progress indicators, and depending on how loaded TogetherAI is,
105- the whole run should take about 15 minutes.
117+ You should immediately see progress indicators, and depending on how
118+ loaded Together AI is, the whole run should take about 15 minutes.
106119
107120> [ !IMPORTANT]
108121> Sometimes, running a benchmark will fail due to temporary errors due to network issues, API outages, etc. While we are working
109122> toward handling these errors gracefully, the current best solution is to simply attempt to rerun the benchmark if it fails.
110123
111124### Viewing the Scores
112125
113- After a successful benchmark run, static HTML pages are generated that display scores on benchmarks and tests.
114- These can be viewed by opening ` web/index.html ` in a web browser. E.g., ` firefox web/index.html ` .
126+ After a successful benchmark run, static HTML pages are generated that
127+ display scores on benchmarks and tests. These can be viewed by opening
128+ ` web/index.html ` in a web browser. E.g., ` firefox web/index.html ` .
129+
130+ Note that the HTML that ModelBench produces is an older version than is available
131+ on [ the website] ( https://ailuminate.mlcommons.org/ ) . Over time we'll simplify the
132+ direct ModelBench output to be more straightforward and more directly useful to
133+ people independently running ModelBench.
134+
135+ ### Using the journal
136+
137+ As ` modelbench ` runs, it logs each important event to the journal. That includes
138+ every step of prompt processing. You can use that to extract most information
139+ that you might want about the run. The journal is a zstandard-compressed JSONL
140+ file, meaning that each line is a valid JSON object.
115141
116- If you would like to dump the raw scores, you can do:
142+ There are many tools that can work with those files. In the example below, we
143+ use [ jq] (https://jqlang.github.io/jq/ , a JSON swiss army knife. For more
144+ information on the journal, see [ the documentation] ( docs/run-journal.md ) .
145+
146+ To dump the raw scores, you could do something like this
117147
118148``` shell
119- poetry run modelbench grid -m 10 > scoring-grid. csv
149+ zstd -d -c $( ls run/journals/ * | tail -1 ) | jq -rn ' ["sut", "hazard", "score", "reference score"], (inputs | select(.message=="hazard scored") | [.sut, .hazard, .score, .reference]) | @ csv'
120150```
121151
122- To see all raw requests, responses, and annotations, do:
152+ That will produce CSV for each hazard scored, as well as showing the reference
153+ score for that hazard.
154+
155+ Or if you'd like to see the processing chain for a specific prompt, you could do:
123156
124157``` shell
125- poetry run modelbench responses -m 10 response-output-dir
158+ zstd -d -c $( ls run/journals/ * | tail -1 ) | jq -r ' select(.prompt_id=="airr_practice_1_0_41321") '
126159```
127- That will produce a series of CSV files, one per Harm, in the given output directory. Please note that many of the
128- prompts may be uncomfortable or harmful to view, especially to people with a history of trauma related to one of the
129- Harms that we test for. Consider carefully whether you need to view the prompts and responses, limit exposure to
130- what's necessary, take regular breaks, and stop if you feel uncomfortable. For more information on the risks, see
131- [ this literature review on vicarious trauma] ( https://www.zevohealth.com/wp-content/uploads/2021/08/Literature-Review_Content-Moderators37779.pdf ) .
160+
161+ That should output a series of JSON objects showing the flow from ` queuing item `
162+ to ` item finished ` .
163+
164+ ** CAUTION** : Please note that many of the prompts may be uncomfortable or
165+ harmful to view, especially to people with a history of trauma related to
166+ one of the hazards that we test for. Consider carefully whether you need
167+ to view the prompts and responses, limit exposure to what's necessary,
168+ take regular breaks, and stop if you feel uncomfortable. For more
169+ information on the risks, see [ this literature review on vicarious
170+ trauma] ( https://www.zevohealth.com/wp-content/uploads/2021/08/Literature-Review_Content-Moderators37779.pdf ) .
132171
133172### Managing the Cache
134173
135- To speed up runs, ModelBench caches calls to both SUTs and annotators. That's normally what a benchmark-runner wants.
136- But if you have changed your SUT in a way that ModelBench can't detect, like by deploying a new version of your model
137- to the same endpoint, you may have to manually delete the cache. Look in ` run/suts ` for an ` sqlite ` file that matches
138- the name of your SUT and either delete it or move it elsewhere. The cache will be created anew on the next run.
174+ To speed up runs, ModelBench caches calls to both SUTs and
175+ annotators. That's normally what a benchmark-runner wants. But if you
176+ have changed your SUT in a way that ModelBench can't detect, like by
177+ deploying a new version of your model to the same endpoint, you may
178+ have to manually delete the cache. Look in ` run/suts ` for an ` sqlite `
179+ file that matches the name of your SUT and either delete it or move it
180+ elsewhere. The cache will be created anew on the next run.
139181
140182### Running the benchmark on your SUT
141183
142- ModelBench uses the ModelGauge library to discover and manage SUTs. For an example of how you can run a benchmark
143- against a custom SUT, check out this [ tutorial] ( https://github.com/mlcommons/modelbench/blob/main/docs/add-a-sut.md ) .
184+ ModelBench uses the ModelGauge library to discover
185+ and manage SUTs. For an example of how you can run
186+ a benchmark against a custom SUT, check out this
187+ [ tutorial] ( https://github.com/mlcommons/modelbench/blob/main/docs/add-a-sut.md ) .
144188
145189## Contributing
146190
0 commit comments