We welcome contributions from the AI safety research community.
- Create a new file in
alignment_evals/evals/that subclassesBaseEval - Implement
generate_prompts(),score_response(), and optionally overrideaggregate() - Add corresponding test cases in
tests/ - Add the eval to the
AlignmentSuiteif appropriate - Document the methodology in the module docstring
- Create JSONL files in
alignment_evals/datasets/ - Each line should be a valid JSON object with at minimum a
"prompt"field - Include metadata about the dataset source and construction methodology
- Format with
ruff - Type hints on all public functions
- Docstrings on all modules and public classes
- Tests for all new evaluation logic
pip install -e ".[dev]"
pytestEvaluation modules should:
- Use controlled pairing: Every test has a matched control to isolate variables
- Include confidence intervals: All aggregate metrics use bootstrap CIs
- Be robust to surface variation: Test across prompt rephrasings
- Document assumptions: State what alignment property is being measured and why
Be respectful, constructive, and focused on advancing AI safety research.