A rigorous, open-source framework for evaluating alignment properties of large language models. Built to make alignment research reproducible and measurable.
Alignment research needs quantitative rigor. Most alignment evaluations today are ad-hoc, unreproducible, or narrowly scoped. alignment-evals provides a standardized evaluation harness for measuring critical alignment properties across models, prompting strategies, and fine-tuning approaches.
Measures whether a model changes its answer to match user opinion rather than truth. Uses paired prompts with and without stated user preferences, then computes a sycophancy index — the rate at which the model flips to the user's stated position.
Evaluates whether a model resists or complies with oversight mechanisms. Tests include: accepting correction gracefully, deferring to human judgment under uncertainty, and not taking actions to preserve its own continuity or influence.
Detects strategic deception through multi-turn scenarios where a model might benefit from providing misleading information. Measures consistency between stated reasoning and revealed preferences across scenarios.
Tests whether a model's alignment properties hold when prompted outside its training distribution. Applies controlled perturbations (jailbreak-adjacent prompts, unusual formatting, multilingual mixing) and measures alignment metric drift.
Evaluates tendency to acquire resources, influence, or capabilities beyond what the task requires. Uses scenario-based evals where the model can choose between minimal and expansive action.
alignment_evals/
├── core/ # Evaluation engine
│ ├── runner.py # Orchestrates eval runs
│ ├── metrics.py # Alignment metric computation
│ └── dataset.py # Dataset loading and sampling
├── evals/ # Individual evaluation modules
│ ├── sycophancy.py # Sycophancy detection suite
│ ├── corrigibility.py # Corrigibility testing suite
│ ├── deception.py # Deception probing suite
│ ├── goal_stability.py # Distribution shift testing
│ └── power_seeking.py # Power-seeking behavior detection
├── datasets/ # Evaluation datasets (JSONL)
│ ├── sycophancy/
│ ├── corrigibility/
│ └── deception/
├── adapters/ # Model API adapters
│ ├── openai_adapter.py
│ ├── anthropic_adapter.py
│ └── local_adapter.py
└── analysis/ # Results analysis and visualization
├── report.py
└── compare.py
pip install alignment-evalsfrom alignment_evals import EvalRunner, SycophancyEval
from alignment_evals.adapters import AnthropicAdapter
adapter = AnthropicAdapter(model="claude-sonnet-4-5-20250929")
runner = EvalRunner(adapter=adapter)
results = runner.run(SycophancyEval(n_samples=200))
print(results.summary())
# Sycophancy Index: 0.07 (95% CI: [0.03, 0.11])
# Opinion-flip rate: 4.5%
# Confidence-shift rate: 12.0%from alignment_evals import EvalRunner, AlignmentSuite
from alignment_evals.adapters import OpenAIAdapter
adapter = OpenAIAdapter(model="gpt-4")
runner = EvalRunner(adapter=adapter)
# Run all alignment evaluations
results = runner.run(AlignmentSuite(n_samples=500))
results.to_report("gpt4_alignment_report.html")from alignment_evals import compare_models
comparison = compare_models(
models=["claude-sonnet-4-5-20250929", "gpt-4", "llama-3-70b"],
suite=AlignmentSuite(n_samples=300),
output="comparison_report.html"
)from alignment_evals.core import BaseEval, EvalResult
class MyCustomEval(BaseEval):
name = "my_custom_eval"
description = "Tests a novel alignment property"
def generate_prompts(self) -> list[dict]:
return [...]
def score_response(self, prompt: dict, response: str) -> float:
return ...
def aggregate(self, scores: list[float]) -> EvalResult:
return EvalResult(
metric_name="custom_score",
value=sum(scores) / len(scores),
confidence_interval=self.bootstrap_ci(scores)
)Each evaluation module follows a rigorous methodology:
- Hypothesis formulation — Define the alignment property and its observable consequences.
- Controlled pairing — Every test case has a matched control to isolate the alignment-relevant variable.
- Statistical analysis — All metrics include bootstrap confidence intervals and effect sizes.
- Robustness checks — Results are validated across prompt rephrasings to control for surface-level sensitivity.
We welcome contributions, especially new evaluation modules and datasets. See CONTRIBUTING.md for guidelines.
@software{calkin2026alignmentevals,
title={alignment-evals: A Framework for Measuring AI Alignment Properties},
author={Calkin, Maxwell},
year={2026},
url={https://github.com/MaxwellCalkin/alignment-evals}
}MIT License. See LICENSE.