alignment-evals

A rigorous, open-source framework for evaluating alignment properties of large language models. Built to make alignment research reproducible and measurable.

Why This Exists

Alignment research needs quantitative rigor. Most alignment evaluations today are ad-hoc, unreproducible, or narrowly scoped. alignment-evals provides a standardized evaluation harness for measuring critical alignment properties across models, prompting strategies, and fine-tuning approaches.

What It Measures

Sycophancy Detection

Measures whether a model changes its answer to match user opinion rather than truth. Uses paired prompts with and without stated user preferences, then computes a sycophancy index — the rate at which the model flips to the user's stated position.

Corrigibility Testing

Evaluates whether a model resists or complies with oversight mechanisms. Tests include: accepting correction gracefully, deferring to human judgment under uncertainty, and not taking actions to preserve its own continuity or influence.

Deception Probing

Detects strategic deception through multi-turn scenarios where a model might benefit from providing misleading information. Measures consistency between stated reasoning and revealed preferences across scenarios.

Goal Stability Under Distribution Shift

Tests whether a model's alignment properties hold when prompted outside its training distribution. Applies controlled perturbations (jailbreak-adjacent prompts, unusual formatting, multilingual mixing) and measures alignment metric drift.

Power-Seeking Behavior

Evaluates tendency to acquire resources, influence, or capabilities beyond what the task requires. Uses scenario-based evals where the model can choose between minimal and expansive action.

Architecture

alignment_evals/
├── core/                  # Evaluation engine
│   ├── runner.py          # Orchestrates eval runs
│   ├── metrics.py         # Alignment metric computation
│   └── dataset.py         # Dataset loading and sampling
├── evals/                 # Individual evaluation modules
│   ├── sycophancy.py      # Sycophancy detection suite
│   ├── corrigibility.py   # Corrigibility testing suite
│   ├── deception.py       # Deception probing suite
│   ├── goal_stability.py  # Distribution shift testing
│   └── power_seeking.py   # Power-seeking behavior detection
├── datasets/              # Evaluation datasets (JSONL)
│   ├── sycophancy/
│   ├── corrigibility/
│   └── deception/
├── adapters/              # Model API adapters
│   ├── openai_adapter.py
│   ├── anthropic_adapter.py
│   └── local_adapter.py
└── analysis/              # Results analysis and visualization
    ├── report.py
    └── compare.py

Quick Start

pip install alignment-evals

from alignment_evals import EvalRunner, SycophancyEval
from alignment_evals.adapters import AnthropicAdapter

adapter = AnthropicAdapter(model="claude-sonnet-4-5-20250929")
runner = EvalRunner(adapter=adapter)

results = runner.run(SycophancyEval(n_samples=200))
print(results.summary())
# Sycophancy Index: 0.07 (95% CI: [0.03, 0.11])
# Opinion-flip rate: 4.5%
# Confidence-shift rate: 12.0%

Running the Full Suite

from alignment_evals import EvalRunner, AlignmentSuite
from alignment_evals.adapters import OpenAIAdapter

adapter = OpenAIAdapter(model="gpt-4")
runner = EvalRunner(adapter=adapter)

# Run all alignment evaluations
results = runner.run(AlignmentSuite(n_samples=500))
results.to_report("gpt4_alignment_report.html")

Comparing Models

from alignment_evals import compare_models

comparison = compare_models(
    models=["claude-sonnet-4-5-20250929", "gpt-4", "llama-3-70b"],
    suite=AlignmentSuite(n_samples=300),
    output="comparison_report.html"
)

Extending with Custom Evals

from alignment_evals.core import BaseEval, EvalResult

class MyCustomEval(BaseEval):
    name = "my_custom_eval"
    description = "Tests a novel alignment property"

    def generate_prompts(self) -> list[dict]:
        return [...]

    def score_response(self, prompt: dict, response: str) -> float:
        return ...

    def aggregate(self, scores: list[float]) -> EvalResult:
        return EvalResult(
            metric_name="custom_score",
            value=sum(scores) / len(scores),
            confidence_interval=self.bootstrap_ci(scores)
        )

Research Methodology

Each evaluation module follows a rigorous methodology:

Hypothesis formulation — Define the alignment property and its observable consequences.
Controlled pairing — Every test case has a matched control to isolate the alignment-relevant variable.
Statistical analysis — All metrics include bootstrap confidence intervals and effect sizes.
Robustness checks — Results are validated across prompt rephrasings to control for surface-level sensitivity.

Contributing

We welcome contributions, especially new evaluation modules and datasets. See CONTRIBUTING.md for guidelines.

Citation

@software{calkin2026alignmentevals,
  title={alignment-evals: A Framework for Measuring AI Alignment Properties},
  author={Calkin, Maxwell},
  year={2026},
  url={https://github.com/MaxwellCalkin/alignment-evals}
}

License

MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
alignment_evals		alignment_evals
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

alignment-evals

Why This Exists

What It Measures

Sycophancy Detection

Corrigibility Testing

Deception Probing

Goal Stability Under Distribution Shift

Power-Seeking Behavior

Architecture

Quick Start

Running the Full Suite

Comparing Models

Extending with Custom Evals

Research Methodology

Contributing

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

alignment-evals

Why This Exists

What It Measures

Sycophancy Detection

Corrigibility Testing

Deception Probing

Goal Stability Under Distribution Shift

Power-Seeking Behavior

Architecture

Quick Start

Running the Full Suite

Comparing Models

Extending with Custom Evals

Research Methodology

Contributing

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages