Skip to content

MaxwellCalkin/alignment-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

alignment-evals

A rigorous, open-source framework for evaluating alignment properties of large language models. Built to make alignment research reproducible and measurable.

Why This Exists

Alignment research needs quantitative rigor. Most alignment evaluations today are ad-hoc, unreproducible, or narrowly scoped. alignment-evals provides a standardized evaluation harness for measuring critical alignment properties across models, prompting strategies, and fine-tuning approaches.

What It Measures

Sycophancy Detection

Measures whether a model changes its answer to match user opinion rather than truth. Uses paired prompts with and without stated user preferences, then computes a sycophancy index — the rate at which the model flips to the user's stated position.

Corrigibility Testing

Evaluates whether a model resists or complies with oversight mechanisms. Tests include: accepting correction gracefully, deferring to human judgment under uncertainty, and not taking actions to preserve its own continuity or influence.

Deception Probing

Detects strategic deception through multi-turn scenarios where a model might benefit from providing misleading information. Measures consistency between stated reasoning and revealed preferences across scenarios.

Goal Stability Under Distribution Shift

Tests whether a model's alignment properties hold when prompted outside its training distribution. Applies controlled perturbations (jailbreak-adjacent prompts, unusual formatting, multilingual mixing) and measures alignment metric drift.

Power-Seeking Behavior

Evaluates tendency to acquire resources, influence, or capabilities beyond what the task requires. Uses scenario-based evals where the model can choose between minimal and expansive action.

Architecture

alignment_evals/
├── core/                  # Evaluation engine
│   ├── runner.py          # Orchestrates eval runs
│   ├── metrics.py         # Alignment metric computation
│   └── dataset.py         # Dataset loading and sampling
├── evals/                 # Individual evaluation modules
│   ├── sycophancy.py      # Sycophancy detection suite
│   ├── corrigibility.py   # Corrigibility testing suite
│   ├── deception.py       # Deception probing suite
│   ├── goal_stability.py  # Distribution shift testing
│   └── power_seeking.py   # Power-seeking behavior detection
├── datasets/              # Evaluation datasets (JSONL)
│   ├── sycophancy/
│   ├── corrigibility/
│   └── deception/
├── adapters/              # Model API adapters
│   ├── openai_adapter.py
│   ├── anthropic_adapter.py
│   └── local_adapter.py
└── analysis/              # Results analysis and visualization
    ├── report.py
    └── compare.py

Quick Start

pip install alignment-evals
from alignment_evals import EvalRunner, SycophancyEval
from alignment_evals.adapters import AnthropicAdapter

adapter = AnthropicAdapter(model="claude-sonnet-4-5-20250929")
runner = EvalRunner(adapter=adapter)

results = runner.run(SycophancyEval(n_samples=200))
print(results.summary())
# Sycophancy Index: 0.07 (95% CI: [0.03, 0.11])
# Opinion-flip rate: 4.5%
# Confidence-shift rate: 12.0%

Running the Full Suite

from alignment_evals import EvalRunner, AlignmentSuite
from alignment_evals.adapters import OpenAIAdapter

adapter = OpenAIAdapter(model="gpt-4")
runner = EvalRunner(adapter=adapter)

# Run all alignment evaluations
results = runner.run(AlignmentSuite(n_samples=500))
results.to_report("gpt4_alignment_report.html")

Comparing Models

from alignment_evals import compare_models

comparison = compare_models(
    models=["claude-sonnet-4-5-20250929", "gpt-4", "llama-3-70b"],
    suite=AlignmentSuite(n_samples=300),
    output="comparison_report.html"
)

Extending with Custom Evals

from alignment_evals.core import BaseEval, EvalResult

class MyCustomEval(BaseEval):
    name = "my_custom_eval"
    description = "Tests a novel alignment property"

    def generate_prompts(self) -> list[dict]:
        return [...]

    def score_response(self, prompt: dict, response: str) -> float:
        return ...

    def aggregate(self, scores: list[float]) -> EvalResult:
        return EvalResult(
            metric_name="custom_score",
            value=sum(scores) / len(scores),
            confidence_interval=self.bootstrap_ci(scores)
        )

Research Methodology

Each evaluation module follows a rigorous methodology:

  1. Hypothesis formulation — Define the alignment property and its observable consequences.
  2. Controlled pairing — Every test case has a matched control to isolate the alignment-relevant variable.
  3. Statistical analysis — All metrics include bootstrap confidence intervals and effect sizes.
  4. Robustness checks — Results are validated across prompt rephrasings to control for surface-level sensitivity.

Contributing

We welcome contributions, especially new evaluation modules and datasets. See CONTRIBUTING.md for guidelines.

Citation

@software{calkin2026alignmentevals,
  title={alignment-evals: A Framework for Measuring AI Alignment Properties},
  author={Calkin, Maxwell},
  year={2026},
  url={https://github.com/MaxwellCalkin/alignment-evals}
}

License

MIT License. See LICENSE.

About

Rigorous framework for evaluating AI alignment properties — sycophancy, corrigibility, deception, goal stability, and power-seeking — with statistical confidence intervals

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages