CrucibleBench

Statistical Testing Framework for AI Research

A comprehensive statistical testing framework designed specifically for AI/ML research in Elixir. CrucibleBench provides rigorous statistical tests, effect size measures, power analysis, and publication-ready reporting.

Features

Parametric Tests: t-tests (independent, paired), ANOVA
Non-Parametric Tests: Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis
Effect Sizes: Cohen's d, Hedges' g, Glass's delta, eta-squared, omega-squared
Power Analysis: A priori and post-hoc power calculations
Confidence Intervals: Bootstrap and analytical methods
Experiment DSL: High-level API for A/B tests, ablation studies, hyperparameter sweeps
Export Formats: Markdown, LaTeX, HTML for publication

Design Principles

Statistical Rigor: All implementations follow established statistical methods
Interpretability: Every result includes effect sizes and practical significance
Reproducibility: Complete audit trails for research reproducibility
Peer-Review Ready: Publication-quality output suitable for academic papers

Installation

Add crucible_bench to your list of dependencies in mix.exs:

def deps do
  [
    {:crucible_bench, "~> 0.4.0"}
  ]
end

Or install from GitHub:

def deps do
  [
    {:crucible_bench, github: "North-Shore-AI/crucible_bench"}
  ]
end

Stage Contract

CrucibleBench.Stage implements the Crucible.Stage behaviour from crucible_framework.

Options

:tests - Statistical tests to run (default: [:ttest])
:alpha - Significance level (default: 0.05)
:confidence_level - Confidence level (default: 0.95)
:bootstrap_iterations - Bootstrap iterations (default: 1000)
:data_source - Data source (:outputs, :metrics, or {:custom, fn})

Schema Introspection

# Get stage schema
schema = CrucibleBench.Stage.describe(%{})
# => %{
#   __schema_version__: "1.0.0",
#   name: :bench,
#   description: "Statistical benchmarking and hypothesis testing",
#   required: [],
#   optional: [:tests, :alpha, :confidence_level, :bootstrap_iterations, :data_source],
#   types: %{...},
#   defaults: %{tests: [:ttest], alpha: 0.05, ...},
#   __extensions__: %{bench: %{...}}
# }

Pipeline Integration

CrucibleBench v0.4.0+ provides CrucibleBench.Stage for seamless integration with crucible_framework pipelines:

# In your pipeline configuration
context = %{
  experiment: %{
    reliability: %{
      stats: %CrucibleIR.Reliability.Stats{
        tests: [:ttest, :bootstrap],
        alpha: 0.05,
        confidence_level: 0.95,
        bootstrap_iterations: 2000
      }
    }
  },
  outputs: [0.85, 0.87, 0.84, 0.86, 0.88]
}

# Run statistical analysis
{:ok, updated_context} = CrucibleBench.Stage.run(context)

# Access results
updated_context.bench.tests
# => %{
#   ttest: %{test_type: :ttest, ...},
#   bootstrap: %{test_type: :bootstrap, confidence_interval: {0.84, 0.88}, ...}
# }

updated_context.bench.summary
# => %{n: 5, mean: 0.86, sd: 0.0141, median: 0.86}

Advanced Stage Configuration

The Stage supports multiple data layouts for different test types:

# Two-group comparison (t-test, Mann-Whitney)
context = %{
  experiment: %{reliability: %{stats: stats_config}},
  control: [0.72, 0.68, 0.75, 0.71, 0.69],
  treatment: [0.78, 0.73, 0.81, 0.76, 0.74]
}

{:ok, ctx} = CrucibleBench.Stage.run(context)
ctx.bench.tests.ttest
# => %{
#   test_type: :ttest,
#   statistic: -3.42,
#   p_value: 0.0089,
#   significant: true,
#   effect_size: %{cohens_d: -2.16, interpretation: "large"},
#   confidence_interval: {-0.095, -0.019}
# }

# Multi-group comparison (ANOVA, Kruskal-Wallis)
context = %{
  experiment: %{
    reliability: %{
      stats: %CrucibleIR.Reliability.Stats{
        tests: [:anova],
        alpha: 0.05
      }
    }
  },
  groups: [
    [0.89, 0.91, 0.88, 0.90, 0.92],  # Model A
    [0.87, 0.89, 0.86, 0.88, 0.90],  # Model B
    [0.84, 0.86, 0.83, 0.85, 0.87]   # Model C
  ]
}

{:ok, ctx} = CrucibleBench.Stage.run(context)
ctx.bench.tests.anova.effect_size.eta_squared
# => 0.72 (large effect)

# Paired comparison (paired t-test, Wilcoxon)
context = %{
  experiment: %{reliability: %{stats: stats_config}},
  before: [0.72, 0.68, 0.75, 0.71, 0.69],
  after: [0.78, 0.73, 0.81, 0.76, 0.74]
}

{:ok, ctx} = CrucibleBench.Stage.run(context)
# Automatically uses paired t-test

Metrics Merging

The Stage automatically merges statistical results into context.metrics:

{:ok, ctx} = CrucibleBench.Stage.run(context)

ctx.metrics.bench_n           # Sample size
ctx.metrics.bench_mean        # Mean value
ctx.metrics.bench_sd          # Standard deviation
ctx.metrics.bench_median      # Median value
ctx.metrics.bench_ttest_p_value  # P-value from t-test (if run)

This enables downstream pipeline stages to access statistical summaries directly.

Inspect-AI Eval Logs

CrucibleBench can adapt EvalEx results into inspect-ai-style eval logs for downstream analysis:

metrics = [
  %{accuracy: 1.0},
  %{accuracy: 0.0},
  %{accuracy: 1.0}
]

result = EvalEx.Result.new("inspect_evals/gsm8k", :testset, metrics, 3, 120)

log = CrucibleBench.EvalLog.from_eval_result(result, scorer_name: "llm_judge")

scores = CrucibleBench.EvalLog.Extract.eval_log_scores_dict(log)
stderr = CrucibleBench.EvalLog.Extract.eval_log_headline_stderr(log)

Using IR Configuration

You can also pass CrucibleIR.Reliability.Stats directly to comparison functions:

config = %CrucibleIR.Reliability.Stats{
  alpha: 0.01,
  confidence_level: 0.99,
  tests: [:ttest]
}

control = [0.72, 0.68, 0.75, 0.71, 0.69]
treatment = [0.78, 0.73, 0.81, 0.76, 0.74]

result = CrucibleBench.compare(control, treatment, config)
# Uses alpha=0.01 and 99% confidence interval

Quick Start

Compare Two Groups

# Compare control vs treatment groups
control = [0.72, 0.68, 0.75, 0.71, 0.69]
treatment = [0.78, 0.73, 0.81, 0.76, 0.74]

result = CrucibleBench.compare(control, treatment)
# => %CrucibleBench.Result{
#   test: :welch_t_test,
#   p_value: 0.0024,
#   effect_size: %{cohens_d: 1.25, interpretation: "large"},
#   confidence_interval: {0.02, 0.14}
# }

Paired Comparison

# Before/after measurements
before = [0.72, 0.68, 0.75, 0.71, 0.69]
after = [0.78, 0.73, 0.81, 0.76, 0.74]

result = CrucibleBench.compare_paired(before, after)

Compare Multiple Groups

# Compare 3+ groups with ANOVA
gpt4 = [0.89, 0.91, 0.88, 0.90, 0.92]
claude = [0.87, 0.89, 0.86, 0.88, 0.90]
gemini = [0.84, 0.86, 0.83, 0.85, 0.87]

result = CrucibleBench.compare_multiple([gpt4, claude, gemini])

Effect Size Analysis

# Calculate Cohen's d
effect = CrucibleBench.effect_size(control, treatment)
# => %{
#   cohens_d: 1.25,
#   interpretation: "large",
#   mean1: 0.71,
#   mean2: 0.764
# }

Confidence Intervals

# Calculate 95% CI for mean
data = [0.85, 0.87, 0.84, 0.86, 0.88]
ci = CrucibleBench.confidence_interval(data, :mean)
# => %{interval: {0.8432, 0.8768}, method: :analytical}

# Bootstrap CI for median
ci = CrucibleBench.confidence_interval(data, :median, method: :bootstrap)

Power Analysis

# A priori: Calculate required sample size
result = CrucibleBench.power_analysis(:t_test,
  analysis_type: :a_priori,
  effect_size: 0.5,    # Medium effect
  alpha: 0.05,
  power: 0.80          # 80% power
)
# => %{n_per_group: 64, recommendation: "Collect at least 64 samples per group..."}

# Post-hoc: Calculate achieved power
result = CrucibleBench.power_analysis(:t_test,
  analysis_type: :post_hoc,
  effect_size: 0.5,
  n_per_group: 30,
  alpha: 0.05
)
# => %{power: 0.548, recommendation: "Marginal power..."}

High-Level Experiment DSL

A/B Testing

result = CrucibleBench.experiment(:ab_test,
  control: control_scores,
  treatment: treatment_scores,
  name: "Prompt Engineering Test"
)

# Comprehensive output includes:
# - Statistical significance
# - Effect size with interpretation
# - Power analysis
# - Recommendations

Ablation Study

result = CrucibleBench.experiment(:ablation,
  baseline: [0.85, 0.87, 0.84, 0.86, 0.88],
  without_component: [0.78, 0.76, 0.79, 0.77, 0.75],
  component_name: "Ensemble Voting"
)

# Shows performance drop and component importance

Hyperparameter Sweep

result = CrucibleBench.experiment(:hyperparameter_sweep,
  configurations: [config_a, config_b, config_c],
  labels: ["Config A", "Config B", "Config C"],
  correction_method: :holm # or :bonferroni, :benjamini_hochberg
)

# Identifies best configuration with pairwise comparisons
# Pairwise p-values are adjusted using the chosen correction method

Assumption Checks (Normality & Variance)

# Normality
NormalityTests.quick_check(data)          # fast skew/kurtosis screen
NormalityTests.assess_normality(data)     # Shapiro-Wilk + skew/kurtosis with recommendation

# Variance equality
VarianceTests.levene_test([g1, g2, g3])   # robust Brown-Forsythe (median-centered)
VarianceTests.f_test(g1, g2)              # classic F-test (assumes normality)
VarianceTests.quick_check(g1, g2)         # fast variance ratio heuristic

Use normality/variance checks to choose between parametric and non-parametric tests.
Constant or near-constant data is handled safely (no crashes).

Multiple Comparison Control

p_values = [0.01, 0.03, 0.04, 0.20]

# Adjust p-values
MultipleComparisons.correct(p_values, method: :holm)
MultipleComparisons.correct(p_values, method: :benjamini_hochberg, fdr_level: 0.10)

# Boolean rejections (uses the same alpha/FDR level)
MultipleComparisons.reject(p_values, method: :bonferroni)

Hyperparameter sweeps automatically apply corrections (:holm default); set correction_method: and optional fdr_level: to change behavior.
Exports include original and adjusted p-values plus significance under the chosen correction.

Export Results

Markdown

markdown = CrucibleBench.Export.to_markdown(result)
IO.puts(markdown)

LaTeX

latex = CrucibleBench.Export.to_latex(result)
# Generates LaTeX table for academic papers

HTML

html = CrucibleBench.Export.to_html(result)
# Generates styled HTML report

Experiment Reports

report = CrucibleBench.Export.experiment_to_markdown(ab_result)
# Comprehensive markdown report with interpretations

Statistical Tests Reference

Parametric Tests

Test	Function	Use Case
Independent t-test	`CrucibleBench.Stats.TTest.test/3`	Compare 2 independent groups
Welch's t-test	`CrucibleBench.Stats.TTest.test/3`	Compare 2 groups (unequal variance)
Paired t-test	`CrucibleBench.Stats.PairedTTest.test/3`	Compare 2 related groups
One-way ANOVA	`CrucibleBench.Stats.ANOVA.one_way/2`	Compare 3+ independent groups

Non-Parametric Tests

Test	Function	Use Case
Mann-Whitney U	`CrucibleBench.Stats.MannWhitney.test/3`	Non-parametric alternative to t-test
Wilcoxon signed-rank	`CrucibleBench.Stats.Wilcoxon.test/3`	Non-parametric alternative to paired t-test
Kruskal-Wallis	`CrucibleBench.Stats.KruskalWallis.test/2`	Non-parametric alternative to ANOVA

Effect Sizes

Measure	Function	Interpretation
Cohen's d	`CrucibleBench.Stats.EffectSize.cohens_d/2`	Standardized mean difference
Hedges' g	`CrucibleBench.Stats.EffectSize.hedges_g/2`	Bias-corrected Cohen's d
Glass's delta	`CrucibleBench.Stats.EffectSize.glass_delta/2`	Using control SD only
Eta-squared	Included in ANOVA results	Proportion of variance explained

Effect Size Interpretation

Based on Cohen (1988):

Cohen's d	Interpretation
< 0.2	Negligible
0.2 - 0.5	Small
0.5 - 0.8	Medium
> 0.8	Large

Eta-squared (η²)	Interpretation
< 0.01	Negligible
0.01 - 0.06	Small
0.06 - 0.14	Medium
> 0.14	Large

Module Structure

lib/crucible_bench/
├── bench.ex                          # Main API
├── result.ex                         # Result struct
├── stats.ex                          # Core statistics
├── analysis.ex                       # High-level analysis
├── experiment.ex                     # Experiment DSL
├── export.ex                         # Export/reporting
└── stats/
    ├── t_test.ex                     # Independent t-test
    ├── paired_t_test.ex              # Paired t-test
    ├── anova.ex                      # ANOVA
    ├── mann_whitney.ex               # Mann-Whitney U
    ├── wilcoxon.ex                   # Wilcoxon signed-rank
    ├── kruskal_wallis.ex             # Kruskal-Wallis
    ├── effect_size.ex                # Effect size measures
    ├── confidence_interval.ex        # CI calculations
    ├── power.ex                      # Power analysis
    ├── multiple_comparisons.ex       # p-value corrections (FWER/FDR)
    ├── normality_tests.ex            # Shapiro-Wilk + diagnostics
    ├── variance_tests.ex             # Levene, F-test, variance heuristics
    └── distributions.ex              # Probability distributions

Examples

See examples/basic_usage.exs for comprehensive examples covering:

Independent samples t-test
Paired t-test
One-way ANOVA
Effect size analysis
Confidence intervals
Power analysis
A/B test experiment
Ablation study
Hyperparameter sweep
Result export

Run examples:

mix run examples/basic_usage.exs

Testing

Run the test suite:

mix test

Run specific tests:

mix test test/bench_test.exs
mix test test/stats_test.exs
mix test test/effect_size_test.exs

Best Practices for AI Research

1. Always Report Effect Sizes

P-values alone don't tell the full story. Always include effect sizes:

result = CrucibleBench.compare(control, treatment)
IO.puts("P-value: #{result.p_value}")
IO.puts("Effect size: #{result.effect_size.cohens_d} (#{result.effect_size.interpretation})")

2. Check Statistical Power

Ensure your study has adequate power:

power = CrucibleBench.power_analysis(:t_test,
  analysis_type: :post_hoc,
  effect_size: observed_effect,
  n_per_group: n,
  alpha: 0.05
)

if power.power < 0.8 do
  IO.puts("Warning: Underpowered study! #{power.recommendation}")
end

3. Use Confidence Intervals

CIs provide more information than p-values:

result = CrucibleBench.compare(group1, group2)
{lower, upper} = result.confidence_interval
IO.puts("95% CI: [#{lower}, #{upper}]")

4. Consider Practical Significance

Statistical significance ≠ practical significance:

if result.p_value < 0.05 and abs(effect.cohens_d) < 0.2 do
  IO.puts("Statistically significant but negligible effect size")
end

5. Use Experiment DSL for Complex Studies

The experiment DSL automates best practices:

result = CrucibleBench.experiment(:ab_test,
  control: control,
  treatment: treatment,
  name: "My Experiment"
)

# Automatically includes:
# - Appropriate test selection
# - Effect size calculation
# - Power analysis
# - Recommendations

Common Use Cases in AI Research

Compare Model Performance

model_a_scores = [0.85, 0.87, 0.84, 0.86, 0.88]
model_b_scores = [0.88, 0.90, 0.89, 0.91, 0.87]

result = CrucibleBench.compare(model_a_scores, model_b_scores)
effect = CrucibleBench.effect_size(model_a_scores, model_b_scores)

Test Prompt Engineering

baseline_prompt = [0.72, 0.68, 0.75, 0.71, 0.69]
optimized_prompt = [0.78, 0.73, 0.81, 0.76, 0.74]

result = CrucibleBench.experiment(:ab_test,
  control: baseline_prompt,
  treatment: optimized_prompt,
  name: "Prompt Optimization"
)

Evaluate Architecture Changes

baseline = [0.85, 0.87, 0.84, 0.86, 0.88]
new_arch = [0.88, 0.90, 0.89, 0.91, 0.87]

result = CrucibleBench.compare(baseline, new_arch)
markdown = CrucibleBench.Export.to_markdown(result)
File.write!("results.md", markdown)

Ablation Studies

full_system = [0.85, 0.87, 0.84, 0.86, 0.88]
without_cache = [0.78, 0.76, 0.79, 0.77, 0.75]

result = CrucibleBench.experiment(:ablation,
  baseline: full_system,
  without_component: without_cache,
  component_name: "Response Cache"
)

Limitations

Sample Size: Most tests assume n ≥ 30 for asymptotic properties. Use bootstrap methods for smaller samples.
Normality: Parametric tests assume normality. Bench automatically suggests non-parametric alternatives when assumptions are violated.
Independence: All tests assume independent observations. Use appropriate designs for repeated measures.

References

Statistical Methods

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Routledge.
Welch, B. L. (1947). The generalization of "Student's" problem when several different population variances are involved. Biometrika, 34(1-2), 28-35.
Kruskal, W. H., & Wallis, W. A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583-621.

AI Research Statistics

Dror, R., et al. (2018). The hitchhiker's guide to testing statistical significance in natural language processing. Proceedings of ACL.
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1-30.

Advanced Features

Bootstrap Confidence Intervals

For small samples or non-normal data, use bootstrap methods:

# Bootstrap CI for median (robust to outliers)
data = [0.85, 0.87, 0.84, 0.86, 0.88, 0.83, 0.89, 0.85]
ci = CrucibleBench.confidence_interval(data, :median,
  method: :bootstrap,
  iterations: 10000
)
# => %{interval: {0.835, 0.875}, method: :bootstrap, bootstrap_distribution: %{...}}

Multiple Effect Size Measures

# Compare different effect size calculations
cohens_d = Stats.EffectSize.cohens_d(group1, group2)
hedges_g = Stats.EffectSize.hedges_g(group1, group2)  # Bias-corrected
glass_delta = Stats.EffectSize.glass_delta(group1, group2)  # Control SD only

IO.puts("Cohen's d: #{cohens_d.cohens_d}")
IO.puts("Hedges' g: #{hedges_g.hedges_g}")
IO.puts("Glass's Δ: #{glass_delta.glass_delta}")

Power Analysis Curves

Calculate power for different sample sizes:

effect_size = 0.5
for n <- [20, 30, 50, 100] do
  power = CrucibleBench.power_analysis(:t_test,
    analysis_type: :post_hoc,
    effect_size: effect_size,
    n_per_group: n,
    alpha: 0.05
  )
  IO.puts("n=#{n}: power=#{Float.round(power.power * 100, 1)}%")
end

Complete API Reference

Core Functions

`CrucibleBench.compare(group1, group2, opts \\\\ [])`

Compares two independent groups with automatic test selection.

Options:

:test - Force specific test (:t_test, :welch_t_test, :mann_whitney)
:confidence_level - CI level (default: 0.95)
:check_assumptions - Test normality (default: true)
:alternative - :two_sided, :less, :greater

Returns: CrucibleBench.Result struct

`CrucibleBench.compare_paired(group1, group2, opts \\\\ [])`

Compares paired/related groups.

Options: Same as compare/3

`CrucibleBench.compare_multiple(groups, opts \\\\ [])`

Compares 3+ groups with ANOVA or Kruskal-Wallis.

Options:

:test - Force :anova or :kruskal_wallis
:check_assumptions - Test normality (default: true)

`CrucibleBench.effect_size(group1, group2, opts \\\\ [])`

Calculates Cohen's d effect size.

`CrucibleBench.confidence_interval(data, statistic, opts \\\\ [])`

Calculates confidence intervals.

Statistics: :mean, :median, :variance, etc. Methods: :analytical, :bootstrap

`CrucibleBench.power_analysis(test_type, opts \\\\ [])`

Power analysis calculations.

Types: :a_priori, :post_hoc Required: :effect_size, :alpha, :power or :n_per_group

Experiment DSL

`CrucibleBench.experiment(:ab_test, opts)`

Options:

:control - Control group data
:treatment - Treatment group data
:name - Experiment name

`CrucibleBench.experiment(:ablation, opts)`

Options:

:baseline - Full system performance
:without_component - Performance without component
:component_name - Name of removed component

`CrucibleBench.experiment(:hyperparameter_sweep, opts)`

Options:

:configurations - List of performance arrays
:labels - Configuration names

Export Functions

`CrucibleBench.Export.to_markdown(result)`

`CrucibleBench.Export.to_latex(result)`

`CrucibleBench.Export.to_html(result)`

`CrucibleBench.Export.experiment_to_markdown(experiment_result)`

Integration Examples

With Phoenix LiveView

defmodule StatsLive do
  use Phoenix.LiveView

  def handle_event("run_test", %{"control" => control, "treatment" => treatment}, socket) do
    result = CrucibleBench.compare(control, treatment)
    markdown = CrucibleBench.Export.to_markdown(result)

    {:noreply, assign(socket, result: result, markdown: markdown)}
  end
end

Research Workflow Integration

defmodule ResearchPipeline do
  def run_experiment(control_data, treatment_data, metadata) do
    # 1. Run statistical test
    result = CrucibleBench.compare(control_data, treatment_data)

    # 2. Check power
    power_analysis = CrucibleBench.power_analysis(:t_test,
      analysis_type: :post_hoc,
      effect_size: abs(result.effect_size.cohens_d),
      n_per_group: length(control_data),
      alpha: 0.05
    )

    # 3. Generate report
    report = CrucibleBench.Export.experiment_to_markdown(%{
      experiment_type: :ab_test,
      name: metadata.name,
      significant?: result.p_value < 0.05,
      p_value: result.p_value,
      effect_size: result.effect_size,
      power: power_analysis.power,
      # ... other fields
    })

    # 4. Save results
    File.write!("results/#{metadata.name}.md", report)

    {:ok, result, power_analysis}
  end
end

Benchmark Integration

defmodule BenchmarkRunner do
  def run_benchmarks(models, dataset) do
    results = for {name, model} <- models do
      scores = Enum.map(dataset, &model.predict/1)
      {name, scores}
    end

    # Statistical comparison of all models
    score_lists = Enum.map(results, fn {_name, scores} -> scores end)
    comparison = CrucibleBench.compare_multiple(score_lists)

    # Pairwise comparisons
    pairwise = for i <- 0..(length(results)-2),
                   j <- (i+1)..(length(results)-1) do
      {name_i, scores_i} = Enum.at(results, i)
      {name_j, scores_j} = Enum.at(results, j)

      result = CrucibleBench.compare(scores_i, scores_j)
      %{comparison: "#{name_i} vs #{name_j}",
        p_value: result.p_value,
        effect_size: result.effect_size.cohens_d}
    end

    %{omnibus: comparison, pairwise: pairwise}
  end
end

Performance Considerations

Memory Usage

Bootstrap methods with high iteration counts (>10,000) may consume significant memory
For large datasets, consider using analytical methods when assumptions are met
Effect size calculations are O(n) in sample size

Computational Complexity

Operation	Complexity	Notes
t-test	O(n)	Fast for any n
ANOVA	O(k×n)	k = number of groups
Bootstrap CI	O(iterations × n)	Expensive for high precision
Mann-Whitney	O(n²)	Slow for large n (>1000)
Kruskal-Wallis	O(n log n)	Better scaling

Optimization Tips

# Use analytical methods when possible
ci = CrucibleBench.confidence_interval(data, :mean, method: :analytical)

# Reduce bootstrap iterations for faster results
ci = CrucibleBench.confidence_interval(data, :median,
  method: :bootstrap,
  iterations: 1000  # Instead of default 10000
)

# Cache results for repeated analyses
@cached_power_analysis Memoize.memoize fn params ->
  CrucibleBench.power_analysis(params)
end

Troubleshooting

Common Issues

Non-significant results despite large differences

# Check if you have enough power
result = CrucibleBench.compare(group1, group2)
power = CrucibleBench.power_analysis(:t_test,
  analysis_type: :post_hoc,
  effect_size: abs(result.effect_size.cohens_d),
  n_per_group: length(group1),
  alpha: 0.05
)

if power.power < 0.8 do
  IO.puts("Underpowered study! Need larger sample size.")
end

Assumption violations

# Check normality
result = CrucibleBench.compare(group1, group2, check_assumptions: true)
# If normality test fails, consider non-parametric tests

# Or manually check
skew1 = CrucibleBench.Stats.skewness(group1)
kurt1 = CrucibleBench.Stats.kurtosis(group1)

Outliers affecting results

# Use robust statistics
median_ci = CrucibleBench.confidence_interval(data, :median, method: :bootstrap)
# Compare with mean-based results

Error Messages

"Need at least 2 groups": compare_multiple/2 requires 2+ groups
"Unknown test: xyz": Invalid test type specified
"Sample size too small": Some tests require minimum n (e.g., normality tests)

Research Methodology

Best Practices Checklist

Power Analysis: Calculate required sample size before data collection
Effect Sizes: Always report alongside p-values
Assumptions: Test normality, homogeneity of variance
Multiple Testing: Apply corrections for multiple comparisons
Confidence Intervals: Report CIs, not just p-values
Replication: Design studies for reproducibility

Common Research Scenarios

Pre-registered Analysis Plan

# Define analysis plan before data collection
analysis_plan = %{
  primary_test: :welch_t_test,
  alpha: 0.05,
  power_target: 0.80,
  effect_size_estimate: 0.5,
  required_n: 64  # From a priori power analysis
}

# Execute plan
result = CrucibleBench.compare(group1, group2, test: analysis_plan.primary_test)

Exploratory Data Analysis

# Multiple effect sizes for robustness
effect_sizes = [
  CrucibleBench.effect_size(group1, group2),
  Stats.EffectSize.hedges_g(group1, group2),
  Stats.EffectSize.glass_delta(group1, group2)
]

# Sensitivity analysis with different tests
results = [
  CrucibleBench.compare(group1, group2, test: :welch_t_test),
  CrucibleBench.compare(group1, group2, test: :mann_whitney)
]

Meta-analysis Preparation

# Calculate effect sizes for meta-analysis
studies = [
  {study1_control, study1_treatment, "Study 1"},
  {study2_control, study2_treatment, "Study 2"}
]

meta_data = for {control, treatment, name} <- studies do
  effect = CrucibleBench.effect_size(control, treatment)
  result = CrucibleBench.compare(control, treatment)

  %{
    study: name,
    cohens_d: effect.cohens_d,
    variance: Stats.effect_size_variance(effect.cohens_d, length(control) + length(treatment)),
    n: length(control) + length(treatment)
  }
end

Contributing

Development Setup

# Clone and setup
git clone https://github.com/North-Shore-AI/crucible_bench.git
cd crucible_bench
mix deps.get

# Run tests
mix test

# Run examples
mix run examples/basic_usage.exs
mix run examples/advanced_usage.exs

# Generate docs
mix docs

Code Standards

Modules: Follow Elixir naming conventions
Functions: Clear, descriptive names with comprehensive documentation
Tests: Unit tests for all public functions, property-based tests where applicable
Documentation: Complete @doc and @moduledoc with examples

Adding New Tests

# 1. Implement test in appropriate stats module
defmodule CrucibleBench.Stats.NewTest do
  def test(group1, group2, opts \\ []) do
    # Implementation
    # Return CrucibleBench.Result struct
  end
end

# 2. Add to Analysis module
def compare_groups(group1, group2, opts) do
  # ... existing logic
  test_to_use = if new_condition, do: :new_test, else: existing_logic

  case test_to_use do
    :new_test -> NewTest.test(group1, group2, opts)
    # ... other cases
  end
end

# 3. Add comprehensive tests
test "new test handles various inputs" do
  # Test cases
end

Reporting Issues

Please include:

Elixir/Erlang versions
Sample data that reproduces the issue
Expected vs actual behavior
Full error messages and stack traces

License

MIT License - see LICENSE file for details

Changelog

v0.2.0 (Current)

Complete statistical testing framework with parametric and non-parametric coverage using accurate distribution functions
Expanded effect size suite with paired measures, eta/omega squared, and rank-biserial correlation plus interpretation guidance
Analytical and bootstrap confidence intervals and power analysis with actionable recommendations
High-level helpers for automatic test selection and experiment DSL for A/B tests, ablations, and hyperparameter sweeps
Publication-ready exports to Markdown, LaTeX, and HTML with standardized result metadata

v0.1.0

Initial release with comprehensive statistical testing framework
Support for parametric and non-parametric tests
Effect size calculations and power analysis
Bootstrap confidence intervals
Experiment DSL for common research patterns
Export to Markdown, LaTeX, and HTML formats
Complete documentation and examples

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
docs		docs
examples		examples
lib		lib
test		test
.formatter.exs		.formatter.exs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
execute_rename.sh		execute_rename.sh
mix.exs		mix.exs
mix.lock		mix.lock
rename_to_crucible.sh		rename_to_crucible.sh

Folders and files

Latest commit

History

Repository files navigation

CrucibleBench

Features

Design Principles

Installation

Stage Contract

Options

Schema Introspection

Pipeline Integration

Advanced Stage Configuration

Metrics Merging

Inspect-AI Eval Logs

Using IR Configuration

Quick Start

Compare Two Groups

Paired Comparison

Compare Multiple Groups

Effect Size Analysis

Confidence Intervals

Power Analysis

High-Level Experiment DSL

A/B Testing

Ablation Study

Hyperparameter Sweep

Assumption Checks (Normality & Variance)

Multiple Comparison Control

Export Results

Markdown

LaTeX

HTML

Experiment Reports

Statistical Tests Reference

Parametric Tests

Non-Parametric Tests

Effect Sizes

Effect Size Interpretation

Module Structure

Examples

Testing

Best Practices for AI Research

1. Always Report Effect Sizes

2. Check Statistical Power

3. Use Confidence Intervals

4. Consider Practical Significance

5. Use Experiment DSL for Complex Studies

Common Use Cases in AI Research

Compare Model Performance

Test Prompt Engineering

Evaluate Architecture Changes

Ablation Studies

Limitations

References

Statistical Methods

AI Research Statistics

Advanced Features

Bootstrap Confidence Intervals

Multiple Effect Size Measures

Power Analysis Curves

Complete API Reference

Core Functions

CrucibleBench.compare(group1, group2, opts \\\\ [])

CrucibleBench.compare_paired(group1, group2, opts \\\\ [])

CrucibleBench.compare_multiple(groups, opts \\\\ [])

CrucibleBench.effect_size(group1, group2, opts \\\\ [])

CrucibleBench.confidence_interval(data, statistic, opts \\\\ [])

CrucibleBench.power_analysis(test_type, opts \\\\ [])

Experiment DSL

CrucibleBench.experiment(:ab_test, opts)

CrucibleBench.experiment(:ablation, opts)

CrucibleBench.experiment(:hyperparameter_sweep, opts)

Export Functions

CrucibleBench.Export.to_markdown(result)

CrucibleBench.Export.to_latex(result)

CrucibleBench.Export.to_html(result)

CrucibleBench.Export.experiment_to_markdown(experiment_result)

Integration Examples

With Phoenix LiveView

`CrucibleBench.compare(group1, group2, opts \\\\ [])`

`CrucibleBench.compare_paired(group1, group2, opts \\\\ [])`

`CrucibleBench.compare_multiple(groups, opts \\\\ [])`

`CrucibleBench.effect_size(group1, group2, opts \\\\ [])`

`CrucibleBench.confidence_interval(data, statistic, opts \\\\ [])`

`CrucibleBench.power_analysis(test_type, opts \\\\ [])`

`CrucibleBench.experiment(:ab_test, opts)`

`CrucibleBench.experiment(:ablation, opts)`

`CrucibleBench.experiment(:hyperparameter_sweep, opts)`

`CrucibleBench.Export.to_markdown(result)`

`CrucibleBench.Export.to_latex(result)`

`CrucibleBench.Export.to_html(result)`

`CrucibleBench.Export.experiment_to_markdown(experiment_result)`

Packages