leanEthereum · tcoratger · Jan 16, 2026 · Jan 16, 2026 · Jan 16, 2026 · Jan 16, 2026
diff --git a/.claude/agents/code-tester.md b/.claude/agents/code-tester.md
@@ -11,6 +11,18 @@ You are SpecForge, an elite Test Engineer specializing in the Lean Ethereum Cons
 
 Generate rigorous, comprehensive unit tests and spec test fillers for the leanSpec repository. Your tests verify spec compliance and ensure cross-client interoperability across all modules.
 
+## Auto-Invoke Skills
+
+### Consensus Testing
+
+When writing tests for consensus-related code, invoke the `/consensus-testing` skill first to load specialized multi-validator testing patterns.
+
+**Triggers to invoke the skill:**
+- Test file is in `tests/consensus/`
+- Testing functions like `process_block`, `on_block`, `on_attestation`
+- Code involves validators, attestations, or justification/finalization
+- Fork choice or state transition scenarios with multiple validators
+
 ## Workflow (Follow This Order)
 
 ### 1. Explore First
@@ -20,7 +32,8 @@ Generate rigorous, comprehensive unit tests and spec test fillers for the leanSp
 - Map out exception types and when they're raised
 
 ### 2. Check Existing Tests
-- Search `tests/lean_spec/` for related test files
+- Search `tests/lean_spec/` for related unit test files
+- Search `tests/consensus/` for related spec test filler files
 - Match the established style and naming conventions
 - Avoid duplicating existing test coverage
 - Identify gaps in current coverage
@@ -37,6 +50,7 @@ Generate rigorous, comprehensive unit tests and spec test fillers for the leanSp
 
 ### 5. Verify
 - Run `uv run pytest <test_file>` to ensure tests pass
+- Run `uv run fill --clean --fork=devnet <test_file>` to ensure test fillers pass
 - Run `uv run ruff check <test_file>` for linting
 - Run `uv run ruff format <test_file>` for formatting
 - Fix any issues before presenting results

diff --git a/.claude/skills/consensus-testing.md b/.claude/skills/consensus-testing.md
@@ -0,0 +1,152 @@
+---
+name: consensus-testing
+description: "Specialized patterns for testing consensus and fork choice code with multiple validators. Use when writing tests in tests/consensus/, or when testing functions involving validators, attestations, justification, or finalization."
+---
+
+# Consensus & Fork Choice Testing Patterns
+
+Testing consensus logic requires understanding how validators interact. Single-validator tests miss critical dynamics.
+
+## Multi-Validator Test Design
+
+**Minimum validator counts by scenario:**
+- Basic consensus: 4 validators (allows 1 byzantine, maintains 2/3 honest)
+- Justification threshold: 8+ validators (clean 2/3 math)
+
+**Always vary the validator set composition:**
+- All validators honest and online
+- Supermajority honest (exactly 2/3 + 1)
+- At justification threshold (exactly 2/3)
+- Below threshold (2/3 - 1, should fail to justify)
+- Mixed online/offline validators
+
+## Validator Relationship Scenarios
+
+Test how validators interact, not just individual behavior:
+
+**Attestation patterns:**
+- All validators attest to same head (happy path)
+- Validators split between two competing heads
+- Staggered attestations across slots
+- Late attestations arriving after new blocks
+- Missing attestations from subset of validators
+
+**Proposer/attester dynamics:**
+- Proposer includes own attestation
+- Proposer excludes valid attestations (censorship)
+- Attestations reference proposer's parent (not proposer's block)
+- Multiple blocks proposed for same slot (equivocation)
+
+**Committee behavior:**
+- Full committee participation
+- Partial committee (threshold edge cases)
+- Empty committee attestations
+- Cross-committee attestation conflicts
+
+## Fork Choice Scenarios
+
+Fork choice tests must exercise competing chain heads:
+
+**Branch competition:**
+```
+         +-- B2a <- B3a (3 attestations)
+genesis <- B1 -+
+         +-- B2b <- B3b (4 attestations)  <- winner
+```
+- Test that head follows attestation weight
+- Verify re-org when new attestations shift weight
+- Check tie-breaking rules when weights equal
+
+**Critical scenarios to cover:**
+1. **Weight transitions**: Head changes as attestations arrive
+2. **Deep re-orgs**: New branch overtakes after multiple slots
+3. **Equivocation handling**: Same validator attests to conflicting heads
+4. **Checkpoint boundaries**: Behavior at epoch transitions
+5. **Finalization effects**: Finalized blocks cannot be re-orged
+
+## Justification & Finalization
+
+The 2/3 supermajority threshold is critical:
+
+**Justification tests:**
+- Exactly 2/3 participation -> should justify
+- One less than 2/3 -> should NOT justify
+- Validators with different effective balances (weighted voting)
+- Justification with gaps (skip epochs)
+
+**Finalization tests:**
+- Two consecutive justified epochs -> finalization
+- Justified but not finalized (gap in justification)
+- Finalization with varying participation rates
+- Cannot finalize without prior justification
+
+## Timing & Ordering
+
+Consensus is sensitive to when events occur:
+
+**Test event orderings:**
+- Attestation before vs after block arrival
+- Multiple attestations in same slot vs spread across slots
+- Block arrives late (after attestation deadline)
+- Out-of-order block delivery (child before parent)
+
+**Slot boundary behavior:**
+- Actions at slot start vs slot end
+- Crossing epoch boundaries
+- Genesis slot special cases
+
+## Spec Filler Patterns for Fork Choice
+
+```python
+def test_competing_branches(fork_choice_test: ForkChoiceTestFiller) -> None:
+    """Fork choice selects branch with higher attestation weight."""
+    fork_choice_test(
+        anchor_state=genesis_state,
+        anchor_block=genesis_block,
+        steps=[
+            # Build competing branches
+            OnBlock(block=block_2a),
+            OnBlock(block=block_2b),
+            # Add attestations favoring branch b
+            OnAttestation(attestation=att_for_2b_validator_0),
+            OnAttestation(attestation=att_for_2b_validator_1),
+            OnAttestation(attestation=att_for_2a_validator_2),
+            # Verify head follows weight
+            Checks(head=block_2b.hash_tree_root()),
+        ],
+    )
+```
+
+## State Transition with Multiple Validators
+
+```python
+def test_justification_threshold(state_transition_test: StateTransitionTestFiller) -> None:
+    """State justifies checkpoint when 2/3 validators attest."""
+    # Create state with 8 validators
+    state = create_state_with_validators(count=8)
+
+    # Block with attestations from exactly 6/8 validators (75% > 2/3)
+    block = create_block_with_attestations(
+        state=state,
+        attesting_validators=[0, 1, 2, 3, 4, 5],  # 6 of 8
+    )
+
+    state_transition_test(
+        pre=state,
+        blocks=[block],
+        post=StateExpectation(
+            current_justified_checkpoint=expected_checkpoint,
+        ),
+    )
+```
+
+## Common Pitfalls
+
+Avoid these testing mistakes:
+
+1. **Single validator tests** - Miss consensus dynamics entirely
+2. **Always-honest scenarios** - Never test byzantine behavior
+3. **Ignoring weights** - Validators may have different balances
+4. **Fixed ordering** - Real networks have non-deterministic message arrival
+5. **Skipping threshold edges** - The 2/3 boundary is where bugs hide
+6. **Testing implementation** - Test spec behavior, not internal state
diff --git a/tests/consensus/devnet/fc/test_fork_choice_reorgs.py b/tests/consensus/devnet/fc/test_fork_choice_reorgs.py
@@ -226,7 +226,6 @@ def test_three_block_deep_reorg(
     Reorg Details:
         - **Depth**: 3 blocks (deepest in this test suite)
         - **Trigger**: Alternative fork becomes longer
-        - **Weight advantage**: 4 proposer attestations vs 3
 
     Why This Matters
     ----------------
@@ -245,6 +244,7 @@ def test_three_block_deep_reorg(
     about chain history, ensuring safety and liveness even in adversarial scenarios.
     """
     fork_choice_test(
+        anchor_state=generate_pre_state(num_validators=6),
         steps=[
             # Common base
             BlockStep(
@@ -656,13 +656,13 @@ def test_back_and_forth_reorg_oscillation(
     tests fork choice correctness under extreme conditions.
 
     Oscillation Pattern:
-        Slot 2: Fork A leads (1 block) ← head
-        Slot 3: Fork B catches up (1 block each) → tie
-        Slot 4: Fork B extends (2 vs 1) ← head switches to B
-        Slot 5: Fork A extends (2 vs 2) → tie
-        Slot 6: Fork A extends (3 vs 2) ← head switches to A
-        Slot 7: Fork B extends (3 vs 3) → tie
-        Slot 8: Fork B extends (4 vs 3) ← head switches to B
+        Slot 2: Fork A leads (1 vs 0) ← head
+        Slot 2: Fork B created (1 vs 1) → tie, A maintains
+        Slot 3: Fork B extends (2 vs 1) ← head switches to B (REORG #1)
+        Slot 3: Fork A extends (2 vs 2) → tie, B maintains
+        Slot 4: Fork A extends (3 vs 2) ← head switches to A (REORG #2)
+        Slot 4: Fork B extends (3 vs 3) → tie, A maintains
+        Slot 5: Fork B extends (4 vs 3) ← head switches to B (REORG #3)
 
     Expected Behavior
     -----------------
@@ -671,7 +671,7 @@ def test_back_and_forth_reorg_oscillation(
     3. All reorgs are 1-2 blocks deep
     4. Fork choice remains consistent and correct throughout
 
-    Reorg Count: 3 reorgs in 6 slots (very high rate)
+    Reorg Count: 3 reorgs in 4 slots (very high rate)
 
     Why This Matters
     ----------------
@@ -694,6 +694,7 @@ def test_back_and_forth_reorg_oscillation(
     convergence.
     """
     fork_choice_test(
+        anchor_state=generate_pre_state(num_validators=6),
         steps=[
             # Common base
             BlockStep(