val-window: skip DELETE when no rows exist below retention threshold

## Problem

The `val-window` module executes a `DELETE FROM t_validator_rewards_summary WHERE f_epoch <= {threshold}` every time a finalized checkpoint event arrives (~every 6.4 minutes), regardless of whether any rows actually match the condition.

When the minimum epoch in the table is already above the threshold (i.e., all old data has already been pruned), the DELETE becomes a **no-op mutation** that still forces ClickHouse to scan every active part in the table. On a table with 51 fragmented parts (~230 GiB), each no-op DELETE takes **30-82 seconds** of CPU time.

### The vicious cycle

This creates a feedback loop with partition fragmentation:

```
val-window no-op DELETEs (every ~6 min, 30-82 sec each)
       │
       ▼
CPU saturated scanning all parts
       │
       ▼
Background merge scheduler starved ◄──────┐
       │                                   │
       ▼                                   │
Parts remain fragmented (41+ parts)        │
       │                                   │
       ▼                                   │
Next DELETE must scan all 41+ parts ───────┘
```

**Observed impact (2026-03-04):** ClickHouse at 1,218% CPU, server load average 35+, only 0.5% CPU idle. Zero background merges running despite 41 fragmented parts. API returning HTTP 500 errors due to query timeouts.

### Root cause in code

In [`pkg/validator_window/window.go:95-99`](https://github.com/migalabs/goteth/blob/dev/pkg/validator_window/window.go#L95-L99):

```go
windowLowerEpochBoundary := dbHeadEpoch - phase0.Epoch(s.windowEpochSize)

log.Infof("database head epoch: %d", dbHeadEpoch)
log.Infof("deleting validator rewards from %d epoch backwards", windowLowerEpochBoundary)
err = s.dbClient.DeleteValidatorRewardsUntil(windowLowerEpochBoundary)
```

The DELETE is always executed without checking if `MIN(f_epoch)` in the table is already above `windowLowerEpochBoundary`.

## Proposed fix

Add a pre-check query before executing the DELETE. If the minimum epoch in the table is already above the threshold, skip the DELETE entirely:

```go
windowLowerEpochBoundary := dbHeadEpoch - phase0.Epoch(s.windowEpochSize)

log.Infof("database head epoch: %d", dbHeadEpoch)

// Check if there are actually rows to delete
minEpoch, err := s.dbClient.RetrieveMinValidatorRewardsEpoch()
if err != nil {
    log.Errorf("could not retrieve min epoch: %s", err)
    s.EndProcesses()
    return
}

if minEpoch > windowLowerEpochBoundary {
    log.Infof("min epoch %d already above threshold %d, skipping DELETE", minEpoch, windowLowerEpochBoundary)
    continue
}

log.Infof("deleting validator rewards from %d epoch backwards", windowLowerEpochBoundary)
err = s.dbClient.DeleteValidatorRewardsUntil(windowLowerEpochBoundary)
```

This requires adding a `RetrieveMinValidatorRewardsEpoch()` method to the DB service:

```go
// In pkg/db/validator_rewards.go
const selectMinEpochQuery = `SELECT min(f_epoch) AS f_epoch FROM %s`

func (p *DBService) RetrieveMinValidatorRewardsEpoch() (phase0.Epoch, error) {
    var dest []struct {
        F_epoch uint64 `ch:"f_epoch"`
    }
    err := p.highSelect(
        fmt.Sprintf(selectMinEpochQuery, valRewardsTable),
        &dest)
    if len(dest) > 0 {
        return phase0.Epoch(dest[0].F_epoch), err
    }
    return 0, err
}
```

The `SELECT min(f_epoch)` query is essentially free on a MergeTree table (ClickHouse reads it from partition metadata, no full scan needed).

## Workaround

Until the fix is deployed, the immediate workaround that resolved the incident was:

1. `docker stop goteth-val-window-1` — stop the no-op DELETEs
2. `KILL MUTATION` — cancel the in-flight no-op mutation
3. `OPTIMIZE TABLE t_validator_rewards_summary PARTITION {271,272,273,274} FINAL` — consolidate fragmented parts
4. `docker start goteth-val-window-1` — restart with merged partitions (DELETEs now take ~3-6 sec instead of 30-82 sec)

This workaround prevents the vicious cycle but doesn't eliminate the unnecessary DELETE mutations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

val-window: skip DELETE when no rows exist below retention threshold #237

Problem

The vicious cycle

Root cause in code

Proposed fix

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

val-window: skip DELETE when no rows exist below retention threshold #237

Description

Problem

The vicious cycle

Root cause in code

Proposed fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions