-
Notifications
You must be signed in to change notification settings - Fork 13
val-window: skip DELETE when no rows exist below retention threshold #237
Description
Problem
The val-window module executes a DELETE FROM t_validator_rewards_summary WHERE f_epoch <= {threshold} every time a finalized checkpoint event arrives (~every 6.4 minutes), regardless of whether any rows actually match the condition.
When the minimum epoch in the table is already above the threshold (i.e., all old data has already been pruned), the DELETE becomes a no-op mutation that still forces ClickHouse to scan every active part in the table. On a table with 51 fragmented parts (~230 GiB), each no-op DELETE takes 30-82 seconds of CPU time.
The vicious cycle
This creates a feedback loop with partition fragmentation:
val-window no-op DELETEs (every ~6 min, 30-82 sec each)
│
▼
CPU saturated scanning all parts
│
▼
Background merge scheduler starved ◄──────┐
│ │
▼ │
Parts remain fragmented (41+ parts) │
│ │
▼ │
Next DELETE must scan all 41+ parts ───────┘
Observed impact (2026-03-04): ClickHouse at 1,218% CPU, server load average 35+, only 0.5% CPU idle. Zero background merges running despite 41 fragmented parts. API returning HTTP 500 errors due to query timeouts.
Root cause in code
In pkg/validator_window/window.go:95-99:
windowLowerEpochBoundary := dbHeadEpoch - phase0.Epoch(s.windowEpochSize)
log.Infof("database head epoch: %d", dbHeadEpoch)
log.Infof("deleting validator rewards from %d epoch backwards", windowLowerEpochBoundary)
err = s.dbClient.DeleteValidatorRewardsUntil(windowLowerEpochBoundary)The DELETE is always executed without checking if MIN(f_epoch) in the table is already above windowLowerEpochBoundary.
Proposed fix
Add a pre-check query before executing the DELETE. If the minimum epoch in the table is already above the threshold, skip the DELETE entirely:
windowLowerEpochBoundary := dbHeadEpoch - phase0.Epoch(s.windowEpochSize)
log.Infof("database head epoch: %d", dbHeadEpoch)
// Check if there are actually rows to delete
minEpoch, err := s.dbClient.RetrieveMinValidatorRewardsEpoch()
if err != nil {
log.Errorf("could not retrieve min epoch: %s", err)
s.EndProcesses()
return
}
if minEpoch > windowLowerEpochBoundary {
log.Infof("min epoch %d already above threshold %d, skipping DELETE", minEpoch, windowLowerEpochBoundary)
continue
}
log.Infof("deleting validator rewards from %d epoch backwards", windowLowerEpochBoundary)
err = s.dbClient.DeleteValidatorRewardsUntil(windowLowerEpochBoundary)This requires adding a RetrieveMinValidatorRewardsEpoch() method to the DB service:
// In pkg/db/validator_rewards.go
const selectMinEpochQuery = `SELECT min(f_epoch) AS f_epoch FROM %s`
func (p *DBService) RetrieveMinValidatorRewardsEpoch() (phase0.Epoch, error) {
var dest []struct {
F_epoch uint64 `ch:"f_epoch"`
}
err := p.highSelect(
fmt.Sprintf(selectMinEpochQuery, valRewardsTable),
&dest)
if len(dest) > 0 {
return phase0.Epoch(dest[0].F_epoch), err
}
return 0, err
}The SELECT min(f_epoch) query is essentially free on a MergeTree table (ClickHouse reads it from partition metadata, no full scan needed).
Workaround
Until the fix is deployed, the immediate workaround that resolved the incident was:
docker stop goteth-val-window-1— stop the no-op DELETEsKILL MUTATION— cancel the in-flight no-op mutationOPTIMIZE TABLE t_validator_rewards_summary PARTITION {271,272,273,274} FINAL— consolidate fragmented partsdocker start goteth-val-window-1— restart with merged partitions (DELETEs now take ~3-6 sec instead of 30-82 sec)
This workaround prevents the vicious cycle but doesn't eliminate the unnecessary DELETE mutations.