raft: clean unstable log early

### Background

The [unstable](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/raft/log_unstable.go#L36) log structure in `pkg/raft` holds log entries until they have been written to storage **and** fsync-ed.

After the [introduction](https://github.com/etcd-io/raft/pull/8) of async log writes, the flow of entries from memory to `Storage` is:

1. Entries are [appended](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/raft/log.go#L143) to `unstable`.
2. On `handleRaftReady`, the `unstable` entries are [extracted](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/kv/kvserver/replica_raft.go#L830) and [paired](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/raft/rawnode.go#L169-L170) with a `MsgStorageAppend` message.
3. The batch of entries is [written](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/kv/kvserver/replica_raft.go#L1065) to Pebble.
   - If async log writes are enabled, and the batch qualifies for an async write, the batch is written to Pebble, but [not synced](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/kv/kvserver/logstore/logstore.go#L258-L261).
   - If the write doesn't qualify for async write, the entries are written **and** synced.
4. When the entries have been synced, we/Pebble invoke a callback which [sends](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/kv/kvserver/replica_raft.go#L1613) a `MsgStorageAppendResp` responses back to the raft instance.
5. When [handling](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/raft/raft.go#L1119) the append response, raft [removes](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/raft/log_unstable.go#L158-L162) entries from `unstable` [^1].

### Improvement

In this flow, there is a period of time (between steps 3-5) when an entry has already been written to Pebble and sits in memtables, but still resides in the `unstable` struct. When async writes are enabled, this can last for multiple `Ready` iterations. Holding these entries in `unstable` is not strictly necessary, because they are already readable from the log `Storage`. We should clear them in step (3). This will, effectively, become a "transfer" of entries from `unstable` to `Storage`.

In Replication AC, entry tokens are [admitted](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/kv/kvserver/replica_raft.go#L1059-L1061) and returned to the leader in step (3), too. Clearing the `unstable` entries at this point effectively includes them into the replication token "lifetime", and protects the node from OOMs caused by `unstable` build-ups.

The modification will be along the lines of having a new method/message to raft saying that some/all entries in `unstable` have been (non-durably) written, so raft can clear them. There can be some complications in the interaction with the async writes [protocol](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/raft/rawnode.go#L277-L356).

Alternatively, we can go full on the "transfer" semantics, and remove entries from `unstable` when `Ready` returns them. We would still need to deliver "acks" to raft when entries are synced.

[^1]: Some entries may have already been cleared from `unstable` by this time, e.g. if the leadership changed and the new leader has overwritten some entries. We only remove the entries that are a guaranteed to be matched by storage, and there are no in-flight appends overwriting them. See [this](https://github.com/cockroachdb/cockroach/blob/06c9608ec395d130c7f86d95c5b763d770b13a15/pkg/raft/rawnode.go#L277-L356) comment for some details.

Jira issue: CRDB-37890

Epic CRDB-37515

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft: clean unstable log early #122438

Background

Improvement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

raft: clean unstable log early #122438

Description

Background

Improvement

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions