Skip to content

Latest commit

 

History

History
93 lines (74 loc) · 2.54 KB

File metadata and controls

93 lines (74 loc) · 2.54 KB

Metrics

Prometheus metrics are exposed at http://node:9090/metrics.

Raft Consensus

Metric Type Description
tensor_chain_raft_state Gauge Current state (follower=0, candidate=1, leader=2)
tensor_chain_term Gauge Current Raft term
tensor_chain_commit_index Gauge Highest committed log index
tensor_chain_applied_index Gauge Highest applied log index
tensor_chain_elections_total Counter Total elections started
tensor_chain_append_entries_total Counter Total AppendEntries RPCs

Transactions

Metric Type Description
tensor_chain_tx_active Gauge Currently active transactions
tensor_chain_tx_commits_total Counter Total committed transactions
tensor_chain_tx_aborts_total Counter Total aborted transactions
tensor_chain_tx_latency_seconds Histogram Transaction latency

Deadlock Detection

Metric Type Description
tensor_chain_deadlocks_total Counter Total deadlocks detected
tensor_chain_deadlock_victims_total Counter Transactions aborted as victims
tensor_chain_wait_graph_size Gauge Current wait-for graph size

Gossip

Metric Type Description
tensor_chain_gossip_members Gauge Known cluster members
tensor_chain_gossip_healthy Gauge Healthy members
tensor_chain_gossip_suspect Gauge Suspect members
tensor_chain_gossip_failed Gauge Failed members

Storage

Metric Type Description
tensor_chain_entries_total Gauge Total stored entries
tensor_chain_memory_bytes Gauge Memory usage
tensor_chain_disk_bytes Gauge Disk usage
tensor_chain_wal_size_bytes Gauge WAL file size

Health Endpoint

curl http://node:9090/health

Response:

{
  "status": "healthy",
  "raft_state": "leader",
  "term": 42,
  "commit_index": 12345,
  "members": 3,
  "healthy_members": 3
}

Alerting Rules

groups:
  - name: neumann
    rules:
      - alert: NoLeader
        expr: sum(tensor_chain_raft_state{state="leader"}) == 0
        for: 30s
        labels:
          severity: critical

      - alert: HighReplicationLag
        expr: tensor_chain_commit_index - tensor_chain_applied_index > 1000
        for: 1m
        labels:
          severity: warning

      - alert: HighDeadlockRate
        expr: rate(tensor_chain_deadlocks_total[5m]) > 1
        for: 5m
        labels:
          severity: warning