Tutorial: Monitoring & Alerting

This guide consolidates all pg_trickle monitoring capabilities into a single reference: built-in SQL views, NOTIFY-based alerts, and the Prometheus/Grafana observability stack.

Quick Health Check

The fastest way to verify pg_trickle is healthy:

SELECT * FROM pgtrickle.health_check() WHERE severity != 'OK';

If this returns no rows, everything is working. Any WARN or ERROR rows tell you where to investigate.

Built-in Monitoring Views

Stream table status

-- Overview: name, status, mode, staleness
SELECT name, status, refresh_mode, staleness, stale
FROM pgtrickle.stream_tables_info;

-- Detailed stats: refresh counts, duration, error streaks
SELECT pgt_name, total_refreshes, avg_duration_ms, consecutive_errors, stale
FROM pgtrickle.pg_stat_stream_tables;

-- Live status with error counts
SELECT * FROM pgtrickle.pgt_status();

Refresh history

-- Last 10 refreshes for a specific stream table
SELECT start_time, action, status, duration_ms, rows_inserted, rows_deleted, error_message
FROM pgtrickle.get_refresh_history('order_totals', 10);

-- Global refresh timeline (last 20 events across all stream tables)
SELECT start_time, stream_table, action, status, duration_ms, error_message
FROM pgtrickle.refresh_timeline(20);

-- Aggregate refresh statistics
SELECT * FROM pgtrickle.st_refresh_stats();

CDC pipeline health

-- Per-source CDC mode, WAL lag, and alerts
SELECT * FROM pgtrickle.check_cdc_health();

-- Change buffer sizes (pending changes not yet consumed)
SELECT stream_table, source_table, cdc_mode, pending_rows, buffer_bytes
FROM pgtrickle.change_buffer_sizes()
ORDER BY pending_rows DESC;

-- Verify all CDC triggers are installed and enabled
SELECT source_table, trigger_type, trigger_name
FROM pgtrickle.trigger_inventory()
WHERE NOT present OR NOT enabled;

Dependencies

-- ASCII tree view of the entire dependency graph
SELECT tree_line, status, refresh_mode
FROM pgtrickle.dependency_tree();

-- Diamond consistency groups
SELECT * FROM pgtrickle.diamond_groups();

Fuse circuit breaker

-- Check fuse state for all stream tables
SELECT name, fuse_mode, fuse_state, fuse_ceiling, blown_at
FROM pgtrickle.fuse_status();

Parallel workers

-- Worker pool status (when parallel_refresh_mode = 'on')
SELECT * FROM pgtrickle.worker_pool_status();

-- Recent parallel job history
SELECT job_id, unit_key, status, duration_ms
FROM pgtrickle.parallel_job_status(60);

NOTIFY-Based Alerting

pg_trickle emits real-time events via PostgreSQL's NOTIFY system:

LISTEN pg_trickle_alert;

Event Types

Event	Trigger	Severity
`stale_data`	Scheduler is also behind — view is genuinely out of date	Warning
`no_upstream_changes`	Scheduler is healthy but source tables have had no writes — view is correct	Info
`auto_suspended`	Stream table suspended after max consecutive errors	Critical
`resumed`	Stream table resumed after suspension	Info
`reinitialize_needed`	Upstream DDL change detected	Warning
`buffer_growth_warning`	Change buffer growing unexpectedly	Warning
`slot_lag_warning`	WAL replication slot retaining excessive data	Warning
`fuse_blown`	Circuit breaker tripped	Warning
`refresh_completed`	Refresh completed successfully	Info
`refresh_failed`	Refresh failed	Error
`diamond_partial_failure`	One member of an atomic diamond group failed	Warning
`scheduler_falling_behind`	Refresh duration approaching the schedule interval	Warning
`spill_threshold_exceeded`	Delta MERGE spilled to temp files for consecutive refreshes, forcing FULL	Warning

Notification Payload

Each notification carries a JSON payload:

{
  "event": "auto_suspended",
  "stream_table": "order_totals",
  "consecutive_errors": 3,
  "last_error": "column \"deleted_column\" does not exist",
  "timestamp": "2026-03-31T14:22:01.123Z"
}

Bridging to External Systems

To forward NOTIFY events to external alerting systems (PagerDuty, Slack, OpsGenie), use a listener process:

# Example: Python listener using psycopg
import psycopg
import json

conn = psycopg.connect("postgresql://user:pass@host/db", autocommit=True)
conn.execute("LISTEN pg_trickle_alert")

for notify in conn.notifies():
    payload = json.loads(notify.payload)
    event = payload["event"]
    # no_upstream_changes is informational — source tables are quiet but healthy.
    # Only page on actionable events.
    if event in ("auto_suspended", "fuse_blown", "refresh_failed"):
        send_to_pagerduty(payload)
    elif event == "stale_data":  # scheduler itself is falling behind
        send_to_pagerduty(payload)

Prometheus & Grafana Stack

For production deployments, use the pre-built observability stack in the monitoring/ directory:

cd monitoring/
docker compose up -d

This gives you:

Prometheus scraping pg_trickle metrics via postgres_exporter
Grafana with a pre-provisioned dashboard
Alerting rules for staleness, errors, CDC lag, and scheduler health

See Prometheus & Grafana Integration for full setup details.

Diagnostic Workflow

When something is wrong, follow this systematic workflow:

Step 1 — Global health

SELECT * FROM pgtrickle.health_check() WHERE severity != 'OK';

Step 2 — Status and staleness

SELECT name, status, consecutive_errors, staleness
FROM pgtrickle.pgt_status()
ORDER BY staleness DESC NULLS FIRST;

Step 3 — Recent refresh activity

SELECT start_time, stream_table, action, status, error_message
FROM pgtrickle.refresh_timeline(20);

Step 4 — Error details for a specific stream table

SELECT * FROM pgtrickle.diagnose_errors('my_stream_table');

Step 5 — CDC pipeline

SELECT stream_table, source_table, pending_rows, buffer_bytes
FROM pgtrickle.change_buffer_sizes()
ORDER BY pending_rows DESC;

Step 6 — Trigger verification

SELECT source_table, trigger_type, trigger_name
FROM pgtrickle.trigger_inventory()
WHERE NOT present OR NOT enabled;

Common Alert Responses

Alert	Likely Cause	Action
`stale_data`	Scheduler behind, long refresh, or lock contention	Check `pgt_status()` and `refresh_timeline()`
`auto_suspended`	Repeated refresh failures	Fix root cause, then `resume_stream_table()`
`fuse_blown`	Bulk load exceeded fuse ceiling	Investigate, then `reset_fuse()`
`buffer_growth_warning`	Scheduler not consuming buffers fast enough	Check scheduler status and refresh errors
`reinitialize_needed`	Source table DDL changed	Verify schema compatibility; scheduler handles automatically

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tutorial: Monitoring & Alerting

Quick Health Check

Built-in Monitoring Views

Stream table status

Refresh history

CDC pipeline health

Dependencies

Fuse circuit breaker

Parallel workers

NOTIFY-Based Alerting

Event Types

Notification Payload

Bridging to External Systems

Prometheus & Grafana Stack

Diagnostic Workflow

Step 1 — Global health

Step 2 — Status and staleness

Step 3 — Recent refresh activity

Step 4 — Error details for a specific stream table

Step 5 — CDC pipeline

Step 6 — Trigger verification

Common Alert Responses

Further Reading

FilesExpand file tree

MONITORING_AND_ALERTING.md

Latest commit

History

MONITORING_AND_ALERTING.md

File metadata and controls

Tutorial: Monitoring & Alerting

Quick Health Check

Built-in Monitoring Views

Stream table status

Refresh history

CDC pipeline health

Dependencies

Fuse circuit breaker

Parallel workers

NOTIFY-Based Alerting

Event Types

Notification Payload

Bridging to External Systems

Prometheus & Grafana Stack

Diagnostic Workflow

Step 1 — Global health

Step 2 — Status and staleness

Step 3 — Recent refresh activity

Step 4 — Error details for a specific stream table

Step 5 — CDC pipeline

Step 6 — Trigger verification

Common Alert Responses

Further Reading