What is system.replicas in ClickHouse?

system.replicas is a ClickHouse system table that exposes the replication status of every ReplicatedMergeTree table. It shows replication lag, queue depth, ZooKeeper coordination status, and leader election state for each replicated table on the current node.

What are the most important columns in system.replicas?

The most critical columns are: absolute_delay (replication lag in seconds), queue_size (pending replication operations), is_readonly (replica health), log_pointer vs log_max_index (log position gap), and active_replicas vs total_replicas (cluster availability).

How do I check if all ClickHouse replicas are in sync?

Query system.replicas and compare log_pointer values across replicas. Also verify absolute_delay is 0, queue_size is 0, and active_replicas equals total_replicas. For deeper consistency verification, compare part counts across replicas.

What does is_readonly mean in system.replicas?

is_readonly = 1 means the replica cannot process writes or replication operations. This usually indicates a lost ZooKeeper session. The replica will serve stale read queries but cannot receive new data until the ZooKeeper connection is restored.

The Complete Guide to system.replicas: What Every Column Actually Means

system.replicas is the single most important system table for ClickHouse operations. Every replicated table on every node has a row in this table. It tells you whether replication is healthy, whether the replica is falling behind, and whether ZooKeeper coordination is working. The official docs list what each column is. This guide explains what each column means operationally — and which ones actually matter.

The Table at a Glance

SELECT * FROM system.replicas FORMAT Vertical;

This returns 30+ columns per replicated table. Most teams monitor 2-3 of them. The rest contain operational intelligence that prevents incidents — if you know how to read them.

Column-by-Column Reference

Identity Columns

`database` / `table`

The replicated table this row describes. Every ReplicatedMergeTree table gets one row per node.

`engine`

Always a Replicated variant: ReplicatedMergeTree, ReplicatedReplacingMergeTree, etc. If you don't see a table here, it's not replicated — and has no replication health to monitor.

`replica_name`

The unique name of this replica, typically set in the table DDL or derived from {replica} macros in the ClickHouse config. If two nodes accidentally share the same replica_name, replication silently breaks. Check for duplicates:

SELECT database, table, replica_name, count() AS cnt
FROM system.replicas
GROUP BY database, table, replica_name
HAVING cnt > 1;

This should always return zero rows.

`replica_path` / `zookeeper_path`

The ZooKeeper path where this table's replication metadata lives. Essential for debugging — you can inspect ZooKeeper state directly:

SELECT name, value, ctime
FROM system.zookeeper
WHERE path = '/clickhouse/tables/01/events/replicas';

For ZooKeeper health monitoring, see the ZooKeeper dependency guide.

The Columns That Matter Most

`is_leader`

What it is: Whether this replica is the current leader for merge scheduling.

What it means operationally: The leader decides which parts to merge and when. In a healthy two-replica setup, one replica is the leader. If no replica is the leader (is_leader = 0 on all replicas for a table), merges stop. Parts accumulate. Eventually you hit "too many parts" errors.

-- Tables with no leader — merges are not happening
SELECT database, table
FROM system.replicas
GROUP BY database, table
HAVING sum(is_leader) = 0;

Alert threshold: Any table with no leader is a critical issue.

`is_readonly`

What it is: Whether the replica is in read-only mode.

What it means operationally: A read-only replica has lost its ZooKeeper session or encountered a critical error. It will serve stale data for reads but cannot process writes, replication, or merges. This is one of the most dangerous states because the replica looks "up" to load balancers but is actually frozen in time.

-- Read-only replicas — serving stale data
SELECT database, table, replica_name, absolute_delay
FROM system.replicas
WHERE is_readonly = 1;

Alert threshold: Any is_readonly = 1 requires immediate investigation. See the PULSE Check pre-upgrade guide for how this affects upgrades.

`absolute_delay`

What it is: The time difference (in seconds) between the latest replication log entry and the last entry this replica processed.

What it means operationally: This is the metric everyone monitors — and it lies. absolute_delay = 0 means the replica has processed all log entries, but it does NOT mean the replica has all the data. A failed part fetch advances the log pointer without delivering data. For the full story on why this metric is insufficient, see Your Replicas Are Lying.

SELECT database, table, replica_name, absolute_delay
FROM system.replicas
WHERE absolute_delay > 30
ORDER BY absolute_delay DESC;

Alert thresholds:

Green: < 30 seconds
Warning: 30–300 seconds
Critical: > 300 seconds

For detailed replication lag debugging, see the replication lag guide.

`queue_size`

What it is: The total number of operations waiting in the replication queue.

What it means operationally: This is a better real-time signal than absolute_delay. A nonzero queue during active ingestion is normal. A growing queue means replication is falling behind faster than it can catch up. Trend this value over 5-minute windows.

SELECT database, table, replica_name,
    queue_size,
    inserts_in_queue,
    merges_in_queue
FROM system.replicas
WHERE queue_size > 0
ORDER BY queue_size DESC;

Alert threshold: Queue size growing for more than 5 consecutive minutes.

Queue Breakdown Columns

`inserts_in_queue` / `merges_in_queue`

What they are: The breakdown of queue_size by operation type.

What they mean operationally: If inserts_in_queue is high, the replica is falling behind on data ingestion — likely a network or disk I/O issue. If merges_in_queue is high, the replica can't keep up with merge scheduling — likely a CPU or disk contention issue. The fix is different depending on which is growing.

`inserts_in_queue` high → check network, disk write throughput

`merges_in_queue` high → check CPU, concurrent merge settings, disk I/O

For merge-specific troubleshooting, see the merge queue guide and merge internals.

Log Position Columns

`log_pointer` / `log_max_index`

What they are: log_max_index is the latest entry in the replication log. log_pointer is the latest entry this replica has processed.

What they mean operationally: The gap between them (log_max_index - log_pointer) is the raw count of unprocessed replication operations. This is more granular than absolute_delay — you can see exactly how many operations are pending, not just the time lag.

SELECT database, table, replica_name,
    log_pointer,
    log_max_index,
    (log_max_index - log_pointer) AS log_gap,
    absolute_delay
FROM system.replicas
WHERE (log_max_index - log_pointer) > 0
ORDER BY log_gap DESC;

Alert threshold: log_gap > 10 persisting for more than 2 minutes means the replica is stuck — even if absolute_delay is low.

Cluster Health Columns

`total_replicas` / `active_replicas`

What they are: How many replicas are configured vs. how many are currently online and communicating.

What they mean operationally: If active_replicas < total_replicas, at least one replica is down or unreachable. This directly impacts your fault tolerance — if you have 2 replicas and 1 is inactive, you have zero redundancy.

-- Tables with reduced replica count
SELECT database, table,
    total_replicas,
    active_replicas,
    (total_replicas - active_replicas) AS missing_replicas
FROM system.replicas
WHERE active_replicas < total_replicas
ORDER BY missing_replicas DESC;

Alert threshold: Any active_replicas < total_replicas needs investigation within 15 minutes.

Part Tracking Columns

`parts_to_check` / `future_parts`

What they are: parts_to_check is the number of parts queued for consistency verification. future_parts is the number of parts that will exist after all queued operations complete.

What they mean operationally: High parts_to_check means ClickHouse suspects data inconsistency and is verifying parts — this happens after crashes or ZooKeeper reconnections. High future_parts (> 20) means many operations are queued — if it keeps growing, merges can't keep up.

SELECT database, table,
    parts_to_check,
    future_parts,
    queue_size
FROM system.replicas
WHERE parts_to_check > 0
   OR future_parts > 20
ORDER BY parts_to_check DESC;

Columns You Can (Mostly) Ignore

These columns exist but rarely need operational monitoring:

is_session_expired — Transient ZooKeeper state; is_readonly is the persistent signal
last_queue_update — Timestamp of last queue change; useful for debugging, not monitoring
can_become_leader — Configuration flag, not a health signal
replica_is_active — Redundant with active_replicas count

The Five Queries That Replace Your Replication Dashboard

Instead of building a complex Grafana dashboard, run these five queries and you'll know more about your replication health than most teams ever will:

1. Overall Replication Status

SELECT
    database,
    table,
    replica_name,
    is_leader,
    is_readonly,
    absolute_delay,
    queue_size,
    log_max_index - log_pointer AS log_gap,
    active_replicas,
    total_replicas
FROM system.replicas
ORDER BY absolute_delay DESC, queue_size DESC;

2. Tables in Trouble

SELECT database, table, replica_name,
    multiIf(
        is_readonly = 1, 'CRITICAL: Read-only',
        absolute_delay > 300, 'CRITICAL: Lag > 5 min',
        absolute_delay > 30, 'WARNING: Lag > 30s',
        queue_size > 50, 'WARNING: Deep queue',
        active_replicas < total_replicas, 'WARNING: Missing replicas',
        'OK'
    ) AS status,
    absolute_delay,
    queue_size
FROM system.replicas
WHERE is_readonly = 1
   OR absolute_delay > 30
   OR queue_size > 50
   OR active_replicas < total_replicas
ORDER BY status ASC;

3. Queue Composition

SELECT database, table, replica_name,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    round(inserts_in_queue / greatest(queue_size, 1) * 100, 1) AS insert_pct,
    round(merges_in_queue / greatest(queue_size, 1) * 100, 1) AS merge_pct
FROM system.replicas
WHERE queue_size > 0
ORDER BY queue_size DESC;

4. Log Pointer Divergence

SELECT database, table, replica_name,
    log_pointer,
    log_max_index,
    (log_max_index - log_pointer) AS log_gap,
    absolute_delay
FROM system.replicas
WHERE (log_max_index - log_pointer) > 5
ORDER BY log_gap DESC;

5. Leader Election Status

SELECT database, table,
    countIf(is_leader = 1) AS leaders,
    countIf(is_readonly = 1) AS readonly_replicas,
    min(active_replicas) AS min_active,
    max(total_replicas) AS total
FROM system.replicas
GROUP BY database, table
HAVING leaders = 0
    OR readonly_replicas > 0
    OR min_active < total;

How ClusterSight Uses system.replicas

ClusterSight queries system.replicas on every node every 60 seconds and maps the results to the U (Uptime) dimension of the PULSE Framework. It tracks:

Trend data for queue_size and absolute_delay — alerting on growth rate, not just thresholds
Cross-replica comparison — detecting when replicas disagree on part counts or log positions
Leader election gaps — flagging tables with no active leader before merge backlogs form
Read-only state changes — alerting the moment a replica goes read-only, before stale queries accumulate

These are the signals that turn system.replicas from a diagnostic table into a predictive monitoring system.

This post is part of the ClickHouse at Scale series — authoritative deep dives on ClickHouse system tables. For the full monitoring framework, see How to Monitor ClickHouse in Production.

Read next: