What is silent replication drift in ClickHouse?

Silent replication drift occurs when ClickHouse replicas fall behind the source shard without triggering alerts. The absolute_delay column in system.replicas may show 0 while replicas are missing data parts, leading to inconsistent query results depending on which replica handles the request.

How do I detect replication drift in ClickHouse?

Compare row counts and max event timestamps across replicas for the same table. Query system.replicas for queue_size, inserts_in_queue, and log_pointer discrepancies. A zero absolute_delay does not guarantee consistency — you must verify data-level agreement between replicas.

Can ClickHouse replicas return different query results?

Yes. If a replica is missing data parts due to replication drift, queries routed to that replica will return incomplete results. With load balancing across replicas, this means the same dashboard query can return different numbers on consecutive refreshes.

How does ClusterSight detect replication drift?

ClusterSight monitors 400+ system table metrics including replica log pointer divergence, queue depth trends, and data part count mismatches across replicas — catching drift that absolute_delay alone misses.

Your Replicas Are Lying: How Silent Replication Drift Corrupts Analytics

Your ClickHouse dashboards are showing the wrong numbers. Not because the queries are wrong, but because the replicas serving those queries don't have the same data. This is silent replication drift — the most dangerous operational failure in ClickHouse because every monitoring tool you have says everything is fine.

This is the story of how a 45-second replication delay turned into a week of corrupted analytics, a stakeholder meeting with conflicting numbers, and a 3 AM incident that should have been caught on day one.

The Scenario Nobody Warns You About

Picture this. You have a three-node ClickHouse cluster. ReplicatedMergeTree tables. ZooKeeper coordination. Grafana dashboards with green lights across the board. Prometheus is scraping ClickHouseMetrics_ReplicasMaxAbsoluteDelay and it's sitting at zero. Your runbook says replication is healthy.

Except on Thursday, your analytics lead messages you: "The revenue dashboard showed $4.2M at 9 AM and $3.8M at 9:05 AM. Same filters. Same time range. Which number is real?"

Both numbers are real. They came from different replicas. One replica is missing 14 hours of data parts from a specific partition. And nothing in your monitoring noticed.

Why `absolute_delay` Lies

Most ClickHouse replication monitoring boils down to one metric:

SELECT database, table, absolute_delay
FROM system.replicas
WHERE absolute_delay > 0;

If this returns empty, you're healthy. Right?

Wrong. absolute_delay measures the timestamp difference between the last entry in the replication log and the last entry the replica has processed. It answers: "How far behind is this replica in the replication queue?"

It does not answer:

"Does this replica have the same data parts as the other replicas?"
"Did a fetch fail silently and leave a gap?"
"Are there parts that were supposed to merge but didn't?"

Here's the critical gap: a replica can process all replication log entries and still be missing data. A part fetch can fail, get retried, and eventually be marked as lost. The replication log pointer advances. absolute_delay drops to zero. But the data never arrived.

The Three Faces of Silent Drift

1. The Failed Fetch

When a replica needs a data part, it fetches it from the source. If the fetch fails (network timeout, disk full, source unavailable), ClickHouse retries. But after enough failures, the part gets added to virtual_parts — the replica thinks it has the data, or at least has moved past the log entry. The part is gone.

-- Find parts that exist on some replicas but not others
SELECT
    database,
    table,
    partition_id,
    name,
    active,
    replica_name
FROM system.parts
WHERE table = 'events'
ORDER BY partition_id, name, replica_name;

Run this and diff the results across replicas. If replica A has parts that replica B doesn't, you have drift. No alert told you.

2. The Stuck Mutation

Mutations (ALTER TABLE ... UPDATE/DELETE) execute asynchronously. A mutation can complete on two replicas and get stuck on a third. The stuck replica continues serving reads — with stale, un-mutated data.

-- Mutations that completed on some replicas but not all
SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    is_done,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time ASC;

A mutation stuck for hours means one replica is serving data that should have been updated or deleted. If that mutation was removing PII or correcting a data pipeline error, you're serving wrong data to some percentage of queries — depending on which replica the load balancer picks.

3. The Partition Divergence

This is the most insidious. Two replicas process the same inserts but end up with different partition states because of merge timing differences. Replica A merged parts 1, 2, and 3 into a single part. Replica B merged parts 1 and 2 but part 3 is still separate — and it's a part that was later superseded by a newer insert.

The data is technically there on both replicas but the logical state diverges. Deduplication behaves differently. Queries that rely on FINAL or _part_offset return different results.

The Query That Catches Everything

Here's what you should be monitoring instead of (or in addition to) absolute_delay:

-- Cross-replica consistency check
SELECT
    r1.database,
    r1.table,
    r1.replica_name,
    r1.log_pointer AS my_log_pointer,
    r1.queue_size,
    r1.inserts_in_queue,
    r1.merges_in_queue,
    r1.absolute_delay,
    r1.total_replicas,
    r1.active_replicas
FROM system.replicas r1
WHERE r1.active_replicas < r1.total_replicas
   OR r1.queue_size > 0
   OR r1.inserts_in_queue > 0
   OR r1.future_parts > 20
ORDER BY r1.queue_size DESC;

But even this query runs on a single replica. It tells you what that replica knows. The real check requires comparing across replicas:

-- Run on each replica and compare results
SELECT
    database,
    table,
    partition_id,
    count() AS part_count,
    sum(rows) AS total_rows,
    max(modification_time) AS latest_modification
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
ORDER BY database, table, partition_id;

If total_rows for the same (database, table, partition_id) differs between replicas by more than a few percent, you have drift. This is the query that would have caught the Thursday revenue discrepancy — 14 hours before the analytics lead noticed.

Why Standard Monitoring Misses This

The monitoring stack most teams run — Prometheus + Grafana with the ClickHouse exporter — captures:

absolute_delay (we've covered why this lies)
ReplicatedChecks (only catches severe corruption)
Query latency and error rates (drift doesn't cause errors)
CPU, memory, disk (infrastructure metrics, not data integrity)

None of these answer the fundamental question: "If I run the same query on each replica, do I get the same answer?"

This is the monitoring gap that turns "everything is green" into wrong analytics. The gap exists because most monitoring tools treat ClickHouse like a stateless service — check if it's up, check if it's fast, move on. But ClickHouse is a stateful distributed system, and stateful systems can be running perfectly while serving inconsistent data.

The Real-World Damage

Silent replication drift doesn't cause outages. It causes something worse: wrong answers that look right.

Finance reports revenue at $4.2M. The board deck goes out. The real number is $3.8M. Correcting it next week erodes trust in the data team.
A feature experiment shows a 12% lift on Monday and a 4% lift on Tuesday. Same data, same query, different replica. The product team ships based on the 12%.
A compliance query for PII deletion reports complete. But one replica still has the un-mutated rows. The data is "deleted" in two-thirds of query results.

In each case, there was no alert. No error. No red dashboard. Just a quiet lie from a replica that looked healthy.

Building a Real Replication Health Check

Stop relying solely on absolute_delay. Here's a minimum viable replication health check:

1. Log Pointer Divergence

-- Check if replicas are at the same log position
SELECT
    database,
    table,
    replica_name,
    log_pointer,
    log_max_index,
    (log_max_index - log_pointer) AS log_gap
FROM system.replicas
WHERE log_max_index - log_pointer > 10
ORDER BY log_gap DESC;

A log_gap above 10 that persists for more than a few minutes means a replica is stuck processing log entries — even if absolute_delay says zero.

2. Queue Depth Trending

A nonzero queue_size isn't always a problem — replicas always have a small queue during active ingestion. The signal is in the trend. A queue that grows over 5 minutes means replication is falling behind faster than it can catch up.

-- Snapshot this every minute and alert on upward trends
SELECT
    database,
    table,
    replica_name,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    now() AS checked_at
FROM system.replicas
WHERE is_leader = 0
ORDER BY queue_size DESC;

3. Part Count Agreement

-- Run on all replicas, compare results
-- Divergence > 5% for any partition = investigation required
SELECT
    database,
    table,
    partition_id,
    count() AS parts,
    sum(rows) AS rows,
    sum(bytes_on_disk) AS bytes
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
HAVING parts > 1
ORDER BY rows DESC;

4. Stale Mutation Detection

-- Mutations older than 10 minutes that haven't completed
SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    dateDiff('second', create_time, now()) AS age_seconds,
    parts_to_do,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
  AND create_time < now() - INTERVAL 10 MINUTE
ORDER BY create_time ASC;

Any mutation older than 10 minutes with parts_to_do > 0 means at least one replica is serving stale data.

What ClusterSight Monitors

ClusterSight runs these checks continuously across all replicas, not just the one your query happens to hit. It tracks:

Log pointer divergence across all replicas for every replicated table
Queue depth trends with 1-minute granularity — alerting on growing queues, not just nonzero queues
Part count agreement across replicas, flagging partition-level mismatches
Mutation completion status with per-replica tracking and age-based alerting
Data freshness verification — comparing max event timestamps across replicas to detect gaps

These are part of the U (Uptime) letter in the PULSE Framework — the replication health dimension that most monitoring setups miss entirely.

The difference between checking absolute_delay and running a full PULSE replication check is the difference between asking "Is the replica responding?" and asking "Is the replica telling the truth?"

Check Your Cluster's PULSE

Your replicas might be lying right now. Run the cross-replica part count query above on your cluster today. Compare the results. If the numbers match — you're one of the lucky ones. If they don't, you just found a bug in your analytics that no dashboard would have caught.

Get started with ClusterSight to automate these checks across every replica, every table, every minute.

This post is part of the Operator's Blind Spot series — deep dives into ClickHouse production failures that standard monitoring misses. Also in this series: 312 Parts Per Partition and The Mutation That Ran for 72 Hours.