2026-05-04 · 9 min read
The Mutation That Ran for 72 Hours While Every Dashboard Said Green
For 72 hours, a ClickHouse cluster served stale data to every analytics dashboard. A mutation that should have taken 3 minutes got stuck on a single replica. The other two replicas completed the mutation and moved on. No alert fired. No error appeared. Every monitoring tool said the cluster was healthy.
The mutation was an ALTER TABLE UPDATE that corrected a currency conversion bug — a one-line fix to a single column. It completed on replicas 1 and 2 in under a minute. On replica 3, it hung indefinitely. For three days, one-third of queries returned the old, incorrect currency values. The finance team noticed when monthly aggregates didn't match between report runs.
This is the third post in the Operator's Blind Spot series. The first covered silent replication drift. The second covered merge backlog time bombs. This one covers the failure mode that combines the worst of both: stale data AND merge queue degradation.
How Mutations Work (And Don't)
ClickHouse mutations (ALTER TABLE UPDATE, ALTER TABLE DELETE) are fundamentally different from SQL UPDATE and DELETE in traditional databases:
- They're asynchronous. The ALTER statement returns immediately. The actual data rewriting happens in the background.
- They rewrite parts. Instead of modifying rows in place, ClickHouse creates new parts with the mutation applied and marks old parts for deletion.
- They're per-replica. Each replica executes the mutation independently. There's no distributed coordination for mutation progress.
- They block merges. While a mutation is processing a part, that part can't be merged. If the mutation stalls, the merge queue for that table freezes.
-- See the mutation lifecycle
SELECT
database,
table,
mutation_id,
command,
create_time,
is_done,
parts_to_do,
latest_fail_reason
FROM system.mutations
ORDER BY create_time DESC
LIMIT 20;The Three Ways Mutations Get Stuck
1. The Failing Part
A mutation processes parts one at a time. If one part causes the mutation to fail (corrupt data, type mismatch, out of memory), the mutation retries that part indefinitely. The latest_fail_reason column tells you why — but only if you're looking.
-- Mutations that are failing silently
SELECT
database,
table,
mutation_id,
command,
create_time,
dateDiff('second', create_time, now()) AS age_seconds,
parts_to_do,
latest_fail_reason,
latest_fail_time
FROM system.mutations
WHERE is_done = 0
AND latest_fail_reason != ''
ORDER BY create_time ASC;In the 72-hour incident, latest_fail_reason was Memory limit exceeded — the mutation tried to process a 15GB part with insufficient memory. It retried every 60 seconds for three days. The error was logged at DEBUG level, which nobody had enabled.
2. The Resource Starvation
Mutations compete with merges and queries for CPU, memory, and disk I/O. Under heavy query load, mutations can be starved of resources. The mutation appears "in progress" but parts_to_do never decreases.
-- Mutations that aren't making progress
-- Snapshot this every 5 minutes and compare parts_to_do
SELECT
database,
table,
mutation_id,
command,
create_time,
parts_to_do,
dateDiff('second', create_time, now()) AS age_seconds,
now() AS checked_at
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time ASC;If parts_to_do hasn't changed between snapshots, the mutation is stalled.
3. The Replica Divergence
This is the 72-hour scenario. The mutation completes on some replicas but not others. Queries are load-balanced across replicas. Some queries hit the mutated data, some hit the stale data. Results vary on every refresh.
-- Check mutation status across replicas
-- Run on each replica and compare results
SELECT
database,
table,
mutation_id,
command,
is_done,
parts_to_do,
latest_fail_reason
FROM system.mutations
WHERE command LIKE '%UPDATE%'
OR command LIKE '%DELETE%'
ORDER BY create_time DESC
LIMIT 20;If is_done = 1 on some replicas and is_done = 0 on others for the same mutation_id, you have divergence. Queries are returning different results depending on which replica handles them — exactly like silent replication drift, but caused by mutations instead of part fetches.
The Cascade: How a Stuck Mutation Kills Your Merge Queue
A stuck mutation doesn't just serve stale data. It creates a chain reaction:
Stuck mutation on table X
→ Parts waiting for mutation can't be merged
→ Part count for table X grows
→ Query performance on table X degrades
→ At 300 parts/partition, inserts to table X fail
→ Pipeline backs up
→ Incident
This is the same merge backlog time bomb, but triggered by a mutation instead of high-frequency inserts. The root cause is different but the outcome is identical.
-- Is a stuck mutation causing part accumulation?
SELECT
m.database,
m.table,
m.mutation_id,
m.command,
dateDiff('second', m.create_time, now()) AS mutation_age_sec,
m.parts_to_do,
p.part_count,
p.max_partition_parts
FROM system.mutations m
INNER JOIN (
SELECT
database,
table,
count() AS part_count,
max(partition_parts) AS max_partition_parts
FROM (
SELECT database, table, partition_id, count() AS partition_parts
FROM system.parts
WHERE active = 1
GROUP BY database, table, partition_id
)
GROUP BY database, table
) p ON m.database = p.database AND m.table = p.table
WHERE m.is_done = 0
ORDER BY p.max_partition_parts DESC;If mutation_age_sec is high AND max_partition_parts is above 100, the mutation is actively causing a merge backlog.
The Compliance Nightmare
Stuck mutations aren't just a performance problem. They're a data integrity and compliance risk:
- PII deletion:
ALTER TABLE DELETE WHERE user_id = 'X'completes on 2/3 replicas. One-third of queries still return the PII. You're not compliant. - Data correction:
ALTER TABLE UPDATE currency_rate = 1.08 WHERE ...completes partially. Finance reports use different rates depending on which replica they hit. - Access revocation:
ALTER TABLE DELETE WHERE org_id = 'cancelled_customer'gets stuck. Cancelled customer's data is still queryable on one replica.
In each case, the mutation looks done from the replica where you ran it (if it happened to be a replica that completed). You have to check system.mutations on every replica to know the true state.
Building a Mutation Health Monitor
Check 1: Age-Based Detection
-- Any mutation older than 10 minutes is suspicious
SELECT
database,
table,
mutation_id,
command,
create_time,
dateDiff('minute', create_time, now()) AS age_minutes,
parts_to_do,
latest_fail_reason
FROM system.mutations
WHERE is_done = 0
AND create_time < now() - INTERVAL 10 MINUTE
ORDER BY create_time ASC;Alert: Any result = investigate.
Check 2: Failure Detection
-- Mutations that have failed at least once
SELECT
database,
table,
mutation_id,
command,
latest_fail_reason,
latest_fail_time,
parts_to_do
FROM system.mutations
WHERE is_done = 0
AND latest_fail_reason != ''
ORDER BY latest_fail_time DESC;Alert: Any result = critical. The mutation will retry indefinitely but likely won't succeed without intervention.
Check 3: Merge Impact Assessment
-- Tables where mutations are blocking merges
SELECT
database,
table,
countIf(is_mutation = 1) AS mutation_merges,
countIf(is_mutation = 0) AS regular_merges,
count() AS total_active
FROM system.merges
GROUP BY database, table
HAVING mutation_merges > 0
ORDER BY mutation_merges DESC;See system.merges Decoded for the full breakdown of how mutations consume merge pool capacity.
Check 4: Kill and Retry
-- Kill a stuck mutation (does NOT roll back completed parts)
KILL MUTATION WHERE database = 'analytics'
AND table = 'events'
AND mutation_id = '0000000042';
-- Verify it's gone
SELECT * FROM system.mutations
WHERE database = 'analytics' AND table = 'events'
ORDER BY create_time DESC LIMIT 5;
-- Re-run the mutation if needed
ALTER TABLE analytics.events
UPDATE currency_rate = 1.08
WHERE event_date >= '2026-01-01'
SETTINGS mutations_sync = 1; -- Wait for completion this timeImportant: KILL MUTATION leaves the table in a partially-mutated state. Parts that were already processed have the new data. Parts that weren't still have the old data. You need to either re-run the mutation or accept the partial state.
The mutations_sync = 1 setting makes the ALTER statement wait for the mutation to complete before returning — so you'll know immediately if it fails instead of discovering it 72 hours later.
How This Maps to PULSE
Mutation monitoring is the S (Stability) dimension of the PULSE Framework. It also impacts P (Parts) when stuck mutations cause merge backlogs.
ClusterSight monitors mutations across every replica:
- Per-replica mutation status — detects when a mutation completes on some replicas but not others
- Progress tracking — snapshots
parts_to_doevery 60 seconds and alerts on stalled mutations - Failure detection — alerts immediately when
latest_fail_reasonis non-empty - Merge impact assessment — shows when mutations are consuming merge pool capacity and causing part accumulation
The 72-hour mutation would have been caught in under 10 minutes — before a single stale query was served.
Check your cluster's PULSE.
This post is part of the Operator's Blind Spot series — deep dives into ClickHouse production failures that standard monitoring misses. Previously: Your Replicas Are Lying and 312 Parts Per Partition.
Read next:
Frequently Asked Questions
What happens when a ClickHouse mutation gets stuck?
A stuck mutation means ALTER TABLE UPDATE or DELETE hasn't completed on all parts. Queries hitting un-mutated parts return stale data — rows that should have been updated or deleted are still served. Meanwhile, the mutation blocks future merges on the affected table, causing part count growth.
How do I check for stuck mutations in ClickHouse?
Query SELECT database, table, mutation_id, command, create_time, parts_to_do, latest_fail_reason FROM system.mutations WHERE is_done = 0 ORDER BY create_time ASC. Any mutation with parts_to_do > 0 and age over 10 minutes needs investigation.
How do I kill a stuck mutation in ClickHouse?
Run KILL MUTATION WHERE database = 'db' AND table = 'table' AND mutation_id = 'mutation_id'. This stops the mutation but does NOT roll back changes already applied to some parts — your table will have partially mutated data. You may need to re-run the mutation or restore from a replica.
Do ClickHouse mutations block merges?
Yes. Active mutations prevent merges on the parts they need to process. A long-running mutation can freeze the merge queue for that table, causing part count growth that eventually leads to 'too many parts' errors.