What should I check before upgrading ClickHouse?

Run five checks: broken parts count (must be 0), stuck mutations (must be 0), replication lag (must be under 30 seconds), merge queue depth (must be under 50 per table), and ZooKeeper session health (must be active). Any failures should be resolved before proceeding.

Can I upgrade ClickHouse with replication lag?

No. Upgrading a node while replicas are behind risks data loss if the lagging replica becomes the new leader after restart. Ensure absolute_delay is under 30 seconds on all replicas before upgrading.

How do I check if ClickHouse is safe to upgrade?

Query system.parts for broken parts, system.mutations for stuck mutations, system.replicas for replication lag, system.merges for queue depth, and system.zookeeper for session health. All five must be clean before proceeding.

What is a PULSE Check?

A PULSE Check is a structured health verification based on the PULSE Framework — Parts, Uptime, Latency, Stability, Efficiency. Each check maps to one or more PULSE dimensions and uses specific SQL queries against ClickHouse system tables.

PULSE Check: 5 Queries to Run Before Every ClickHouse Upgrade

Upgrading ClickHouse without checking cluster health first is how a 10-minute maintenance window turns into a 4-hour incident. These five queries take 30 seconds to run and catch the problems that turn upgrades into rollbacks. Each maps to a dimension of the PULSE Framework.

Why Pre-Upgrade Checks Matter

ClickHouse upgrades restart the server process. During restart, replicated tables re-sync with ZooKeeper, pending mutations resume, and the merge queue restarts. If any of these subsystems are already unhealthy, the restart amplifies the problem:

Broken parts → ClickHouse may refuse to start or drop the table
Stuck mutations → The mutation retries on startup, potentially blocking the merge queue for hours
Replication lag → A lagging replica that restarts may lose its position in the replication queue
Merge backlog → Post-restart merge storm competes with replication for I/O
ZooKeeper issues → Replicated tables fail to initialize on startup

Run these five queries on every node before you begin.

Query 1: Broken Parts (PULSE → P)

-- Must return 0 rows. Any result = do NOT upgrade.
SELECT
    database,
    table,
    name AS part_name,
    reason
FROM system.detached_parts
ORDER BY database, table;

Also check for corrupted active parts:

SELECT database, table, name, active
FROM system.parts
WHERE active = 0 AND rows > 0
ORDER BY modification_time DESC
LIMIT 20;

Pass criteria: Zero detached parts and zero inactive parts with rows. If you find broken parts, fix them first — see the broken parts fix guide.

Query 2: Stuck Mutations (PULSE → S)

-- Mutations pending for more than 5 minutes
SELECT
    database,
    table,
    mutation_id,
    command,
    create_time,
    dateDiff('second', create_time, now()) AS age_seconds,
    parts_to_do,
    latest_fail_reason
FROM system.mutations
WHERE is_done = 0
  AND create_time < now() - INTERVAL 5 MINUTE
ORDER BY create_time ASC;

Pass criteria: Zero results. A stuck mutation means background processes are blocked. Kill it with KILL MUTATION WHERE mutation_id = 'id' or wait for it to complete before upgrading. Stuck mutations can silently serve stale data — see Your Replicas Are Lying for why this matters.

Query 3: Replication Lag (PULSE → U)

-- All replicated tables with any lag
SELECT
    database,
    table,
    replica_name,
    absolute_delay AS lag_seconds,
    queue_size,
    inserts_in_queue,
    merges_in_queue,
    log_max_index - log_pointer AS log_gap
FROM system.replicas
WHERE absolute_delay > 10
   OR queue_size > 20
   OR (log_max_index - log_pointer) > 10
ORDER BY absolute_delay DESC;

Pass criteria: Zero results. Any replication lag should be resolved before upgrading. Force sync with SYSTEM SYNC REPLICA database.table if needed. For deeper replication issues, see the replication lag guide and how silent drift can corrupt analytics.

Query 4: Merge Queue Depth (PULSE → P)

-- Current merge activity and queue pressure
SELECT
    database,
    table,
    count() AS active_merges,
    sum(rows_read) AS total_rows_merging,
    formatReadableSize(sum(bytes_read_uncompressed)) AS data_merging
FROM system.merges
GROUP BY database, table
ORDER BY active_merges DESC;

And check part count pressure:

-- Tables with concerning part counts
SELECT
    database,
    table,
    count() AS part_count,
    max(modification_time) AS latest_part
FROM system.parts
WHERE active = 1
GROUP BY database, table
HAVING part_count > 100
ORDER BY part_count DESC;

Pass criteria: No table with more than 300 active parts. No merge queue with more than 50 pending operations. If the merge queue is deep, wait for it to drain — see the merge queue guide and merge internals deep dive.

Query 5: ZooKeeper/Keeper Health (PULSE → U)

-- Check ZooKeeper connectivity and session
SELECT
    name,
    value
FROM system.zookeeper
WHERE path = '/';

And verify no replicated tables are in read-only mode:

-- Read-only replicas cannot process writes or replication
SELECT
    database,
    table,
    replica_name,
    is_readonly,
    absolute_delay
FROM system.replicas
WHERE is_readonly = 1;

Pass criteria: ZooKeeper responds to the path query. Zero read-only replicas. If ZooKeeper is degraded, do not upgrade — see the ZooKeeper dependency guide.

The 30-Second Pre-Upgrade Runbook

Run all five queries. Score each pass/fail:

Check	PULSE Dimension	Pass	Fail Action
Broken parts	P (Parts)	0 detached/broken	Fix first
Stuck mutations	S (Stability)	0 pending > 5 min	Kill or wait
Replication lag	U (Uptime)	< 30s, queue < 20	Sync first
Merge queue	P (Parts)	< 300 parts/table	Wait for drain
ZooKeeper	U (Uptime)	Responds, 0 read-only	Investigate

All five pass → safe to upgrade. Any fail → fix before proceeding.

ClusterSight runs this PULSE check continuously and shows a real-time upgrade-readiness status. No manual queries needed.

This post is part of the PULSE Check series — tactical health checks for ClickHouse operators. For the full monitoring framework, see How to Monitor ClickHouse in Production.

Read next: