2026-02-23 · 11 min read
How to Monitor ClickHouse in Production: The Complete Guide (2026)
The short answer: Monitor ClickHouse using its built-in system tables — system.metrics, system.replicas, system.parts, and system.merges. These expose 400+ operational metrics that standard monitoring tools miss. A purpose-built tool like Clustersight surfaces all of them with automated fix commands.
Why ClickHouse Monitoring Is Different
ClickHouse is not a traditional RDBMS, and monitoring it like one is a common mistake that leads to missed failures. Unlike PostgreSQL or MySQL, ClickHouse exposes its health through a rich set of internal system tables that most general-purpose monitoring tools never query.
The result: teams using Grafana or DataDog for ClickHouse monitoring are typically blind to 70% of the metrics that actually predict and explain failures — broken parts, merge queue depth, ZooKeeper session health, mutation queue backlogs, and more.
The Three Tiers of ClickHouse Metrics
Tier 1: Standard Infrastructure Metrics
These are the metrics that Prometheus, Grafana, and any standard monitoring stack will capture:
- CPU usage — High CPU often indicates expensive queries or large merge operations
- Memory usage — ClickHouse is memory-hungry; tracking
system.asynchronous_metricsforMemoryUsageis essential - Disk usage — Track both raw disk and ClickHouse-specific compressed/uncompressed data ratios
- Network I/O — Important for replicated clusters
Query to check basic system metrics:
SELECT metric, value, description
FROM system.metrics
WHERE metric IN ('MemoryTracking', 'Query', 'Merge', 'ReplicatedChecks')
ORDER BY metric;Tier 2: Hidden Operational Metrics
This is where standard tools fail you. ClickHouse's system.asynchronous_metrics exposes hundreds of values that are critical for production health:
ReplicaQueueSize— How many operations are waiting to replicateMaxPartCountForPartition— High values indicate merge queue issuesNumberOfDatabases/NumberOfTables— Schema bloat detection
Query to surface hidden metrics:
SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric LIKE '%Replica%'
OR metric LIKE '%Queue%'
OR metric LIKE '%Part%'
ORDER BY metric;Tier 3: Production Audit Metrics
These require deeper querying of operational tables and are almost never captured by standard tools:
- Broken parts — Parts in
system.partswithactive = 0and non-normal states - Stuck mutations — Mutations in
system.mutationsthat haven't progressed - Detached parts — Files in the
detached/folder that indicate past corruption
The 10 Most Critical ClickHouse Metrics
1. Broken Parts Count
Broken parts are the most urgent signal in ClickHouse monitoring. A broken part means data corruption and ClickHouse will refuse to serve queries from that table.
SELECT database, table, count() AS broken_parts
FROM system.parts
WHERE active = 0 AND broken = 1
GROUP BY database, table
ORDER BY broken_parts DESC;Healthy value: 0. Any broken parts require immediate attention. See our ClickHouse broken parts fix guide.
2. Replication Lag
For replicated tables, replication lag measures how far behind a replica is from the source.
SELECT database, table, replica_name,
absolute_delay AS lag_seconds
FROM system.replicas
WHERE absolute_delay > 0
ORDER BY absolute_delay DESC;Healthy value: Under 30 seconds. Over 5 minutes indicates a serious problem. See our replication lag guide.
3. Merge Queue Depth
ClickHouse merges data parts in the background. A deep merge queue means inserts are creating parts faster than ClickHouse can merge them — a common cause of "too many parts" errors.
SELECT database, table, count() AS merges_in_queue
FROM system.merges
GROUP BY database, table
ORDER BY merges_in_queue DESC;Healthy value: Under 10 per table. See our merge queue guide.
4. Mutation Queue Length
Mutations (ALTER TABLE UPDATE/DELETE) in ClickHouse are expensive background operations. Stuck mutations block future merges.
SELECT database, table, mutation_id, command,
parts_to_do, is_done
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time ASC;Healthy value: 0 pending mutations. Any stuck mutation needs investigation.
5. ZooKeeper/Keeper Session Health
Replicated tables depend on ZooKeeper (or ClickHouse Keeper). A lost or degraded session cascades into replication failures.
SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse';See our ZooKeeper dependency guide.
6. Query Error Rate
SELECT
toStartOfMinute(event_time) AS minute,
countIf(type = 'ExceptionWhileProcessing') AS errors,
count() AS total,
round(errors / total * 100, 2) AS error_pct
FROM system.query_log
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute DESC
LIMIT 30;Healthy value: Under 1% error rate.
7. Disk Usage by Table
SELECT database, table,
formatReadableSize(sum(bytes_on_disk)) AS disk_usage,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
round(sum(bytes_on_disk) / sum(data_uncompressed_bytes) * 100, 1) AS compression_ratio_pct
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20;8. Slow Query Detection
SELECT query_id, user, query,
round(query_duration_ms / 1000, 2) AS duration_sec,
read_rows, read_bytes
FROM system.query_log
WHERE type = 'QueryFinish'
AND query_duration_ms > 5000
AND event_time >= now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC
LIMIT 10;See our slow queries guide.
9. Part Count Per Table
Too many parts in a table is a leading indicator of merge queue issues and will eventually cause insert errors.
SELECT database, table, count() AS part_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY part_count DESC
LIMIT 20;Healthy value: Under 300 parts per table. Over 1000 is critical.
10. Memory Usage by Query
SELECT query_id, user,
formatReadableSize(memory_usage) AS memory,
query
FROM system.processes
ORDER BY memory_usage DESC
LIMIT 10;Setting Up ClickHouse Monitoring: Three Approaches
Approach 1: Manual Grafana + Prometheus Setup
Time to set up: 4–8 hours
Metrics coverage: ~30% (Tier 1 only)
Fix guidance: None
You'll need to configure the ClickHouse Prometheus exporter, set up a Prometheus scrape job, build Grafana dashboards, and write alerting rules manually. This approach covers basic infrastructure metrics but misses all Tier 2 and Tier 3 operational metrics.
Approach 2: Custom Scripts + Alerting
Time to set up: 2–5 days
Metrics coverage: ~60% (Tiers 1–2)
Fix guidance: Manual
Write cron jobs or scripts that query system.parts, system.replicas, and system.merges and send alerts to Slack. Effective but requires ongoing maintenance as ClickHouse versions change.
Approach 3: Purpose-Built Tool (Recommended)
Time to set up: Under 8 minutes
Metrics coverage: 100% (All 3 tiers, 400+ metrics)
Fix guidance: Every alert ships with a copy-pasteable SQL fix command
Clustersight was built exclusively for ClickHouse. Deploy with Docker Compose, connect your cluster, and get a health score (0–100) with pre-built panels covering all 10 critical metrics above — plus 290 more.
The PULSE Framework: A Structured Approach to ClickHouse Health
The 10 metrics above are essential, but in production you need a mental model for organizing them — especially when triaging an incident at 3 AM. The PULSE Framework maps every ClickHouse operational metric to five dimensions:
P — Parts
Are merges keeping up? This dimension covers part counts per table, tiny parts accumulation, detached parts, and compression ratios. When the P score drops, you're heading toward "too many parts" errors and degraded query performance.
Key metrics: Part count per table (#9 above), broken parts (#1), merge queue depth (#3), parts per partition, detached part count.
Warning signs: Part count above 300 per table, any broken or detached parts, merge queue growing faster than it drains.
U — Uptime
Is replication healthy? This dimension goes beyond absolute_delay to check actual data consistency across replicas — log pointer divergence, queue depth trends, part count agreement, and ZooKeeper session health.
Key metrics: Replication lag (#2), ZooKeeper health (#5), replica queue size, log pointer gaps, cross-replica data consistency.
Warning signs: absolute_delay above 30 seconds, growing queue size, replicas with different part counts for the same partition.
L — Latency
Are queries performant? This dimension tracks slow queries, memory pressure, thread pool saturation, and error rates. A latency problem often signals an upstream issue in Parts or Uptime.
Key metrics: Slow query detection (#8), query error rate (#6), memory usage by query (#10), thread pool utilization, query queue depth.
Warning signs: Error rate above 1%, queries exceeding 10 seconds, memory usage spikes on specific queries.
S — Stability
Are mutations and background processes healthy? Stuck mutations silently block merges and serve stale data. Distributed DDL queue backlogs indicate cluster coordination issues. Error trending reveals chronic problems before they become acute.
Key metrics: Mutation queue (#4), distributed DDL queue, system.errors trending, background pool utilization, failed background operations.
Warning signs: Any mutation pending longer than 10 minutes, growing error counts in system.errors, background pool at capacity.
E — Efficiency
Are resources used well? This dimension covers disk utilization, compression analysis, TTL compliance, and cost per query. Efficiency problems don't cause incidents — they cause bills.
Key metrics: Disk usage by table (#7), compression ratios, TTL compliance, unused tables and columns, read amplification per query.
Warning signs: Compression ratio above 60%, tables with expired TTL data still on disk, unused tables consuming significant storage.
Every alert, every dashboard panel, and every health check in ClusterSight maps back to one or more PULSE letters. When your cluster's health score drops, PULSE tells you which dimension degraded — so you know whether to investigate parts, replication, queries, mutations, or resources.
Health Score: A Single Number for ClickHouse Health
The PULSE Framework feeds into a composite health score that weighs the most critical metrics across all five dimensions into a single 0–100 value. ClusterSight's health score works as follows:
- 90–100: Healthy — no action needed
- 70–89: Warning — investigate within 24 hours
- 50–69: Degraded — investigate within 1 hour
- Below 50: Critical — immediate action required
The score penalizes: broken parts (severe), stuck mutations (moderate), replication lag (moderate), merge queue depth (mild), and disk pressure (mild).
Monitoring Checklist
Before going to production with a ClickHouse cluster, verify:
- Broken parts alert configured (threshold: > 0)
- Replication lag alert configured (threshold: > 300 seconds)
- Merge queue alert configured (threshold: > 50 per table)
- Mutation queue alert configured (threshold: > 0 stuck)
- Disk usage alert configured (threshold: > 80%)
- Query error rate alert configured (threshold: > 5%)
- ZooKeeper session health monitored
- Slow query detection in place (threshold: > 10 seconds)
Next Steps
- Your Replicas Are Lying: How Silent Replication Drift Corrupts Analytics
- 312 Parts Per Partition: The Merge Backlog Time Bomb
- Grafana + Prometheus Is Not ClickHouse Monitoring
- PULSE Check: 5 Queries Before Every ClickHouse Upgrade
- PULSE Check: Is Your Compression Actually Working?
- The Complete Guide to system.replicas
- system.merges Decoded: Understanding ClickHouse's Background Engine
- The Mutation That Ran for 72 Hours
- PULSE Check: Detecting Read-Only Replicas
- system.query_log: The 15 Fields That Actually Matter
- Designing Health Scores: How ClusterSight Weights 400+ Metrics
- Why We're Building a ClickHouse Health Tool
- Why system.errors Has 40 Error Types You've Never Checked
- PULSE Check: The One Query That Shows If Merges Are Falling Behind
- system.parts_columns: Tables Destroying Your Disk Budget
- Stop Running OPTIMIZE TABLE in Production
- First 10 Users: What Operators Actually Want
- Month 3: Building ClusterSight
- ClickHouse Broken Parts: Causes, Detection & Fix
- ClickHouse Replication Lag: How to Diagnose & Fix
- ClickHouse Merge Queue: Why It Happens & How to Fix
- ClickHouse System Tables: The Complete Guide
- Try ClusterSight — deploy in under 8 minutes
Frequently Asked Questions
What is the best way to monitor ClickHouse?
The most effective approach combines ClickHouse system tables (system.metrics, system.asynchronous_metrics, system.query_log) with an observability tool purpose-built for ClickHouse. Standard tools like Grafana miss 70% of the operational metrics that predict failures.
How do I check ClickHouse cluster health?
Query system.replicas for replication status, system.merges for merge queue depth, system.parts for broken parts, and system.mutations for stuck mutations. A healthy cluster has 0 broken parts, replication lag under 30 seconds, and no stuck mutations.
What ClickHouse metrics should I monitor in production?
Critical metrics include: broken parts count, replication lag seconds, merge queue size, mutation queue length, ZooKeeper session status, disk usage percent, memory usage, and query error rate.
What is the PULSE Framework for ClickHouse?
PULSE is a structured approach to ClickHouse operational health. It organizes all monitoring metrics into five dimensions: Parts (merge health), Uptime (replication health), Latency (query performance), Stability (mutations and background processes), and Efficiency (resource utilization). ClusterSight uses PULSE to generate a 0–100 health score for your cluster.
Does Grafana work for ClickHouse monitoring?
Grafana can visualize ClickHouse metrics but requires significant manual setup and misses 70% of operational metrics hidden in ClickHouse system tables. Purpose-built tools like Clustersight surface these automatically.