2026-02-23 · 8 min read
How to Monitor ClickHouse in Production: The Complete Guide (2026)
The short answer: Monitor ClickHouse using its built-in system tables — system.metrics, system.replicas, system.parts, and system.merges. These expose 300+ operational metrics that standard monitoring tools miss. A purpose-built tool like Clustersight surfaces all of them with automated fix commands.
Why ClickHouse Monitoring Is Different
ClickHouse is not a traditional RDBMS, and monitoring it like one is a common mistake that leads to missed failures. Unlike PostgreSQL or MySQL, ClickHouse exposes its health through a rich set of internal system tables that most general-purpose monitoring tools never query.
The result: teams using Grafana or DataDog for ClickHouse monitoring are typically blind to 70% of the metrics that actually predict and explain failures — broken parts, merge queue depth, ZooKeeper session health, mutation queue backlogs, and more.
The Three Tiers of ClickHouse Metrics
Tier 1: Standard Infrastructure Metrics
These are the metrics that Prometheus, Grafana, and any standard monitoring stack will capture:
- CPU usage — High CPU often indicates expensive queries or large merge operations
- Memory usage — ClickHouse is memory-hungry; tracking
system.asynchronous_metricsforMemoryUsageis essential - Disk usage — Track both raw disk and ClickHouse-specific compressed/uncompressed data ratios
- Network I/O — Important for replicated clusters
Query to check basic system metrics:
SELECT metric, value, description
FROM system.metrics
WHERE metric IN ('MemoryTracking', 'Query', 'Merge', 'ReplicatedChecks')
ORDER BY metric;Tier 2: Hidden Operational Metrics
This is where standard tools fail you. ClickHouse's system.asynchronous_metrics exposes hundreds of values that are critical for production health:
ReplicaQueueSize— How many operations are waiting to replicateMaxPartCountForPartition— High values indicate merge queue issuesNumberOfDatabases/NumberOfTables— Schema bloat detection
Query to surface hidden metrics:
SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric LIKE '%Replica%'
OR metric LIKE '%Queue%'
OR metric LIKE '%Part%'
ORDER BY metric;Tier 3: Production Audit Metrics
These require deeper querying of operational tables and are almost never captured by standard tools:
- Broken parts — Parts in
system.partswithactive = 0and non-normal states - Stuck mutations — Mutations in
system.mutationsthat haven't progressed - Detached parts — Files in the
detached/folder that indicate past corruption
The 10 Most Critical ClickHouse Metrics
1. Broken Parts Count
Broken parts are the most urgent signal in ClickHouse monitoring. A broken part means data corruption and ClickHouse will refuse to serve queries from that table.
SELECT database, table, count() AS broken_parts
FROM system.parts
WHERE active = 0 AND broken = 1
GROUP BY database, table
ORDER BY broken_parts DESC;Healthy value: 0. Any broken parts require immediate attention. See our ClickHouse broken parts fix guide.
2. Replication Lag
For replicated tables, replication lag measures how far behind a replica is from the source.
SELECT database, table, replica_name,
absolute_delay AS lag_seconds
FROM system.replicas
WHERE absolute_delay > 0
ORDER BY absolute_delay DESC;Healthy value: Under 30 seconds. Over 5 minutes indicates a serious problem. See our replication lag guide.
3. Merge Queue Depth
ClickHouse merges data parts in the background. A deep merge queue means inserts are creating parts faster than ClickHouse can merge them — a common cause of "too many parts" errors.
SELECT database, table, count() AS merges_in_queue
FROM system.merges
GROUP BY database, table
ORDER BY merges_in_queue DESC;Healthy value: Under 10 per table. See our merge queue guide.
4. Mutation Queue Length
Mutations (ALTER TABLE UPDATE/DELETE) in ClickHouse are expensive background operations. Stuck mutations block future merges.
SELECT database, table, mutation_id, command,
parts_to_do, is_done
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time ASC;Healthy value: 0 pending mutations. Any stuck mutation needs investigation.
5. ZooKeeper/Keeper Session Health
Replicated tables depend on ZooKeeper (or ClickHouse Keeper). A lost or degraded session cascades into replication failures.
SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse';See our ZooKeeper dependency guide.
6. Query Error Rate
SELECT
toStartOfMinute(event_time) AS minute,
countIf(type = 'ExceptionWhileProcessing') AS errors,
count() AS total,
round(errors / total * 100, 2) AS error_pct
FROM system.query_log
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute DESC
LIMIT 30;Healthy value: Under 1% error rate.
7. Disk Usage by Table
SELECT database, table,
formatReadableSize(sum(bytes_on_disk)) AS disk_usage,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
round(sum(bytes_on_disk) / sum(data_uncompressed_bytes) * 100, 1) AS compression_ratio_pct
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20;8. Slow Query Detection
SELECT query_id, user, query,
round(query_duration_ms / 1000, 2) AS duration_sec,
read_rows, read_bytes
FROM system.query_log
WHERE type = 'QueryFinish'
AND query_duration_ms > 5000
AND event_time >= now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC
LIMIT 10;See our slow queries guide.
9. Part Count Per Table
Too many parts in a table is a leading indicator of merge queue issues and will eventually cause insert errors.
SELECT database, table, count() AS part_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY part_count DESC
LIMIT 20;Healthy value: Under 300 parts per table. Over 1000 is critical.
10. Memory Usage by Query
SELECT query_id, user,
formatReadableSize(memory_usage) AS memory,
query
FROM system.processes
ORDER BY memory_usage DESC
LIMIT 10;Setting Up ClickHouse Monitoring: Three Approaches
Approach 1: Manual Grafana + Prometheus Setup
Time to set up: 4–8 hours
Metrics coverage: ~30% (Tier 1 only)
Fix guidance: None
You'll need to configure the ClickHouse Prometheus exporter, set up a Prometheus scrape job, build Grafana dashboards, and write alerting rules manually. This approach covers basic infrastructure metrics but misses all Tier 2 and Tier 3 operational metrics.
Approach 2: Custom Scripts + Alerting
Time to set up: 2–5 days
Metrics coverage: ~60% (Tiers 1–2)
Fix guidance: Manual
Write cron jobs or scripts that query system.parts, system.replicas, and system.merges and send alerts to Slack. Effective but requires ongoing maintenance as ClickHouse versions change.
Approach 3: Purpose-Built Tool (Recommended)
Time to set up: Under 8 minutes
Metrics coverage: 100% (All 3 tiers, 300+ metrics)
Fix guidance: Every alert ships with a copy-pasteable SQL fix command
Clustersight was built exclusively for ClickHouse. Deploy with Docker Compose, connect your cluster, and get a health score (0–100) with pre-built panels covering all 10 critical metrics above — plus 290 more.
Health Score: A Single Number for ClickHouse Health
One of the most useful concepts in ClickHouse monitoring is a composite health score that weighs the most critical metrics into a single 0–100 value. Clustersight's health score works as follows:
- 90–100: Healthy — no action needed
- 70–89: Warning — investigate within 24 hours
- 50–69: Degraded — investigate within 1 hour
- Below 50: Critical — immediate action required
The score penalizes: broken parts (severe), stuck mutations (moderate), replication lag (moderate), merge queue depth (mild), and disk pressure (mild).
Monitoring Checklist
Before going to production with a ClickHouse cluster, verify:
- [ ] Broken parts alert configured (threshold: > 0)
- [ ] Replication lag alert configured (threshold: > 300 seconds)
- [ ] Merge queue alert configured (threshold: > 50 per table)
- [ ] Mutation queue alert configured (threshold: > 0 stuck)
- [ ] Disk usage alert configured (threshold: > 80%)
- [ ] Query error rate alert configured (threshold: > 5%)
- [ ] ZooKeeper session health monitored
- [ ] Slow query detection in place (threshold: > 10 seconds)
Next Steps
Frequently Asked Questions
What is the best way to monitor ClickHouse?
The most effective approach combines ClickHouse system tables (system.metrics, system.asynchronous_metrics, system.query_log) with an observability tool purpose-built for ClickHouse. Standard tools like Grafana miss 70% of the operational metrics that predict failures.
How do I check ClickHouse cluster health?
Query system.replicas for replication status, system.merges for merge queue depth, system.parts for broken parts, and system.mutations for stuck mutations. A healthy cluster has 0 broken parts, replication lag under 30 seconds, and no stuck mutations.
What ClickHouse metrics should I monitor in production?
Critical metrics include: broken parts count, replication lag seconds, merge queue size, mutation queue length, ZooKeeper session status, disk usage percent, memory usage, and query error rate.
Does Grafana work for ClickHouse monitoring?
Grafana can visualize ClickHouse metrics but requires significant manual setup and misses 70% of operational metrics hidden in ClickHouse system tables. Purpose-built tools like Clustersight surface these automatically.