What is the best way to monitor ClickHouse?

The most effective approach combines ClickHouse system tables (system.metrics, system.asynchronous_metrics, system.query_log) with an observability tool purpose-built for ClickHouse. Standard tools like Grafana miss 70% of the operational metrics that predict failures.

How do I check ClickHouse cluster health?

Query system.replicas for replication status, system.merges for merge queue depth, system.parts for broken parts, and system.mutations for stuck mutations. A healthy cluster has 0 broken parts, replication lag under 30 seconds, and no stuck mutations.

What ClickHouse metrics should I monitor in production?

Critical metrics include: broken parts count, replication lag seconds, merge queue size, mutation queue length, ZooKeeper session status, disk usage percent, memory usage, and query error rate.

Does Grafana work for ClickHouse monitoring?

Grafana can visualize ClickHouse metrics but requires significant manual setup and misses 70% of operational metrics hidden in ClickHouse system tables. Purpose-built tools like Clustersight surface these automatically.

How to Monitor ClickHouse in Production: The Complete Guide (2026)

The short answer: Monitor ClickHouse using its built-in system tables — system.metrics, system.replicas, system.parts, and system.merges. These expose 300+ operational metrics that standard monitoring tools miss. A purpose-built tool like Clustersight surfaces all of them with automated fix commands.

Why ClickHouse Monitoring Is Different

ClickHouse is not a traditional RDBMS, and monitoring it like one is a common mistake that leads to missed failures. Unlike PostgreSQL or MySQL, ClickHouse exposes its health through a rich set of internal system tables that most general-purpose monitoring tools never query.

The result: teams using Grafana or DataDog for ClickHouse monitoring are typically blind to 70% of the metrics that actually predict and explain failures — broken parts, merge queue depth, ZooKeeper session health, mutation queue backlogs, and more.

The Three Tiers of ClickHouse Metrics

Tier 1: Standard Infrastructure Metrics

These are the metrics that Prometheus, Grafana, and any standard monitoring stack will capture:

CPU usage — High CPU often indicates expensive queries or large merge operations
Memory usage — ClickHouse is memory-hungry; tracking system.asynchronous_metrics for MemoryUsage is essential
Disk usage — Track both raw disk and ClickHouse-specific compressed/uncompressed data ratios
Network I/O — Important for replicated clusters

Query to check basic system metrics:

SELECT metric, value, description
FROM system.metrics
WHERE metric IN ('MemoryTracking', 'Query', 'Merge', 'ReplicatedChecks')
ORDER BY metric;

Tier 2: Hidden Operational Metrics

This is where standard tools fail you. ClickHouse's system.asynchronous_metrics exposes hundreds of values that are critical for production health:

ReplicaQueueSize — How many operations are waiting to replicate
MaxPartCountForPartition — High values indicate merge queue issues
NumberOfDatabases / NumberOfTables — Schema bloat detection

Query to surface hidden metrics:

SELECT metric, value
FROM system.asynchronous_metrics
WHERE metric LIKE '%Replica%'
   OR metric LIKE '%Queue%'
   OR metric LIKE '%Part%'
ORDER BY metric;

Tier 3: Production Audit Metrics

These require deeper querying of operational tables and are almost never captured by standard tools:

Broken parts — Parts in system.parts with active = 0 and non-normal states
Stuck mutations — Mutations in system.mutations that haven't progressed
Detached parts — Files in the detached/ folder that indicate past corruption

The 10 Most Critical ClickHouse Metrics

1. Broken Parts Count

Broken parts are the most urgent signal in ClickHouse monitoring. A broken part means data corruption and ClickHouse will refuse to serve queries from that table.

SELECT database, table, count() AS broken_parts
FROM system.parts
WHERE active = 0 AND broken = 1
GROUP BY database, table
ORDER BY broken_parts DESC;

Healthy value: 0. Any broken parts require immediate attention. See our ClickHouse broken parts fix guide.

2. Replication Lag

For replicated tables, replication lag measures how far behind a replica is from the source.

SELECT database, table, replica_name,
       absolute_delay AS lag_seconds
FROM system.replicas
WHERE absolute_delay > 0
ORDER BY absolute_delay DESC;

Healthy value: Under 30 seconds. Over 5 minutes indicates a serious problem. See our replication lag guide.

3. Merge Queue Depth

ClickHouse merges data parts in the background. A deep merge queue means inserts are creating parts faster than ClickHouse can merge them — a common cause of "too many parts" errors.

SELECT database, table, count() AS merges_in_queue
FROM system.merges
GROUP BY database, table
ORDER BY merges_in_queue DESC;

Healthy value: Under 10 per table. See our merge queue guide.

4. Mutation Queue Length

Mutations (ALTER TABLE UPDATE/DELETE) in ClickHouse are expensive background operations. Stuck mutations block future merges.

SELECT database, table, mutation_id, command,
       parts_to_do, is_done
FROM system.mutations
WHERE is_done = 0
ORDER BY create_time ASC;

Healthy value: 0 pending mutations. Any stuck mutation needs investigation.

5. ZooKeeper/Keeper Session Health

Replicated tables depend on ZooKeeper (or ClickHouse Keeper). A lost or degraded session cascades into replication failures.

SELECT *
FROM system.zookeeper
WHERE path = '/clickhouse';

See our ZooKeeper dependency guide.

6. Query Error Rate

SELECT
    toStartOfMinute(event_time) AS minute,
    countIf(type = 'ExceptionWhileProcessing') AS errors,
    count() AS total,
    round(errors / total * 100, 2) AS error_pct
FROM system.query_log
WHERE event_time >= now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute DESC
LIMIT 30;

Healthy value: Under 1% error rate.

7. Disk Usage by Table

SELECT database, table,
    formatReadableSize(sum(bytes_on_disk)) AS disk_usage,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    round(sum(bytes_on_disk) / sum(data_uncompressed_bytes) * 100, 1) AS compression_ratio_pct
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20;

8. Slow Query Detection

SELECT query_id, user, query,
    round(query_duration_ms / 1000, 2) AS duration_sec,
    read_rows, read_bytes
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query_duration_ms > 5000
  AND event_time >= now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms DESC
LIMIT 10;

See our slow queries guide.

9. Part Count Per Table

Too many parts in a table is a leading indicator of merge queue issues and will eventually cause insert errors.

SELECT database, table, count() AS part_count
FROM system.parts
WHERE active = 1
GROUP BY database, table
ORDER BY part_count DESC
LIMIT 20;

Healthy value: Under 300 parts per table. Over 1000 is critical.

10. Memory Usage by Query

SELECT query_id, user,
    formatReadableSize(memory_usage) AS memory,
    query
FROM system.processes
ORDER BY memory_usage DESC
LIMIT 10;

Setting Up ClickHouse Monitoring: Three Approaches

Approach 1: Manual Grafana + Prometheus Setup

Time to set up: 4–8 hours
Metrics coverage: ~30% (Tier 1 only)
Fix guidance: None

You'll need to configure the ClickHouse Prometheus exporter, set up a Prometheus scrape job, build Grafana dashboards, and write alerting rules manually. This approach covers basic infrastructure metrics but misses all Tier 2 and Tier 3 operational metrics.

Approach 2: Custom Scripts + Alerting

Time to set up: 2–5 days
Metrics coverage: ~60% (Tiers 1–2)
Fix guidance: Manual

Write cron jobs or scripts that query system.parts, system.replicas, and system.merges and send alerts to Slack. Effective but requires ongoing maintenance as ClickHouse versions change.

Approach 3: Purpose-Built Tool (Recommended)

Time to set up: Under 8 minutes
Metrics coverage: 100% (All 3 tiers, 300+ metrics)
Fix guidance: Every alert ships with a copy-pasteable SQL fix command

Clustersight was built exclusively for ClickHouse. Deploy with Docker Compose, connect your cluster, and get a health score (0–100) with pre-built panels covering all 10 critical metrics above — plus 290 more.

Health Score: A Single Number for ClickHouse Health

One of the most useful concepts in ClickHouse monitoring is a composite health score that weighs the most critical metrics into a single 0–100 value. Clustersight's health score works as follows:

90–100: Healthy — no action needed
70–89: Warning — investigate within 24 hours
50–69: Degraded — investigate within 1 hour
Below 50: Critical — immediate action required

The score penalizes: broken parts (severe), stuck mutations (moderate), replication lag (moderate), merge queue depth (mild), and disk pressure (mild).

Monitoring Checklist

Before going to production with a ClickHouse cluster, verify:

[ ] Broken parts alert configured (threshold: > 0)
[ ] Replication lag alert configured (threshold: > 300 seconds)
[ ] Merge queue alert configured (threshold: > 50 per table)
[ ] Mutation queue alert configured (threshold: > 0 stuck)
[ ] Disk usage alert configured (threshold: > 80%)
[ ] Query error rate alert configured (threshold: > 5%)
[ ] ZooKeeper session health monitored
[ ] Slow query detection in place (threshold: > 10 seconds)

Next Steps

ClickHouse Broken Parts: Causes, Detection & Fix
ClickHouse Replication Lag: How to Diagnose & Fix
ClickHouse Merge Queue: Why It Happens & How to Fix
ClickHouse System Tables: The Complete Guide
Try Clustersight free — deploy in under 8 minutes