Can I monitor ClickHouse with Grafana?

You can monitor ClickHouse infrastructure metrics (CPU, memory, disk, query rate) with Grafana + Prometheus. But Grafana misses 70% of operational metrics that predict failures — broken parts, stuck mutations, replication drift, merge backlogs, and ZooKeeper session health. These require querying ClickHouse system tables directly.

What does Grafana miss for ClickHouse monitoring?

Grafana with the standard ClickHouse exporter misses: broken/detached parts, stuck mutations, replication consistency across replicas, merge queue depth and trends, part count per partition, ZooKeeper session health, error type trending in system.errors, and cross-replica data agreement.

What is the best monitoring tool for ClickHouse?

Purpose-built tools like ClusterSight monitor all 400+ ClickHouse system table metrics including broken parts, replication drift, merge backlogs, and stuck mutations. General-purpose tools (Grafana, Datadog, New Relic) only cover infrastructure metrics and miss the operational signals that predict ClickHouse-specific failures.

Grafana + Prometheus Is Not ClickHouse Monitoring

Grafana + Prometheus is infrastructure monitoring that happens to include ClickHouse. It is not ClickHouse monitoring. This distinction sounds pedantic until you're staring at an all-green Grafana dashboard while your analytics reports show different numbers on every refresh.

This isn't a hot take. It's an observation from running ClickHouse in production and watching the same pattern repeat: team sets up Grafana, team feels monitored, cluster silently degrades, incident happens, team is surprised.

What Grafana Actually Monitors

The standard ClickHouse + Prometheus setup (ClickHouse exporter → Prometheus → Grafana) captures approximately 40 metrics:

CPU usage per node
Memory usage (RSS, ClickHouse tracked)
Disk usage and I/O
Query count and rate
Connection count
Network bytes in/out
ReplicasMaxAbsoluteDelay (a single replication metric)
Insert rows/second
A handful of asynchronous metrics

These are infrastructure metrics. They answer: "Is the machine running? Is the process alive? Is it doing work?"

They do not answer: "Is the data correct? Is replication consistent? Are background processes healthy?"

The 70% You're Not Monitoring

ClickHouse exposes 400+ metrics across its system tables. Here's what Grafana never queries:

Part Health (PULSE → P)

Metric	Where It Lives	What It Catches	Grafana Status
Broken parts	`system.parts`	Data corruption	Not monitored
Detached parts	`system.detached_parts`	Past corruption, failed recoveries	Not monitored
Part count per partition	`system.parts`	Merge backlog time bombs	Not monitored
Tiny parts accumulation	`system.parts`	Insert pattern problems	Not monitored
Part creation vs merge rate	`system.part_log`	Merge queue falling behind	Not monitored

Replication Health (PULSE → U)

Metric	Where It Lives	What It Catches	Grafana Status
Cross-replica data agreement	`system.parts` (per replica)	Silent replication drift	Not monitored
Log pointer divergence	`system.replicas`	Stuck replication	Not monitored
Queue depth trending	`system.replicas`	Replication falling behind	Not monitored
Read-only replicas	`system.replicas`	ZooKeeper session loss	Not monitored
Replication queue details	`system.replication_queue`	Stuck fetches and merges	Not monitored

Grafana monitors ReplicasMaxAbsoluteDelay — a single number that lies about replication health. That's it.

Mutation Health (PULSE → S)

Metric	Where It Lives	What It Catches	Grafana Status
Stuck mutations	`system.mutations`	Stale data being served	Not monitored
Mutation age	`system.mutations`	Long-running background ops	Not monitored
Mutation failure reasons	`system.mutations`	Silent mutation failures	Not monitored

Merge Health (PULSE → P)

Metric	Where It Lives	What It Catches	Grafana Status
Active merge count	`system.merges`	Merge queue pressure	Not monitored
Merge progress	`system.merges`	Stalled merges	Not monitored
Merge throughput trending	`system.merges`	I/O contention	Not monitored

See system.merges Decoded for the full breakdown.

Error Trending (PULSE → S)

Metric	Where It Lives	What It Catches	Grafana Status
Error type distribution	`system.errors`	Chronic error patterns	Not monitored
Error count trending	`system.errors`	Degradation over time	Not monitored

ClickHouse has 40+ error types in system.errors. Grafana doesn't query any of them.

"But I Added Custom Panels"

Yes, you can write custom ClickHouse queries in Grafana using the ClickHouse data source plugin. Some teams do. Here's what happens:

Setup: You write 5-10 custom SQL queries as Grafana panels. This takes a day.
Month 1: The queries work. You feel monitored.
Month 3: A ClickHouse upgrade renames a column in system.parts. Two panels silently break. Grafana shows "No data" which looks the same as "zero problems."
Month 6: You realize you're monitoring 15 metrics out of 400+. The merge backlog that just caused an outage wasn't one of them.
Month 9: Nobody maintains the custom queries. Half the panels are broken. The team trusts Grafana anyway because it's Grafana.

The problem isn't Grafana — it's a purpose mismatch. Grafana is designed to visualize metrics you already have. ClickHouse monitoring requires actively querying system tables, understanding the relationships between metrics, and alerting on trends — not just thresholds.

What Purpose-Built Monitoring Looks Like

A monitoring tool designed for ClickHouse does things Grafana structurally cannot:

1. Cross-replica comparison. Grafana queries one ClickHouse instance at a time. Real replication monitoring requires comparing data across replicas — part counts, log positions, data freshness. This is a fundamentally different query pattern.

2. Trend-based alerting. A merge queue of 20 isn't a problem. A merge queue that grew from 5 to 20 in the last 10 minutes is. Grafana can do threshold alerts. It can't easily do "alert when the growth rate of this metric exceeds X."

3. Fix commands. When a part is broken, you need ALTER TABLE ... DROP DETACHED PART. When replication lags, you need SYSTEM SYNC REPLICA database.specific_table. A purpose-built tool knows which command fixes which problem and generates it for the specific table affected.

4. PULSE health scoring. A single 0-100 number that weighs 400+ metrics across five operational dimensions. Not 40 Grafana panels that you have to mentally synthesize at 3 AM.

This Isn't About Grafana Being Bad

Grafana is excellent infrastructure monitoring. Use it for CPU, memory, disk, network. Use it for application metrics. Use it for everything it was designed for.

But when someone asks "Is our ClickHouse cluster healthy?" and you point at a green Grafana dashboard — you're not answering the question. You're answering a different, easier question: "Is our ClickHouse cluster running?"

Running and healthy are not the same thing.

ClusterSight monitors the 70% that Grafana misses. Deploy in under 8 minutes. See what your cluster is actually doing.

Check your cluster's PULSE.

This post is part of the Opinions series — honest takes on ClickHouse operations.

Read next: