2026-04-13 · 7 min read
Grafana + Prometheus Is Not ClickHouse Monitoring
Grafana + Prometheus is infrastructure monitoring that happens to include ClickHouse. It is not ClickHouse monitoring. This distinction sounds pedantic until you're staring at an all-green Grafana dashboard while your analytics reports show different numbers on every refresh.
This isn't a hot take. It's an observation from running ClickHouse in production and watching the same pattern repeat: team sets up Grafana, team feels monitored, cluster silently degrades, incident happens, team is surprised.
What Grafana Actually Monitors
The standard ClickHouse + Prometheus setup (ClickHouse exporter → Prometheus → Grafana) captures approximately 40 metrics:
- CPU usage per node
- Memory usage (RSS, ClickHouse tracked)
- Disk usage and I/O
- Query count and rate
- Connection count
- Network bytes in/out
ReplicasMaxAbsoluteDelay(a single replication metric)- Insert rows/second
- A handful of asynchronous metrics
These are infrastructure metrics. They answer: "Is the machine running? Is the process alive? Is it doing work?"
They do not answer: "Is the data correct? Is replication consistent? Are background processes healthy?"
The 70% You're Not Monitoring
ClickHouse exposes 400+ metrics across its system tables. Here's what Grafana never queries:
Part Health (PULSE → P)
| Metric | Where It Lives | What It Catches | Grafana Status |
|---|---|---|---|
| Broken parts | system.parts | Data corruption | Not monitored |
| Detached parts | system.detached_parts | Past corruption, failed recoveries | Not monitored |
| Part count per partition | system.parts | Merge backlog time bombs | Not monitored |
| Tiny parts accumulation | system.parts | Insert pattern problems | Not monitored |
| Part creation vs merge rate | system.part_log | Merge queue falling behind | Not monitored |
Replication Health (PULSE → U)
| Metric | Where It Lives | What It Catches | Grafana Status |
|---|---|---|---|
| Cross-replica data agreement | system.parts (per replica) | Silent replication drift | Not monitored |
| Log pointer divergence | system.replicas | Stuck replication | Not monitored |
| Queue depth trending | system.replicas | Replication falling behind | Not monitored |
| Read-only replicas | system.replicas | ZooKeeper session loss | Not monitored |
| Replication queue details | system.replication_queue | Stuck fetches and merges | Not monitored |
Grafana monitors ReplicasMaxAbsoluteDelay — a single number that lies about replication health. That's it.
Mutation Health (PULSE → S)
| Metric | Where It Lives | What It Catches | Grafana Status |
|---|---|---|---|
| Stuck mutations | system.mutations | Stale data being served | Not monitored |
| Mutation age | system.mutations | Long-running background ops | Not monitored |
| Mutation failure reasons | system.mutations | Silent mutation failures | Not monitored |
Merge Health (PULSE → P)
| Metric | Where It Lives | What It Catches | Grafana Status |
|---|---|---|---|
| Active merge count | system.merges | Merge queue pressure | Not monitored |
| Merge progress | system.merges | Stalled merges | Not monitored |
| Merge throughput trending | system.merges | I/O contention | Not monitored |
See system.merges Decoded for the full breakdown.
Error Trending (PULSE → S)
| Metric | Where It Lives | What It Catches | Grafana Status |
|---|---|---|---|
| Error type distribution | system.errors | Chronic error patterns | Not monitored |
| Error count trending | system.errors | Degradation over time | Not monitored |
ClickHouse has 40+ error types in system.errors. Grafana doesn't query any of them.
"But I Added Custom Panels"
Yes, you can write custom ClickHouse queries in Grafana using the ClickHouse data source plugin. Some teams do. Here's what happens:
- Setup: You write 5-10 custom SQL queries as Grafana panels. This takes a day.
- Month 1: The queries work. You feel monitored.
- Month 3: A ClickHouse upgrade renames a column in
system.parts. Two panels silently break. Grafana shows "No data" which looks the same as "zero problems." - Month 6: You realize you're monitoring 15 metrics out of 400+. The merge backlog that just caused an outage wasn't one of them.
- Month 9: Nobody maintains the custom queries. Half the panels are broken. The team trusts Grafana anyway because it's Grafana.
The problem isn't Grafana — it's a purpose mismatch. Grafana is designed to visualize metrics you already have. ClickHouse monitoring requires actively querying system tables, understanding the relationships between metrics, and alerting on trends — not just thresholds.
What Purpose-Built Monitoring Looks Like
A monitoring tool designed for ClickHouse does things Grafana structurally cannot:
1. Cross-replica comparison. Grafana queries one ClickHouse instance at a time. Real replication monitoring requires comparing data across replicas — part counts, log positions, data freshness. This is a fundamentally different query pattern.
2. Trend-based alerting. A merge queue of 20 isn't a problem. A merge queue that grew from 5 to 20 in the last 10 minutes is. Grafana can do threshold alerts. It can't easily do "alert when the growth rate of this metric exceeds X."
3. Fix commands. When a part is broken, you need ALTER TABLE ... DROP DETACHED PART. When replication lags, you need SYSTEM SYNC REPLICA database.specific_table. A purpose-built tool knows which command fixes which problem and generates it for the specific table affected.
4. PULSE health scoring. A single 0-100 number that weighs 400+ metrics across five operational dimensions. Not 40 Grafana panels that you have to mentally synthesize at 3 AM.
This Isn't About Grafana Being Bad
Grafana is excellent infrastructure monitoring. Use it for CPU, memory, disk, network. Use it for application metrics. Use it for everything it was designed for.
But when someone asks "Is our ClickHouse cluster healthy?" and you point at a green Grafana dashboard — you're not answering the question. You're answering a different, easier question: "Is our ClickHouse cluster running?"
Running and healthy are not the same thing.
ClusterSight monitors the 70% that Grafana misses. Deploy in under 8 minutes. See what your cluster is actually doing.
Check your cluster's PULSE.
This post is part of the Opinions series — honest takes on ClickHouse operations.
Read next:
Frequently Asked Questions
Can I monitor ClickHouse with Grafana?
You can monitor ClickHouse infrastructure metrics (CPU, memory, disk, query rate) with Grafana + Prometheus. But Grafana misses 70% of operational metrics that predict failures — broken parts, stuck mutations, replication drift, merge backlogs, and ZooKeeper session health. These require querying ClickHouse system tables directly.
What does Grafana miss for ClickHouse monitoring?
Grafana with the standard ClickHouse exporter misses: broken/detached parts, stuck mutations, replication consistency across replicas, merge queue depth and trends, part count per partition, ZooKeeper session health, error type trending in system.errors, and cross-replica data agreement.
What is the best monitoring tool for ClickHouse?
Purpose-built tools like ClusterSight monitor all 400+ ClickHouse system table metrics including broken parts, replication drift, merge backlogs, and stuck mutations. General-purpose tools (Grafana, Datadog, New Relic) only cover infrastructure metrics and miss the operational signals that predict ClickHouse-specific failures.