2026-04-13 · 7 min read

Grafana + Prometheus Is Not ClickHouse Monitoring

Grafana + Prometheus is infrastructure monitoring that happens to include ClickHouse. It is not ClickHouse monitoring. This distinction sounds pedantic until you're staring at an all-green Grafana dashboard while your analytics reports show different numbers on every refresh.

This isn't a hot take. It's an observation from running ClickHouse in production and watching the same pattern repeat: team sets up Grafana, team feels monitored, cluster silently degrades, incident happens, team is surprised.

What Grafana Actually Monitors

The standard ClickHouse + Prometheus setup (ClickHouse exporter → Prometheus → Grafana) captures approximately 40 metrics:

  • CPU usage per node
  • Memory usage (RSS, ClickHouse tracked)
  • Disk usage and I/O
  • Query count and rate
  • Connection count
  • Network bytes in/out
  • ReplicasMaxAbsoluteDelay (a single replication metric)
  • Insert rows/second
  • A handful of asynchronous metrics

These are infrastructure metrics. They answer: "Is the machine running? Is the process alive? Is it doing work?"

They do not answer: "Is the data correct? Is replication consistent? Are background processes healthy?"

The 70% You're Not Monitoring

ClickHouse exposes 400+ metrics across its system tables. Here's what Grafana never queries:

Part Health (PULSE → P)

MetricWhere It LivesWhat It CatchesGrafana Status
Broken partssystem.partsData corruptionNot monitored
Detached partssystem.detached_partsPast corruption, failed recoveriesNot monitored
Part count per partitionsystem.partsMerge backlog time bombsNot monitored
Tiny parts accumulationsystem.partsInsert pattern problemsNot monitored
Part creation vs merge ratesystem.part_logMerge queue falling behindNot monitored

Replication Health (PULSE → U)

MetricWhere It LivesWhat It CatchesGrafana Status
Cross-replica data agreementsystem.parts (per replica)Silent replication driftNot monitored
Log pointer divergencesystem.replicasStuck replicationNot monitored
Queue depth trendingsystem.replicasReplication falling behindNot monitored
Read-only replicassystem.replicasZooKeeper session lossNot monitored
Replication queue detailssystem.replication_queueStuck fetches and mergesNot monitored

Grafana monitors ReplicasMaxAbsoluteDelay — a single number that lies about replication health. That's it.

Mutation Health (PULSE → S)

MetricWhere It LivesWhat It CatchesGrafana Status
Stuck mutationssystem.mutationsStale data being servedNot monitored
Mutation agesystem.mutationsLong-running background opsNot monitored
Mutation failure reasonssystem.mutationsSilent mutation failuresNot monitored

Merge Health (PULSE → P)

MetricWhere It LivesWhat It CatchesGrafana Status
Active merge countsystem.mergesMerge queue pressureNot monitored
Merge progresssystem.mergesStalled mergesNot monitored
Merge throughput trendingsystem.mergesI/O contentionNot monitored

See system.merges Decoded for the full breakdown.

Error Trending (PULSE → S)

MetricWhere It LivesWhat It CatchesGrafana Status
Error type distributionsystem.errorsChronic error patternsNot monitored
Error count trendingsystem.errorsDegradation over timeNot monitored

ClickHouse has 40+ error types in system.errors. Grafana doesn't query any of them.

"But I Added Custom Panels"

Yes, you can write custom ClickHouse queries in Grafana using the ClickHouse data source plugin. Some teams do. Here's what happens:

  1. Setup: You write 5-10 custom SQL queries as Grafana panels. This takes a day.
  2. Month 1: The queries work. You feel monitored.
  3. Month 3: A ClickHouse upgrade renames a column in system.parts. Two panels silently break. Grafana shows "No data" which looks the same as "zero problems."
  4. Month 6: You realize you're monitoring 15 metrics out of 400+. The merge backlog that just caused an outage wasn't one of them.
  5. Month 9: Nobody maintains the custom queries. Half the panels are broken. The team trusts Grafana anyway because it's Grafana.

The problem isn't Grafana — it's a purpose mismatch. Grafana is designed to visualize metrics you already have. ClickHouse monitoring requires actively querying system tables, understanding the relationships between metrics, and alerting on trends — not just thresholds.

What Purpose-Built Monitoring Looks Like

A monitoring tool designed for ClickHouse does things Grafana structurally cannot:

1. Cross-replica comparison. Grafana queries one ClickHouse instance at a time. Real replication monitoring requires comparing data across replicas — part counts, log positions, data freshness. This is a fundamentally different query pattern.

2. Trend-based alerting. A merge queue of 20 isn't a problem. A merge queue that grew from 5 to 20 in the last 10 minutes is. Grafana can do threshold alerts. It can't easily do "alert when the growth rate of this metric exceeds X."

3. Fix commands. When a part is broken, you need ALTER TABLE ... DROP DETACHED PART. When replication lags, you need SYSTEM SYNC REPLICA database.specific_table. A purpose-built tool knows which command fixes which problem and generates it for the specific table affected.

4. PULSE health scoring. A single 0-100 number that weighs 400+ metrics across five operational dimensions. Not 40 Grafana panels that you have to mentally synthesize at 3 AM.

This Isn't About Grafana Being Bad

Grafana is excellent infrastructure monitoring. Use it for CPU, memory, disk, network. Use it for application metrics. Use it for everything it was designed for.

But when someone asks "Is our ClickHouse cluster healthy?" and you point at a green Grafana dashboard — you're not answering the question. You're answering a different, easier question: "Is our ClickHouse cluster running?"

Running and healthy are not the same thing.

ClusterSight monitors the 70% that Grafana misses. Deploy in under 8 minutes. See what your cluster is actually doing.

Check your cluster's PULSE.


This post is part of the Opinions series — honest takes on ClickHouse operations.

Read next:

Frequently Asked Questions

Can I monitor ClickHouse with Grafana?

You can monitor ClickHouse infrastructure metrics (CPU, memory, disk, query rate) with Grafana + Prometheus. But Grafana misses 70% of operational metrics that predict failures — broken parts, stuck mutations, replication drift, merge backlogs, and ZooKeeper session health. These require querying ClickHouse system tables directly.

What does Grafana miss for ClickHouse monitoring?

Grafana with the standard ClickHouse exporter misses: broken/detached parts, stuck mutations, replication consistency across replicas, merge queue depth and trends, part count per partition, ZooKeeper session health, error type trending in system.errors, and cross-replica data agreement.

What is the best monitoring tool for ClickHouse?

Purpose-built tools like ClusterSight monitor all 400+ ClickHouse system table metrics including broken parts, replication drift, merge backlogs, and stuck mutations. General-purpose tools (Grafana, Datadog, New Relic) only cover infrastructure metrics and miss the operational signals that predict ClickHouse-specific failures.