2026-03-28 · 7 min read
Why We're Building a ClickHouse Health Tool
Every ClickHouse cluster has a moment where the dashboards say green and the data says otherwise. For some teams it's a revenue number that changes between refreshes. For others it's a mutation stuck for 72 hours while stakeholders trust stale data. For us, it was a replication gap that silently grew for 6 days before anyone noticed.
That's why we built ClusterSight.
The Problem: ClickHouse Is a Black Box in Production
ClickHouse is one of the most powerful analytical databases ever built. It's also one of the hardest to monitor in production. Not because it lacks observability — ClickHouse exposes over 400 metrics across its system tables. The problem is that no standard monitoring tool queries them.
Here's what a typical ClickHouse monitoring setup looks like:
- Prometheus scrapes the ClickHouse metrics endpoint → gets ~40 infrastructure metrics
- Grafana displays CPU, memory, disk, query rate → the basics
- Maybe a custom script checks
system.replicasfor replication lag
That covers about 30% of what matters. The other 70% sits in system tables that nobody queries:
system.parts— Are parts accumulating? Are any broken or detached?system.mutations— Are mutations completing? Or stuck for hours?system.merges— Is the merge queue keeping up with inserts?system.replicas— Beyondabsolute_delay, are replicas actually consistent?system.errors— Are errors trending upward across 40+ error types?system.replication_queue— What's actually in the replication queue?
These aren't edge cases. These are the metrics that predict whether your cluster will be healthy tomorrow.
What We Tried First
Before building ClusterSight, we tried the standard approaches:
Grafana + ClickHouse Exporter
Result: Got infrastructure metrics. Missed every operational issue. A mutation ran for 4,000 seconds while every Grafana panel was green. We wrote about that gap in How to Monitor ClickHouse in Production.
Custom SQL Scripts via Cron
Result: Worked for the 5 checks we wrote. Then ClickHouse added new system tables. Versions changed column names. The scripts broke silently. No one noticed for 3 weeks because the cron job failure wasn't monitored either.
Altinity Managed Service
Result: Great if you want someone else to run your cluster. Not great if you need to run it yourself, understand it deeply, and maintain control. The operational knowledge stays with Altinity, not with your team.
Datadog / New Relic ClickHouse Integrations
Result: Same 30% coverage as Grafana. These tools treat ClickHouse as one of 500 integrations. They'll never query system.replication_queue or detect merge backlog pressure — it's not in their data model.
None of these answered the question we kept asking: "Is my ClickHouse cluster actually healthy, or just running?"
The PULSE Framework
We needed a structured way to think about ClickHouse health. Not a random collection of metrics, but a framework that maps every operational signal to a diagnostic dimension. That became PULSE:
P — Parts: Are merges keeping up? Part counts, tiny parts, detached parts, compression ratios.
U — Uptime: Is replication healthy? Replica delays, read-only states, ZooKeeper health, queue backlogs. This is where silent replication drift lives — the most dangerous failure mode because everything looks green.
L — Latency: Are queries performant? Slow queries, memory pressure, thread pool saturation, error rates.
S — Stability: Are mutations and background processes healthy? Stuck mutations, distributed queue, error trending.
E — Efficiency: Are resources used well? Disk utilization, compression analysis, TTL compliance.
Every alert in ClusterSight maps to one or more PULSE letters. When your health score drops, you immediately know which dimension degraded.
What ClusterSight Does
1. Queries 400+ System Table Metrics
ClusterSight connects to your ClickHouse cluster and queries every relevant system table — system.parts, system.replicas, system.merges, system.mutations, system.query_log, system.errors, and more. Every 60 seconds. On every node.
This isn't a metrics exporter that waits for Prometheus to scrape. It's an active health scanner that knows what to look for.
2. Generates a PULSE Health Score (0–100)
All 400+ metrics feed into a composite score:
- 90–100: Healthy
- 70–89: Warning — investigate within 24 hours
- 50–69: Degraded — investigate within 1 hour
- Below 50: Critical — immediate action
The score is weighted: broken parts are penalized severely, stuck mutations moderately, merge queue depth mildly. The weights reflect operational severity — a broken part is a data integrity issue, a deep merge queue is a performance issue.
3. Ships Fix Commands With Every Alert
Every alert includes a copy-pasteable SQL command to fix the problem. Not "check your replication" — the actual SYSTEM SYNC REPLICA database.table command for the specific table that's lagging. Not "investigate merge queue" — the specific OPTIMIZE TABLE or settings change that will relieve pressure.
4. Deploys in Under 8 Minutes
docker pull clustersight/clustersight:latestOne Docker container. One connection string. No agents to install on ClickHouse nodes. No schema changes. No Prometheus. No Grafana. Just a health dashboard that works out of the box.
What's Next
We're building in the open. Here's what's on the roadmap:
- PULSE Health Check CLI — A free command-line tool that runs all five PULSE dimension checks and prints a report. No account needed.
- Slack/PagerDuty integration — Alerts where your team already works, with fix commands included.
- Historical trending — Track your PULSE score over time. See how deploys, upgrades, and schema changes affect cluster health.
- Multi-cluster support — One dashboard for all your ClickHouse clusters, with cross-cluster comparison.
Try It
If you run ClickHouse in production, your cluster has blind spots right now. Metrics that nobody's checking. Replication that looks healthy but isn't. Mutations that are stuck while dashboards say green.
Run a PULSE check today. Or let ClusterSight do it for you — deploy in under 8 minutes.
Check your cluster's PULSE.
Read next:
Frequently Asked Questions
What is ClusterSight?
ClusterSight is an operational health platform for ClickHouse clusters. It monitors 400+ system table metrics, generates a PULSE health score (0–100), and provides copy-pasteable fix commands for every alert. Deploy in under 8 minutes with Docker Compose.
How is ClusterSight different from Grafana for ClickHouse?
Grafana monitors infrastructure metrics (CPU, memory, disk). ClusterSight monitors ClickHouse-specific operational metrics — broken parts, stuck mutations, replication drift, merge backlogs, ZooKeeper health — the 70% of signals that Grafana doesn't query. Every alert includes a fix command.
What is the PULSE Framework?
PULSE is a structured approach to ClickHouse health: Parts (merge health), Uptime (replication), Latency (query performance), Stability (mutations/background ops), Efficiency (resource utilization). ClusterSight maps all 400+ metrics to these five dimensions and generates a composite health score.
Is ClusterSight open source?
ClusterSight is available as a Docker image. Visit the docs at clustersight.io/docs for installation instructions and getting started in under 8 minutes.