2026-03-28 · 7 min read

Why We're Building a ClickHouse Health Tool

Every ClickHouse cluster has a moment where the dashboards say green and the data says otherwise. For some teams it's a revenue number that changes between refreshes. For others it's a mutation stuck for 72 hours while stakeholders trust stale data. For us, it was a replication gap that silently grew for 6 days before anyone noticed.

That's why we built ClusterSight.

The Problem: ClickHouse Is a Black Box in Production

ClickHouse is one of the most powerful analytical databases ever built. It's also one of the hardest to monitor in production. Not because it lacks observability — ClickHouse exposes over 400 metrics across its system tables. The problem is that no standard monitoring tool queries them.

Here's what a typical ClickHouse monitoring setup looks like:

  1. Prometheus scrapes the ClickHouse metrics endpoint → gets ~40 infrastructure metrics
  2. Grafana displays CPU, memory, disk, query rate → the basics
  3. Maybe a custom script checks system.replicas for replication lag

That covers about 30% of what matters. The other 70% sits in system tables that nobody queries:

  • system.parts — Are parts accumulating? Are any broken or detached?
  • system.mutations — Are mutations completing? Or stuck for hours?
  • system.merges — Is the merge queue keeping up with inserts?
  • system.replicas — Beyond absolute_delay, are replicas actually consistent?
  • system.errors — Are errors trending upward across 40+ error types?
  • system.replication_queue — What's actually in the replication queue?

These aren't edge cases. These are the metrics that predict whether your cluster will be healthy tomorrow.

What We Tried First

Before building ClusterSight, we tried the standard approaches:

Grafana + ClickHouse Exporter

Result: Got infrastructure metrics. Missed every operational issue. A mutation ran for 4,000 seconds while every Grafana panel was green. We wrote about that gap in How to Monitor ClickHouse in Production.

Custom SQL Scripts via Cron

Result: Worked for the 5 checks we wrote. Then ClickHouse added new system tables. Versions changed column names. The scripts broke silently. No one noticed for 3 weeks because the cron job failure wasn't monitored either.

Altinity Managed Service

Result: Great if you want someone else to run your cluster. Not great if you need to run it yourself, understand it deeply, and maintain control. The operational knowledge stays with Altinity, not with your team.

Datadog / New Relic ClickHouse Integrations

Result: Same 30% coverage as Grafana. These tools treat ClickHouse as one of 500 integrations. They'll never query system.replication_queue or detect merge backlog pressure — it's not in their data model.

None of these answered the question we kept asking: "Is my ClickHouse cluster actually healthy, or just running?"

The PULSE Framework

We needed a structured way to think about ClickHouse health. Not a random collection of metrics, but a framework that maps every operational signal to a diagnostic dimension. That became PULSE:

P — Parts: Are merges keeping up? Part counts, tiny parts, detached parts, compression ratios.

U — Uptime: Is replication healthy? Replica delays, read-only states, ZooKeeper health, queue backlogs. This is where silent replication drift lives — the most dangerous failure mode because everything looks green.

L — Latency: Are queries performant? Slow queries, memory pressure, thread pool saturation, error rates.

S — Stability: Are mutations and background processes healthy? Stuck mutations, distributed queue, error trending.

E — Efficiency: Are resources used well? Disk utilization, compression analysis, TTL compliance.

Every alert in ClusterSight maps to one or more PULSE letters. When your health score drops, you immediately know which dimension degraded.

What ClusterSight Does

1. Queries 400+ System Table Metrics

ClusterSight connects to your ClickHouse cluster and queries every relevant system table — system.parts, system.replicas, system.merges, system.mutations, system.query_log, system.errors, and more. Every 60 seconds. On every node.

This isn't a metrics exporter that waits for Prometheus to scrape. It's an active health scanner that knows what to look for.

2. Generates a PULSE Health Score (0–100)

All 400+ metrics feed into a composite score:

  • 90–100: Healthy
  • 70–89: Warning — investigate within 24 hours
  • 50–69: Degraded — investigate within 1 hour
  • Below 50: Critical — immediate action

The score is weighted: broken parts are penalized severely, stuck mutations moderately, merge queue depth mildly. The weights reflect operational severity — a broken part is a data integrity issue, a deep merge queue is a performance issue.

3. Ships Fix Commands With Every Alert

Every alert includes a copy-pasteable SQL command to fix the problem. Not "check your replication" — the actual SYSTEM SYNC REPLICA database.table command for the specific table that's lagging. Not "investigate merge queue" — the specific OPTIMIZE TABLE or settings change that will relieve pressure.

4. Deploys in Under 8 Minutes

docker pull clustersight/clustersight:latest

One Docker container. One connection string. No agents to install on ClickHouse nodes. No schema changes. No Prometheus. No Grafana. Just a health dashboard that works out of the box.

Get started →

What's Next

We're building in the open. Here's what's on the roadmap:

  • PULSE Health Check CLI — A free command-line tool that runs all five PULSE dimension checks and prints a report. No account needed.
  • Slack/PagerDuty integration — Alerts where your team already works, with fix commands included.
  • Historical trending — Track your PULSE score over time. See how deploys, upgrades, and schema changes affect cluster health.
  • Multi-cluster support — One dashboard for all your ClickHouse clusters, with cross-cluster comparison.

Try It

If you run ClickHouse in production, your cluster has blind spots right now. Metrics that nobody's checking. Replication that looks healthy but isn't. Mutations that are stuck while dashboards say green.

Run a PULSE check today. Or let ClusterSight do it for you — deploy in under 8 minutes.

Check your cluster's PULSE.


Read next:

Frequently Asked Questions

What is ClusterSight?

ClusterSight is an operational health platform for ClickHouse clusters. It monitors 400+ system table metrics, generates a PULSE health score (0–100), and provides copy-pasteable fix commands for every alert. Deploy in under 8 minutes with Docker Compose.

How is ClusterSight different from Grafana for ClickHouse?

Grafana monitors infrastructure metrics (CPU, memory, disk). ClusterSight monitors ClickHouse-specific operational metrics — broken parts, stuck mutations, replication drift, merge backlogs, ZooKeeper health — the 70% of signals that Grafana doesn't query. Every alert includes a fix command.

What is the PULSE Framework?

PULSE is a structured approach to ClickHouse health: Parts (merge health), Uptime (replication), Latency (query performance), Stability (mutations/background ops), Efficiency (resource utilization). ClusterSight maps all 400+ metrics to these five dimensions and generates a composite health score.

Is ClusterSight open source?

ClusterSight is available as a Docker image. Visit the docs at clustersight.io/docs for installation instructions and getting started in under 8 minutes.