2026-05-25 · 8 min read

Month 3: Building ClusterSight — What We Shipped, What Broke, What's Next

Three months ago, ClusterSight was a positioning document and a Docker container that queried system.replicas. Today it monitors 400+ metrics across five PULSE dimensions, sends Slack alerts with fix commands, and has been deployed on clusters ranging from 3-node dev setups to 50-node production fleets.

This is the honest update. What shipped. What users actually wanted (different from what we expected). What broke. What's next.

What Shipped

The Core Dashboard

The PULSE health score — a 0-100 number weighted across five dimensions — is live and updating every 60 seconds. The scoring algorithm survived contact with real clusters mostly intact. Two weight adjustments were needed:

Adjustment 1: The original P (Parts) scoring penalized new tables too aggressively. A freshly-created table with 5 parts scored the same as a table with 5 parts that should have been merged hours ago. Fix: part count scoring now factors in table age and recent insert activity.

Adjustment 2: The U (Uptime) scoring didn't account for intentionally read-only replicas. Some users run read-only replicas deliberately for analytics isolation. These showed as "CRITICAL: Read-only" and tanked the score. Fix: users can now mark replicas as intentionally read-only, and the score excludes them. See Detecting Read-Only Replicas for why this distinction matters.

Alerting With Fix Commands

Every alert includes the specific SQL command to fix the problem. Not "check your replication" — the actual SYSTEM SYNC REPLICA analytics.events for the exact table that's lagging.

This turned out to be the feature users mention most. The feedback pattern: "I knew something was wrong but I didn't know the exact command." ClickHouse has dozens of SYSTEM commands and the syntax for each is subtly different. Having the fix pre-generated and copy-pasteable saves the 10-minute documentation lookup at 3 AM.

Multi-Node Collection

ClusterSight connects to every node in the cluster and compares metrics cross-node. This is how it catches replication drift — comparing part counts and log positions across replicas, not just reading absolute_delay on one node.

The cross-node comparison also catches stuck mutations that complete on some replicas but not others — the failure mode that's invisible when monitoring each node independently.

The PULSE Health Check CLI

The free tool. No account. No data leaves your network. Run it against your cluster and get a PULSE report:

npx clustersight-pulse-check --host your-clickhouse:8123

Output:

PULSE Health Report — your-cluster
═══════════════════════════════════
P (Parts):      92/100  ✓
U (Uptime):     87/100  ⚠ 2 tables with queue_size > 20
L (Latency):    95/100  ✓
S (Stability):  78/100  ⚠ 1 mutation pending > 10 min
E (Efficiency): 84/100  ✓
─────────────────────────
Overall PULSE:  87/100

Recommendations:
→ Check mutation on analytics.events (pending 14 minutes)
→ Monitor replication queue on logs.access_log (queue_size: 34)

This tool drives more sign-ups than any blog post. People run the CLI, see their score, want trending and alerting, and deploy the full product.

What Users Actually Wanted

Expected: System Table Deep Dives

We thought users would want deep visibility into every system table column. Detailed system.merges breakdowns. Full system.query_log analysis.

Reality: "Just Tell Me What's Wrong"

Most users don't want to become system.replicas experts. They want to know:

  1. Is my cluster healthy? (The score)
  2. If not, what's wrong? (The alert)
  3. How do I fix it? (The command)

The deep-dive capability exists for power users, but 80% of interactions are: glance at score → see alert → copy fix command → done.

This validated the PULSE Framework's core idea — that 400+ metrics are only useful if they're organized into a structure that humans can reason about quickly. Nobody wants to monitor 400 things. Everyone wants to know if five dimensions are healthy.

Surprise Request: Historical Trending

The most requested feature we didn't plan for: "Show me my PULSE score over the last 30 days." Users want to correlate score changes with deploys, upgrades, and schema changes. A score of 85 means nothing without context — is that up from 72 last week (recovering) or down from 94 (degrading)?

This is now the #1 roadmap priority.

Surprise Request: "Explain This Alert To My Manager"

Multiple users asked for a way to export or share alerts in a format that non-technical stakeholders can understand. An engineer knows what "312 parts per partition on analytics.events" means. Their VP doesn't.

We're considering a "plain English" alert summary: "The events table is experiencing write pressure due to data accumulation. Without intervention, write operations will fail within approximately 6 hours. Recommended action: [command]."

What Broke

The ZooKeeper False Positive

ClusterSight queries system.zookeeper to check Keeper health. On clusters using ClickHouse Keeper (not ZooKeeper), the system.zookeeper table still exists but behaves differently. Our health check flagged Keeper-based clusters as "ZooKeeper unhealthy" when they were perfectly fine.

Fix: detect whether the cluster uses ZooKeeper or ClickHouse Keeper and adjust queries accordingly.

The query_log Explosion

On busy clusters (>10,000 queries/minute), querying system.query_log every 60 seconds for latency metrics caused... latency issues. ClusterSight's own monitoring queries were showing up as slow queries in the very table it was monitoring.

Fix: sample query_log instead of scanning it fully. Use SAMPLE 0.1 on high-traffic clusters and extrapolate. Also exclude ClusterSight's own queries from analysis.

The Part Count Race Condition

During a large merge operation, the part count momentarily increases (new merged part created before old parts are deleted). ClusterSight briefly flagged this as "part count spike" and sent alerts. Users got 30-second alert/recovery cycles during normal merge activity.

Fix: add a 3-minute debounce on part count alerts. The part count must remain elevated for 3 consecutive checks before alerting.

What's Next

Q3 Roadmap

  1. Historical PULSE trending — Score history with deploy/upgrade event correlation. The most requested feature.
  2. Multi-cluster dashboard — One view across all ClickHouse clusters with comparative health scoring.
  3. PagerDuty/OpsGenie integration — Alerts beyond Slack for on-call workflows.
  4. Query performance advisor — Using system.query_log analysis to recommend specific optimizations (see The 15 Fields That Matter for the foundation).
  5. ClickHouse Keeper-native support — First-class monitoring for Keeper, not just ZooKeeper compatibility mode.

The Content Roadmap

Over the last 3 months we published 12 deep-dive articles across four content pillars:

Operator's Blind Spot (war stories):

PULSE Check (tactical how-tos):

ClickHouse at Scale (system table deep dives):

Build in Public (product journey):

Plus one opinion piece that generated more discussion than everything else combined: Grafana + Prometheus Is Not ClickHouse Monitoring.

Month 4 onwards: the cadence continues. More system table guides, more PULSE checks, and the first conference talk at a ClickHouse meetup.

Try ClusterSight

Run the free PULSE Health Check CLI:

npx clustersight-pulse-check --host your-clickhouse:8123

Or deploy the full platform — under 8 minutes with Docker.

Check your cluster's PULSE.


This post is part of the Build in Public series. Previously: Why We're Building a ClickHouse Health Tool and Designing Health Scores.

Read next:

Frequently Asked Questions

What is ClusterSight?

ClusterSight is an operational health platform for ClickHouse clusters. It monitors 400+ system table metrics, generates a PULSE health score (0-100), and provides copy-pasteable fix commands for every alert. Deploy with Docker in under 8 minutes.

What did ClusterSight ship in the first 3 months?

Core health dashboard with PULSE scoring, real-time monitoring of system.replicas/system.parts/system.merges/system.mutations, Slack alerting with fix commands, multi-node support, and the free PULSE Health Check CLI tool.

What is the PULSE Health Check CLI?

A free command-line tool that runs all five PULSE dimension checks (Parts, Uptime, Latency, Stability, Efficiency) against your ClickHouse cluster and prints a health report. No account needed, no data leaves your network.