2026-04-27 · 9 min read
Designing Health Scores: How ClusterSight Weights 400+ Metrics Into One Number
Reducing 400+ metrics to a single number sounds like it should lose information. Done right, it amplifies it. The trick isn't averaging — it's knowing which metrics are load-bearing and which are noise. This post explains how ClusterSight's PULSE health score works, the design decisions behind it, and why certain metrics get 50x the weight of others.
The Problem With Dashboards
A ClickHouse cluster with 15 replicated tables across 3 nodes generates hundreds of metric values every minute. A Grafana dashboard with 40 panels requires a human to visually scan each one, mentally synthesize the state, and decide if action is needed. At 3 AM, that human has a 50/50 chance of missing the one panel that matters.
The health score replaces that cognitive overhead with a single question: Is the number above 90? If yes, go back to sleep. If no, the score tells you which PULSE dimension degraded and what to investigate.
Architecture: Five Dimensions, One Score
The PULSE Framework defines five operational dimensions. Each gets its own 0-100 sub-score:
┌──────────────────────────────────────────────┐
│ PULSE Health Score │
│ 0 – 100 │
├──────────┬──────────┬────────┬────────┬───────┤
│ P: Parts │ U: Uptime│L: Lat. │S: Stab.│E: Eff.│
│ 0–100 │ 0–100 │ 0–100 │ 0–100 │ 0–100 │
├──────────┼──────────┼────────┼────────┼───────┤
│ 12 metrics│14 metrics│8 metrics│9 metrics│7 metrics│
└──────────┴──────────┴────────┴────────┴───────┘
The composite score is a weighted average of the five dimensions — but with multiplicative penalties for critical violations.
Dimension Weights
Not all dimensions are equal. A compression ratio problem (Efficiency) is annoying. A broken part (Parts) is a data integrity emergency. The weights reflect this:
| Dimension | Weight | Rationale |
|---|---|---|
| P — Parts | 30% | Part health is the foundation. Broken parts = data loss. Merge backlogs = imminent insert failures. |
| U — Uptime | 30% | Replication health determines data consistency. Silent drift means wrong query results. |
| L — Latency | 20% | Query performance directly impacts users. But slow queries don't corrupt data. |
| S — Stability | 15% | Stuck mutations and background errors are serious but usually don't cause immediate data loss. |
| E — Efficiency | 5% | Resource waste costs money but doesn't cause incidents. |
Why P and U are tied at 30%: Both represent data integrity — the kind of problem where "we'll fix it tomorrow" isn't an option. Parts represent the physical integrity of data on disk. Uptime represents the logical consistency of data across replicas. Both can cause wrong query results.
Why E is only 5%: Bad compression doubles your storage bill but doesn't wake anyone up at 3 AM. It matters, but not at the same urgency level. See PULSE Check: Is Your Compression Working? for how to fix efficiency issues.
How Individual Dimensions Score
Each dimension uses a scoring model with three components:
1. Threshold Checks (Binary)
Some metrics have hard thresholds where any violation is critical:
Broken parts > 0 → P score capped at 30
Read-only replicas > 0 → U score capped at 40
Active replicas < total → U score capped at 60
These are non-negotiable. A cluster with broken parts cannot score above 30 on Parts, regardless of how perfect everything else is.
2. Continuous Metrics (Graduated)
Most metrics use graduated scoring:
Part count per partition:
< 100 → 100 points (full marks)
100-200 → 80 points
200-300 → 50 points
> 300 → 20 points
Replication lag (absolute_delay):
< 10s → 100 points
10-30s → 80 points
30-300s → 50 points
> 300s → 20 points
Query error rate:
< 0.1% → 100 points
0.1-1% → 80 points
1-5% → 50 points
> 5% → 20 points
3. Trend Penalties
A metric that's getting worse is more concerning than one that's stable at a bad value. ClusterSight applies trend penalties:
Queue size growing for 5+ minutes → -10 points from U
Part count growing for 1+ hour → -15 points from P
Error rate increasing over 30 min → -10 points from S
Trends catch problems earlier than thresholds. A queue size of 15 is fine. A queue size that went from 2 to 15 in 5 minutes is not. This is the signal that Grafana dashboards structurally miss.
The Composite Score Formula
composite = (P × 0.30) + (U × 0.30) + (L × 0.20) + (S × 0.15) + (E × 0.05)
But with a critical override: if any dimension scores below 30, the composite is capped at 49 (Critical). This prevents a cluster with broken parts from scoring 72 because replication and queries happen to be healthy.
Example 1: Healthy cluster
P=95, U=92, L=88, S=90, E=85
Score = (95×.30)+(92×.30)+(88×.20)+(90×.15)+(85×.05)
Score = 28.5 + 27.6 + 17.6 + 13.5 + 4.25 = 91.45 → 91
Example 2: Merge backlog forming
P=55, U=92, L=88, S=90, E=85
Score = (55×.30)+(92×.30)+(88×.20)+(90×.15)+(85×.05)
Score = 16.5 + 27.6 + 17.6 + 13.5 + 4.25 = 79.45 → 79
Example 3: Broken parts (critical override)
P=25, U=92, L=88, S=90, E=85
Score = capped at 49 (critical) because P < 30
Design Decisions and Trade-offs
Why Not Use ML/Anomaly Detection?
We considered it. Anomaly detection is great for "I don't know what normal looks like." But ClickHouse operators do know what normal looks like — zero broken parts, replication lag under 30 seconds, merge queue not growing. The thresholds are well-understood.
ML adds complexity, requires training data, produces false positives, and makes the score unexplainable. A human should be able to look at the score, understand why it dropped, and know what to fix. Weighted thresholds are transparent. ML is a black box.
Why Not Just Alert on Individual Metrics?
Because alert fatigue. A team with 15 replicated tables across 3 nodes, monitoring 50 metrics each, would need 2,250 alert rules. Even with careful thresholds, that's 5-10 alerts per day during normal operations.
The health score is a single signal. It drops → investigate. It doesn't drop → don't investigate. The dimension breakdown tells you where to look.
Why Weight Efficiency So Low?
Early versions weighted Efficiency at 15%. This caused the health score to drop every time a team added a new table with default codecs (before they optimized compression). Teams would see their score drop from 92 to 78 and investigate — only to find it was just compression ratios on a new table. False urgency erodes trust in the score.
5% keeps Efficiency visible without crying wolf. If a team's compression is terrible, the E dimension shows it. But it doesn't drag the composite score into "Warning" territory for something that isn't urgent.
Per-Table vs Per-Cluster Scoring
ClusterSight scores at both levels:
- Per-table PULSE score — useful for teams that own specific tables
- Cluster-wide PULSE score — the minimum per-table score across all tables
The cluster score uses the minimum (not average) because a cluster is only as healthy as its sickest table. One table with 312 parts per partition affects the whole cluster's insert path.
What the Score Doesn't Capture
No scoring system is perfect. Things the PULSE score deliberately excludes:
- Query correctness — The score can't know if your SQL returns the right business answer
- Schema design quality — Poor partition key choices don't affect the score until they cause symptoms
- Capacity planning — The score measures current health, not "will you run out of disk in 2 weeks"
- External dependencies — If your ingestion pipeline is broken upstream, the cluster looks healthy because nothing is arriving to cause problems
These are valid monitoring concerns but outside the scope of operational health scoring.
Try It
Deploy ClusterSight and see your cluster's PULSE score in under 8 minutes. The score updates every 60 seconds with full dimension breakdown. When it drops, you'll know exactly which dimension changed and what to investigate.
Check your cluster's PULSE.
This post is part of the Build in Public series — transparent documentation of building ClusterSight.
Read next:
Frequently Asked Questions
What is ClusterSight's health score?
ClusterSight's health score is a composite 0-100 number that weights 400+ ClickHouse system table metrics across five PULSE dimensions: Parts, Uptime, Latency, Stability, and Efficiency. 90-100 is healthy, 70-89 is warning, 50-69 is degraded, and below 50 is critical.
How does ClusterSight calculate the health score?
Each PULSE dimension (Parts, Uptime, Latency, Stability, Efficiency) gets an individual 0-100 score based on its metrics. The composite score uses weighted averaging with severity multipliers: Parts and Uptime are weighted highest because they indicate data integrity issues. Critical violations (like broken parts) apply multiplicative penalties.
What makes a good health score for ClickHouse?
A PULSE score above 90 means the cluster is healthy across all dimensions. Scores between 70-89 indicate issues in one or two dimensions that should be investigated within 24 hours. Below 70 means active problems affecting data integrity or performance.