2026-06-22 · 8 min read

First 10 Users: What ClickHouse Operators Actually Want to Monitor

Ten production clusters. Three continents. Clusters ranging from 3-node dev setups to 50-node analytics fleets. After three months of real-world deployments, we know what ClickHouse operators actually care about — and it's not what we expected.

This is the third post in the Build in Public series. The first explained why we built ClusterSight. The second covered how we designed the health score. The Month 3 retrospective covered what shipped and what broke. This one covers what we learned from the people using it.

The Users

Without naming companies, here's who deployed ClusterSight in the first three months:

#Cluster SizeIndustryPrimary Use Case
16 nodesFintechEvent analytics, real-time dashboards
23 nodesSaaS startupProduct analytics
312 nodesAd techClick stream processing, 500M events/day
450 nodesE-commerceSearch analytics, recommendation data
58 nodesGamingPlayer behavior analytics
63 nodesDevOps toolingLog aggregation
720 nodesMediaContent engagement metrics
86 nodesHealthcareClinical data warehousing
915 nodesTelecomNetwork event processing
104 nodesEducationLearning analytics platform

What They Use Most

#1: The Health Score (Every Single User)

Every user checks the PULSE score at least once daily. Most have it on a shared screen or Slack channel. The score's value isn't precision — it's attention management. A score of 95 means "don't look at ClickHouse today." A score of 78 means "spend 10 minutes investigating."

Quote from User #4 (50-node e-commerce): "We used to have a 30-minute ClickHouse check every morning. Now someone glances at the score. If it's green, we skip the check."

#2: Slack Alerts With Fix Commands (9 of 10 Users)

The copy-pasteable fix command is the feature that converted trial users to regular users. Not the alert itself — the command attached to it.

What we expected: Users would read the alert, investigate the root cause, then craft a fix.

What actually happens: User sees alert → copies the command → runs it → moves on. The investigation happens only if the same alert fires again within 24 hours.

This changed how we write alert messages. Each alert now leads with the fix command, then explains the context. Not the other way around.

#3: Replication Consistency Checker (7 of 10 Users)

The cross-replica part count comparison — the foundation of Your Replicas Are Lying — is the most valued deep feature. Three users discovered replication drift they didn't know they had within the first day of deployment.

User #1 (fintech): Found a 3-hour data gap on one replica that had been serving stale data to their risk calculation pipeline. The gap had existed for 2 weeks.

What They Ignore

Detailed System Table Views

We built beautiful breakdowns of system.replicas, system.merges, and system.parts. Column-level detail. Sortable. Filterable.

Almost nobody uses them in day-to-day operations. They're useful during incident investigation — but that's 1% of the time. The other 99%, the health score and alerts are sufficient.

Lesson: Build for the 99% case (passive monitoring) and make the 1% case (active investigation) accessible but not prominent.

Historical Query Analysis

We added system.query_log analysis showing slowest queries, most frequent patterns, and resource consumption. Three users checked it once during setup. None use it regularly through ClusterSight — they already have their own query monitoring (usually through application-level metrics).

Lesson: Don't compete with what teams already have. ClickHouse operational health monitoring is our lane. Query performance monitoring is a different product.

Per-Column Compression Views

The system.parts_columns analysis showing column-level compression ratios and codec recommendations. Fascinating to engineers, rarely acted on.

Lesson: Compression optimization is a one-time activity, not an ongoing monitoring need. It belongs in an audit tool, not a dashboard.

Surprises

Surprise #1: Mutation Monitoring Was Undervalued

We initially weighted mutation monitoring (PULSE → S) lower than replication (PULSE → U). Three users independently reported discovering stuck mutations that had been running for days. One user had a mutation blocking merges on their largest table, causing a slow part count increase that would have hit the danger zone within a week.

Action taken: Increased mutation alert priority. Any mutation older than 30 minutes now gets a dedicated Slack message.

Surprise #2: Teams Want Fewer Alerts, Not More

Early versions sent alerts for every threshold crossing. Users immediately asked to reduce alert volume. The common request: "Only alert me if I need to do something in the next 4 hours."

Action taken: Implemented alert severity tiers. Only "action needed" alerts go to Slack. "Awareness" alerts go to the dashboard only. "Informational" data is visible in deep-dive views.

Surprise #3: The Score Number Matters More Than the Breakdown

We spent weeks designing the per-dimension breakdown (P: 92, U: 87, L: 95, S: 78, E: 84). Users glance at it when the overall score drops, but otherwise they only care about the single number.

The insight: The score is a binary signal for most users: above 85 = fine, below 85 = investigate. The dimension breakdown is useful for the 5 minutes of investigation, not for the 23 hours and 55 minutes of passive monitoring.

What Changed in the Product

Based on 10 users, we made these changes:

1. Alert-First Dashboard

The dashboard now opens to a list of active alerts, not the health score. The score is prominent but secondary. Users' first question is "what needs my attention?" not "what's my score?"

2. Fix Command Prominence

Fix commands moved from the bottom of alert descriptions to the top. Bold. Copy button. The explanation follows.

3. Alert Consolidation

Related alerts are grouped. "Replication lag on 5 tables" instead of 5 separate alerts. Each group has a single "fix all" command when applicable.

4. Quiet Hours

Users can set quiet hours when only critical alerts (broken parts, read-only replicas) fire. Non-critical alerts accumulate and are delivered as a morning summary.

What's Next

The first 10 users shaped the product more than 3 months of internal development. The roadmap is now driven by their feedback:

  1. Historical trending (every user wants this) — PULSE score over time with event correlation
  2. Runbook integration — connect alerts to internal runbooks, not just fix commands
  3. Multi-cluster comparison — User #4 has 5 ClickHouse clusters and wants one view
  4. API access — Users want to pull PULSE scores into their own dashboards and CI/CD pipelines

Try It

Deploy ClusterSight on your cluster and see your PULSE score in under 8 minutes. The first alert usually fires within an hour — and you'll learn something about your cluster you didn't know.

Get started →

Check your cluster's PULSE.


This post is part of the Build in Public series. Previously: Why We Built ClusterSight, Designing Health Scores, and Month 3: Building ClusterSight.

Read next:

Frequently Asked Questions

What do ClickHouse operators want to monitor most?

Based on ClusterSight's first 10 users: replication consistency (not just lag), part count trends (not just thresholds), mutation completion status, and a single health score. Most operators want 'tell me what's wrong and how to fix it' over detailed system table visibility.

What ClickHouse monitoring features do people actually use?

The most used features are: the PULSE health score (checked daily), Slack alerts with fix commands (acted on immediately), and the replication consistency checker. The least used: detailed system table breakdowns, historical query analysis, and per-column compression views.

How do teams adopt ClickHouse monitoring?

Typical adoption path: deploy ClusterSight, see their PULSE score, investigate the first alert, fix it with the provided command, gain trust. Ongoing usage is primarily passive — glance at score, act on alerts. Deep investigation happens during incidents only.