ClickHouse Monitoring Blog

Guides, deep dives, and troubleshooting for ClickHouse teams.

Monitoring Guide Tools Compared Query Performance

First 10 Users: What ClickHouse Operators Actually Want to Monitor

What we learned from ClusterSight's first 10 production deployments. The features operators actually use, what they ignore, and how real-world feedback reshaped the PULSE Framework.

2026-06-22 · 8 min readRead

Query Performance

Stop Running OPTIMIZE TABLE in Production — Here's Why

OPTIMIZE TABLE FINAL is the most misused command in ClickHouse. It rewrites every part, blocks merges, and can crash your cluster. Learn when it's safe, when it's dangerous, and what to do instead.

2026-06-15 · 7 min readRead

system.parts_columns: How to Find the Tables Destroying Your Disk Budget

ClickHouse's system.parts_columns reveals exactly which columns in which tables consume the most storage. Find disk waste, oversized columns, and compression opportunities with these SQL queries.

2026-06-15 · 6 min readRead

PULSE Check: The One Query That Shows If Merges Are Falling Behind

A single SQL query tells you if your ClickHouse merge queue is keeping pace with inserts. Learn the part creation rate vs merge rate check, what the numbers mean, and when to act.

2026-06-08 · 5 min readRead

Why system.errors Has 40 Error Types You've Never Checked

ClickHouse's system.errors table tracks 40+ error types that predict failures before they happen. Learn which errors matter, how to detect chronic patterns, and the queries that turn error noise into operational signal.

2026-06-01 · 7 min readRead

Month 3: Building ClusterSight — What We Shipped, What Broke, What's Next

A transparent look at three months of building ClusterSight — the ClickHouse operational health platform. What we shipped, what users actually wanted, lessons learned, and the roadmap ahead.

2026-05-25 · 8 min readRead

Query Performance

system.query_log: The 15 Fields That Actually Matter for Performance

ClickHouse's system.query_log has 60+ columns. Most are noise. These 15 fields diagnose every performance issue — slow queries, memory spikes, excessive reads, and error patterns. The operational guide to query_log.

2026-05-18 · 10 min readRead

PULSE Check: Detecting Read-Only Replicas Before Users Notice

ClickHouse replicas go read-only when ZooKeeper sessions fail — but they keep serving stale queries. Learn the SQL checks to detect read-only state, understand why load balancers miss it, and prevent silent stale data.

2026-05-11 · 8 min readRead

The Mutation That Ran for 72 Hours While Every Dashboard Said Green

A ClickHouse mutation stuck for 72 hours served stale data to every query while monitoring showed healthy. Learn how stuck mutations form, why they're invisible to standard monitoring, and how to detect them before they corrupt analytics.

2026-05-04 · 9 min readRead

Designing Health Scores: How ClusterSight Weights 400+ Metrics Into One Number

A behind-the-scenes look at how ClusterSight's PULSE health score works — how we weight 400+ ClickHouse metrics across five dimensions, handle edge cases, and turn operational complexity into a single 0-100 number.

2026-04-27 · 9 min readRead

Query Performance

system.merges Decoded: Understanding ClickHouse's Background Engine

A deep reference guide to ClickHouse's system.merges table. Learn what every column means, how to diagnose slow merges, detect I/O contention, and prevent merge backlogs before they cause too-many-parts errors.

2026-04-20 · 9 min readRead

Grafana + Prometheus Is Not ClickHouse Monitoring

Why Grafana dashboards give a false sense of security for ClickHouse clusters. The 70% of operational metrics Prometheus never scrapes, and what purpose-built monitoring actually looks like.

2026-04-13 · 7 min readRead

PULSE Check: Is Your ClickHouse Compression Actually Working?

Three SQL queries to verify your ClickHouse compression is effective. Find tables wasting disk with poor compression ratios, wrong codecs, and uncompressed columns eating your storage budget.

2026-04-13 · 6 min readRead

312 Parts Per Partition: The Merge Backlog Time Bomb Nobody Monitors

ClickHouse part counts silently grow until inserts fail. Learn how merge backlogs form, why standard monitoring misses them, and the SQL queries that catch part count explosions before they become incidents.

2026-04-06 · 9 min readRead

Why We're Building a ClickHouse Health Tool

The story behind ClusterSight — why standard monitoring tools fail ClickHouse operators, what the PULSE Framework is, and how we're building an operational health platform that catches the 70% of metrics Grafana misses.

2026-03-28 · 7 min readRead

The Complete Guide to system.replicas: What Every Column Actually Means

A deep reference guide to ClickHouse's system.replicas table. Learn what every column means operationally, which ones matter for production monitoring, and the queries that catch replication problems before they cause incidents.

2026-03-27 · 10 min readRead

PULSE Check: 5 Queries to Run Before Every ClickHouse Upgrade

Before upgrading ClickHouse, run these 5 diagnostic SQL queries to verify cluster health. Catch broken parts, stuck mutations, replication drift, and merge backlogs before they turn an upgrade into a rollback.

2026-03-26 · 6 min readRead

Your Replicas Are Lying: How Silent Replication Drift Corrupts Analytics

ClickHouse replicas can fall out of sync without triggering alerts. Learn how silent replication drift corrupts analytics dashboards, why standard monitoring misses it, and how to detect it before stakeholders notice.

2026-03-25 · 10 min readRead

ClickHouse Observability Tools Compared: Clustersight vs Grafana vs Datadog vs HyperDX

A practical comparison of ClickHouse observability tools — what each one does well, where each falls short, and how to choose the right one for your team.

2026-02-24 · 3 min readRead

Query Performance

ClickHouse Query Performance: The Complete Guide

Everything you need to know about diagnosing and fixing slow ClickHouse queries — from system.query_log to EXPLAIN PLAN to schema optimization.

2026-02-24 · 3 min readRead

ClickHouse Broken Parts: Causes, Detection & Fix

Broken parts in ClickHouse mean data corruption. Learn what causes them, how to detect them with system.parts, and how to fix them.

2026-02-23 · 3 min readRead

ClickHouse Merge Queue: Why It Grows & How to Fix It

A deep merge queue in ClickHouse leads to too-many-parts errors and insert failures. Learn how to monitor it and keep it healthy.

2026-02-23 · 3 min readRead

Query Performance

ClickHouse Merges: How They Work and Why They Matter

ClickHouse merges are background operations that consolidate data parts. Understanding how merges work is essential for query performance, storage efficiency, and avoiding 'too many parts' errors.

2026-02-23 · 3 min readRead

How to Monitor ClickHouse in Production: The Complete Guide (2026)

Learn how to monitor ClickHouse in production using system tables, health metrics, and automated alerts. The definitive guide for SREs and data engineers.

2026-02-23 · 11 min readRead

ClickHouse Out of Memory Errors: Root Causes & Solutions

ClickHouse OOM errors crash queries and can destabilize a cluster. Learn what causes them, how to detect memory pressure early, and how to configure memory limits.

2026-02-23 · 3 min readRead

Query Performance

ClickHouse Query Profiling: How to Diagnose Expensive Queries

Use ClickHouse EXPLAIN, query_log, and flame graphs to profile expensive queries and understand exactly where time is being spent.

2026-02-23 · 2 min readRead

ClickHouse Replication Lag: How to Diagnose & Fix

ClickHouse replication lag means replicas are behind the source. Learn how to detect it, what causes it, and how to fix it with system.replicas.

2026-02-23 · 3 min readRead

Query Performance

ClickHouse Query Performance: Finding and Fixing Slow Queries

Learn how to identify slow queries in ClickHouse using system.query_log, understand what makes queries expensive, and optimize them.

2026-02-23 · 3 min readRead

ClickHouse System Tables: The Complete Guide

ClickHouse system tables expose hundreds of operational metrics. Learn which tables matter most for monitoring, performance tuning, and debugging.

2026-02-23 · 3 min readRead

Understanding ClickHouse's ZooKeeper Dependency

ClickHouse uses ZooKeeper (or ClickHouse Keeper) to coordinate replicated tables. Learn how this dependency works, what fails when ZooKeeper is unhealthy, and how to monitor it.

2026-02-23 · 3 min readRead