What is system.errors in ClickHouse?

system.errors is a system table that accumulates error counts by type since the last server restart. It tracks 40+ distinct error codes including network errors, ZooKeeper failures, memory limit violations, and storage errors — most of which never appear in standard monitoring.

How do I check for errors in ClickHouse?

Query SELECT name, code, value, last_error_time, last_error_message FROM system.errors ORDER BY last_error_time DESC. This shows all error types with their counts and most recent occurrence.

What ClickHouse errors should I monitor?

Critical errors to monitor: NETWORK_ERROR (replication/distributed query issues), KEEPER_EXCEPTION (ZooKeeper failures), MEMORY_LIMIT_EXCEEDED (OOM risk), TOO_MANY_PARTS (insert failures imminent), and CANNOT_READ_ALL_DATA (possible data corruption). Alert on any increase in these counters.

Does Grafana monitor ClickHouse system.errors?

No. Standard Grafana + Prometheus setups do not query system.errors. This table requires direct SQL queries against ClickHouse. Purpose-built tools like ClusterSight monitor all error types automatically.

Why system.errors Has 40 Error Types You've Never Checked

ClickHouse has been tracking every error your cluster encounters since its last restart. It stores them in system.errors — a table with 40+ error types that most teams have never queried. These aren't query errors that users report. They're infrastructure errors: network failures between nodes, ZooKeeper session drops, memory limit violations, storage corruption signals. The errors that predict incidents hours before they happen.

This is the fourth post in the Operator's Blind Spot series. We've covered replication drift, merge backlogs, and stuck mutations. This one covers the error signals that tie all of them together.

The Table Nobody Queries

SELECT
    name,
    code,
    value AS error_count,
    last_error_time,
    last_error_message
FROM system.errors
ORDER BY last_error_time DESC;

Run this on any production cluster. You'll find errors you didn't know were happening. Errors that have been accumulating for weeks. Errors that explain the "random" performance degradation your team has been chasing.

The reason nobody queries this table: system.errors doesn't trigger alerts. It doesn't show up in Prometheus exports. Grafana doesn't know it exists. It just sits there, quietly accumulating evidence of problems.

The Error Types That Matter

Not all 40+ error types are equally important. Here's the operational triage:

Critical (Alert Immediately)

Error Name	Code	What It Means
`KEEPER_EXCEPTION`	999	ZooKeeper/Keeper communication failure. Replication will stall. Read-only replicas likely.
`TOO_MANY_PARTS`	252	Part count exceeded threshold. Inserts are failing or about to fail.
`CANNOT_READ_ALL_DATA`	32	Potential data corruption. Parts may be broken or incomplete.
`MEMORY_LIMIT_EXCEEDED`	241	Queries or merges hitting memory limits. OOM errors imminent.
`NETWORK_ERROR`	210	Node-to-node communication failure. Distributed queries and replication affected.

Warning (Investigate Within Hours)

Error Name	Code	What It Means
`SOCKET_TIMEOUT`	209	Network latency between nodes. Replication slowing down.
`CANNOT_WRITE_TO_FILE_DESCRIPTOR`	28	Disk write failures. Check disk health and space.
`ALL_REPLICAS_ARE_STALE`	508	Distributed queries can't find a fresh replica. Data freshness problem.
`NO_REPLICA_HAS_PART`	234	A required part doesn't exist on any replica. Data loss risk.
`CHECKSUM_DOESNT_MATCH`	40	Data integrity mismatch during replication. Corruption signal.

Informational (Review Weekly)

Error Name	Code	What It Means
`AUTHENTICATION_FAILED`	516	Failed login attempts. Could be misconfigured clients or security probing.
`UNKNOWN_TABLE`	60	Queries referencing tables that don't exist. Application misconfiguration.
`SYNTAX_ERROR`	62	Malformed queries. Client-side bug.
`QUERY_WAS_CANCELLED`	394	Queries killed by timeout or manual cancellation. May indicate slow query patterns.

The Query That Catches Everything

Error Trending Over Time

system.errors only shows cumulative counts since last restart. To detect trends, snapshot the table periodically and compare:

-- Current error state (snapshot this every 5 minutes)
SELECT
    name,
    code,
    value AS count,
    last_error_time,
    substring(last_error_message, 1, 200) AS last_message,
    now() AS snapshot_time
FROM system.errors
WHERE value > 0
ORDER BY value DESC;

Recent Error Activity

-- Errors that occurred in the last hour
SELECT
    name,
    code,
    value AS total_count,
    last_error_time,
    dateDiff('minute', last_error_time, now()) AS minutes_ago,
    substring(last_error_message, 1, 200) AS last_message
FROM system.errors
WHERE last_error_time > now() - INTERVAL 1 HOUR
ORDER BY last_error_time DESC;

If this returns results, something went wrong in the last hour. If KEEPER_EXCEPTION or NETWORK_ERROR appear, check replication immediately.

Critical Error Detection

-- Errors that require immediate attention
SELECT
    name,
    code,
    value AS count,
    last_error_time,
    last_error_message
FROM system.errors
WHERE code IN (
    999,  -- KEEPER_EXCEPTION
    252,  -- TOO_MANY_PARTS
    32,   -- CANNOT_READ_ALL_DATA
    241,  -- MEMORY_LIMIT_EXCEEDED
    210,  -- NETWORK_ERROR
    40,   -- CHECKSUM_DOESNT_MATCH
    234   -- NO_REPLICA_HAS_PART
)
AND value > 0
ORDER BY last_error_time DESC;

Alert threshold: Any result from this query needs investigation.

Error Patterns That Predict Incidents

Pattern 1: KEEPER_EXCEPTION → Read-Only Replicas

A spike in KEEPER_EXCEPTION errors precedes read-only replica state by 2-10 minutes. The ZooKeeper session degrades before it fails completely. If you catch the error trend, you can investigate ZooKeeper health before replicas go read-only.

-- Early warning for ZooKeeper issues
SELECT
    name, value, last_error_time
FROM system.errors
WHERE name LIKE '%KEEPER%' OR name LIKE '%ZOOKEEPER%'
ORDER BY last_error_time DESC;

Pattern 2: NETWORK_ERROR → Replication Lag

NETWORK_ERROR between nodes means part fetches are failing. Replication lag follows within minutes. The error appears in system.errors before absolute_delay increases in system.replicas.

Pattern 3: MEMORY_LIMIT_EXCEEDED → OOM Cascade

Individual MEMORY_LIMIT_EXCEEDED errors are normal — ClickHouse correctly kills oversized queries. But a growing count means queries are getting heavier or available memory is shrinking. The cascade: memory pressure → merge slowdowns → part accumulation → insert failures.

Pattern 4: CHECKSUM_DOESNT_MATCH → Silent Corruption

Even a single CHECKSUM_DOESNT_MATCH error is significant. It means data transferred between replicas didn't match the expected checksum. This could be a transient network error (benign) or actual data corruption (critical). Always investigate.

Combining system.errors With Other System Tables

The real power of system.errors is correlating it with other monitoring signals:

-- Correlate errors with replication state
SELECT
    e.name AS error_name,
    e.value AS error_count,
    e.last_error_time,
    r.database,
    r.table,
    r.absolute_delay,
    r.queue_size,
    r.is_readonly
FROM system.errors e
CROSS JOIN system.replicas r
WHERE e.name IN ('KEEPER_EXCEPTION', 'NETWORK_ERROR')
  AND e.value > 0
  AND (r.absolute_delay > 30 OR r.is_readonly = 1 OR r.queue_size > 20)
ORDER BY r.absolute_delay DESC;

If this returns results, errors and replication degradation are happening simultaneously — likely related.

How This Maps to PULSE

Error monitoring is the S (Stability) dimension of the PULSE Framework. But errors often signal problems in other dimensions:

KEEPER_EXCEPTION / NETWORK_ERROR → U (Uptime) — replication at risk
TOO_MANY_PARTS → P (Parts) — merge backlog critical
MEMORY_LIMIT_EXCEEDED → L (Latency) — query performance degrading

ClusterSight monitors all error types in system.errors every 60 seconds:

Error trending — alerts on rate of increase, not just presence
Cross-dimension correlation — links errors to replication, merge, and query degradation automatically
Critical error alerting — instant notification for corruption signals, ZooKeeper failures, and part count violations
Error history — tracks error patterns over time since system.errors resets on restart

Check your cluster's PULSE.

This post is part of the Operator's Blind Spot series. Previously: Your Replicas Are Lying, 312 Parts Per Partition, and The Mutation That Ran for 72 Hours.

Read next: