Identifying and Resolving Hidden ClickHouse Bottlenecks: A Step-by-Step Guide

Introduction

Even when all the usual suspects look clean — I/O, memory, rows scanned, parts read — a ClickHouse query can still crawl. At Cloudflare, our billing pipeline, which processes hundreds of millions of dollars in usage revenue, suddenly slowed after a routine migration. The culprit turned out to be a hidden bottleneck buried deep inside ClickHouse internals. This guide walks you through the same diagnostic and resolution process we used, so you can detect and fix similar issues before they affect your critical pipelines.

Identifying and Resolving Hidden ClickHouse Bottlenecks: A Step-by-Step Guide
Source: blog.cloudflare.com

Note: This guide assumes intermediate knowledge of ClickHouse. For absolute beginners, review the official documentation first.

What You Need

Step-by-Step Process

Step 1: Recognize the Symptoms

Your pipeline has suddenly become slow, and the problem appears after a migration or configuration change. Typical signs:

In our case, the billing pipeline timing became erratic, and invoices became increasingly difficult to reconcile.

Step 2: Check the Usual Suspects

Start with the metrics that normally pinpoint a slowdown:

In our scenario, all these metrics appeared normal. I/O was low, memory was fine, rows scanned hadn't increased, and parts read were stable. This told us the bottleneck was internal — something deeper in the query execution engine.

Step 3: Dig Deeper – Profile Internal Events

When normal checks fail, turn to the system.events table. Look for unusual values in low-level counters, especially those related to:

We noticed a spike in the number of small read operations. Despite reading the same total bytes, ClickHouse was performing many more individual system calls. This pointed to a contention issue inside the ReadBuffer layer — the code responsible for reading data from disk.

Step 4: Identify the Root Cause

Compare the system events between fast and slow runs (or before and after a migration). Look for events where the count increased dramatically while the total bytes remained constant. In our case, we found that a change in the way ClickHouse prefetches data had introduced a global mutex lock inside the ReadBufferFromFileDescriptor class. Normally, each thread has its own read buffer; after the migration, multiple threads were contending for a single buffer, causing severe serialization.

Identifying and Resolving Hidden ClickHouse Bottlenecks: A Step-by-Step Guide
Source: blog.cloudflare.com

Check your ClickHouse version’s changelog for any changes to read prefetch logic. If you suspect a similar mutex issue, you can confirm by running perf top or strace to see whether pthread_mutex_lock appears prominently during query execution.

Step 5: Implement the Fixes

We wrote three patches to resolve the bottleneck. Depending on your exact issue, you may need to adapt them:

  1. Remove the global mutex in ReadBuffer. Replace it with a per-thread buffer allocation so that prefetch threads don’t compete for the same resource.
  2. Adjust prefetch size. The default prefetch amount was too small, causing many tiny reads. Increase it using the setting max_read_buffer_size (e.g., to 2 MB).
  3. Optimize async reads. Improve coordination between the main read thread and prefetch threads to reduce context switching.

Always test these changes in a staging environment first. Our patches increased query throughput by over 400% for the affected workloads.

Step 6: Validate and Monitor

After applying the changes:

We saw immediate improvement: daily aggregation jobs returned to normal, and the billing pipeline cleared its backlog within 24 hours.

Tips and Best Practices

Hidden bottlenecks are rare but devastating. By systematically profiling internal ClickHouse events, you can uncover issues that traditional monitoring misses and keep your pipelines fast and reliable.

Tags:

Recommended

Discover More

Apple Raises Mac Mini Price: Entry-Level Model Discontinued Amid Chip ConstraintsPython 3.15 Enters Alpha 3 with Game-Changing Profiler and UTF-8 DefaultNavigating Windows 11's Low Latency Profile: A Step-by-Step Guide to Understanding the ControversyMIT's 'SEAL' Framework Lets AI Rewrite Its Own Code: A Leap Toward Self-Evolving IntelligenceASUS Overcomes the Main Hurdle in AR Gaming Glasses