Monitor SLAs and scale ClickHouse Cloud with clickhousectl and agents ClickHouse Cloud users can now monitor query-level SLAs and automate infrastructure scaling using the open-source tool clickhousectl and AI agents. The tool enables tagging specific queries, calculating percentile-based latency targets, and automatically investigating or remediating SLA breaches by scaling resources. This workflow allows teams to maintain performance guarantees for critical frontend queries without manual intervention. ClickHouse Cloud makes it trivial to automatically scale your infrastructure up and down, horizontally or vertically, in response to resource pressure. But sometimes you want to go further and monitor SLAs on specific queries. Perhaps they're the queries fired off by your frontend app, and it degrades your user-experience when latency exceeds 200ms. This guide shows you how to tag queries so you can calculate SLAs, then use clickhousectl https://github.com/ClickHouse/clickhousectl to query and scale ClickHouse Cloud to investigate and fix breaches. You'll also see how you can pass this workflow off to an agent to investigate and remediate for you. Try out the runnable example https://github.com/ClickHouse/examples/tree/main/ai/clickhousectl/agentic-sla-scaling . Setup /blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl setup Install clickhousectl https://clickhouse.com/docs/interfaces/cli and use an API key to auth: curl https://clickhouse.com/cli | sh clickhousectl cloud auth login --api-key "$CLICKHOUSE CLOUD API KEY" --api-secret "$CLICKHOUSE CLOUD API SECRET" Confirm you can see your services. The first column is the service ID you'll use everywhere else: clickhousectl cloud service list Defining and measuring your SLA /blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl defining-and-measuring-your-sla First, you need to define your SLA and know how to measure it. An SLA is only useful if it's specific: a percentile, a latency target, and the queries it applies to. For a frontend dashboard, that might be "p99 under 200 ms for the queries behind the main view" . That's what we'll use for the example here. The system.query log records every query a ClickHouse service runs. The trick is to tag your queries so you can easily filter to them. Set log comment https://clickhouse.com/docs/operations/settings/settings log comment on the queries you want to track, and they become trivial to isolate later: SELECT event type, count , avg value , quantile 0.9 value FROM events WHERE event type = 'purchase' AND event time now - INTERVAL 1 DAY GROUP BY event type SETTINGS log comment = 'frontend-dashboard'; With the queries tagged, you can read them back from the log: clickhousectl cloud service query --id "$SERVICE ID" --query " SELECT event time, query duration ms FROM clusterAllReplicas default, system.query log WHERE type = 'QueryFinish' AND log comment = 'frontend-dashboard' ORDER BY event time DESC LIMIT 10" Once you can see them, measuring the SLA is just an aggregation. Compute the p99 latency for exactly that workload over the last five minutes: clickhousectl cloud service query --id "$SERVICE ID" --query " SELECT toUInt64 quantile 0.99 query duration ms AS p99 ms, count AS queries FROM clusterAllReplicas default, system.query log WHERE event time now - INTERVAL 5 MINUTE AND type = 'QueryFinish' AND log comment = 'frontend-dashboard'" Investigating a breach /blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl investigating-a-breach A breached SLA tells you that latency went up, but not why. There are two places to look, and they answer different questions. Sometimes it's a simple case of CPU/Memory being over-utilised. Other times the hardware stats look fine, and you need to dig a little deeper into whats going on inside the database. Inside the database /blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl inside-the-database The first signal lives in ClickHouse itself. system.query log doesn't just help you with the SLA query, you can ask questions about everything else that ran alongside it, too. That helps you to understand if something about the workload is changing. Bucketing volume and latency by minute is a good place to start: clickhousectl cloud service query --id "$SERVICE ID" --query " SELECT toStartOfMinute event time AS minute, count AS queries, toUInt64 quantile 0.99 query duration ms AS p99 ms FROM clusterAllReplicas default, system.query log WHERE event time now - INTERVAL 30 MINUTE AND type = 'QueryFinish' AND log comment = 'frontend-dashboard' GROUP BY minute ORDER BY minute" A common case can be an increase in query volume/concurrency. As your application grows, more users are actively viewing their dashboard, firing off more queries at the same. If query volume climbed in lockstep with p99, you probably have a concurrency problem. If p99 rose while volume stayed flat, something else is competing for resources, and you can widen the same query drop the log comment filter, group by log comment or query kind to find the heavy queries, ingestion, or merges crowding out your dashboard. System metrics /blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl system-metrics The second signal is resource pressure. To see whether the service is actually saturated, look at its metrics. ClickHouse Cloud exposes a Prometheus-compatible endpoint https://clickhouse.com/docs/integrations/prometheus per service. clickhousectl can help you take a quick peek: clickhousectl cloud service prometheus "$SERVICE ID" --filtered-metrics true The snapshot is enough to get an idea of current state. For trends over time, point a standing Prometheus scraper at the same endpoint. Pay particular attention to these metrics: | Resource | Metric s | How to read it | |---|---|---| CPU | ClickHouseAsyncMetrics CGroupUserTimeNormalized + ClickHouseAsyncMetrics CGroupSystemTimeNormalized , vs. ClickHouseAsyncMetrics CGroupMaxCPU | Sum the two normalized values to get cores in use. ~1.0 = one core saturated; approaching CGroupMaxCPU = CPU maxed out. | Memory | ClickHouseAsyncMetrics CGroupMemoryUsed รท ClickHouseAsyncMetrics CGroupMemoryTotal | Fraction of the memory limit in use. Approaching 1.0 = memory pressure. | Concurrency | ClickHouseMetrics Query | Queries executing right now, a quick proxy for how busy the service is. | The state of the service helps you determine the right action to take. High concurrency with low memory suggests that you add replicas, we just need more cores to spread query concurrency over. Memory pinned near the limit on every replica suggests you need bigger replicas. Scaling with clickhousectl /blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl scaling-with-clickhousectl cloud service scale allows you to scale a ClickHouse Cloud service horizontally and vertically: clickhousectl cloud service scale "$SERVICE ID" \ --min-replica-memory-gb 8 \ --max-replica-memory-gb 16 \ --num-replicas 3 --num-replicas is the horizontal dimension how many replicas run in parallel . The --min-replica-memory-gb and --max-replica-memory-gb flags control vertical scaling. ClickHouse Cloud has native auto-scaling that can vertically scale replicas when it sees resource pressure. Set them apart to let Cloud scale replicas up and down automatically; set them equal to fix the replica size. The example above runs 3 replicas, each free to scale between 8 and 16 GB. A simple cron /blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl a-simple-cron You could put this inside a simple cron: bash /usr/bin/env bash set -euo pipefail SERVICE ID="