{"slug": "monitor-slas-and-scale-clickhouse-cloud-with-clickhousectl-and-agents", "title": "Monitor SLAs and scale ClickHouse Cloud with clickhousectl and agents", "summary": "ClickHouse Cloud users can now monitor query-level SLAs and automate infrastructure scaling using the open-source tool clickhousectl and AI agents. The tool enables tagging specific queries, calculating percentile-based latency targets, and automatically investigating or remediating SLA breaches by scaling resources. This workflow allows teams to maintain performance guarantees for critical frontend queries without manual intervention.", "body_md": "ClickHouse Cloud makes it trivial to automatically scale your infrastructure up and down, horizontally or vertically, in response to resource pressure. But sometimes you want to go further and monitor SLAs on specific queries. Perhaps they're the queries fired off by your frontend app, and it degrades your user-experience when latency exceeds >200ms.\n\nThis guide shows you how to tag queries so you can calculate SLAs, then use [ clickhousectl](https://github.com/ClickHouse/clickhousectl) to query and scale ClickHouse Cloud to investigate and fix breaches. You'll also see how you can pass this workflow off to an agent to investigate and remediate for you.\n\nTry out the [runnable example](https://github.com/ClickHouse/examples/tree/main/ai/clickhousectl/agentic-sla-scaling).\n\n## Setup [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#setup)\n\n[Install clickhousectl](https://clickhouse.com/docs/interfaces/cli) and use an API key to auth:\n\n```\ncurl https://clickhouse.com/cli | sh\n\nclickhousectl cloud auth login --api-key \"$CLICKHOUSE_CLOUD_API_KEY\" --api-secret \"$CLICKHOUSE_CLOUD_API_SECRET\"\n```\n\nConfirm you can see your services. The first column is the service ID you'll use everywhere else:\n\n```\nclickhousectl cloud service list\n```\n\n## Defining and measuring your SLA [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#defining-and-measuring-your-sla)\n\nFirst, you need to define your SLA and know how to measure it. An SLA is only useful if it's specific: a percentile, a latency target, and the queries it applies to. For a frontend dashboard, that might be *\"p99 under 200 ms for the queries behind the main view\"*. That's what we'll use for the example here.\n\nThe `system.query_log`\n\nrecords every query a ClickHouse service runs. The trick is to tag your queries so you can easily filter to them. Set [ log_comment](https://clickhouse.com/docs/operations/settings/settings#log_comment) on the queries you want to track, and they become trivial to isolate later:\n\n```\nSELECT event_type, count(), avg(value), quantile(0.9)(value)\nFROM events\nWHERE event_type = 'purchase'\n  AND event_time > now() - INTERVAL 1 DAY\nGROUP BY event_type\nSETTINGS log_comment = 'frontend-dashboard';\n```\n\nWith the queries tagged, you can read them back from the log:\n\n```\nclickhousectl cloud service query --id \"$SERVICE_ID\" --query \"\n  SELECT event_time, query_duration_ms\n  FROM clusterAllReplicas(default, system.query_log)\n  WHERE type = 'QueryFinish'\n    AND log_comment = 'frontend-dashboard'\n  ORDER BY event_time DESC\n  LIMIT 10\"\n```\n\nOnce you can see them, measuring the SLA is just an aggregation. Compute the p99 latency for exactly that workload over the last five minutes:\n\n```\nclickhousectl cloud service query --id \"$SERVICE_ID\" --query \"\n  SELECT\n      toUInt64(quantile(0.99)(query_duration_ms)) AS p99_ms,\n      count() AS queries\n  FROM clusterAllReplicas(default, system.query_log)\n  WHERE event_time > now() - INTERVAL 5 MINUTE\n    AND type = 'QueryFinish'\n    AND log_comment = 'frontend-dashboard'\"\n```\n\n## Investigating a breach [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#investigating-a-breach)\n\nA breached SLA tells you that latency went up, but not why. There are two places to look, and they answer different questions. Sometimes it's a simple case of CPU/Memory being over-utilised. Other times the hardware stats look fine, and you need to dig a little deeper into whats going on inside the database.\n\n### Inside the database [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#inside-the-database)\n\nThe first signal lives in ClickHouse itself. `system.query_log`\n\ndoesn't just help you with the SLA query, you can ask questions about everything else that ran alongside it, too. That helps you to understand if something about the workload is changing.\n\nBucketing volume and latency by minute is a good place to start:\n\n```\nclickhousectl cloud service query --id \"$SERVICE_ID\" --query \"\n  SELECT\n      toStartOfMinute(event_time) AS minute,\n      count() AS queries,\n      toUInt64(quantile(0.99)(query_duration_ms)) AS p99_ms\n  FROM clusterAllReplicas(default, system.query_log)\n  WHERE event_time > now() - INTERVAL 30 MINUTE\n    AND type = 'QueryFinish'\n    AND log_comment = 'frontend-dashboard'\n  GROUP BY minute\n  ORDER BY minute\"\n```\n\nA common case can be an increase in query volume/concurrency. As your application grows, more users are actively viewing their dashboard, firing off more queries at the same.\n\nIf query volume climbed in lockstep with p99, you probably have a concurrency problem. If p99 rose while volume stayed flat, something *else* is competing for resources, and you can widen the same query (drop the `log_comment`\n\nfilter, group by `log_comment`\n\nor `query_kind`\n\n) to find the heavy queries, ingestion, or merges crowding out your dashboard.\n\n### System metrics [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#system-metrics)\n\nThe second signal is resource pressure. To see whether the service is actually saturated, look at its metrics. ClickHouse Cloud exposes a [Prometheus-compatible endpoint](https://clickhouse.com/docs/integrations/prometheus) per service. `clickhousectl`\n\ncan help you take a quick peek:\n\n```\nclickhousectl cloud service prometheus \"$SERVICE_ID\" --filtered-metrics true\n```\n\nThe snapshot is enough to get an idea of current state. For trends over time, point a standing Prometheus scraper at the same endpoint.\n\nPay particular attention to these metrics:\n\n| Resource | Metric(s) | How to read it |\n|---|---|---|\nCPU | `ClickHouseAsyncMetrics_CGroupUserTimeNormalized` + `ClickHouseAsyncMetrics_CGroupSystemTimeNormalized` , vs. `ClickHouseAsyncMetrics_CGroupMaxCPU` | Sum the two normalized values to get cores in use. ~1.0 = one core saturated; approaching `CGroupMaxCPU` = CPU maxed out. |\nMemory | `ClickHouseAsyncMetrics_CGroupMemoryUsed` ÷ `ClickHouseAsyncMetrics_CGroupMemoryTotal` | Fraction of the memory limit in use. Approaching 1.0 = memory pressure. |\nConcurrency | `ClickHouseMetrics_Query` | Queries executing right now, a quick proxy for how busy the service is. |\n\nThe state of the service helps you determine the right action to take. High concurrency with low memory suggests that you add replicas, we just need more cores to spread query concurrency over. Memory pinned near the limit on every replica suggests you need bigger replicas.\n\n## Scaling with clickhousectl [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#scaling-with-clickhousectl)\n\n`cloud service scale`\n\nallows you to scale a ClickHouse Cloud service horizontally and vertically:\n\n```\nclickhousectl cloud service scale \"$SERVICE_ID\" \\\n  --min-replica-memory-gb 8 \\\n  --max-replica-memory-gb 16 \\\n  --num-replicas 3\n```\n\n`--num-replicas`\n\nis the horizontal dimension (how many replicas run in parallel). The `--min-replica-memory-gb`\n\nand `--max-replica-memory-gb`\n\nflags control vertical scaling. ClickHouse Cloud has native auto-scaling that can vertically scale replicas when it sees resource pressure. Set them apart to let Cloud scale replicas up and down automatically; set them equal to fix the replica size. The example above runs 3 replicas, each free to scale between 8 and 16 GB.\n\n## A simple cron [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#a-simple-cron)\n\nYou could put this inside a simple cron:\n\n``` bash\n#!/usr/bin/env bash\nset -euo pipefail\nSERVICE_ID=\"<your-service-id>\"\nSLA_MS=200\n\np99=$(clickhousectl cloud service query --id \"$SERVICE_ID\" --format TSV --query \"\n  SELECT toUInt64(quantile(0.99)(query_duration_ms))\n  FROM clusterAllReplicas(default, system.query_log)\n  WHERE event_time > now() - INTERVAL 1 MINUTE\n    AND type = 'QueryFinish'\n    AND log_comment = 'frontend-dashboard'\")\n\nif (( p99 > SLA_MS )); then\n  echo \"SLA breached: p99=${p99}ms > ${SLA_MS}ms. Scaling out\"\n  clickhousectl cloud service scale \"$SERVICE_ID\" --num-replicas 4\nelse\n  echo \"OK: p99=${p99}ms\"\nfi\n```\n\nRun it once a minute, and it can give you a super simple way to give your application some breathing room. But you'll have to think about the rest of the flow, too. Scaling back down if pressure eases, scaling further when needed, deciding between horizontal or vertical scaling, and so on.\n\n## Using agents to investigate and remediate [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#using-agents-to-investigate-and-remediate)\n\nIf you want to go beyond hard-coded heuristics, it's an interesting use case for agents.\n\nA cron might still be the right way to run the SLA-check every minute. But if the SLA is breached, an agent can help to reason about what action to take.\n\nThe ClickHouse agent skills can help your agent to better use ClickHouse and `clickhousectl`\n\n. You can install them easily using `clickhousectl`\n\nitself:\n\n```\nclickhousectl skills --agent claude\n```\n\nThe check itself can stay a cron, it's cheap and predictable. But instead of a hard-coded `scale --num-replicas 4`\n\n, you can pass the failure to an LLM, giving it context about the failure, how to investigate, and what remediation options it should consider:\n\n```\nif (( p99 > SLA_MS )); then\n  read -r -d '' PROMPT <<EOF || true\nThe 'frontend-dashboard' query latency SLA on ClickHouse Cloud service $SERVICE_ID\nhas just breached: p99 over the last minute is ${p99}ms against a ${SLA_MS}ms target.\n\nYou're the on-call agent. Work out WHY the SLA is breaching, then remediate it by\napplying exactly one scaling action to the service. Let the evidence drive the choice.\n\nWhat you have to work with (clickhousectl only):\n  - SQL against the service's system tables. system.query_log is the richest source:\n    one row per query, with its timing and memory use, each tagged with the workload\n    it belongs to in the log_comment column ('frontend-dashboard' is the SLA workload):\n      clickhousectl cloud service query --id $SERVICE_ID --format TSV --query \"<SQL>\"\n  - Live resource pressure from Prometheus (CPU, memory, query concurrency, merges):\n      clickhousectl cloud service prometheus $SERVICE_ID --filtered-metrics true\n\nYour two scaling levers. Apply only ONE, whichever the root cause calls for:\n  - Replica count:  clickhousectl cloud service scale $SERVICE_ID --num-replicas N\n  - Replica size:   clickhousectl cloud service scale $SERVICE_ID --min-replica-memory-gb M --max-replica-memory-gb M\n\nGeneral advice on which scaling pattern to use:\n- Prefer scaling vertically if cause is unclear.\n- Scale vertically if latency is likely caused by resource contention from other queries.\n- Scale horizontally if latency is caused by an increase in query concurrency or write throughput.\n\nApply one action, then explain the evidence you relied on and why that lever fits.\nEOF\n\n  printf '%s' \"$PROMPT\" | claude -p --model sonnet --allowedTools \"Bash(clickhousectl:*)\"\nfi\n```\n\n### Use your own scaling policy [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#use-your-own-scaling-policy)\n\nYou can take this further with your own rules and guidelines for scaling. Perhaps you want to guide the model not to scale beyond X replicas, or give it additional guidance on exactly what to look for (and how).\n\nCreating a context file in Markdown, or encoding it inside a custom agent skill, is a great way to guide the agent towards more desirable behaviour.\n\n### Auditing [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#auditing)\n\nEvery action performed via `clickhousectl`\n\nlands in the ClickHouse Cloud activity log, so you get an audit trail for free:\n\n```\nclickhousectl cloud activity list\n```\n\n## Get clickhousectl [#](/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl#get-clickhousectl)\n\nEverything in this guide uses [ clickhousectl](https://github.com/ClickHouse/clickhousectl), the ClickHouse CLI for local and cloud. It's the single tool for taking a project from your laptop to production: spinning up ClickHouse locally, building against it, and managing the Cloud service it eventually runs on.\n\nInstall with:\n`curl https://clickhouse.com/cli | sh`", "url": "https://wpnews.pro/news/monitor-slas-and-scale-clickhouse-cloud-with-clickhousectl-and-agents", "canonical_source": "https://clickhouse.com/blog/monitor-and-scale-clickhouse-cloud-with-clickhousectl", "published_at": "2026-06-05 14:58:29+00:00", "updated_at": "2026-06-12 15:10:24.350149+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-tools", "ai-agents", "mlops"], "entities": ["ClickHouse Cloud", "clickhousectl"], "alternates": {"html": "https://wpnews.pro/news/monitor-slas-and-scale-clickhouse-cloud-with-clickhousectl-and-agents", "markdown": "https://wpnews.pro/news/monitor-slas-and-scale-clickhouse-cloud-with-clickhousectl-and-agents.md", "text": "https://wpnews.pro/news/monitor-slas-and-scale-clickhouse-cloud-with-clickhousectl-and-agents.txt", "jsonld": "https://wpnews.pro/news/monitor-slas-and-scale-clickhouse-cloud-with-clickhousectl-and-agents.jsonld"}}