{"slug": "executable-udfs-are-now-in-public-beta-on-clickhouse-cloud", "title": "Executable UDFs are now in public beta on ClickHouse Cloud", "summary": "ClickHouse Cloud now offers executable user-defined functions (UDFs) in public beta, allowing users to upload Python code as a zip file and call it from SQL like any built-in function. The feature runs inference code in a managed sandbox alongside data, enabling real-time model scoring within queries, joins, and materialized views without requiring a separate server. A demo shows the UDF powering an autoencoder that scores billions of stock trade ticks for anomalies inline with ingest.", "body_md": "Today we're excited to announce that executable UDFs are now available\nin public beta on ClickHouse Cloud. You can write a function in Python, upload it as a zip to your cluster, and call it from\nSQL like any built-in. ClickHouse manages a pool of long-lived sandboxed\nprocesses and routes rows through them at query speed. The function is\ncallable anywhere SQL is: ad-hoc queries, joins, even materialized views\nthat fire on every insert.\n\nThis isn't a brand-new idea. We've shipped executable UDFs in self-hosted\nClickHouse for a while. Our 2023 post on calling OpenAI from\nSQL\nwalked through the same mechanism. What's new today is that you don't\nneed to run your own server to use it. The model code lives where the\ndata is, runs in a managed sandbox, and the deployment surface is one\nupload screen in the Cloud console.\n\nTo show what this unlocks, we built a demo. A small PyTorch autoencoder\nscores ~6 billion equity trade ticks for anomalousness, inline with\ningest. A Next.js front-end consumes the embeddings. Full source for the\nnotebook, UDF bundle, SQL, and webapp is in this repo.\n\nYou have a trained model. You have a stream of data in ClickHouse.\nGetting them into the same room used to mean one of three options.\n\nStand up a separate scoring service. Now you maintain a model\nserver, an ingest pipeline that routes rows to it, and a way to write\nthe scores back into ClickHouse. The model is no longer near the data\nin any meaningful sense.\n\nTranslate the model into pure SQL. Workable for some tree-based\nmodels. Painful for anything with embeddings. Every retrain means\nregenerating thousands of lines of SQL by hand.\n\nBatch score offline and join later. Loses freshness. The \"anomaly\"\non a trade that just hit is only useful if you can react to it now.\n\nExecutable UDFs collapse all three into one. Write the inference code as\na normal Python file. Point ClickHouse at it. Call it from SQL. The\nfunction runs inline with whatever query needs it, including inside a\nmaterialized view, which is exactly what we do here.\n\nLast year we wrote \"Building StockHouse\",\nshowing how ClickHouse handles a continuous firehose of stock trade ticks\nin real time. That post stopped at the ingest and query layer. The\nnatural next question is: what if you wanted to apply a learned model to\nevery trade as it lands?\n\nWe picked an unsupervised anomaly-detection setup because it shows off\nthe shape of the problem cleanly.\n\nA small autoencoder (~270K parameters) is trained on 50M historical\ntrade ticks. Its inputs: a hashed ticker, 7 numeric features (price,\nsize, exchange, etc.), and 6 cyclical-encoded temporal features.\n\nFor each trade, the model produces a 32-dim embedding and a\nreconstruction error. High error means the model wasn't trained on\npatterns like this trade. It's anomalous in shape compared to what's\nnormal for that symbol's history.\n\nThe UDF that wraps this model is embed_trade. It's the only\nML-specific piece in the system. Everything else is plain SQL: the\nscore aggregation, the per-symbol baselines, the views.\n\nHere's the data flow:\n\n```\n            ┌───────────────────────────┐\n            │  default.trades           │     ← upstream feed (e.g. Polygon)\n            └──────────────┬────────────┘\n                           │ INSERT\n                           ▼\n            ┌───────────────────────────┐\n            │  trades_embeddings_mv     │     ← fires on every INSERT\n            │  (calls embed_trade UDF)  │\n            └──────────────┬────────────┘\n                           │\n                           ▼\n            ┌───────────────────────────┐\n            │  default.trades_embeddings│     ← same trade + 32-dim\n            │                           │       embedding + recon_score\n            └──────────────┬────────────┘\n              ▲            │\n              │            │ refresh hourly\n              │            ▼\n              │  ┌──────────────────────┐\n              │  │ trades_baselines     │     ← per-symbol score\n              │  │ trades_dim_baselines │       distribution stats\n              │  └──────────────────────┘\n              │\n              └──── consumed by webapp queries\n                    (anomalies are defined relative\n                     to each symbol's own baselines)\n```\n\nEvery INSERT INTO trades flows through the materialized view, gets\nscored, and lands in trades_embeddings. The webapp never re-runs the\nmodel. It only reads trades_embeddings and two cheap baseline tables.\nThe expensive inference happens exactly once per trade, inline with\ningest, and every downstream query is a normal aggregation.\n\nThe model itself is small and unremarkable as ML goes, but the training\npipeline is worth a quick look because it has to produce artifacts the\nUDF can load at runtime. The full walkthrough lives in\nnotebook/train_and_deploy_udf.ipynb.\nA summary:\n\nStream training data into Parquet chunks. A SELECT against\ndefault.trades derives the 14 input features server-side (price,\nsize, exchange, condition-code count, hashed ticker, and cyclical\nencodings of hour and day of week). The notebook pulls the result via\nquery_arrow_stream and writes 5M-row Parquet chunks to local disk.\nNothing is held in memory.\n\nFit a StandardScaler incrementally. Welford's algorithm via\npartial_fit gives the same mean and variance as a single\nscaler.fit() over the full dataset, with bounded memory. We fit on\nthe 7 base numeric features only. The hashed ticker is an integer key\nand the cyclical features are already on a sensible scale.\n\nTrain the autoencoder.TradeAutoencoderV2 is a 4-layer encoder\ninto a 32-dim latent, with a symmetric decoder back to the numeric\nfeature space. The sym embedding lookup happens at the input layer,\nsym_idx = xxHash32(sym) % NUM_HASH_BUCKETS. Loss is MSE on the\nreconstructed numeric features. Training streams rows out of the\nParquet chunks via an IterableDataset and stops when a 200-batch\nmoving-average loss fails to improve for 5 windows.\n\nSave two artifacts.scaler_params.pt holds mean_ and scale_\nas Float32 tensors. trade_autoencoder_v2.pt holds the model\nstate_dict plus a config dict with the constructor kwargs. The\nUDF's main.py reads these at startup and reconstructs the model.\n\nPackage the bundle. A final notebook cell zips main.py,\nrequirements.txt, and the two .pt files into embed_trade.zip,\nready to upload.\n\nThe deployment surface is a single upload screen in the Cloud console.\nYou give it a name, a zip containing your code and model files, and a\nfew runtime parameters.\n\nFor embed_trade we use:\n\nType:executable_pool. Long-lived processes, hot model in memory.\n\nPool size:10 per replica. Each process loads the 2MB model at\nstartup (~1.5s) and reuses it for every subsequent call.\n\nRuntime:python3.11. Dependencies (torch==2.4.1,\nnumpy==1.26.4) come from the requirements.txt in the zip.\n\nFormat:TabSeparated. The UDF reads one TSV line per input row\non stdin and prints (embedding, recon_score) on stdout.\n\n14 arguments, each with an explicit ClickHouse type. The signature\nmatches the autoencoder's training schema exactly. See\nudf/cloud-deployment.md for the full\ntable.\n\nThe function is then callable from SQL like any built-in:\n\n```\n1WITH\n2    fromUnixTimestamp64Milli(t, 'America/New_York') AS ts,\n3    embed_trade(\n4        xxHash32(sym), p, s, x, z, toUInt64(length(c)), trfi, trft,\n5        toUInt8(toHour(ts)), toUInt8(toDayOfWeek(ts, 1)),\n6        sin((toHour(ts) * 2 * pi()) / 24),\n7        cos((toHour(ts) * 2 * pi()) / 24),\n8        sin((toDayOfWeek(ts, 1) * 2 * pi()) / 7),\n9        cos((toDayOfWeek(ts, 1) * 2 * pi()) / 7)\n10    ) AS result\n11SELECT\n12    sym, i, x, p, s, c, t, q, z, trfi, trft, inserted_at,\n13    result.2 AS recon_score,\n14    result.1 AS embedding\n15FROM stockhouse.trades limit 10;\n```\n\nThe interesting part isn't that you can do this. It's where you can\nput the call.\n\nEvery INSERT INTO trades fires this MV. The Python pool scores\nthe batch and lands the result in trades_embeddings. There's no other\nmover, no other service, no separate scheduler. Just SQL.\n\nThis is the part that wasn't possible before executable UDFs landed in\nCloud. The equivalent service architecture would be a Kafka consumer\nreading from trades, batching rows, posting to a model server, writing\nthe results back. Same end state, several more moving parts. Here it's\none DDL statement.\n\nThe performance shape is unsurprising. Cost per row is the model forward\npass (a few milliseconds on a warm pool) plus the TSV serialization.\nClickHouse batches rows into the UDF in chunks. The pool runs a handful\nof in-flight invocations in parallel. We backfilled ~6B historical rows\nat ~35K rows/sec sustained over several hours on a 3-replica cluster\nwith no manual scaling. Same UDF, same MV, same SQL.\n\nThe autoencoder gives us a raw recon_score per trade. That's a number\nbetween roughly 0.00002 and 1,000,000+ across the dataset. A naive\n\"trades above 0.062 are anomalous\" filter (using the global 99th\npercentile from the model's training distribution) sounds reasonable\nuntil you actually look at the data.\n\nA handful of symbols, like BRK.A and LLY, score every single trade above\nthat threshold because their share prices are unusually high. Their\nentire distribution sits in the right tail of the global one. A \"100%\nanomalous\" stat for those symbols is technically correct and practically\nuseless.\n\nSo we redefine \"anomaly\" relative to each symbol's own history. For\nevery symbol, we maintain its lifetime p95 of recon_score. A trade\nis anomalous for that symbol if it exceeds the symbol's own p95. About\n5% of trades qualify in a typical window, by construction. When that\nfraction spikes well above 5%, the symbol is having a genuinely unusual\nwindow.\n\nThe per-symbol baseline lives in another ClickHouse table:\n\n```\n1CREATE TABLE trades_baselines (\n2    sym         LowCardinality(String),\n3    p50         Float32,\n4    p95         Float32,\n5    p99         Float32,\n6    -- ...\n7    computed_at DateTime\n8)\n9ENGINE = MergeTree\n10ORDER BY sym;\n```\n\nA refreshable materialized view repopulates it every hour:\n\n```\n1CREATE MATERIALIZED VIEW trades_baselines_mv\n2REFRESH EVERY 1 HOUR\n3TO trades_baselines\n4AS\n5SELECT\n6    sym,\n7    quantiles(0.5, 0.95, 0.99)(recon_score) AS qs,\n8    qs[1] AS p50, qs[2] AS p95, qs[3] AS p99,\n9    -- ...\n10FROM trades_embeddings\n11WHERE NOT has(c, 15) AND NOT has(c, 12)   -- exclude auction prints\n12GROUP BY sym;\n```\n\nRefreshable MVs atomically truncate and replace the target table on each\nrefresh. Plain MergeTree is the right engine: no FINAL, no dedup\nlogic, no read-time overhead.\n\nThe leaderboard query then joins live trades against the baselines\ntable to count anomalies per symbol relative to their own baseline:\n\n```\n1SELECT\n2    e.sym,\n3    countIf(e.recon_score > b.p95) AS anomaly_count,\n4    round(sumIf(e.s, e.recon_score > b.p95) * 100.0 / sum(e.s), 2) AS pct_of_volume\n5FROM stockhouse.trades_embeddings AS e\n6INNER JOIN stockhouse.trades_baselines AS b ON e.sym = b.sym\n7WHERE e.t >= now() - INTERVAL 1 HOUR\n8GROUP BY e.sym\n9ORDER BY pct_of_volume DESC\n10LIMIT 50;\n```\n\nThis query goes from ~1.7s (recomputing baselines inline as a CTE) to\n~0.27s (joining against the pre-computed table). Same answer, roughly 6x\nfaster. The expensive part is materialized exactly once an hour instead\nof on every page load.\n\nThe webapp is a Next.js + Click UI + Highcharts demo. It consumes\ntrades_embeddings and the baseline tables.\n\nThe anomaly dashboard ranks S&P 500 symbols by share of trading\nvolume that exceeds their own baseline.\n\nThe packed-bubble chart sizes and colors each symbol by pct_of_volume,\nthe share of total trading volume in the window that came from trades\nabove the symbol's lifetime p95. Symbols with redder, larger bubbles had\nunusually anomaly-heavy windows. The table on the left carries the same\nsort, with OHLC, max score, and the per-symbol baseline alongside.\n\nThe symbol drilldown zooms in on one ticker.\n\nA candlestick and volume pane sits on top. Both axes overlap a single\nplot area, with the price axis stretched downward to push candles into\nthe top 65% and volume bars into the bottom 30%. Hover any row in the\nanomalous-trades table and the corresponding candle's volume bar fills\nyellow, sized to that trade's share of the bucket's total volume.\nCrosshairs snap to the candle center.\n\nThe similarity search opens as a modal over the drilldown when you\nclick a trade.\n\nThe radar chart plots each trade's 13 input dimensions, normalized\nagainst the symbol's lifetime min, max, and avg per dim. Because the avg\nalways maps to 0.5, the baseline series renders as a perfect 13-sided\npolygon at the chart's midpoint. Easy to spot deviations from. Hover a\nsimilar-trade row to overlay it. The 50 most-similar trades come from\ncosineDistance(embedding, target_embedding) over the same symbol's\nembedding column.\n\nThe model drift monitor tracks the score distribution over time.\n\nWeekly p50, p95, p99, and max of recon_score, with horizontal\nreference lines at the static thresholds the model was originally\ncalibrated against. If the p99 line starts climbing week over week, the\nmarket has drifted from the model's training distribution and it's time\nto retrain.\n\nThe auction print monitor is the home for the extreme tail. Opening\n(c=12) and closing (c=15) auction prints score in the thousands to\nmillions because of their massive share sizes.\n\nThey'd dominate every other view if we didn't filter them out everywhere\nelse. Here they get their own page.\n\nOne more thing: network-access UDFs (private beta) #\n\nEverything you've seen so far runs on the deterministic path. embed_trade\nscores rows at ingest, baselines refresh hourly, the webapp reads\npre-computed tables. No external calls anywhere on the read path. That's\nthe shape you want for the load-bearing pieces: cheap, predictable, no\nupstream that can disappear on you.\n\nBut once a trade has been flagged as anomalous, the obvious next\nquestion is why. That answer lives outside ClickHouse — in news APIs,\nSEC filings, halt notices, social signals. To pull those in we need\nnetwork access from the UDF.\n\nNetwork-access executable UDFs are in private beta on ClickHouse\nCloud. Once enabled, the UDF runtime can make outbound HTTPS calls to\nany allowed host. We added two new UDFs in this repo to use it:\n\nGiven (sym, t, window_min), calls two external sources and returns a\nJSON array of events near that trade time:\n\nMassive News API (Polygon recently rebranded as Massive;\napi.polygon.io endpoints still respond as before).\n\nSEC EDGAR (free, public, no API key).\n\n```\n1SELECT\n2    sym,\n3    JSONLength(nearby_events(sym, t, 120)) AS n_events\n4FROM stockhouse.trades_embeddings\n5WHERE recon_score > 1.0\n6LIMIT 5;\n```\n\nYou could almost do this with url(). The differences that make it a\nUDF:\n\nIn-process composition. Polygon's results and EDGAR's filings get\ndeduped, sorted, and capped in a single Python call. Chaining two\nurl() calls in SQL would force the same logic into a UNION ALL\nwith downstream arrayJoin/groupArray plumbing — workable, but\nuglier.\n\nAuth in env. The Polygon API key is read from\nPOLYGON_API_KEY at pool-process startup. It never appears in SQL.\n\nPer-process LRU cache. Each pool worker keeps recent results\nkeyed by (sym, minute, window). The same trade hovered twice in the\nUI costs one API call, not two.\n\nConnection reuse. A long-lived requests.Session() per process\nkeeps HTTP connections alive for the duration of that worker, which\nis hours.\n\nGiven (sym, t), fetches context via nearby_events's internals, then\nasks Anthropic Claude to classify the most likely cause of the\nanomalous trade. Returns a typed tuple:\n\n```\n1WITH classify_trade('LLY', 1778777944818) AS c\n2SELECT c.1 AS cause, c.2 AS confidence, c.3 AS summary;\n```\n\nThe cause is constrained to a fixed taxonomy: earnings, m_and_a,\nhalt, rumor, sector_move, block_trade, no_news_found. We\nenforce this via Anthropic's tool-use mechanism. The model is\nrequired to call a tool whose input_schema includes an enum on the\ncause field, so the response is guaranteed to be parseable and the\ncause is guaranteed to be one of the known values. No regex parsing of\nfree-form prose, no \"the model returned something close to 'earnings'\nbut with extra words\" follow-up logic.\n\nRemember the similarity modal from the webapp? classify_trade and\nnearby_events drive a \"Why anomalous?\" panel pinned to the top of\nthat modal. When you open a trade, the panel hits both UDFs in parallel\nand shows:\n\nA badge with the classified cause and a confidence number\n\nA 1–2 sentence summary written by the model\n\nA short list of the news headlines and filings that drove the call\n\nurl() has been in ClickHouse for years and it's good for ad-hoc\nfetches. What network-access UDFs add is the rest of the picture:\nstateful clients, auth lifecycle, multi-step pipelines, structured\nLLM output, and per-process caching. The boundary between \"code that\nneeds to run\" and \"data that needs to be queried\" gets thinner.\n\nYou can put a 200-line Python function with three API calls and an LLM\nprompt into a SELECT. Nobody else has to learn it exists.\n\nWant to try it on your cluster? Network-access UDFs are in private\nbeta — reach out to ClickHouse Cloud support to get it enabled!\n\nMost ML-on-streaming-data architectures pay an integration tax. The\nmodel lives somewhere. The data lives somewhere else. The glue between\nthem is its own system. The setup in this repo flattens that. There's a\nClickHouse Cloud cluster, a 2MB Python file, and one DDL statement that\nbinds them together.\n\nEvery piece of UI logic in the webapp is a SQL query. Anomaly detection\nis the only ML in the system, and even that's not \"ML in the webapp\",\nit's a column in a table. The \"how anomalous is this symbol's last\nhour\" calculation, the \"find me similar trades by cosine distance\"\nquery, the per-symbol p95 baseline, the materialized views that keep it\nall fresh: standard SQL features, running against standard ClickHouse\ntables.\n\nExecutable UDFs in Cloud don't add new abstractions on top of\nClickHouse. They give you a way to make Python part of your SQL.\n\nBackfill historical data (optional). Bulk INSERT into\ntrades_embeddings using the same SELECT pattern as the MV, scoped to\nany time range. The MV in step 2 will catch every subsequent INSERT\ninto default.trades automatically.\n\nThe notebook in notebook/ walks through training your own autoencoder\nend to end. It streams training data from default.trades into Parquet\nchunks, fits a StandardScaler incrementally, trains with early\nstopping, and zips the artifacts into a deployable bundle.", "url": "https://wpnews.pro/news/executable-udfs-are-now-in-public-beta-on-clickhouse-cloud", "canonical_source": "https://clickhouse.com/blog/executable-udfs-clickhouse-cloud-beta", "published_at": "2026-06-01 14:08:38+00:00", "updated_at": "2026-06-03 10:36:42.711865+00:00", "lang": "en", "topics": ["machine-learning", "ai-infrastructure", "ai-tools", "ai-products", "mlops"], "entities": ["ClickHouse", "ClickHouse Cloud", "PyTorch", "Next.js", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/executable-udfs-are-now-in-public-beta-on-clickhouse-cloud", "markdown": "https://wpnews.pro/news/executable-udfs-are-now-in-public-beta-on-clickhouse-cloud.md", "text": "https://wpnews.pro/news/executable-udfs-are-now-in-public-beta-on-clickhouse-cloud.txt", "jsonld": "https://wpnews.pro/news/executable-udfs-are-now-in-public-beta-on-clickhouse-cloud.jsonld"}}