{"slug": "the-correctness-layer-how-we-beat-claude-code-on-the-ade-benchmark", "title": "The Correctness Layer: How We Beat Claude Code on the ADE Benchmark", "summary": "Altimate-code achieved the top spot on the ADE and DAB benchmarks by splitting its architecture into a probabilistic LLM layer for strategy and code generation and a deterministic Rust and TypeScript stack for verifiable operations. The company built the system after finding that LLMs, as probability distributions, produce unreliable and unrepeatable answers for deterministic data-engineering tasks like semantic equivalence checks and lineage verification. The architecture moves three specific operations entirely out of the model to eliminate the randomness that caused a single benchmark task to pass once and fail twice across three identical runs.", "body_md": "We started [altimate-code](https://www.altimate.sh/) with one goal: build a harness that data engineers could actually trust with their pipelines and platforms. The first thing that became clear is that LLMs are the wrong tool for half of that job. Two queries either are or are not semantically equivalent. A lineage edge either does or does not exist. A row-level diff is either correct or not. None of those are creative questions, and a probability distribution is the wrong answer to any of them.\n\nOver the past year, we have built a three-layer architecture that puts the LLM in charge of the work it is good at (strategy, intent-parsing, code generation) and a deterministic Rust and TypeScript stack underneath for everything that has to be reproducible. The split (detailed later) is what gets us to the top of the ADE and the [DAB](https://ucbepic.github.io/DataAgentBench/) benchmarks, and more importantly, what makes the result repeatable on someone else's hardware.\n\nThis post walks through how the split actually works: where the boundary sits, the three operations we moved out of the model entirely, and what thousands of structured benchmark traces taught us about where agents actually fail.\n\n## LLMs are probability distributions\n\nRun the same prompt through the same model twice, and you'll get two different answers. That's not a bug. It's the substrate. Language models sample from a probability distribution every time they generate a token, and the distribution is shaped — sometimes broadly — by sampling temperature[1], prompt-cache state, and the internal nondeterminism of large batched inference on shared hardware[2].\n\nFor creative work, this is exactly what you want. Drafting a SQL query, summarizing a result, recovering from an unexpected error — these are open-ended tasks with many acceptable answers, and probabilistic output is what makes the model useful.\n\nFor correctness, it's a liability. \"Are these two queries semantically equivalent?\" has one right answer. \"What columns does this output project from this source?\" has one right answer. \"Are these two tables byte-equal?\" has one right answer. When a probabilistic system answers a deterministic question, you can't cache the answer (it might be different next time), you can't debug it (a single sample tells you nothing about the distribution), and you can't benchmark it (a single pass-rate is a coin flip away from a different number).\n\n## What this looks like on a real benchmark\n\nWe took one task off the [ADE benchmark](https://github.com/dbt-labs/ade-bench), a dbt refactor called `asana004`\n\n, and ran it through altimate-code three times with the same model, the same prompt, and the same starting state. The agent passed it once and failed it twice. Across the three runs, dbt's test counts came back as 6/6, then 4/6, then 5/6.\n\nIf you are building a data-engineering agent, that is the problem you have to design around. You cannot trust a system whose output is intrinsically random. You cannot cache it. You cannot reproduce it. You cannot debug it. And you cannot reliably benchmark it.\n\n## The comfortable go-to (but wrong) answer\n\nThe instinct is to make the model itself more deterministic. Lower the temperature. Tighten the prompts. Introduce multi-agent voting. Run test-time consensus across N samples. Vendors in the data space are responding to this demand, and their responses make sense on some level. Temperature=0 is easy (provides marginal repeatability at the cost of much of the agent's ability to explore or recover from mistakes), ensemble voting is well-understood, and prompt engineering keeps getting better.\n\nThis works, partially. Lower temperature reduces variance but does not eliminate it. Multi-agent voting trades dollars for variance reduction but never reaches zero. None of it changes the fundamental shape of the problem. The model is a probability distribution, and you are sampling from it.\n\n## So stop fixing the model\n\nThe LLM should stay out of the correctness layer.\n\nThat is the bet behind altimate-code. The LLM stays in charge of strategy, intent-parsing, and code generation, which are the genuinely creative parts of the work. Everything else runs in a deterministic Rust and TypeScript layer underneath: SQL validation, schema diffing, query equivalence, lineage extraction, cross-database join inference, data-parity diffing. The agent does not decide whether two queries are equivalent. It calls a function that proves they are.\n\nI keep coming back to this framing because it is the only one I have seen that survives contact with a real benchmark. Once you have watched the same prompt produce three different answers, you stop wanting the model to be more reliable. You start wanting it to be less involved.\n\nOnce you have watched the same prompt produce three different answers, you stop wanting the model to be more reliable. You start wanting it to be less involved.\n\n## Three layers\n\naltimate-code's architecture has three layers. The decision that matters most is where the boundary between them sits.\n\nThe **deterministic core** is a Rust library (`altimate-core`\n\n) that exposes 34 SQL operations as pure functions over abstract syntax trees and schemas. It parses, validates, transpiles, fingerprints, checks equivalence, diffs schemas, classifies PII, extracts column lineage, and diffs rows across warehouses. Every operation runs sub-millisecond. Our benchmark battery validates 1,000 SQL queries in 30 ms and lints 1,000 queries in 250 ms. All of it is reproducible. All of it is zero-cost at the LLM API layer because it never touches the model. The engine carries 5,700+ Rust tests, runs on a single sqlparser-based AST representation, and targets 34 SQL dialects, from Postgres and Snowflake to Trino and TSQL.\n\nThe **deterministic harness** is the TypeScript code that runs the agent. A dispatcher maintains a registry of native handlers. When the agent calls `altimate_core.transpile`\n\n, the dispatcher routes the call to compiled Rust via napi-rs bindings, not to the model. Skills (encoded playbooks) prescribe step orderings. Helpers like dialect-aware identifier quoting prevent whole classes of \"the model generated almost-correct SQL\" failures.\n\nThe **probabilistic agent** is the LLM. It reads task descriptions, plans, picks tools, writes drafts, summarizes results, and recovers from failures. It sits in the creative layer, not the correctness layer.\n\nThe agent does not always know which layer it is calling into. But it always can know. A `hasNativeHandler`\n\ncheck classifies a tool call as deterministic or not before it runs. That asymmetry is most of the value.\n\nConsider what happens when the agent calls `altimate_core.track_lineage`\n\n. The tool wrapper packages the arguments. The dispatcher looks up the handler registered for that method name. The handler normalizes the agent-supplied schema, accepting either the flat `{table: {col: TYPE}}`\n\nshape or the nested `SchemaDefinition`\n\nshape, and converts both to a canonical form before passing through. The napi-rs binding calls into compiled Rust. The Rust engine builds the lineage graph and returns it. The tool formats the output for the agent. From the agent's perspective, it called a tool and got back lineage. From the architecture's perspective, every step is deterministic, including the schema normalization, which the harness does, not the model.\n\n## What this looks like in practice\n\nI will walk through three concrete operations, because adjectives do not earn the claim.\n\nTake two queries written differently but doing the same thing:\n\n```\n-- Writer A\nSELECT id FROM users\nWHERE status = 'active'\n  AND created > NOW() - INTERVAL 30 DAY\n\n-- Writer B\nSELECT id FROM users\nWHERE created > DATE_SUB(NOW(), INTERVAL 30 DAY)\n  AND status = 'active'\n```\n\n`altimate_core.checkEquivalence`\n\nproves these are semantically identical without running either of them. The comparison happens on parsed ASTs against the provided schema. It handles predicate reordering, equivalent date-arithmetic functions, and identical column projection. The result is a boolean plus a confidence score, not a sample of an LLM's opinion. Downstream, that means refactor verification, migration safety checks, and regression tests that confirm a query's behavior did not change. None of it requires a live database or a model that \"looks at\" the queries.\n\nDialect translation is the same story. `transpile(sql, fromDialect, toDialect)`\n\ntakes a Snowflake query and produces a BigQuery one (or BigQuery to Postgres, or Postgres to Databricks) via `sqlparser-rs`\n\nand dialect-specific AST transforms. It runs sub-millisecond. The agent never has to memorize that Snowflake's `IFF(cond, a, b)`\n\nbecomes `CASE WHEN cond THEN a ELSE b END`\n\nin ANSI dialects. We have measured the LLM-direct alternative. The failure mode is not \"wrong output.\" It is almost-right output that compiles in dev and breaks in prod when an edge-case function shows up. Determinism here is not about speed. It is about coverage. The AST transform handles every case the parser handles, not every case the model happens to have seen in training.\n\n`schemaFingerprint(schema)`\n\nreturns a SHA-256 hash of a serialized schema. That is the entire interface. The hash is what makes caching trustable. When the agent is validating many queries against a large schema, we compute the fingerprint once and use it as a cache key. Identical schemas produce identical hashes, and the cache hits. An LLM-derived \"is this schema the same?\" cannot offer that guarantee. A model might say \"yes\" twice and \"no\" the third time, and you would have to debug whether the third answer was the wrong one. The fingerprint also acts as the index for progressive context compression. The core exposes five disclosure levels: fingerprint-only, table list, column names (the default), full detail for relevant tables only, and the complete schema - and the agent picks the level it needs for the question at hand. The selection is deterministic, the cache key is deterministic, and nothing about either step is the LLM's opinion of \"is this schema close enough?\"\n\nReliability and performance turn out to be the same lever pulled twice. Determinism is what makes the cache safe. The cache is what makes the agent fast. You cannot get the second without the first.\n\n## Case study: cross-database join inference\n\nThe cleanest illustration of this principle is a tool we shipped in late April 2026. [altimate-code](https://github.com/AltimateAI/altimate-code) is often connected to several warehouses at once (Snowflake, BigQuery, Postgres), and the agent needs to suggest cross-database joins. One warehouse calls an entity `customer_42`\n\n. Another calls it `cust_42`\n\n, or `c-42`\n\n, or `42`\n\n. Naming and formatting disagree, so naive column-name matching fails.\n\nA pure-LLM approach to this is straightforward to describe and unreliable to ship. You prompt the model with the schemas and ask \"which columns join across these DBs?\" It works on the cases the model has seen. It hallucinates plausible-but-wrong matches on the cases it has not. Different runs return different rankings.\n\nThe deterministic version is about 400 lines of TypeScript. The algorithm is small enough to summarize:\n\nFrom each connection, sample ~50 string values per column.\n\nCompute the longest common prefix per column, trimmed back to the last\n\n`_`\n\n,`-`\n\n, or`:`\n\nseparator. So`[\"businessid_42\", \"businessid_43\", …]`\n\nproduces prefix`\"businessid_\"`\n\n. Sequences without a separator are rejected.For each pair of columns from different databases, require distinct non-empty prefixes (so the prefix actually distinguishes the two columns), strip the prefixes, and compute the set intersection of suffixes.\n\nEmit a candidate when the suffix overlap is non-empty. Rank by overlap size, then by\n\n`overlap / min(left, right)`\n\n.\n\n`businessid_42`\n\non one DB and `businessref_42`\n\non another both reduce to suffix `42`\n\n. The algorithm pairs them with score 1.0 and reports the rule as \"shared suffix after stripping distinct prefixes.\"\n\nThe function that computes the common prefix is fewer than 30 lines. It walks character-by-character and trims back to the last separator. Identical inputs produce identical outputs, every time, in any order, on any machine. We can ship the algorithm to a customer and tell them exactly what will happen.\n\n## Case study: column lineage with depth tiering\n\nLineage is the question of where each output column comes from. Given a query with joins, CTEs, and expressions like `SUM(o.amount + o.tax)`\n\n, you need to know which upstream columns flow into each output. An LLM can answer this question in plain English. It cannot answer it precisely enough to chain operations on. Agents that act on lineage need exact edges, not summaries.\n\n`altimate_core.column_lineage`\n\ndoes this as an AST walk. Each output column gets a list of source columns plus a transformation lens: direct projection, function call, expression, or aggregation. The walk is deterministic by construction. Subqueries and CTEs are inlined. Window functions decompose into their partition and order clauses. CASE expressions report all branches as sources.\n\nThe piece worth featuring is the *depth tiering*. The same engine exposes three modes:\n\n**Basic**: direct source-target edges only, with minimal context.** Deep**: adds transformation lenses (expression, function, aggregation), confidence scores, and the internals of every CTE the lineage flows through.**Full**: transitive closure across multi-query chains, plus indirect dependencies from`WHERE`\n\n,`JOIN`\n\n, and`GROUP BY`\n\npredicates that affect row selection.\n\nThe agent picks the depth it needs for the question at hand. The classification of what counts as \"basic\" versus \"deep\" is encoded. It is not an LLM call disguised as one. The pattern repeats: every level of detail an agent can ask for is a level the engine knows how to produce, and the choice between them is a parameter, not a guess.\n\nThe same principle controls the output format. Lineage, like every operation in altimate-core, ships with two formatters: a verbose one for humans, and a compact one optimized for an agent's context window. Compact is the default. The agent gets a headline token (`LINEAGE | 5 cols, 3 tables, depth=basic`\n\n) followed by structured `column <- source`\n\nrows it can parse without spending tokens on prose. Depth chooses how much detail; the format chooses how that detail is encoded; both compound across a trajectory of hundreds of tool calls.\n\n## Case study: data parity\n\nA different shape of problem: prod was migrated to a new warehouse, and we need to know whether the rows match.\n\nThe naive answers are bad. You write a 200-line custom SQL diff query and hope you covered the edge cases, or you ask an LLM to \"compare these two tables\" and get an unactionable summary.\n\nThe engine implements two complementary algorithms. **HashDiff** uses bisection plus checksums for cross-database diffs. It partitions both tables, checksums each segment, recursively bisects any segment whose checksums disagree, and terminates at the row level on exactly the segments that contain mismatches. The bisection lets it pinpoint the exact rows that differ instead of scanning every row, and the module ships with 303 dedicated tests covering every dialect pair we support. **JoinDiff** uses a same-database `FULL OUTER JOIN`\n\nfor when both tables live in the same warehouse. The cost characteristics differ. The correctness guarantee is the same.\n\nThe engine ships with a cooperative state-machine API. It emits the SQL it wants executed, the caller runs it, and the engine steps forward based on the result. There is no point in the loop where the engine guesses, or where its behavior depends on which LLM is in front. Given identical inputs, the diffs come out identical, every time.\n\n\"*Are these two tables byte-equal?*\" is the antithesis of a creative task.\n\n## Smaller wins, in the harness\n\nSo far the story has been about the Rust core. The TypeScript harness gets the same treatment, and the wins are smaller per-instance but they compound.\n\nA single function called `quoteIdentForDialect`\n\nholds the truth about identifier quoting across our supported warehouses. MySQL gets backticks, T-SQL gets brackets, and everything else gets ANSI double quotes. The LLM never has to remember which is which, because it never builds quoted SQL directly. It passes column names through tools that route through `quoteIdentForDialect`\n\n. That is not a clever optimization. It is a category of failure removed from the system. The same pattern applies to `EXPLAIN`\n\n-plan retrieval. Postgres has `EXPLAIN (ANALYZE, BUFFERS)`\n\n. Snowflake has `EXPLAIN USING TEXT`\n\n. BigQuery has no `EXPLAIN`\n\nstatement at all, and you read plans from job metadata. A ~50-line decision table maps each warehouse to its plan-retrieval mechanism. The agent does not ask the model \"what is the BigQuery EXPLAIN syntax?\" because a tool already knows.\n\nSkills are altimate-specific procedures encoded in markdown. The `dbt-develop`\n\nskill prescribes a four-step workflow: plan (check the layer and naming conventions), discover (use schema-search tools, read existing YAML), write (generate SQL with validated joins), and validate (call `altimate_core_validate`\n\n, then `column_lineage`\n\n, then `dbt build`\n\n). The skill does not replace the LLM. It constrains where the LLM is creative and where the harness is prescriptive. Our trace data shows skill-driven sessions complete in fewer turns and with fewer tool errors than ad-hoc sessions on the same task.\n\nNone of these are individually dramatic. Cumulatively they are the difference between an agent that almost works and one you can deploy.\n\n## What thousands of session traces taught us\n\nWe ran the ADE-Bench DuckDB benchmark against altimate-code and captured structured traces for every session. Our benchmark topping score was 74.4%.[[2]](2) The traces are where the interesting story lives.\n\nFailure has three distinct causes, and they require three different responses.\n\nThe first cause is stochastic LLM noise. Two airbnb tasks (`airbnb005`\n\n, `airbnb006`\n\n) failed the original baseline but then passed all three reruns. `asana004`\n\n(the task from the opening) passed once out of three. The inputs are the same, the outputs differ, and the cause is intrinsic to the model. There is nothing to fix here except the model.\n\nThe second cause is deterministic capability gaps. `asana003`\n\nfailed at exactly 16 of 17 dbt tests, every single time across reruns. Same failure mode every time, with the same dbt-test count. The agent's output has a stable bug we can investigate and fix.\n\nThe third cause is bench-fixture bugs. Three airbnb tasks (`airbnb001`\n\n, `airbnb002`\n\n, `airbnb008`\n\n) failed reproducibly, but the root cause was a pre-existing bug in the bench's own dbt incremental models: a broken `is_incremental()`\n\nfilter that collapses the date series to a single row, so the `LAG`\n\nwindow function in the MoM/WoW aggregation returns `NULL`\n\non rerun and overwrites the correct first-run values. This is not an altimate-code problem. The maintainers shipped a test-setup workaround in dbt-labs/ade-bench#106, but the model SQL itself is still broken — fix proposed in [dbt-labs/ade-bench#145](https://github.com/dbt-labs/ade-bench/pull/145).\n\nFrom the outside, all three look identical. The test failed. Only the deterministic substrate (structured traces, reproducible reruns, exact dbt-test counts) lets us separate them. Without that, we would be guessing about what is actually wrong, and our improvement work would be aimed at the wrong target.\n\n## Where the market is\n\nMost coding agents in the field are general-purpose. Cursor, Cline, Claude Code, Codex, and GitHub Copilot Workspace all sit in that category. They are built for the generic developer-with-a-text-editor case. When a data engineer uses them to write a Snowflake query or refactor a dbt model, the tools treat SQL like any other code. Text goes in, text comes out, and the model decides. There is no deterministic substrate for the data-specific operations because there does not need to be one for general coding.\n\nThe data-native AI products (dbt Cloud's AI features, in-warehouse LLMs like Snowflake Cortex) are closer to the right surface area, but they are still primarily LLM wrappers with a different UI. They generate SQL. They do not prove SQL.\n\nOur bet is that for data engineering specifically, the LLM-as-creative-layer / code-as-correctness-layer split is the right architecture, and that the deterministic infrastructure is where the actual moat lives. Models will get better. APIs will get cheaper. Prompt techniques will improve. None of that changes the calculus on whether two queries are semantically equivalent, or whether a cross-warehouse data diff is correct. Those are facts about code and data, decidable by code that operates on code and data.\n\naltimate-code currently sits at number one on the [ADE benchmark](https://www.altimate.sh/benchmarks), ahead of Claude Code, Cortex Code CLI, and dbt Labs, by a wide margin. It also recently scored #1 on the [Berkeley EPIC Data Lab DAB](https://ucbepic.github.io/DataAgentBench/) (Data Agent Benchmark)\n\n## What is still LLM territory\n\nIt would be dishonest to claim we replaced the LLM. We did not, and we will not. The LLM still owns intent-parsing (\"refactor the `asana__project`\n\nmodel into an intermediate one that…\"), code generation, description-writing for dbt model YAML, and strategy under uncertainty. When a build fails for reasons that are not immediately classifiable, the model is what figures out the next move.\n\n## What's next\n\n* Comparative trace analysis across models*. Sonnet, Opus, DeepSeek v4 pro, Kimi K2, and Qwen 2.5 Coder all run through the same altimate-code stack on the same benchmarks. What we don't yet have is a systematic comparison of how they differ on data-engineering tasks — which failure modes are model-specific vs architecture-specific, which models pair best with which skills, where the cost-quality frontier sits per task category. The traces are captured; we’ll share the analysis soon.\n\n* Harness resilience*. The same trace data surfaces a handful of harness-side next moves. We are making the agent loop resilient to truncated model outputs from streaming providers. We are adding a per-generation timeout so a hung request does not tie up a worker for 25 minutes. We are detecting when the agent has fallen into a same-tool-same-argument loop and forcing a replan. Each one is a place where the harness can be more deterministic about how it handles a probabilistic model.\n\n* Cost / quality frontier mapping*. Our DeepSeek v4 pro run on DAB scored 0.5693 stratified Pass@1 for $9 of inference, ahead of the then benchmark leader Pi's Opus 4.6 submission — is one data point in a larger frontier. We’re in touch with DAB maintainers to add this run to the leaderboard. The natural follow-up is systematic: across the model lineup we already run through altimate-code, at what cost does each model fall below a quality threshold, broken out by task category (SQL refactor, dialect translation, lineage extraction, cross-DB joins)? The architecture is model-agnostic; the right backbone for a given customer workload may be category-dependent.\n\n## Conclusion\n\nThe claim is narrower than \"no LLM.\" It is that the LLM stays out of correctness. The probabilistic layer handles probabilistic work. The deterministic layer handles deterministic work. The boundary is where the payoff lives.\n\nProbabilistic models are a remarkable substrate for the things they are good at: language, intent, creativity. They are a poor substrate for proof, equivalence, and reproducibility. The architectural question for any agent in this space is not \"can we use the LLM for X?\" It is *\"should we, and if not, what should we use instead?\"*\n\nFor the data-engineering work altimate-code does, the answer is the correctness layer underneath: a deterministic core in Rust, a deterministic harness in TypeScript, with the LLM kept exactly where it belongs.\n\n[[2]](2): [Defeating Nondeterminism in LLM Inference](https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/)\n\n*altimate-code is an open source project:* *github.com/AltimateAI/altimate-code**.** If you are building in this space and want to compare notes, reach out by email at* info-at-altimate.ai*.*", "url": "https://wpnews.pro/news/the-correctness-layer-how-we-beat-claude-code-on-the-ade-benchmark", "canonical_source": "https://www.altimate.ai/blog/the-correctness-layer-in-ade", "published_at": "2026-05-27 14:31:04+00:00", "updated_at": "2026-05-27 14:46:15.940292+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-tools", "ai-infrastructure", "ai-startups"], "entities": ["altimate-code", "Altimate", "ADE", "DAB", "Claude Code", "Rust", "TypeScript"], "alternates": {"html": "https://wpnews.pro/news/the-correctness-layer-how-we-beat-claude-code-on-the-ade-benchmark", "markdown": "https://wpnews.pro/news/the-correctness-layer-how-we-beat-claude-code-on-the-ade-benchmark.md", "text": "https://wpnews.pro/news/the-correctness-layer-how-we-beat-claude-code-on-the-ade-benchmark.txt", "jsonld": "https://wpnews.pro/news/the-correctness-layer-how-we-beat-claude-code-on-the-ade-benchmark.jsonld"}}