{"slug": "grounding-claude-in-truth", "title": "Grounding Claude in truth", "summary": "Anthropic's Claude AI agent was producing inaccurate data analysis for Fin, a company where 700 of 1,400 employees now query the warehouse daily using the tool. Internal audits found consistent errors including wrong tables, metric definitions, and inferences, with resolution rate accuracy at only 65-70% — prompting Fin to build Ground Truth, an AI-first semantic layer that codifies approved metric definitions, SQL templates, and business context to prevent agents from improvising against raw schema.", "body_md": "# Grounding Claude in truth\n\n### The semantic layer for an AI-first data team\n\nAI is breaking down the rigid boundaries between roles. PMs are taking on design and designers are running analysis. But what does that mean for specialists? I don’t think the answer is that specialists go away. I think part of their job changes from doing the work for everyone else to making sure everyone else, and every Agent, can do the work **well**.\n\nNon-engineers will keep shipping code. Non-analysts will keep running analysis. We are not going backwards. Still, that work can’t be improvised against raw systems. To be most effective, this new way of working will require the people with deep knowledge to encode that expertise into tools, definitions, guardrails, and evals the whole company can use.\n\nThat’s the direction we’re taking with Ground Truth, the AI-first semantic layer we’re building at Fin. Ground Truth gives Agents approved metric definitions, SQL templates, and business context to query the warehouse and help teams throughout the business produce accurate analysis.\n\n## The problem we ran into\n\nWe rolled out Claude Code across the company a few months ago. To say it ripped through the place like wildfire is an understatement. Of the 1,400 people here, around 700 now query our warehouse on a typical day using Claude, up from 220 six months ago.\n\nThat unlocked a lot of speed. It also exposed a problem.\n\nWhen we audited the analysis output, we found consistent errors from the Agent: wrong tables, wrong metric definitions, wrong inferences, and wrong answers. The worrying part was that every error we found had only been caught because a human happened to notice. We had no systematic approach for catching the errors no one was watching for.\n\nDuring testing, a skill that had been scoped to a single customer was picked up on an org-wide query and returned an 80% automation rate as the number. The real number was 32%.\n\nWhen we ran an internal eval across a sample of key metrics, resolution rate, one of our core business metrics, was accurate about 65-70% of the time. The Agent was often close, but close isn’t the bar if the output is going into comp, QBRs, or customer-facing analysis.\n\n## What we’re building\n\nWe’re building Ground Truth to solve this problem: an AI-first semantic layer for our data warehouse.\n\nThe core idea is simple. Instead of letting the Agent guess how to calculate resolution rate or ARR every time it is asked, we codify each definition once in a central place, written and reviewed by the domain experts who own that metric.\n\nWhen someone asks a data question, the Agent searches the semantic layer first. It pulls the right definition, the right table, the right filters, the right SQL pattern, and the known traps before it touches the warehouse.\n\nThe Agent stops improvising from raw schema information and starts from the same definition an analyst would use.\n\n## How the definitions are structured\n\nThe layer is organised by business domain: Fin, helpdesk, finance, GTM, and so on.\n\nEach domain contains two kinds of files.\n\nThe first is a _context.md file. Think of it as the five-minute briefing you’d get if you grabbed an analyst who knows that area of the warehouse and asked them what you need to know before you start querying. It maps the domain: which metrics exist, what their synonyms are, how they relate to each other, which entity filters to apply, and how to interpret common patterns.\n\nThe second kind is a metric file - one per metric. Each includes the description, formula, SQL templates at each grain, variants, and the gotchas that tend to trip people up.\n\nFor resolution rate, that includes things like:\n\nNever average daily rates. Always sum the numerator and denominator separately, then divide.\n\nUse the approved resolved conversation definition, not a status field that happens to look similar.\n\nApply the right customer, paying, and channel filters before aggregating.\n\nThese files are the source of truth for Agent-facing metric definitions. They are version-controlled, owned by named domain experts, and reviewed like code.\n\n## How the Agent uses it at runtime\n\nAt runtime, the Agent does three things.\n\nFirst, it searches the semantic layer by metric name, synonym, and related concept.\n\nSecond, if it finds a match, it injects the full metric definition into context: the correct table, filters, SQL template, grain, variants, and known failure modes.\n\nThird, it generates and runs SQL from that approved definition rather than reconstructing the metric from scratch.\n\nIf there is no match, the Agent can still fall back to exploring the warehouse schema on its own. But that answer is treated with lower confidence.\n\nMore importantly, the miss is logged.\n\n## How the semantic layer keeps itself current\n\nWhen the Agent cannot find a definition, the query and surrounding context are logged. A separate Agent then picks it up with the goal of creating a new metric definition or improving an existing one.\n\nThat Agent does three things.\n\nFirst, it searches source tables that could plausibly answer the question. It ranks candidates by schema priority, with curated marts ahead of raw sources, and by how often each table appears in past queries on the same topic.\n\nSecond, it drafts the semantic metric file: description, formula, SQL template, variants, and gotchas inferred from related metrics and previous failure patterns.\n\nThird, it routes the change to the right human reviewer. That might be the owner of the underlying table, the most active querier in that area, or the domain analyst.\n\nThe human then reviews, edits if required, and approves. The result is a PR into the semantic layer.\n\nThe point is that every question the system cannot answer well becomes a chance to improve the system. Accuracy compounds with usage.\n\n## Early results\n\nWe’re still early, but the results have been strong\n\nAccuracy on core metrics like resolution rate has moved from ~70% to 100%. How we ultimately evaluate the system is a topic in itself, which I’ll cover in a follow-up post on the eval system we built as part of Ground Truth.\n\nThe median number of SQL queries the agent writes to land an answer dropped from six to two, because it isn’t exploring the warehouse from scratch.\n\nTime to answer is down about 90% for the same reason.\n\nAnd, just as importantly, the system now has memory. When it fails, we can see where it failed, turn that failure into a reviewed definition, and make the next answer better.\n\n## The role shift\n\nAs more people across a company run their own analysis, analysts become part of the underlying infrastructure. The repeat questions, like “what’s our resolution rate this quarter” or “how has it trended,” get handled by the system.\n\nAnalysts spend their time on higher-leverage work: improving the system, adding guardrails to keep Agent-led analysis accurate and unbiased, and pushing into more complex forms of analysis.\n\nI see specialists across the company are undergoing similar changes in responsibility.\n\nDesigners will not just design every experience. They’ll build and maintain systems that help others design better. Engineers will build and maintain platforms that make it safe for others to ship.\n\nAnalysts will not answer every data question. They will define what good analysis looks like, encode it into the tools, and measure whether the Agents are living up to it.", "url": "https://wpnews.pro/news/grounding-claude-in-truth", "canonical_source": "https://ideas.fin.ai/p/grounding-claude-in-truth", "published_at": "2026-06-05 14:35:58+00:00", "updated_at": "2026-06-05 15:09:29.976717+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-products", "ai-tools"], "entities": ["Claude", "Fin", "Claude Code"], "alternates": {"html": "https://wpnews.pro/news/grounding-claude-in-truth", "markdown": "https://wpnews.pro/news/grounding-claude-in-truth.md", "text": "https://wpnews.pro/news/grounding-claude-in-truth.txt", "jsonld": "https://wpnews.pro/news/grounding-claude-in-truth.jsonld"}}