Grounding Claude in truth

wpnews.pro

The semantic layer for an AI-first data team

AI is breaking down the rigid boundaries between roles. PMs are taking on design and designers are running analysis. But what does that mean for specialists? I don’t think the answer is that specialists go away. I think part of their job changes from doing the work for everyone else to making sure everyone else, and every Agent, can do the work well.

Non-engineers will keep shipping code. Non-analysts will keep running analysis. We are not going backwards. Still, that work can’t be improvised against raw systems. To be most effective, this new way of working will require the people with deep knowledge to encode that expertise into tools, definitions, guardrails, and evals the whole company can use.

That’s the direction we’re taking with Ground Truth, the AI-first semantic layer we’re building at Fin. Ground Truth gives Agents approved metric definitions, SQL templates, and business context to query the warehouse and help teams throughout the business produce accurate analysis.

The problem we ran into #

We rolled out Claude Code across the company a few months ago. To say it ripped through the place like wildfire is an understatement. Of the 1,400 people here, around 700 now query our warehouse on a typical day using Claude, up from 220 six months ago.

That unlocked a lot of speed. It also exposed a problem.

When we audited the analysis output, we found consistent errors from the Agent: wrong tables, wrong metric definitions, wrong inferences, and wrong answers. The worrying part was that every error we found had only been caught because a human happened to notice. We had no systematic approach for catching the errors no one was watching for.

During testing, a skill that had been scoped to a single customer was picked up on an org-wide query and returned an 80% automation rate as the number. The real number was 32%.

When we ran an internal eval across a sample of key metrics, resolution rate, one of our core business metrics, was accurate about 65-70% of the time. The Agent was often close, but close isn’t the bar if the output is going into comp, QBRs, or customer-facing analysis.

What we’re building #

We’re building Ground Truth to solve this problem: an AI-first semantic layer for our data warehouse.

The core idea is simple. Instead of letting the Agent guess how to calculate resolution rate or ARR every time it is asked, we codify each definition once in a central place, written and reviewed by the domain experts who own that metric.

When someone asks a data question, the Agent searches the semantic layer first. It pulls the right definition, the right table, the right filters, the right SQL pattern, and the known traps before it touches the warehouse.

The Agent stops improvising from raw schema information and starts from the same definition an analyst would use.

How the definitions are structured #

The layer is organised by business domain: Fin, helpdesk, finance, GTM, and so on.

Each domain contains two kinds of files.

The first is a _context.md file. Think of it as the five-minute briefing you’d get if you grabbed an analyst who knows that area of the warehouse and asked them what you need to know before you start querying. It maps the domain: which metrics exist, what their synonyms are, how they relate to each other, which entity filters to apply, and how to interpret common patterns.

The second kind is a metric file - one per metric. Each includes the description, formula, SQL templates at each grain, variants, and the gotchas that tend to trip people up.

For resolution rate, that includes things like: Never average daily rates. Always sum the numerator and denominator separately, then divide.

Use the approved resolved conversation definition, not a status field that happens to look similar. Apply the right customer, paying, and channel filters before aggregating.

These files are the source of truth for Agent-facing metric definitions. They are version-controlled, owned by named domain experts, and reviewed like code.

How the Agent uses it at runtime #

At runtime, the Agent does three things.

First, it searches the semantic layer by metric name, synonym, and related concept.

Second, if it finds a match, it injects the full metric definition into context: the correct table, filters, SQL template, grain, variants, and known failure modes.

Third, it generates and runs SQL from that approved definition rather than reconstructing the metric from scratch.

If there is no match, the Agent can still fall back to exploring the warehouse schema on its own. But that answer is treated with lower confidence. More importantly, the miss is logged.

How the semantic layer keeps itself current #

When the Agent cannot find a definition, the query and surrounding context are logged. A separate Agent then picks it up with the goal of creating a new metric definition or improving an existing one.

That Agent does three things.

First, it searches source tables that could plausibly answer the question. It ranks candidates by schema priority, with curated marts ahead of raw sources, and by how often each table appears in past queries on the same topic.

Second, it drafts the semantic metric file: description, formula, SQL template, variants, and gotchas inferred from related metrics and previous failure patterns.

Third, it routes the change to the right human reviewer. That might be the owner of the underlying table, the most active querier in that area, or the domain analyst.

The human then reviews, edits if required, and approves. The result is a PR into the semantic layer.

The point is that every question the system cannot answer well becomes a chance to improve the system. Accuracy compounds with usage.

Early results #

We’re still early, but the results have been strong

Accuracy on core metrics like resolution rate has moved from ~70% to 100%. How we ultimately evaluate the system is a topic in itself, which I’ll cover in a follow-up post on the eval system we built as part of Ground Truth.

The median number of SQL queries the agent writes to land an answer dropped from six to two, because it isn’t exploring the warehouse from scratch.

Time to answer is down about 90% for the same reason.

And, just as importantly, the system now has memory. When it fails, we can see where it failed, turn that failure into a reviewed definition, and make the next answer better.

The role shift #

As more people across a company run their own analysis, analysts become part of the underlying infrastructure. The repeat questions, like “what’s our resolution rate this quarter” or “how has it trended,” get handled by the system.

Analysts spend their time on higher-leverage work: improving the system, adding guardrails to keep Agent-led analysis accurate and unbiased, and pushing into more complex forms of analysis.

I see specialists across the company are undergoing similar changes in responsibility.

Designers will not just design every experience. They’ll build and maintain systems that help others design better. Engineers will build and maintain platforms that make it safe for others to ship.

Analysts will not answer every data question. They will define what good analysis looks like, encode it into the tools, and measure whether the Agents are living up to it.

source & further reading

ideas.fin.ai — original article Claude Code: Good skills, bad skills Nobody changed their title. Everybody changed their job. What if it all goes right?