Practical Breakdown of the Value of the Semantic Layer for AI Agents: Results of A/B Testing

An A/B test by DataForge on a Fashion Retail database found that adding a semantic layer improved LLM-generated analytical answer quality by 19%, from 7.27 to 8.67 out of 10. Without semantic context, models made mistakes mirroring those of new analysts, such as confusing revenue with gross sales, despite writing correct SQL syntax.

Over the past two years, numerous expectations have formed around Text-to-SQL. It seemed that the problem had practically been solved: all you had to do was connect GPT, Claude, or another language model to an enterprise data warehouse, after which any employee would be able to get answers to business questions without involving analysts or developers. At conferences and in demos, everything looked exactly like that: a user asks a question in natural language, the model generates an SQL query, returns an answer, and builds a chart. It created the impression that the analyst profession might soon undergo significant change. Practice shows that reality turned out to be much more complicated. Modern LLMs are indeed very good at SQL syntax. They know how to build JOINs, use aggregate functions, perform groupings, and write fairly complex queries. However, the real complexity of analytics does not lie in syntax at all. It lies in understanding what the data actually means and how the company has agreed to interpret it. To test how important this context really is, we conducted an experiment using the DataForge Analyst Agent. We took a real Fashion Retail database containing approximately 6.5 million rows and asked the agent fifteen analytical questions of varying complexity. Each question was asked twice. In the first case, the agent only saw the physical database schema: tables, columns, and data types. In the second case, it was additionally provided with the semantic layer — metric definitions, calculation formulas, business rules, and pre-defined relationships between entities. The only difference between the two scenarios was the presence of semantic context. Everything else remained unchanged. The results turned out to be quite revealing. In thirteen out of fifteen cases, the version with the semantic layer came out ahead. The average answer quality score increased from 7.27 to 8.67 points out of 10, corresponding to an improvement of approximately 19%. An even more interesting observation emerged: the model’s mistakes almost completely mirrored the mistakes a new analyst makes when encountering an unfamiliar data warehouse for the first time. This allows us to view the issue not as a limitation of a particular technology, but as a fundamental characteristic of working with organizational knowledge. Why LLMs Make Mistakes Even If They Know SQL Perfectly When specialists talk about the shortcomings of Text-to-SQL, it often creates the impression that the problem lies in poor query generation quality. Our experiment demonstrated the opposite. In most cases, the SQL itself was written correctly. The mistakes occurred much earlier — at the stage of interpreting business meaning. Imagine a new employee who is placed in front of a data warehouse and told: “Figure it out on your own.” They will see tables and columns, be able to build queries, and even obtain some numbers. However, they will not know how revenue is calculated within the company, what constitutes a “receipt excluding refunds,” which metrics are considered corporate standards, and which exceptions exist in the calculations. This is exactly the situation a language model finds itself in when operating without a semantic layer. The most dangerous outcomes are not the obviously incorrect answers. If the model returns an SQL execution error, the user understands that something has gone wrong. Much more dangerous are answers that look convincing, are accompanied by attractive charts, and differ from the correct values by only a few percentage points. It is precisely these kinds of mistakes that can quietly find their way into management reporting. Mistake №1. AI Confused Revenue and Gross Sales The first question sounded as simple as possible: What is the total amount of sales for 2024? The agent operating without a semantic layer identified the data sales refunds table and the total gross column. From the perspective of a statistical model, this looked entirely logical: if the user is asking about sales, then a monetary field should be summed. The following query was generated: SELECT SUM total gross FROM data sales refunds WHERE date = ‘2024–01–01’ AND date < ‘2025–01–01’; The resulting figure amounted to USD 23.81 billion. If you were to show this SQL to a junior analyst, many would fail to notice the problem. The query is written correctly. The period filter is present. An obvious monetary column is being used. However, the table simultaneously stored both sales and refunds. Employees of the company knew this, but nothing in the data structure explicitly indicated it. Within the business logic, the following rule applied: operation id = 1 → sale operation id = 0 → refund Consequently, net revenue had to be calculated as follows: SUM CASE WHEN operation id = 1 THEN total gross ELSE 0 END — SUM CASE WHEN operation id = 0 THEN total gross ELSE 0 END After applying the correct formula, the picture changed: The difference amounted to USD 3.34 billion. In a training example, this looks like an interesting model error. In a real company, a discrepancy of this magnitude could lead to an incorrect assessment of business performance, flawed plan-versus-actual comparisons, and distorted financial indicators. It is important to note that the model did not make a mistake in SQL. It made a mistake in understanding the business. And that is significantly more difficult to correct than a syntactic inaccuracy. Why the Model Made This Decision Modern LLMs operate based on probabilistic patterns. They analyze field names, the surrounding context, and attempt to reconstruct the most plausible interpretation. The name total gross genuinely resembles a sales metric. From the model’s perspective, its decision was rational. However, corporate KPIs rarely align directly with the physical fields of a database. Net revenue may account for refunds, discounts, bonuses, tax adjustments, and numerous other factors. Without explicit descriptions of these rules, the model is forced to formulate hypotheses. Sometimes it guesses correctly. Sometimes it does not. This is precisely why organizations have invested in KPI standardization for decades. In the era of generative AI, the value of that work only increases. Mistake №2. When Billions of US Dollars Suddenly Become “Receipts” The next question turned out to be even more revealing. Show Receipts before refunds by quarter for 2024. For retail specialists, the term Receipts before refunds is perfectly familiar. It usually refers to the number of sales receipts before accounting for refunds. For a language model, however, it is merely an unfamiliar phrase whose meaning has to be inferred independently. Without a semantic layer, the agent constructed the following calculation: SUM CASE WHEN operation id = 1 THEN total gross ELSE 0 END ASreceipts before refunds The resulting figures looked as follows: The numbers looked convincing. The quarters were present. The trend was present. The SQL query executed without errors. However, an obvious question arose: why is the number of receipts measured in billions of US Dollars? The answer turned out to be simple. The model decided that the metric Receipts was somehow related to monetary turnover. It matched an unfamiliar term with the most probable monetary metric. The correct calculation looked different: COUNT DISTINCT CASE WHEN operation id = 1 THEN receipt number END After that, the results aligned with business expectations: This example clearly demonstrates a fundamental limitation of modern AI systems. If an organization’s terminology is not explicitly described, the model begins interpreting it on its own. The more specific the organization’s language is, the greater the value of the semantic layer. Mistake №3. Why COUNT Is Sometimes More Dangerous Than SUM Counting the number of objects is traditionally perceived as one of the simplest analytical tasks. Our experiment showed that this is precisely where numerous pitfalls are hidden. We asked the following question: Show the number of receipts by city for 2024. Without a semantic layer, the agent used the following construct: COUNT DISTINCT receipt number At first glance, everything looked correct. Unique receipt numbers were being used, duplicates were excluded, and the results were grouped by city. However, it later became clear that the calculation included not only sales transactions but also refund documents. Refunds also have receipt numbers, and from the database’s perspective, they are no different from regular transactions. The correct version looked as follows: COUNT DISTINCT CASE WHEN operation id = 1 THEN receipt number END The difference between the two queries appears minimal. In practice, only a single condition changed. However, it is precisely these kinds of details that determine the quality of corporate analytics. The user sees the final number and assumes that it reflects reality. Yet sometimes thousands of extra records are hidden behind a single additional filter. Errors of this kind are especially dangerous because they propagate into derived metrics. An incorrect receipt count affects conversion rates, average basket size, assessments of store performance, and plan achievement calculations. An error in a primary metric inevitably scales across the entire reporting landscape. Mistake №4. UPT Cannot Be Guessed The example involving the UPT metric proved to be particularly interesting. UPT stands for Units Per Transaction and represents the average number of product items within a single receipt. For retail employees, it is one of the fundamental KPIs. For the model, however, it is simply an abbreviation whose meaning has to be inferred from context. Without a semantic layer, the agent generated the following calculation: COUNT / COUNT DISTINCT receipt number The logic was understandable. The model assumed that the number of rows corresponded to the number of units sold, while the number of receipts represented the number of transactions. The resulting values fell within the range of 3.25 to 4.03. However, the corporate definition of the metric looked different: COUNT DISTINCT receipt position / COUNT DISTINCT receipt number Additionally, only sales transactions were included: CASE WHEN operation id = 1 THEN … END After applying the correct formula, the metric was reduced by almost half, falling into the range of 1.63 to 2.01. This example illustrates an important point. Knowing SQL does not automatically mean understanding analytics. It is entirely possible to write a technically correct query and still arrive at the wrong business result. In many cases, knowing the formula is more important than knowing the query language. Why the Semantic Layer Turned Out to Be So Effective Across all of the examples discussed above, the semantic layer performed the same function: it eliminated ambiguity. Instead of guessing the meaning of a metric, the agent received a ready-made definition. Instead of exploring possible table join paths, it relied on pre-defined JOINs. Instead of statistical assumptions, it applied corporate standards. For example, the agent was provided with definitions such as the following: In essence, the semantic layer acted as methodological guidance from an experienced analyst. It explained to the model not only the structure of the data, but also the rules governing its interpretation. An Interesting Observation: Semantics Is Not Always Necessary Interestingly, in two of the questions, the version without the semantic layer received higher scores. For example, when the task simply required listing countries, the additional context offered almost no advantage. This result also carries practical value. It demonstrates that the semantic layer is especially useful when composite metrics, hidden business rules, complex relationships between entities, and corporate KPIs are involved. If the task boils down to a straightforward lookup query, the model is often perfectly capable of handling it on its own. This means that when designing AI analysts, attention should be paid not only to the completeness of the semantic model, but also to the appropriate amount of context provided. Our experiment allows us to formulate several recommendations First, organizations should avoid connecting LLMs directly to enterprise data warehouses without first documenting the business logic. The existence of a database schema alone does not guarantee correct answers. Second, corporate KPIs should be documented centrally. Metrics such as UPT, LFL, conversion rate, profitability, plan attainment, or refund rate should all have a single, agreed-upon definition. Third, AI systems should be tested using real user questions. Tests such as “How many rows are in this table?” reveal almost nothing about analytical quality. The real difficulties emerge in questions related to decision-making. Finally, it is important to remember that artificial intelligence does not replace domain knowledge. It merely enables organizations to apply their accumulated expertise more quickly. Our A/B test demonstrated that the use of a semantic layer improved the quality of analytical answers by approximately 19% and delivered an advantage in 87% of the questions. However, the most important outcome of the experiment was not the numbers themselves, but a deeper understanding of the nature of artificial intelligence errors. Modern LLMs genuinely have an excellent command of SQL. They can explore data schemas and construct sophisticated queries. But they do not know what your organization considers to be revenue, which receipts should be excluded from calculations, or how internal KPIs are defined. For many years, semantic models were viewed as a supporting element of BI platforms — an instrument for building reports and enabling self-service analytics. The emergence of AI analysts unexpectedly changed their role. It turned out that the semantic layer is precisely the mechanism that enables a general-purpose language model to become an analyst who understands the specifics of a particular business. In other words, the era of generative AI has not diminished the value of an organization’s accumulated knowledge. On the contrary, it has transformed that knowledge into one of the key determinants of analytical quality. Perhaps the most unexpected conclusion from our experiment is that the era of generative AI does not reduce the value of the work carried out by BI and DWH teams over the past decades. On the contrary, it turns that work into the foundation for the next generation of analytical tools. Where semantic models were previously used primarily for reporting and self-service analytics, they are now becoming the backbone of enterprise AI agents. This is why an increasing number of organizations are beginning to treat semantic management as a strategic asset. One example of this approach has been implemented in the DataForge https://dataforge.one/ platform, where a unified business model is used simultaneously for BI, documentation, and AI assistant workflows. semantic layer, semantic layer for AI, AI agents, enterprise AI agents, Text-to-SQL, Text to SQL, Text-to-SQL accuracy, LLM analytics, LLM for business intelligence, LLM for SQL generation, SQL generation with LLMs, AI-powered analytics, AI analyst, AI data analyst, semantic modeling, enterprise semantic layer Practical Breakdown of the Value of the Semantic Layer for AI Agents: Results of A/B Testing https://pub.towardsai.net/practical-breakdown-of-the-value-of-the-semantic-layer-for-ai-agents-results-of-a-b-testing-5a2c6b130e22 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.