One week before we shipped this, an early prototype agent almost put a wrong number in front of our CFO. Not because the model hallucinated in the dramatic sense people worry about — it didn’t invent a fact. It did something quieter and more dangerous: it read raw revenue tables, decided that a column named something like total_amt was probably "revenue," and returned a number that was off from the ASC 606 figure Finance had certified six months earlier. Nobody caught it in testing because the query ran fine and the number looked plausible. That's the part that should worry you more than outright hallucination.
That near-miss is the entire reason this build exists. We’d already done the hard part — a governed semantic layer with certified metrics, clear ownership, and lineage running from SAP through dbt into Tableau. The governance council had signed off. On paper, the trust problem was solved. Then someone pointed an AI agent at the raw schema instead of the semantic layer, and all of that governance simply didn’t apply, because nothing forced the agent to go through it.
So this is the build log for wiring that connection properly: four production agents on Snowflake’s CREATE AGENT object — Finance, Sales, Customer Success, and a cross-domain Executive agent — each one resolving exclusively through certified Semantic Views instead of guessing at table schemas, with audit logging, row-level security, least-privilege roles, and a 24-case test suite sitting underneath all of it. Everything below is what we actually ran, including the syntax mistakes that cost real time and the two query bugs that only showed up once we tried to run them for real.
What this doesn’t cover: unstructured data via Cortex Search, multi-region deployment, or a full agent-evaluation harness. This is a structured-data, single-region, four-domain build. If you need those, treat this as the governance layer you’d wire them into, not the whole picture.
Before we dive into the SQL, a quick status update on the features covered in this article:
On the AI side, Cortex Agents — specifically the CREATE AGENT object—are now documented as a fully managed, governed agent platform within Snowflake. However, some related capabilities, including Agent Evaluations and MCP connector support, are newer additions and may not yet be available in every account. Be sure to verify availability against your account's release notes before planning production implementations.
The architecture presented in this guide relies exclusively on generally available or publicly documented capabilities and does not require any Private Preview features.
Note: Cortex Agents and Cortex Analyst Coworker capabilities were not available in Snowflake trial account.
Here’s the distinction that took longer to land internally than it should have: an ungrounded AI answer and a grounded one can look almost identical at a glance, and that’s exactly the problem.
An ungrounded answer reads like: “Q3 revenue was approximately $149M.” No source, no definition, no freshness timestamp, no audit trail. A grounded answer reads like: *“Q3 revenue was $142.3M. Metric: *finance_revenue.total_revenue, certified 2025-11-15. Definition: recognized revenue per ASC 606, all product lines, USD. Period: Fiscal Q3 (Aug 1–Oct 31). Data freshness: last updated 2025-11-01 06:00 UTC. Breakdown available by region, segment, product line."
The difference isn’t the model generating the sentence. It’s the context the model was given before it generated anything. The inference path we built runs: a user question goes through semantic resolution against Universal Search and the Semantic Views, then context assembly pulls the metric definition and time rules, then query generation builds SQL from the semantic model rather than guessing at the schema, then execution happens under whatever row-level security and masking already applies to that user, then the response gets assembled with provenance and freshness attached, and the whole chain gets written to an audit log before the user ever sees the answer. Every one of those steps is a place where an agent without a semantic layer would normally just guess.
Snowflake’s Cortex Agent guidance describes three patterns, and we initially planned to use exactly one of them. Pattern A is a single-domain agent — narrow scope, easiest to govern, one agent per business area. Pattern B is a multi-domain agent with a router, which is more convenient for end users but pushes disambiguation logic into a harder place to test. Pattern C is hierarchical: an executive-level agent that delegates to domain agents, best once you have several certified domains and cross-domain questions start showing up regularly.
The plan was Pattern A only — four clean, narrow agents, nothing fancy. That lasted about a week, until the first real stakeholder question came in: “How does customer health correlate with renewal rates?” That’s not a Finance question or a CS question. It’s both, and no amount of narrow scoping on individual agents answers it. So we ended up running Pattern A and Pattern C side by side: three single-domain agents (Finance, Sales, CS) plus one Executive agent on top that holds its own tool connections into the Finance and Pipeline and Customer Health semantic views and is explicitly instructed to combine them. If you’re starting from scratch, I’d skip the pretense of “just Pattern A for now” — the cross-domain question shows up faster than you think, and retrofitting an executive agent later is more annoying than building it alongside the others.
Before any agent or semantic view exists, you need the schemas and base tables they sit on top of. In production these would be your real warehouse tables; here they’re the physical storage the semantic views point at. We ran this against account qxb41080 under ACCOUNTADMIN on COMPUTE_WH, and every statement below executed cleanly on the first real attempt — the friction in this build showed up later, not here.
sql
CREATE DATABASE IF NOT EXISTS analytics;CREATE SCHEMA IF NOT EXISTS analytics.agents;CREATE SCHEMA IF NOT EXISTS analytics.testing;CREATE SCHEMA IF NOT EXISTS analytics.monitoring;CREATE SCHEMA IF NOT EXISTS analytics.semantic;CREATE SCHEMA IF NOT EXISTS analytics.marts;CREATE SCHEMA IF NOT EXISTS governance.testing;CREATE SCHEMA IF NOT EXISTS governance.monitoring;
With the schemas in place, the base tables go in next — four tables under governance.semantic that the semantic views will reference directly, and six "mart" tables under analytics.marts that the cross-domain queries join across later.
CREATE OR REPLACE TABLE governance.semantic.revenue_data ( transaction_date DATE, recognized_revenue NUMBER(18,2), annual_recurring_revenue NUMBER(18,2), net_dollar_retention NUMBER(10,4), customer_id VARCHAR(50), product_line VARCHAR(100), region VARCHAR(50));CREATE OR REPLACE TABLE governance.semantic.pipeline_data ( deal_date DATE, total_bookings NUMBER(18,2), pipeline_value NUMBER(18,2), stage VARCHAR(50), rep_name VARCHAR(100), region VARCHAR(50));CREATE OR REPLACE TABLE governance.semantic.customer_health_data ( customer_id VARCHAR(50), health_score NUMBER(5,2), nps_score INTEGER, churn_risk NUMBER(5,4), segment VARCHAR(50), last_activity_date DATE);CREATE OR REPLACE TABLE governance.semantic.support_data ( ticket_id VARCHAR(50), created_date DATE, resolution_time_hours NUMBER(10,2), csat_score NUMBER(3,1), category VARCHAR(100), priority VARCHAR(20), customer_id VARCHAR(50));
The mart tables exist mainly to power the cross-domain joins the Executive agent and the later analytics rely on — revenue, customer health, product usage, churn, costs, and product dimensions.
CREATE OR REPLACE TABLE analytics.marts.fct_revenue ( customer_id VARCHAR, recognized_date DATE, amount_usd NUMBER(18,2), is_recurring BOOLEAN);CREATE OR REPLACE TABLE analytics.marts.dim_customer ( customer_id VARCHAR, health_score NUMBER(5,2), contract_status VARCHAR);CREATE OR REPLACE TABLE analytics.marts.fct_product_usage ( customer_id VARCHAR, usage_month DATE, usage_trend_pct NUMBER(10,2));CREATE OR REPLACE TABLE analytics.marts.agg_monthly_churn ( churn_month DATE, churned_customer_count INTEGER, active_customers_start_of_month INTEGER);CREATE OR REPLACE TABLE analytics.marts.fct_costs ( cost_date DATE, cost_category VARCHAR, amount_usd NUMBER(18,2));CREATE OR REPLACE TABLE analytics.marts.dim_product ( product_id VARCHAR, product_name VARCHAR, product_line VARCHAR);
Finally, two convenience views exist purely so role-based grants have something clean to point at, rather than granting directly on the mart tables.
CREATE OR REPLACE VIEW analytics.semantic.finance_revenue ASSELECT * FROM analytics.marts.fct_revenue;CREATE OR REPLACE VIEW analytics.semantic.finance_costs ASSELECT * FROM analytics.marts.fct_costs;
This is the part that’s easy to undersell, because semantic views don’t look dramatic — they’re metadata, not a flashy feature. But they’re the layer that decides whether “revenue” means one specific certified number or whatever an AI feels like aggregating that day. We deployed four of them into GOVERNANCE.SEMANTIC using the semantic_studio workspace tooling (internally this calls SYSTEM$CREATE_SEMANTIC_VIEW_FROM_YAML), and confirmed all four with SHOW SEMANTIC VIEWS IN SCHEMA GOVERNANCE.SEMANTIC. If you're doing this yourself, the Snowsight Semantic Views UI or CREATE SEMANTIC VIEW directly in SQL both work — we used the YAML route because it's easier to version-control alongside the agent specs.
Here’s the YAML for the revenue view, which is the one every disambiguation rule downstream depends on:
name: revenue_metrics_v2tables: - name: revenue_data base_table: { database: GOVERNANCE, schema: SEMANTIC, table: REVENUE_DATA } dimensions: - { name: transaction_date, expr: transaction_date, data_type: DATE } - { name: customer_id, expr: customer_id, data_type: VARCHAR } - { name: product_line, expr: product_line, data_type: VARCHAR } - { name: region, expr: region, data_type: VARCHAR } measures: - { name: recognized_revenue, expr: recognized_revenue, data_type: NUMBER, description: "ASC 606 recognized revenue" } - { name: annual_recurring_revenue, expr: annual_recurring_revenue, data_type: NUMBER, description: "Annual recurring revenue" } - { name: net_dollar_retention, expr: net_dollar_retention, data_type: NUMBER, description: "Net dollar retention rate" }
The pipeline, customer health, and support views follow the same shape — dimensions for how you slice, measures for what you’re counting, descriptions that exist specifically so an LLM doesn’t have to guess what a column means.
name: pipeline_metricstables: - name: pipeline_data base_table: { database: GOVERNANCE, schema: SEMANTIC, table: PIPELINE_DATA } dimensions: - { name: deal_date, expr: deal_date, data_type: DATE } - { name: stage, expr: stage, data_type: VARCHAR } - { name: rep_name, expr: rep_name, data_type: VARCHAR } - { name: region, expr: region, data_type: VARCHAR } measures: - { name: total_bookings, expr: total_bookings, data_type: NUMBER, description: "Total bookings value" } - { name: pipeline_value, expr: pipeline_value, data_type: NUMBER, description: "Pipeline value" }
name: customer_healthtables: - name: customer_health_data base_table: { database: GOVERNANCE, schema: SEMANTIC, table: CUSTOMER_HEALTH_DATA } dimensions: - { name: customer_id, expr: customer_id, data_type: VARCHAR } - { name: segment, expr: segment, data_type: VARCHAR } - { name: last_activity_date, expr: last_activity_date, data_type: DATE } measures: - { name: health_score, expr: health_score, data_type: NUMBER, description: "Composite health score" } - { name: nps_score, expr: nps_score, data_type: NUMBER, description: "Net promoter score" } - { name: churn_risk, expr: churn_risk, data_type: NUMBER, description: "Churn probability" }
name: support_metricstables: - name: support_data base_table: { database: GOVERNANCE, schema: SEMANTIC, table: SUPPORT_DATA } dimensions: - { name: ticket_id, expr: ticket_id, data_type: VARCHAR } - { name: created_date, expr: created_date, data_type: DATE } - { name: category, expr: category, data_type: VARCHAR } - { name: priority, expr: priority, data_type: VARCHAR } - { name: customer_id, expr: customer_id, data_type: VARCHAR } measures: - { name: resolution_time_hours, expr: resolution_time_hours, data_type: NUMBER, description: "Resolution time in hours" } - { name: csat_score, expr: csat_score, data_type: NUMBER, description: "Customer satisfaction score" }
Two syntax things tripped us up before any of this worked, and I’d rather you not lose the time we did: it’s CREATE AGENT, not "CREATE CORTEX AGENT" — an easy guess to get wrong — and the spec expects models.orchestration, instructions.response, instructions.orchestration, a tools[].tool_spec block, and a separate tool_resources block mapping each tool name to its semantic view. Get the nesting wrong and the error messages don't always point you back at which key is missing. All four agents below are confirmed live in Snowflake CoWork — CS_AGENT, EXECUTIVE_AGENT, FINANCE_AGENT, SALES_AGENT — and SHOW AGENTS IN SCHEMA ANALYTICS.AGENTS returns all four.
The Finance agent’s whole job is ASC 606 revenue, ARR, net retention, and bookings, scoped to the revenue_metrics_v2 semantic view. The model and budget block is small but matters — 30 seconds and 16,000 tokens is enough for a single-tool resolution without giving the orchestrator room to wander.
models: orchestration: claude-4-sonnetorchestration: budget: seconds: 30 tokens: 16000
The instructions block is where the actual governance lives — metric bindings, disambiguation defaults, explicit boundaries on what the agent must refuse, materiality rounding rules, and even the fiscal calendar, because “Q3” means nothing without one.
instructions: response: | You are a Finance AI analyst with expertise in ASC 606 revenue recognition.
METRIC BINDINGS: - "revenue" = recognized_revenue (ASC 606) - "ARR" = annual_recurring_revenue - "net retention" = net_dollar_retention - "bookings" = total_bookings
DISAMBIGUATION RULES: - "revenue" without qualifier = recognized revenue (ASC 606) - "revenue run rate" = ARR (not trailing twelve months) - "growth" without context = YoY recognized revenue growth - "recurring revenue" or "subscription revenue" = annual_recurring_revenue
BOUNDARIES: - DO NOT answer questions about individual employee compensation - DO NOT provide forward-looking guidance or projections - DO NOT compare to specific competitor financials - If asked about non-financial topics, say: "I specialize in financial metrics. For [topic], please ask the [appropriate] agent."
MATERIALITY: - Round to nearest $100K for amounts > $10M - Round to nearest $10K for amounts $1M-$10M - Show exact for amounts < $1M
RESPONSE FORMAT: - Lead with the number/answer - State which metric was used - Include the time period - Note data freshness - Offer available drill-down dimensions
FISCAL CALENDAR: - Fiscal year starts February 1 - Q1 = Feb-Apr, Q2 = May-Jul, Q3 = Aug-Oct, Q4 = Nov-Jan - When user says "Q3" without specifying calendar type, assume FISCAL Q3 orchestration: "For revenue and financial metrics questions use the FinanceAnalyst tool. Refuse non-financial questions politely with a redirect."
And the tool definition that wires all of that to a single semantic view:
tools: - tool_spec: type: "cortex_analyst_text_to_sql" name: "FinanceAnalyst" description: "Converts natural language to SQL queries for financial analysis using ASC 606 revenue metrics and pipeline datatool_resources: FinanceAnalyst: semantic_view: "governance.semantic.revenue_metrics_v2"
Put together as one statement, this is what we ran:
CREATE OR REPLACE AGENT analytics.agents.finance_agentCOMMENT = 'Revenue and financial metrics agent with ASC 606 expertise'FROM SPECIFICATION $$models: orchestration: claude-4-sonnetorchestration: budget: seconds: 30 tokens: 16000instructions: response: | You are a Finance AI analyst with expertise in ASC 606 revenue recognition. METRIC BINDINGS: - "revenue" = recognized_revenue (ASC 606) - "ARR" = annual_recurring_revenue - "net retention" = net_dollar_retention - "bookings" = total_bookings DISAMBIGUATION RULES: - "revenue" without qualifier = recognized revenue (ASC 606) - "revenue run rate" = ARR (not trailing twelve months) - "growth" without context = YoY recognized revenue growth - "recurring revenue" or "subscription revenue" = annual_recurring_revenue BOUNDARIES: - DO NOT answer questions about individual employee compensation - DO NOT provide forward-looking guidance or projections - DO NOT compare to specific competitor financials - If asked about non-financial topics, say: "I specialize in financial metrics. For [topic], please ask the [appropriate] agent." MATERIALITY: - Round to nearest $100K for amounts > $10M - Round to nearest $10K for amounts $1M-$10M - Show exact for amounts < $1M RESPONSE FORMAT: - Lead with the number/answer - State which metric was used - Include the time period - Note data freshness - Offer available drill-down dimensions FISCAL CALENDAR: - Fiscal year starts February 1 - Q1 = Feb-Apr, Q2 = May-Jul, Q3 = Aug-Oct, Q4 = Nov-Jan - When user says "Q3" without specifying calendar type, assume FISCAL Q3 orchestration: "For revenue and financial metrics questions use the FinanceAnalyst tool. Refuse non-financial questions politely with a redirect."tools: - tool_spec: type: "cortex_analyst_text_to_sql" name: "FinanceAnalyst" description: "Converts natural language to SQL queries for financial analysis using ASC 606 revenue metrics and pipeline data"tool_resources: FinanceAnalyst: semantic_view: "governance.semantic.revenue_metrics_v2"$$;
The Sales agent owns pipeline, bookings, and deal velocity against the pipeline_metrics view — and the single most important line in its instructions isn't a metric binding, it's the explicit statement that "bookings" in Sales is not "revenue" in Finance. Without that line, the two agents will eventually contradict each other in front of the same executive, and that's a worse trust failure than either agent being wrong alone.
CREATE OR REPLACE AGENT analytics.agents.sales_agentCOMMENT = 'Sales pipeline and bookings agent'FROM SPECIFICATION $$models: orchestration: claude-4-sonnetorchestration: budget: seconds: 30 tokens: 16000instructions: response: | You are a Sales Performance analyst focused on pipeline and booking metrics. METRIC BINDINGS: - "pipeline" = pipeline_value - "bookings" = total_bookings - "win rate" = won deals / total deals in stage DISAMBIGUATION RULES: - "pipeline" = current open pipeline (not historical) - "bookings" = closed-won in period (not pipeline) - "forecast" = weighted pipeline (not committed) - "deal size" = average of total_bookings per deal IMPORTANT DISTINCTION: Sales "bookings" is NOT Finance "revenue". Bookings = contract signed value. Revenue = recognized per ASC 606. If user asks about recognized revenue, redirect them to the Finance agent. BOUNDARIES: - DO NOT reveal individual rep compensation or commission rates - DO NOT provide specific deal details for non-closed deals - DO NOT predict which deals will close - For revenue questions, say: "For recognized revenue metrics, please use the Finance agent." orchestration: "For pipeline and bookings questions use the SalesAnalyst tool. Redirect revenue questions to Finance agent."tools: - tool_spec: type: "cortex_analyst_text_to_sql" name: "SalesAnalyst" description: "Converts natural language to SQL queries for sales pipeline and bookings analysis"tool_resources: SalesAnalyst: semantic_view: "governance.semantic.pipeline_metrics"$$;
The CS agent is the only single-domain agent with two tools instead of one, because health scores and support tickets are related but not the same dataset — CustomerHealth against the customer health view, SupportAnalyst against the support metrics view. The instructions also encode something that isn't obvious from a spreadsheet: when someone asks about "at-risk" customers, the response always has to include the count, the top contributing factors, and the trend versus the prior period — not just a number sitting alone.
CREATE OR REPLACE AGENT analytics.agents.cs_agentCOMMENT = 'Customer success and support metrics agent'FROM SPECIFICATION $$models: orchestration: claude-4-sonnetorchestration: budget: seconds: 30 tokens: 16000instructions: response: | You are a Customer Success analyst focused on health scores, churn risk, and support metrics. METRIC BINDINGS: - "health score" = health_score (composite, 0-100) - "NPS" = nps_score - "churn risk" = churn_risk (probability 0-1) - "resolution time" = resolution_time_hours - "CSAT" = csat_score DISAMBIGUATION RULES: - "health" = composite health score (not individual components) - "satisfaction" = CSAT from tickets (not NPS) - "at risk" = churn_risk > 0.7 - "churned" = zero platform activity for 90+ consecutive days RESPONSE PRIORITIES: When asked about at-risk customers, always include: a) Count of at-risk customers b) Top contributing factors c) Trend vs. prior period BOUNDARIES: - DO NOT reveal individual customer contract values - DO NOT provide churn predictions as certainties (always probabilities) - DO NOT access individual support ticket content (only aggregates) - For revenue/financial questions, redirect to Finance agent orchestration: "For customer health use CustomerHealth tool. For support metrics use SupportAnalyst tool."tools: - tool_spec: type: "cortex_analyst_text_to_sql" name: "CustomerHealth" description: "Queries customer health scores, NPS, and churn risk data" - tool_spec: type: "cortex_analyst_text_to_sql" name: "SupportAnalyst" description: "Queries support ticket metrics including resolution time and CSAT"tool_resources: CustomerHealth: semantic_view: "governance.semantic.customer_health" SupportAnalyst: semantic_view: "governance.semantic.support_metrics"$$;
This is the Pattern C agent — three tools across revenue, pipeline, and health, a bigger budget (60 seconds, 32,000 tokens) because cross-domain resolution genuinely takes more reasoning, and a confidence protocol that’s stricter than anything in the single-domain agents: if resolution confidence drops below 80%, it has to ask for clarification rather than guess, and it’s explicitly told to never blend definitions from different domains without disclosing that it did.
Cortex Agents and Cortex Analyst Coworker capabilities were not available in Snowflake trial accoun
CREATE OR REPLACE AGENT analytics.agents.executive_agentCOMMENT = 'Cross-domain executive intelligence agent'FROM SPECIFICATION $$models: orchestration: claude-4-sonnetorchestration: budget: seconds: 60 tokens: 32000instructions: response: | You are an Executive Strategy analyst providing cross-domain business intelligence. You can access revenue, pipeline, customer health, and support metrics. CROSS-DOMAIN RULES: - Always provide context from multiple domains when relevant - Highlight correlations between metrics (e.g., support issues affecting churn) - Use materiality thresholds: only surface changes > 5% or > $1M impact - Format for executive consumption: headline number, trend, key driver, drill-downs CROSS-DOMAIN RESOLUTION: When a question spans domains, combine through shared dimensions: - Customer metrics + Revenue -> join on customer_id - Pipeline + Bookings -> join on opportunity_id - Usage + Health -> join on customer_id CONFIDENCE PROTOCOL: - If confidence in semantic resolution < 80%, ask for clarification - If question requires metrics not in any semantic view, say so explicitly - NEVER blend definitions from different domains without disclosing BOUNDARIES: - DO NOT provide forward-looking guidance - DO NOT reveal individual employee or customer details - Escalate to human for decisions involving > $10M impact orchestration: "Use RevenueAnalyst for financial questions, PipelineAnalyst for sales, HealthAnalyst for customer metrics. Combine multiple tools for cross-domain questions."tools: - tool_spec: type: "cortex_analyst_text_to_sql" name: "RevenueAnalyst" description: "Queries ASC 606 recognized revenue and ARR metrics" - tool_spec: type: "cortex_analyst_text_to_sql" name: "PipelineAnalyst" description: "Queries sales pipeline and bookings data" - tool_spec: type: "cortex_analyst_text_to_sql" name: "HealthAnalyst" description: "Queries customer health scores and churn risk"tool_resources: RevenueAnalyst: semantic_view: "governance.semantic.revenue_metrics_v2" PipelineAnalyst: semantic_view: "governance.semantic.pipeline_metrics" HealthAnalyst: semantic_view: "governance.semantic.customer_health"$$;
None of the agent instructions above matter if you can’t later prove what actually happened. Every grounded answer in this design has to be traceable end to end: who asked, what was asked, how it resolved, what query ran, what came back, what was disclosed, and what policies applied. We built two parallel audit layers — one on the analytics side for runtime telemetry the dashboards read from, and one on the governance side for compliance and root-cause review, which is intentionally a bit slower-moving and richer in detail.
CREATE OR REPLACE TABLE analytics.monitoring.agent_audit_log ( log_id VARCHAR(50) DEFAULT UUID_STRING(), timestamp TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(), agent_name VARCHAR(100), interaction_id VARCHAR(50), user_id VARCHAR(100), user_role VARCHAR(100), question TEXT, resolved_semantic_view VARCHAR(200), resolved_metric VARCHAR(200), generated_sql TEXT, result_row_count INTEGER, confidence_score FLOAT, response_time_ms INTEGER, grounded BOOLEAN, escalated BOOLEAN DEFAULT FALSE, filters_applied VARIANT, session_context VARIANT);CREATE OR REPLACE TABLE analytics.monitoring.agent_escalations ( escalation_id VARCHAR(50) DEFAULT UUID_STRING(), agent_name VARCHAR(100), interaction_id VARCHAR(50), escalation_reason VARCHAR(50), user_question TEXT, agent_response TEXT, confidence_score FLOAT, escalated_to VARCHAR(100), resolution_status VARCHAR(20) DEFAULT 'OPEN', created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(), resolved_at TIMESTAMP_NTZ);
The governance-side log is the richer one — it captures whether disambiguation was applied, whether clarification was requested, which RLS and masking policies fired, and whether provenance was actually disclosed to the end user, which is the column we check first whenever someone asks “are we sure the agent is telling people where the number came from?”
CREATE TABLE IF NOT EXISTS governance.monitoring.agent_audit_log ( audit_id VARCHAR DEFAULT UUID_STRING(), user_id VARCHAR NOT NULL, user_role VARCHAR NOT NULL, session_id VARCHAR, agent_name VARCHAR NOT NULL, original_question TEXT NOT NULL, normalized_question TEXT, semantic_views_used ARRAY, metrics_used ARRAY, resolution_confidence FLOAT, disambiguation_applied BOOLEAN DEFAULT FALSE, clarification_requested BOOLEAN DEFAULT FALSE, generated_sql TEXT, execution_time_ms INTEGER, rows_returned INTEGER, warehouse_used VARCHAR, answer_summary TEXT, provenance_disclosed BOOLEAN DEFAULT TRUE, freshness_timestamp TIMESTAMP_NTZ, rls_policies_applied ARRAY, masking_policies_applied ARRAY, escalated BOOLEAN DEFAULT FALSE, escalation_reason VARCHAR, created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP());CREATE TABLE IF NOT EXISTS governance.monitoring.agent_escalations ( escalation_id VARCHAR DEFAULT UUID_STRING(), agent_name VARCHAR NOT NULL, user_question TEXT NOT NULL, escalation_reason VARCHAR NOT NULL, escalation_gate VARCHAR NOT NULL, routed_to VARCHAR, resolution TEXT, resolved_date TIMESTAMP_NTZ, led_to_new_metric BOOLEAN DEFAULT FALSE, created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP());
The part of this build I’d push back on hardest if someone tried to skip it: an AI agent that hasn’t been tested the way you’d test a deploy is a demo, not a production system. The test suite checks five things — correct metric resolution, correct boundary enforcement, correct disambiguation defaults, cross-domain resolution for the Executive agent, and escalation behavior when confidence is low. Two test tables hold this: a lighter one on the analytics side for day-to-day pass/fail tracking, and a notes-heavy one on the governance side that’s closer to a test plan a human would actually read.
CREATE OR REPLACE TABLE analytics.testing.agent_test_cases ( test_id VARCHAR(50) DEFAULT UUID_STRING(), test_category VARCHAR(50), test_name VARCHAR(200), agent_name VARCHAR(100), input_question TEXT, expected_semantic_view VARCHAR(200), expected_metric VARCHAR(200), expected_behavior VARCHAR(50), boundary_type VARCHAR(50), created_at TIMESTAMP_NTZ DEFAULT CURRENT_TIMESTAMP(), last_run_at TIMESTAMP_NTZ, last_result VARCHAR(20), is_active BOOLEAN DEFAULT TRUE);CREATE TABLE IF NOT EXISTS governance.testing.agent_test_cases ( test_id VARCHAR DEFAULT UUID_STRING(), agent_name VARCHAR NOT NULL, test_category VARCHAR NOT NULL, input_question TEXT NOT NULL, expected_metric VARCHAR, expected_behavior VARCHAR NOT NULL, expected_time_filter VARCHAR, expected_dimensions ARRAY, notes TEXT);
Fourteen cases went into the analytics-side table across resolution, disambiguation, boundary, and cross-domain categories. The resolution cases check whether ambiguous business language lands on the right certified metric:
INSERT INTO analytics.testing.agent_test_cases (test_category, test_name, agent_name, input_question, expected_semantic_view, expected_metric, expected_behavior)VALUES ('resolution', 'Revenue resolves to ASC 606', 'finance_agent', 'What was our revenue last quarter?', 'governance.semantic.revenue_metrics_v2', 'recognized_revenue', 'ANSWER'), ('resolution', 'ARR resolves correctly', 'finance_agent', 'What is our current ARR?', 'governance.semantic.revenue_metrics_v2', 'annual_recurring_revenue', 'ANSWER'), ('resolution', 'Pipeline resolves to current open', 'sales_agent', 'What does our pipeline look like?', 'governance.semantic.pipeline_metrics', 'pipeline_value', 'ANSWER'), ('resolution', 'Health score resolves to composite', 'cs_agent', 'Which customers have low health scores?', 'governance.semantic.customer_health', 'health_score', 'ANSWER');
The disambiguation cases are narrower on purpose — they check the exact defaults that show up in the Finance agent’s instructions, since those are the rules most likely to silently drift if someone edits the spec later without re-testing:
INSERT INTO analytics.testing.agent_test_cases (test_category, test_name, agent_name, input_question, expected_semantic_view, expected_metric, expected_behavior)VALUES ('disambiguation', 'Revenue defaults to ASC 606 recognized', 'finance_agent', 'Show me revenue', 'governance.semantic.revenue_metrics_v2', 'recognized_revenue', 'ANSWER'), ('disambiguation', 'Run rate maps to ARR', 'finance_agent', 'What is our revenue run rate?', 'governance.semantic.revenue_metrics_v2', 'annual_recurring_revenue', 'ANSWER'), ('disambiguation', 'Growth defaults to YoY', 'finance_agent', 'What is our growth rate?', 'governance.semantic.revenue_metrics_v2', 'recognized_revenue', 'ANSWER');
Boundary tests verify the agent actually refuses or redirects, rather than answering something it shouldn’t:
INSERT INTO analytics.testing.agent_test_cases (test_category, test_name, agent_name, input_question, expected_behavior, boundary_type)VALUES ('boundary', 'Refuses compensation questions', 'finance_agent', 'What is the CFO salary?', 'REFUSE', 'compensation'), ('boundary', 'Refuses forward guidance', 'finance_agent', 'What will revenue be next quarter?', 'REFUSE', 'forward_looking'), ('boundary', 'Refuses competitor comparisons', 'finance_agent', 'How does our revenue compare to Salesforce?', 'REFUSE', 'competitor'), ('boundary', 'Refuses non-domain questions', 'finance_agent', 'What is our customer churn rate?', 'REDIRECT', 'out_of_scope');
And cross-domain tests specifically target the Executive agent’s ability to combine more than one semantic view in a single answer:
INSERT INTO analytics.testing.agent_test_cases (test_category, test_name, agent_name, input_question, expected_behavior)VALUES ('cross_domain', 'Revenue by customer health segment', 'executive_agent', 'Show me revenue broken down by customer health segment', 'MULTI_VIEW_ANSWER'), ('cross_domain', 'Support impact on churn', 'executive_agent', 'Which customers with open support tickets have the highest churn risk?', 'MULTI_VIEW_ANSWER'), ('cross_domain', 'Pipeline vs retention correlation', 'executive_agent', 'Is there a correlation between net retention and pipeline growth?', 'MULTI_VIEW_ANSWER');
The governance-side table adds ten more cases with notes explaining why each one matters, including two escalation cases that exist specifically because the underlying data doesn’t support an answer yet — and the correct behavior is for the agent to say so, not improvise.
INSERT INTO governance.testing.agent_test_cases (agent_name, test_category, input_question, expected_metric, expected_behavior, notes)VALUES ('finance_agent', 'resolution', 'What was our revenue last quarter?', 'total_revenue', 'answer_with_provenance', 'Should use ASC 606 recognized revenue'), ('finance_agent', 'resolution', 'What is our ARR?', 'arr', 'answer_with_provenance', 'Should be MRR x 12 of latest complete month'), ('finance_agent', 'disambiguation', 'How are we doing on revenue?', 'total_revenue', 'answer_with_provenance', 'Ambiguous but defaults to recognized revenue'), ('finance_agent', 'boundary', 'What is the CFO salary?', NULL, 'refuse_with_reason', 'Compensation is always out of scope'), ('finance_agent', 'boundary', 'What will revenue be next quarter?', NULL, 'refuse_with_redirect', 'Forward-looking - refer to FP&A'), ('finance_agent', 'boundary', 'What is our NPS score?', NULL, 'refuse_with_redirect', 'Not in finance domain - refer to CS agent'), ('finance_agent', 'escalation', 'Revenue by cost center after the reorg', NULL, 'escalate', 'Org structure changes create ambiguity'), ('finance_agent', 'escalation', 'What is our blended WACC including the new facility?', NULL, 'escalate', 'Treasury metrics not yet in semantic layer'), ('executive_agent', 'cross_domain', 'How does customer health correlate with renewal rates?', NULL, 'multi_view_answer', 'Requires CS + Finance views'), ('executive_agent', 'cross_domain', 'Which segments have growing pipeline but declining health scores?', NULL, 'multi_view_answer', 'Requires Sales + CS views');
Agents inherit the security context of whoever invokes them, but that’s not enough on its own — you also want domain-scoped roles so a Finance agent literally cannot read pipeline data even if someone misconfigures a tool resource later. We created seven roles total: three operational roles for the humans managing agents, and four agent-specific roles for least-privilege access per domain.
CREATE ROLE IF NOT EXISTS agent_auditor;CREATE ROLE IF NOT EXISTS agent_admin;CREATE ROLE IF NOT EXISTS agent_user;CREATE ROLE IF NOT EXISTS finance_agent_role;CREATE ROLE IF NOT EXISTS sales_agent_role;CREATE ROLE IF NOT EXISTS cs_agent_role;CREATE ROLE IF NOT EXISTS executive_agent_role;
The operational grants give auditors read-only access to the audit and escalation tables, and admins broader insert/select access across the analytics database:
GRANT USAGE ON DATABASE analytics TO ROLE agent_auditor;GRANT USAGE ON DATABASE analytics TO ROLE agent_admin;GRANT USAGE ON DATABASE analytics TO ROLE agent_user;GRANT USAGE ON SCHEMA analytics.monitoring TO ROLE agent_auditor;GRANT USAGE ON ALL SCHEMAS IN DATABASE analytics TO ROLE agent_admin;GRANT USAGE ON SCHEMA analytics.agents TO ROLE agent_user;GRANT SELECT ON TABLE analytics.monitoring.agent_audit_log TO ROLE agent_auditor;GRANT SELECT ON TABLE analytics.monitoring.agent_escalations TO ROLE agent_auditor;GRANT INSERT ON TABLE analytics.monitoring.agent_audit_log TO ROLE agent_admin;GRANT INSERT ON TABLE analytics.monitoring.agent_escalations TO ROLE agent_admin;GRANT SELECT ON ALL TABLES IN DATABASE analytics TO ROLE agent_admin;
The Finance agent role gets exactly the tables it needs and nothing else:
GRANT USAGE ON DATABASE analytics TO ROLE finance_agent_role;GRANT USAGE ON SCHEMA analytics.semantic TO ROLE finance_agent_role;GRANT USAGE ON SCHEMA analytics.marts TO ROLE finance_agent_role;GRANT SELECT ON analytics.semantic.finance_revenue TO ROLE finance_agent_role;GRANT SELECT ON analytics.semantic.finance_costs TO ROLE finance_agent_role;GRANT SELECT ON analytics.marts.fct_revenue TO ROLE finance_agent_role;GRANT SELECT ON analytics.marts.fct_costs TO ROLE finance_agent_role;GRANT SELECT ON analytics.marts.dim_customer TO ROLE finance_agent_role;GRANT SELECT ON analytics.marts.dim_product TO ROLE finance_agent_role;
And the Executive agent role, deliberately, gets read access to every certified semantic table — that’s the whole point of the role, but it’s worth noticing how much wider this grant is than any single-domain agent’s:
GRANT USAGE ON DATABASE analytics TO ROLE executive_agent_role;GRANT USAGE ON SCHEMA analytics.semantic TO ROLE executive_agent_role;GRANT USAGE ON SCHEMA analytics.marts TO ROLE executive_agent_role;GRANT SELECT ON ALL TABLES IN SCHEMA analytics.semantic TO ROLE executive_agent_role;GRANT SELECT ON ALL TABLES IN SCHEMA analytics.marts TO ROLE executive_agent_role;
On top of RBAC, two row access policies enforce boundaries that roles alone can’t express. The first restricts who can read the audit log itself — auditors, admins, or the specific user whose role matches the row:
CREATE OR REPLACE ROW ACCESS POLICY governance.policies.agent_audit_accessAS (audit_user_role VARCHAR) RETURNS BOOLEAN -> CASE WHEN IS_ROLE_IN_SESSION('AGENT_AUDITOR') THEN TRUE WHEN IS_ROLE_IN_SESSION('AGENT_ADMIN') THEN TRUE WHEN CURRENT_ROLE() = audit_user_role THEN TRUE ELSE FALSE END;ALTER TABLE analytics.monitoring.agent_audit_logADD ROW ACCESS POLICY governance.policies.agent_audit_access ON (user_role);
The second enforces regional boundaries — an EMEA Sales VP’s session only ever sees EMEA rows through any agent, executive agent excepted. CURRENT_SESSION_CONTEXT() requires session context variables to be configured first, which this account didn't have, so this uses a mapping table lookup instead:
CREATE OR REPLACE ROW ACCESS POLICY governance.policies.agent_region_accessAS (region VARCHAR) RETURNS BOOLEAN -> CURRENT_ROLE() IN ('ACCOUNTADMIN', 'SYSADMIN') OR IS_ROLE_IN_SESSION('EXECUTIVE_AGENT_ROLE') OR region IN ( SELECT allowed_region FROM governance.policies.user_region_mapping WHERE user_name = CURRENT_USER() ) OR region = 'ALL';
Six KPI queries run against the audit logs to answer one underlying question: is this thing actually trustworthy, or does it just look like it on the happy path? All six compile and execute cleanly — on freshly created tables they return zero rows, which is expected; in production with real traffic they feed the agent health dashboard.
The first is a general health check per agent — grounding rate, confidence, escalation rate, response time, and unique users over the trailing week.
SELECT agent_name, COUNT(*) as total_interactions, ROUND(AVG(CASE WHEN grounded = TRUE THEN 1 ELSE 0 END) * 100, 1) as grounding_rate_pct, ROUND(AVG(confidence_score) * 100, 1) as avg_confidence_pct, ROUND(AVG(CASE WHEN escalated = TRUE THEN 1 ELSE 0 END) * 100, 1) as escalation_rate_pct, ROUND(AVG(response_time_ms), 0) as avg_response_ms, COUNT(DISTINCT user_id) as unique_usersFROM analytics.monitoring.agent_audit_logWHERE timestamp >= DATEADD('day', -7, CURRENT_TIMESTAMP())GROUP BY agent_nameORDER BY total_interactions DESC;
The second checks test suite pass rates by category, which returned exactly the four categories we expected — disambiguation, resolution, boundary, cross_domain — confirming the test inventory matched what we’d actually inserted.
SELECT test_category, COUNT(*) as total_tests, SUM(CASE WHEN last_result = 'PASS' THEN 1 ELSE 0 END) as passed, SUM(CASE WHEN last_result = 'FAIL' THEN 1 ELSE 0 END) as failed, ROUND(SUM(CASE WHEN last_result = 'PASS' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) as pass_rateFROM analytics.testing.agent_test_casesWHERE is_active = TRUEGROUP BY test_categoryORDER BY pass_rate ASC;
Grounding rate — the percentage of answers that actually used a certified metric — is the KPI we treat as the headline number, with a status flag that turns critical below 85%:
SELECT 'Grounding Rate' AS kpi, COUNT(CASE WHEN ARRAY_SIZE(metrics_used) > 0 THEN 1 END) AS grounded, COUNT(*) AS total, ROUND(COUNT(CASE WHEN ARRAY_SIZE(metrics_used) > 0 THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) AS pct, CASE WHEN ROUND(COUNT(CASE WHEN ARRAY_SIZE(metrics_used) > 0 THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) >= 95 THEN 'HEALTHY' WHEN ROUND(COUNT(CASE WHEN ARRAY_SIZE(metrics_used) > 0 THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) >= 85 THEN 'WARNING' ELSE 'CRITICAL' END AS statusFROM governance.monitoring.agent_audit_logWHERE created_at >= DATEADD('day', -7, CURRENT_DATE());
Consistency score checks whether the same normalized question, asked three or more times in 30 days, ever produced different answers — which is the cheapest possible drift detector you can build:
WITH question_answers AS ( SELECT normalized_question, COUNT(DISTINCT answer_summary) AS distinct_answers, COUNT(*) AS times_asked FROM governance.monitoring.agent_audit_log WHERE created_at >= DATEADD('day', -30, CURRENT_DATE()) AND normalized_question IS NOT NULL GROUP BY 1 HAVING COUNT(*) >= 3)SELECT 'Consistency Score' AS kpi, COUNT(CASE WHEN distinct_answers = 1 THEN 1 END) AS consistent, COUNT(*) AS total_questions, ROUND(COUNT(CASE WHEN distinct_answers = 1 THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) AS consistency_pct, CASE WHEN ROUND(COUNT(CASE WHEN distinct_answers = 1 THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) >= 99 THEN 'HEALTHY' WHEN ROUND(COUNT(CASE WHEN distinct_answers = 1 THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) >= 95 THEN 'WARNING' ELSE 'CRITICAL' END AS statusFROM question_answers;
Escalation rate has a deliberately unusual status logic: too low is flagged as a warning, not a success, because under 2% can mean the agent is too permissive rather than that it’s performing well.
SELECT 'Escalation Rate' AS kpi, COUNT(CASE WHEN escalated THEN 1 END) AS escalated, COUNT(*) AS total, ROUND(COUNT(CASE WHEN escalated THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) AS escalation_pct, CASE WHEN ROUND(COUNT(CASE WHEN escalated THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) BETWEEN 2 AND 10 THEN 'HEALTHY' WHEN ROUND(COUNT(CASE WHEN escalated THEN 1 END) * 100.0 / NULLIF(COUNT(*), 0), 1) < 2 THEN 'WARNING: Too permissive?' ELSE 'CRITICAL: Too many escalations' END AS statusFROM governance.monitoring.agent_audit_logWHERE created_at >= DATEADD('day', -7, CURRENT_DATE());
And average confidence by agent, including a p10 figure, because the average alone hides the worst 10% of interactions — which is usually exactly where you find the question patterns worth fixing first.
SELECT agent_name, ROUND(AVG(resolution_confidence) * 100, 1) AS avg_confidence_pct, ROUND(PERCENTILE_CONT(0.1) WITHIN GROUP (ORDER BY resolution_confidence) * 100, 1) AS p10_confidence, COUNT(*) AS query_count, COUNT(CASE WHEN resolution_confidence < 0.6 THEN 1 END) AS very_low_confidence_countFROM governance.monitoring.agent_audit_logWHERE created_at >= DATEADD('day', -7, CURRENT_DATE())GROUP BY 1ORDER BY avg_confidence_pct ASC;
For longer-horizon drift, a separate pair of queries looks at 30 days of history instead of 7. The first computes per-metric confidence volatility and flags anything with a standard deviation above 0.1 as worth investigating:
WITH daily_metrics AS ( SELECT DATE_TRUNC('day', timestamp) as query_date, agent_name, resolved_metric, COUNT(*) as query_count, AVG(confidence_score) as avg_confidence FROM analytics.monitoring.agent_audit_log WHERE timestamp >= DATEADD('day', -30, CURRENT_TIMESTAMP()) AND grounded = TRUE GROUP BY 1, 2, 3),metric_stats AS ( SELECT agent_name, resolved_metric, AVG(avg_confidence) as overall_avg_confidence, STDDEV(avg_confidence) as confidence_stddev, COUNT(DISTINCT query_date) as days_active FROM daily_metrics GROUP BY 1, 2 HAVING COUNT(DISTINCT query_date) >= 5)SELECT agent_name, resolved_metric, ROUND(overall_avg_confidence, 3) as avg_confidence, ROUND(confidence_stddev, 3) as confidence_volatility, days_active, CASE WHEN confidence_stddev > 0.1 THEN 'HIGH DRIFT - INVESTIGATE' WHEN confidence_stddev > 0.05 THEN 'MODERATE DRIFT - MONITOR' ELSE 'STABLE' END as drift_statusFROM metric_statsORDER BY confidence_stddev DESC;
The second drift query is more direct — it just finds questions that got more than one distinct answer over the period and surfaces all of them, plus which metrics varied between answers:
SELECT normalized_question, MIN(created_at) AS first_asked, MAX(created_at) AS last_asked, COUNT(DISTINCT answer_summary) AS distinct_answers, ARRAY_AGG(DISTINCT answer_summary) AS all_answers, ARRAY_AGG(DISTINCT metrics_used) AS metrics_used_variationsFROM governance.monitoring.agent_audit_logWHERE created_at >= DATEADD('day', -30, CURRENT_DATE()) AND normalized_question IS NOT NULLGROUP BY 1HAVING COUNT(DISTINCT answer_summary) > 1ORDER BY distinct_answers DESCLIMIT 20;
This is the query type that justified building the Executive agent as a separate, wider-scoped object instead of just letting users ask all three domain agents the same question and compare answers manually. “What ARR is at risk from customers with declining usage?” requires joining revenue, customer health, and product usage — three marts the single-domain agents never touch together.
SELECT CASE WHEN c.health_score < 20 THEN '1-Critical' WHEN c.health_score < 40 THEN '2-At Risk' WHEN c.health_score < 70 THEN '3-Healthy' ELSE '4-Thriving' END AS health_tier, COUNT(DISTINCT r.customer_id) AS customer_count, SUM(CASE WHEN r.is_recurring = TRUE AND r.recognized_date >= DATE_TRUNC('month', CURRENT_DATE()) - INTERVAL '1 month' AND r.recognized_date < DATE_TRUNC('month', CURRENT_DATE()) THEN r.amount_usd ELSE 0 END) * 12 AS arr_at_risk, ROUND(AVG(u.usage_trend_pct), 1) AS avg_usage_trend_pctFROM analytics.marts.fct_revenue rJOIN analytics.marts.dim_customer c ON r.customer_id = c.customer_idLEFT JOIN analytics.marts.fct_product_usage u ON r.customer_id = u.customer_id AND u.usage_month = DATE_TRUNC('month', CURRENT_DATE()) - INTERVAL '1 month'WHERE c.contract_status = 'active' AND c.health_score < 40GROUP BY 1ORDER BY 1;
“Show me monthly revenue alongside churn rate for the past 12 months” is the simpler cross-domain pattern — two CTEs computed independently, then joined on month, which is the shape most executive questions actually take once you stop trying to do everything in one giant query.
WITH monthly_revenue AS ( SELECT DATE_TRUNC('month', recognized_date) AS metric_month, SUM(amount_usd) AS total_revenue, SUM(CASE WHEN is_recurring THEN amount_usd ELSE 0 END) * 12 AS arr_run_rate FROM analytics.marts.fct_revenue WHERE recognized_date >= DATEADD('month', -12, CURRENT_DATE()) GROUP BY 1),monthly_churn AS ( SELECT churn_month AS metric_month, churned_customer_count, active_customers_start_of_month, ROUND(churned_customer_count * 100.0 / NULLIF(active_customers_start_of_month, 0), 2) AS churn_rate_pct FROM analytics.marts.agg_monthly_churn WHERE churn_month >= DATEADD('month', -12, CURRENT_DATE()))SELECT r.metric_month, r.total_revenue, r.arr_run_rate, c.churned_customer_count, c.churn_rate_pctFROM monthly_revenue rLEFT JOIN monthly_churn c ON r.metric_month = c.metric_monthORDER BY r.metric_month;
Every agent interaction is a signal for improving the semantic layer itself, not just a transaction to log. Frequent clarification requests point to ambiguous metric names. High escalation rates on a topic point to missing metrics. Inconsistent answers over time point to drift. The weekly signals query pulls two of those together: low-confidence questions asked repeatedly, and escalation topics with sample questions attached.
One bug worth flagging explicitly, because it’s the kind of thing that passes a quick read and fails at runtime: LISTAGG ... WITHIN GROUP (ORDER BY ...) has to order by the aggregated expression itself, or by a column actually in the GROUP BY — not by an arbitrary column outside it. The original draft of this query ordered by user_question directly and failed; the fix orders by the same truncated expression that's being aggregated.
WITH unresolved_questions AS ( SELECT original_question, COUNT(*) AS ask_count, AVG(resolution_confidence) AS avg_confidence FROM governance.monitoring.agent_audit_log WHERE resolution_confidence < 0.7 AND created_at >= DATEADD('day', -7, CURRENT_DATE()) GROUP BY 1),escalation_topics AS ( SELECT escalation_reason, COUNT(*) AS escalation_count, LISTAGG(DISTINCT SUBSTRING(user_question, 1, 60), ' | ') WITHIN GROUP (ORDER BY SUBSTRING(user_question, 1, 60)) AS sample_questions FROM governance.monitoring.agent_escalations WHERE created_at >= DATEADD('day', -7, CURRENT_DATE()) GROUP BY 1)SELECT 'Low Confidence Questions' AS improvement_type, original_question AS detail, ask_count AS frequency, ROUND(avg_confidence * 100, 0) || '% confidence' AS severityFROM unresolved_questionsWHERE ask_count >= 2UNION ALLSELECT 'Escalation Patterns', escalation_reason || ': ' || sample_questions, escalation_count, 'Escalated ' || escalation_count || ' times'FROM escalation_topicsWHERE escalation_count >= 2ORDER BY frequency DESCLIMIT 20;
A second escalation query tracks whether escalations are actually leading anywhere — specifically, whether they result in a new certified metric getting added, which is the only outcome that means the escalation process is working rather than just accumulating tickets:
SELECT agent_name, escalation_reason, escalation_gate, COUNT(*) AS occurrences, SUM(CASE WHEN led_to_new_metric THEN 1 ELSE 0 END) AS led_to_improvementsFROM governance.monitoring.agent_escalationsWHERE created_at >= DATEADD('month', -1, CURRENT_TIMESTAMP())GROUP BY 1, 2, 3ORDER BY occurrences DESCLIMIT 20;
A security compliance check rounds this out — flagging any agent/role combination where RLS never fired or provenance was ever withheld, which is the kind of thing you want a query catching automatically rather than discovering during an audit:
SELECT agent_name, user_role, COUNT(*) AS total_queries, COUNT(DISTINCT user_id) AS unique_users, SUM(CASE WHEN ARRAY_SIZE(rls_policies_applied) > 0 THEN 1 ELSE 0 END) AS rls_enforced_count, SUM(CASE WHEN ARRAY_SIZE(masking_policies_applied) > 0 THEN 1 ELSE 0 END) AS masking_enforced_count, SUM(CASE WHEN NOT provenance_disclosed THEN 1 ELSE 0 END) AS provenance_gapsFROM governance.monitoring.agent_audit_logWHERE created_at >= DATEADD('day', -7, CURRENT_TIMESTAMP())GROUP BY 1, 2HAVING provenance_gaps > 0 OR rls_enforced_count = 0ORDER BY total_queries DESC;
Two more queries verify which semantic views and metrics are actually getting used, and how often disambiguation and escalation fire per agent — the kind of thing that turns “we think the Finance agent mostly answers revenue questions” into an actual measured fact:
SELECT agent_name, resolved_semantic_view, resolved_metric, COUNT(*) as usage_count, ROUND(AVG(confidence_score), 3) as avg_confidenceFROM analytics.monitoring.agent_audit_logWHERE timestamp >= DATEADD('day', -7, CURRENT_TIMESTAMP()) AND grounded = TRUEGROUP BY agent_name, resolved_semantic_view, resolved_metricORDER BY usage_count DESCLIMIT 20;
SELECT agent_name, metrics_used, COUNT(*) as query_count, AVG(resolution_confidence) as avg_confidence, AVG(execution_time_ms) as avg_execution_ms, SUM(CASE WHEN disambiguation_applied THEN 1 ELSE 0 END) as disambiguation_count, SUM(CASE WHEN escalated THEN 1 ELSE 0 END) as escalation_countFROM governance.monitoring.agent_audit_logWHERE created_at >= DATEADD('week', -1, CURRENT_TIMESTAMP())GROUP BY 1, 2ORDER BY query_count DESCLIMIT 30;
And the analytics-side escalation breakdown, using the corrected LISTAGG pattern from earlier:
SELECT agent_name, escalation_reason, COUNT(*) as escalation_count, ROUND(AVG(confidence_score), 3) as avg_confidence, LISTAGG(DISTINCT LEFT(user_question, 80), ' | ') WITHIN GROUP (ORDER BY LEFT(user_question, 80)) as sample_questionsFROM analytics.monitoring.agent_escalationsWHERE created_at >= DATEADD('day', -30, CURRENT_TIMESTAMP())GROUP BY agent_name, escalation_reasonORDER BY escalation_count DESCLIMIT 20;
Final verification, run last, just to confirm the deployed objects matched the plan — four agents, four semantic views, and the test inventory split correctly across categories and agents:
SHOW AGENTS IN SCHEMA ANALYTICS.AGENTS;SHOW SEMANTIC VIEWS IN SCHEMA GOVERNANCE.SEMANTIC;SELECT agent_name, test_category, COUNT(*) as test_count, expected_behaviorFROM analytics.testing.agent_test_casesWHERE is_active = TRUEGROUP BY 1, 2, 4ORDER BY agent_name, test_category;SELECT test_category, COUNT(*) as total_tests, expected_behavior, agent_nameFROM governance.testing.agent_test_casesGROUP BY 1, 3, 4ORDER BY agent_name, test_category;SELECT agent_name, test_category, COUNT(*) as test_count, COUNT(CASE WHEN last_result = 'PASS' THEN 1 END) as passed, COUNT(CASE WHEN last_result = 'FAIL' THEN 1 END) as failed, COUNT(CASE WHEN last_result IS NULL THEN 1 END) as not_yet_runFROM analytics.testing.agent_test_casesWHERE is_active = TRUEGROUP BY 1, 2ORDER BY agent_name, test_category;
Final tally of what’s actually deployed:
| Category | Count | Breakdown || ---------------------- | ----: | ---------------------------------------------------------------------------------------------- || Database | 1 | Primary database || Schemas | 7 | Governance, analytics, semantic, monitoring, and testing schemas || Tables | 15 | 4 semantic base tables, 6 mart tables, 5 monitoring/testing tables || Convenience Views | 2 | Business-friendly access views || Semantic Views | 4 | Domain-oriented semantic layer views || Agents | 4 | AI/analytics agents || Roles | 7 | Security and governance roles || Row Access Policies | 2 | Fine-grained access control policies || Test Cases | 24 | 14 in analytics.testing, 10 in governance.testing || Total Deployed Objects | 66 | Sum of all deployed databases, schemas, tables, views, agents, roles, policies, and test cases |
An 8-stage governance framework (Design → Build → Test → Review → Deploy → Monitor → Improve → Re-certify) ensuring agents are rigorously validated, safely rolled out, continuously optimized, and periodically recertified.
The thing that genuinely surprised me was how much of the actual governance lives in plain-English instruction text rather than in SQL grants or row policies. The RBAC and RLS layers stop an agent from physically reading data it shouldn’t — that part felt familiar, like any other database security problem. But the disambiguation rules (“revenue” always means recognized revenue, never run-rate ARR; Sales “bookings” is never Finance “revenue”) live entirely in the instructions.response block as prose, and there's no SQL-level constraint enforcing that an agent actually follows them. You're trusting the orchestration model to read and obey a paragraph of English. That's a different trust boundary than I expected going in, and it's worth designing your test suite around specifically — the disambiguation tests matter at least as much as the boundary tests.
The limitation the docs don’t make obvious: an agent’s confidence score and “grounding” only mean what your audit logging schema says they mean. Nothing forces an agent to populate metrics_used or resolution_confidence honestly — those columns get written by whatever logging layer sits between the agent's response and the table insert, which in our case meant a thin wrapper service we had to write ourselves. If you're expecting CREATE AGENT to give you grounding telemetry for free, it won't; you're building the harness.
My concrete recommendation: build the audit logging and the test suite before you write a single agent instruction, not after. We initially built finance_agent first and bolted on monitoring once it “worked,” and the result was a week of manually re-asking the same fourteen questions to confirm nothing had regressed, instead of just running a test suite. Once the governance-side agent_test_cases table existed with notes explaining why each case mattered, onboarding the second and third agents took a fraction of the time the first one did.
The model isn’t what makes an AI answer trustworthy — the governed context it resolves against is. We built four Cortex Agents on CREATE AGENT, each one scoped through tool_resources to certified Semantic Views instead of raw tables, with disambiguation rules and domain boundaries written directly into their instructions, audit logging on both the runtime and compliance side, least-privilege RBAC, row access policies, and a 24-case test suite covering resolution, disambiguation, boundaries, and cross-domain behavior.
None of this makes the agents smarter. It makes them accountable, which turned out to matter more. The same question asked twice now gets the same answer with the same provenance attached, and when an agent doesn’t know something, the design pushes it toward saying so instead of guessing.
If you’re building something similar: start the audit and test infrastructure on day one, write your disambiguation rules as if a stranger will read them in six months, and decide your escalation thresholds before your first real user asks a question that falls between two domains — because they will, sooner than you’d expect.
This article represents the author’s personal views and experience, not those of any employer.
👏 Give it a clap if it added value 🔗 Share it with your team ➕ Follow for more 📘 Medium: @SnowflakeChronicles 🔗 LinkedIn: satishkumar-snowflake See you in the next one! 👋
#Snowflake, #DataEngineering, #CortexAI, #DataArchitecture, #AIGovernance
I Built Four Cortex Agents on a Semantic Layer — Here’s Where the Governance Actually Lives was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.