How I Replaced 1,000 Brittle Rules with 3 AI Calls: A Hybrid Data Quality Framework

wpnews.pro

I was tasked with building an auditing pipeline for a massive stream of messy, third-party vendor data. My first instinct was the traditional one: stack up a few hundred lines of regex, maintain huge lookup tables, and chain together complex CASE WHEN blocks to flag anomalies before the data hit our production dashboards.

It didn’t take long for that approach to fall apart.

The moment a vendor misspelled an asset name, switched a catalog ledger code, or introduced an ambiguous product category, the pipeline failed in one of two ways. It either triggered hundreds of false-positive alerts that the team quickly learned to ignore, or it quietly let massive classification errors slide straight through to executive reporting.

Traditional Data Quality (DQ) checks excel at validating schemas — null checks, data types, string lengths — but they struggle profoundly with semantic validity. A rigid SQL rule cannot tell you if an unstructured text description actually aligns with its assigned ledger category. At the same time, off your entire data quality check to a heavy LLM pipeline is a great way to create an expensive, over-engineered infrastructure headache.

To bridge this gap, I deployed a hybrid DQ framework in Databricks SQL that balances deterministic rules with inline cognitive functions. By leveraging ai_classify(), we built an intelligent semantic auditing layer directly in the data stream—without managing an ML endpoint, writing custom Python deployment scripts, or managing complex extraction pipelines.

Here is the exact architectural pattern we used, the production lessons we learned, and why treating data engineering as Context Engineering changes the game.

To understand why a hybrid framework makes sense, it helps to look at modern aviation.

A commercial airliner doesn’t rely on a human pilot to manually adjust the flight controls for every single second of a cross-country flight. That would be an incredibly inefficient waste of cognitive energy. Instead, the plane relies on autopilot for speed and stability. The automated system excels at processing predictable, deterministic inputs at scale.

However, the autopilot isn’t sentient. If the plane encounters unmapped weather anomalies or a complex mechanical emergency, control is immediately handed over to the captain. The human pilot brings a holistic, semantic understanding of context that static code simply does not possess.

[Raw Data Ingestion]                │                ▼┌───────────────────────────────────┐│ 1. DETERMINISTIC LAYER (Autopilot)│ ──> Cruising speed, zero-cost└───────────────────────────────────┘                │                │ (Anomalies & Ambiguity)                ▼┌───────────────────────────────────┐│ 2. COGNITIVE LAYER (The Captain)  │ ──> High-judgment, targeted token spend└───────────────────────────────────┘                │                ▼      [Audited Reporting Layer]

For years, data engineering has suffered from an architectural flaw: we’ve been trying to force our “autopilot” (regex and static code) to navigate complex contextual storms it was never built to understand. The secret to scalable data quality isn’t firing the autopilot and making the captain fly the whole route manually — it’s knowing exactly when to pass control.

The core operating philosophy here is direct: Rules for speed, AI for judgment.

Running every single incoming row through an LLM will quickly destroy your Databricks budget and tank your Spark pipeline’s throughput. Instead, this framework uses a three-stage funnel designed to isolate the LLM, reserving token consumption exclusively for rows that actually require human-like evaluation.

1. Deterministic (The Shield): Utilizing high-speed metadata regex and exact ledger join rules, this initial layer scans millions of rows instantly at zero compute cost. Its engineering purpose is to catch high-confidence, obvious violations right at the threshold of the pipeline.

2. AI-Verified (The Filter): Operating on an ultra-low-cost targeted sampling profile, this layer triggers ai_classify() exclusively to evaluate the flags raised by the Deterministic layer. It acts as a semantic second opinion to confirm or demote alerts, slashing false-positive noise before human stewards are notified.

3. AI-Discovery (The Net): Running a controlled, sampled scan on otherwise unflagged, edge-case data blocks, this final layer deploys ai_classify() to scour ambiguous records. It functions as a safety net to uncover deep, latent semantic violations that static rules are fundamentally blind to.

To make this framework production-grade, we had to solve for cost, accuracy, and real-world data messiness. Three distinct engineering patterns made it work:

1. AI as a Second Opinion (The Noise Filter)

We don’t let the AI have the first word. When a deterministic regex rule flags a catalog item as a potential mismatch, the AI layer acts as a sanity check to verify the context.

For example, a traditional rule might flag a line-item because it sees a specific restricted keyword. But a human looking at the record would instantly look at the broader department context and realize it’s a completely safe, miscategorized piece of infrastructure equipment. By feeding these flagged rows to ai_classify(), the AI confirms or demote the alert. This single pattern dramatically reduced our pipeline's false-positive noise.

2. Category-to-Eligibility Mismatch

Instead of maintaining an impossible, fast-moving lookup list of thousands of raw vendor item names, we shifted the focus to Context Engineering.

We use ai_classify() to evaluate the raw, uncurated item description and output its true, objective product type. Once the AI establishes that clean semantic context, we use a simple, deterministic SQL join to compare that type against the allowed classification rules for that ledger category. The AI standardizes the messy reality; the SQL handles the policy enforcement.

3. Blank-Aware Field Resolution & Edge-Case Guardrails

Real-world enterprise data rarely gives you clean NULL values; it gives you empty strings, double spaces, or hidden characters that bypass standard IS NOT NULL checks. We standardized our text ingestion using a blank-aware pattern before it ever touched our auditing logic:

COALESCE(NULLIF(TRIM(item_name), ''), fallback_column, 'Unknown')

Additionally, we implemented safe-zone guardrails. If an item belongs to an obvious, entirely non-violating catalog department (e.g., standard internal hardware matched with an IT infrastructure category), it skips the AI classification layer entirely. We don’t waste tokens proving an obvious compliance match is compliant.

The beauty of this stack is its simplicity: Databricks SQL, Unity Catalog for governance, and notebook-based orchestration scheduled as a daily job. There are no Python ML frameworks, no custom container endpoints, and no token-streaming boilerplate.

Here is a generalized, production-ready example of how the semantic evaluation layer looks inside the pipeline:

WITH normalized_source AS (    SELECT         transaction_id,        ledger_category,        -- Clean up messy space-as-blank anomalies        COALESCE(NULLIF(TRIM(item_desc), ''), CONCAT(dept_name, ' Item'), 'Unknown') AS clean_item_desc,        dept_name    FROM gold_catalog_table    WHERE dept_name NOT IN ('INTERNAL_IT', 'FACILITIES') -- Guardrail: Skip safe zones to save costs)SELECT     transaction_id,    clean_item_desc,    dept_name,    ledger_category,    -- Inline LLM classification directly within the data stream    ai_classify(        clean_item_desc,         ARRAY('Software Subscription', 'Physical Asset', 'Office Supply', 'Travel Expense')    ) AS inferred_item_typeFROM normalized_source;

While implementing this, we hit several real-world quirks of inline SQL AI functions that you won’t find in the basic documentation:

By shifting from a purely rule-based system to this hybrid, cognitive data pipeline, the operational metrics and data ecosystem changed completely:

You don’t need a massive data science team, a dedicated ML Ops pipeline, or a multi-million dollar budget to build intelligent data quality systems. By treating AI as a high-level cognitive layer for judgment and wrapping it in classic, deterministic SQL engineering guardrails, you can build self-healing, context-aware pipelines right inside your existing Lakehouse architecture.

How I Replaced 1,000 Brittle Rules with 3 AI Calls: A Hybrid Data Quality Framework was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

source & further reading

pub.towardsai.net — original article DeepSeek V4 vs DeepSeek V4 Flash: Which Model Should Developers Choose in 2026? I Built an AI Agent That Can Query My Kubernetes Cluster, But Never Break It How to Anonymize Personal Data in an LLM Pipeline.

How I Replaced 1,000 Brittle Rules with 3 AI Calls: A Hybrid Data Quality Framework

Run your AI side-project on zahid.host