Designing Hybrid Edge AI Systems for Low-Latency Intent Classification in Mobile Applications

wpnews.pro

Large Language Models (LLMs) have fundamentally changed how applications process natural language. They excel at reasoning, summarization, question answering, and generating human-like responses. As a result, many modern applications route every user message directly to a cloud-hosted LLM.

While this approach is effective for complex conversations, it is often unnecessary for deterministic interactions. Commands such as "Show my leave balance", "Open settings", or "Contact HR" do not require generative reasoning. They require identifying a known intent and triggering a predefined workflow.

Sending these requests to the cloud introduces avoidable latency, increases operational costs, depends on network availability, and transmits user data that could otherwise remain on the device.

This article presents a hybrid architecture that performs intent classification entirely on the client using a lightweight machine learning model. By classifying predictable requests locally and forwarding only ambiguous or complex queries to a cloud-based LLM, applications can provide a significantly faster, more private, and more resilient user experience.

Although the implementation examples reference Core ML on iOS, the architectural principles discussed here apply equally to Android, desktop, and embedded systems.

Over the past few years, conversational interfaces have evolved from simple rule-based chatbots into sophisticated AI assistants capable of understanding natural language.

As engineers, it is tempting to assume that every user message deserves the full reasoning power of a Large Language Model. In practice, however, most application interactions are remarkably predictable.

Consider the following examples:

These requests are not open-ended questions.

They are commands.

Their purpose is not to generate new knowledge but to identify the user's intent and execute an existing application workflow.

Yet many applications still send these requests to remote AI services.

Although this simplifies implementation, it often creates unnecessary architectural complexity.

Each interaction now depends on:

The user experiences several hundred milliseconds—or even multiple seconds—of delay simply to navigate to a screen that already exists inside the application.

This raises an important architectural question:

Should every natural language request be processed by a Large Language Model?

For many applications, the answer is no.

Modern AI systems are incredibly capable, but capability alone should not dictate architecture.

One of the fundamental responsibilities of software architecture is selecting the appropriate technology for each problem.

A calculator does not require a database.

A login screen does not require distributed computing.

Likewise, deterministic user commands often do not require generative AI.

Consider an enterprise application with the following features:

A conversational interface might receive thousands of requests every day, but a significant percentage of those requests fall into a relatively small number of predictable categories.

Examples include:

User Request	Intended Action
"How many leaves do I have?"	Open Leave Balance
"Apply leave tomorrow"	Open Leave Application
"Show my salary slip"	Navigate to Payroll
"Office timings"	Display Working Hours
"Email HR"	Open Contact Screen

Each request maps directly to an existing application feature.

No reasoning is required.

No content generation is required.

No external knowledge retrieval is required.

The challenge is simply determining which predefined action should be executed.

This is fundamentally a classification problem, not a reasoning problem.

Recognizing this distinction opens the door to a much simpler architecture.

Instead of treating every request as an AI problem, we can divide user interactions into two categories.

These requests have known outcomes.

Examples include:

The expected action is already implemented inside the application.

The only missing piece is determining which action the user intended.

A lightweight text classifier can solve this in just a few milliseconds.

These require reasoning beyond predefined workflows.

Examples include:

Compare my leave history over the last three years and suggest the best vacation period.

or

Summarize the company's parental leave policy.

or

Explain why my reimbursement request was rejected.

These requests benefit from the contextual understanding and reasoning capabilities of an LLM.

Rather than replacing the cloud entirely, the objective is to ensure that only requests requiring advanced reasoning are forwarded to it.

This observation naturally leads to a hybrid architecture.

Instead of placing the LLM at the front of every interaction, the application first evaluates whether the request belongs to a known intent.

                    User Input
                         │
                         ▼
           On-Device Intent Classifier
                         │
          ┌──────────────┴──────────────┐
          │                             │
   High Confidence               Low Confidence
          │                             │
          ▼                             ▼
 Execute Local Action          Forward to Cloud LLM

This design introduces an intelligent routing layer between the user interface and the network.

The classifier becomes responsible for determining whether the application already knows how to satisfy the request.

If it does, the workflow executes immediately without leaving the device.

If not, the request is escalated to a cloud-based language model.

This architecture combines the strengths of both approaches:

Rather than viewing edge AI and cloud AI as competing technologies, they become complementary components within the same system.

Choosing between local inference and cloud inference is not about determining which technology is "better."

Each solves a different class of problems.

Architectural Characteristic	Cloud LLM	On-Device Intent Classifier
Network Connectivity	Required	Not Required
Average Response Time	1–4 seconds	Typically under 5 ms
Operational Cost	Per-request API cost	Zero after deployment
Privacy	Data transmitted externally	Data remains on device
Offline Capability	No	Yes
Reasoning Ability	Excellent	Limited
Deterministic Commands	Overkill	Ideal

The objective is not to eliminate cloud AI.

Instead, it is to reserve expensive reasoning engines for situations that genuinely require them.

A useful mental model is:

Use edge AI for routing. Use cloud AI for reasoning.

This simple design principle can significantly improve responsiveness while reducing unnecessary infrastructure costs.

Intent classification is one of the oldest and most successful applications of Natural Language Processing.

Unlike generative models, which attempt to produce new text, a classifier performs a much simpler task:

Determine which predefined category best matches the input.

For example:

"Check my leave balance"

might produce

leave_balance

while

"What are today's office timings?"

might produce

working_hours

The output is not a paragraph.

It is simply a label.

Because the problem is constrained, the resulting model is dramatically smaller than a Large Language Model.

In many production systems, an intent classifier occupies only a few tens of kilobytes while performing inference in just a few milliseconds.

This makes it an excellent candidate for on-device deployment.

Like every supervised learning problem, model quality depends heavily on training data.

Fortunately, intent classification requires relatively straightforward datasets.

Each row contains two values:

For example:

text,label
hello,greeting
hi there,greeting
good morning,greeting
how many leaves do i have,leave_balance
check my remaining leave,leave_balance
apply leave tomorrow,apply_leave
request leave for friday,apply_leave
show my salary,salary_info
salary slip,salary_info
company policy,policy_info
working hours,working_hours
contact hr,contact_hr
email hr,contact_hr
thank you,goodbye
bye,goodbye

Although this appears simple, dataset quality often determines whether the classifier succeeds or fails.

Users rarely express the same request in identical words.

For example, all of the following sentences should ideally map to the same intent:

leave balance
remaining leave
how many leaves do I have
show available leave
check my leave count

Including multiple phrasings helps the model generalize beyond the exact examples seen during training.

Each intent should represent one distinct action.

For example:

leave_balance

should never contain examples such as

apply leave tomorrow

Mixing multiple concepts under the same label introduces ambiguity and reduces prediction accuracy.

Suppose one intent contains:

500 examples

while another contains only:

12 examples

The model naturally becomes biased toward the larger class.

Maintaining approximately equal representation across intents generally produces more consistent predictions.

One of the most valuable exercises during dataset creation is imagining how real users naturally phrase requests.

Engineers often write technically correct examples.

Users rarely do.

A robust dataset includes:

The closer the training data resembles production traffic, the better the classifier performs.

With a well-structured dataset in place, the next step is converting those examples into a model capable of recognizing user intent from previously unseen text.

Unlike Large Language Models, intent classifiers are supervised learning models. During training, each sentence is associated with a predefined label, allowing the algorithm to learn statistical relationships between words, phrases, and the corresponding intent.

Conceptually, the training pipeline can be represented as:

              Training Dataset
                     │
                     ▼
          Text Preprocessing Pipeline
                     │
                     ▼
          Feature Extraction / Tokenization
                     │
                     ▼
          Intent Classification Model
                     │
                     ▼
              Evaluation & Validation
                     │
                     ▼
             Core ML Model (.mlmodel)
                     │
                     ▼
            Bundled with Mobile App

Although the underlying mathematics may differ depending on the chosen algorithm, the overall workflow remains remarkably consistent.

The model repeatedly analyzes labeled examples, gradually adjusting its internal parameters until it can reliably associate previously unseen sentences with the correct intent.

Once training is complete, the learned parameters are exported as a compact Core ML model that executes entirely on the device.

One common misconception is that every Natural Language Processing problem requires a transformer or Large Language Model.

For intent classification, this is rarely true.

The objective is not to generate language.

It is simply to determine which predefined category best matches an input.

Several lightweight algorithms perform exceptionally well for this task, including:

Apple's Create ML abstracts much of this complexity, allowing developers to train high-quality text classifiers without implementing these algorithms manually.

The choice of algorithm is generally less important than the quality of the training dataset.

In many practical systems, careful dataset engineering yields larger accuracy improvements than switching between classification algorithms.

Before text can be processed by a machine learning model, it must be transformed into numerical representations.

This process is known as feature engineering.

Although modern frameworks automate much of this work, understanding the pipeline helps explain why dataset quality is so important.

A simplified transformation pipeline looks like this:

Original Sentence

"How many leaves do I have?"

        │

        ▼

Tokenization

["how","many","leaves","do","i","have"]

        │

        ▼

Normalization

["how","many","leave","have"]

        │

        ▼

Numerical Representation

[0.14, 0.82, 0.53, ... ]

        │

        ▼

Intent Prediction

The model never understands English in the human sense.

Instead, it learns statistical relationships between numerical representations and known intent labels.

This distinction explains why diverse training examples matter.

The model is learning patterns—not memorizing complete sentences.

Training accuracy alone is not sufficient.

A model that memorizes its training examples may perform poorly when presented with real user input.

A typical evaluation process includes:

One particularly useful visualization is the confusion matrix.

Instead of simply reporting an overall accuracy value, the confusion matrix reveals where the model makes mistakes.

For example:

                 Predicted

             Leave   Salary   Policy

Actual Leave    95       2        3

Actual Salary    1      98        1

Actual Policy    4       2       94

This information often exposes overlapping intent definitions, enabling developers to improve the dataset rather than endlessly tuning the model.

In practice, improving the dataset usually produces larger gains than modifying the learning algorithm.

After validation, the trained classifier is exported as a Core ML model.

HRIntentClassifier.mlmodel

During the build process, Xcode automatically compiles the model into an optimized runtime representation.

HRIntentClassifier.mlmodel
          │
          ▼
HRIntentClassifier.mlmodelc

The compiled asset becomes part of the application bundle and requires no additional downloads or runtime dependencies.

Unlike cloud-hosted models, inference occurs entirely within the application's process.

No API requests are necessary.

No authentication tokens are required.

No network connection is needed.

Once the model has been bundled with the application, the implementation becomes surprisingly straightforward.

The classifier behaves like any other local resource.

A dedicated routing service encapsulates the interaction with Core ML, keeping the user interface independent from the machine learning implementation.

import Foundation
import CoreML

public final class LocalIntentRouter {

    private let model: MLModel

    public init(configuration: MLModelConfiguration = .init()) throws {

        guard let modelURL = Bundle.main.url(
            forResource: "HRIntentClassifier",
            withExtension: "mlmodelc"
        ) else {
            throw RouterError.modelNotFound
        }

        model = try MLModel(
            contentsOf: modelURL,
            configuration: configuration
        )
    }

    public func predictIntent(from text: String) -> PredictionResult? {

        let cleaned = text
            .trimmingCharacters(in: .whitespacesAndNewlines)

        guard !cleaned.isEmpty else {
            return nil
        }

        do {

            let provider = try MLDictionaryFeatureProvider(
                dictionary: [
                    "text": MLFeatureValue(string: cleaned)
                ]
            )

            let prediction = try model.prediction(from: provider)

            guard
                let label =
                    prediction.featureValue(for: "label")?.stringValue,
                let probabilities =
                    prediction.featureValue(for: "labelProbability")?
                    .dictionaryValue as? [String : Double]
            else {
                return nil
            }

            return PredictionResult(
                intent: label,
                confidence: probabilities[label] ?? 0
            )

        } catch {

            print(error.localizedDescription)
            return nil
        }
    }
}

struct PredictionResult {

    let intent: String
    let confidence: Double
}

enum RouterError: Error {

    case modelNotFound
}

Notice that the service returns not only the predicted intent but also its associated confidence score.

This confidence value plays an important role in production systems.

Machine learning predictions should never be treated as absolute truth.

Instead, every prediction carries a confidence score representing how certain the model is about its decision.

A practical routing strategy looks like this:

Prediction:

leave_balance

Confidence:

0.97

Since confidence is very high, the application immediately opens the Leave Balance screen.

Now consider another example.

Prediction:

policy_information

Confidence:

0.41

A confidence of 41% suggests uncertainty.

Rather than risking an incorrect navigation, the application forwards the request to a cloud-based LLM for further interpretation.

This hybrid decision process provides the best of both worlds.

                 User Query
                      │
                      ▼
             Intent Classifier
                      │
          Confidence Score Generated
                      │
      ┌───────────────┴────────────────┐
      │                                │
 Confidence ≥ Threshold         Confidence < Threshold
      │                                │
      ▼                                ▼
 Execute Local Action          Forward to Cloud AI

Rather than replacing the LLM, the classifier becomes an intelligent gatekeeper that filters predictable requests before they ever leave the device.

From the user's perspective, the entire interaction is almost instantaneous.

User types message

        │

        ▼

Text cleaned

        │

        ▼

Core ML Prediction

        │

        ▼

Confidence Evaluation

        │

        ▼

Execute Local Workflow

The total execution time is typically measured in only a few milliseconds.

Unlike cloud inference, there are no network handshakes, serialization overhead, authentication requests, or server scheduling delays.

The interaction feels immediate because it occurs entirely inside the application.

This architectural pattern becomes especially valuable in environments with poor connectivity, intermittent network access, or strict privacy requirements.

More importantly, it demonstrates that not every AI interaction requires cloud-scale infrastructure.

Sometimes, the most effective solution is also the simplest: a small, focused model executing directly where the user already is.

source & further reading

dev.to — original article The MCP attack your code review cannot see No one reads privacy policies. So I built 6 AI Agents to do it for me. AI Isn't Magic. It's Just Evidence Gathering

Designing Hybrid Edge AI Systems for Low-Latency Intent Classification in Mobile Applications

Run your AI side-project on zahid.host