# How to Integrate AI and LLMs into Production Web Apps (Lessons from the Field)

> Source: <https://dev.to/ahadnawaz/how-to-integrate-ai-and-llms-into-production-web-apps-lessons-from-the-field-4nj5>
> Published: 2026-05-28 18:42:55+00:00

Everyone is adding AI to their product right now. Most of them are doing it wrong.

Not because they chose the wrong model. Not because they used the wrong library. But because they treated AI integration like a regular feature and skipped all the engineering discipline that production systems require.

I have integrated LLMs into multiple production applications. This is what I wish I had known before I started.

A traditional API call is deterministic. You send a request, you get a predictable response. You can write tests against it. You can cache it. You can reason about it.

An LLM call is not deterministic. The same input can produce different outputs on different runs. The model can refuse, hallucinate, or return output in a format you did not expect. Your system needs to be designed around this reality, not in spite of it.

This means defensive parsing, fallback logic, output validation, and graceful degradation are not optional extras. They are the core of the feature.

The biggest LLMs are not always the right choice. I learned this building EditDeck Pro, an AI creative platform for music.

Some tasks needed a large frontier model for nuanced creative output. Others needed a fast, cheap model that could run many times per session without accumulating significant latency or cost.

The pattern that works:

Use a lighter model for classification, extraction, and short structured outputs. Use a larger model for generation tasks where quality matters more than speed. Route dynamically between them based on the task type.

This can reduce your inference costs by 60 to 80 percent on workloads that mix simple and complex tasks.

Prompts are code. They should be versioned, tested, and reviewed like code.

I store prompts in a dedicated module with version numbers. When I change a prompt I run it against a fixed evaluation set of inputs and compare the outputs to the previous version. If the quality drops on any test case, the change does not ship.

This sounds like overhead. It is not. Prompts drift over time as you iterate. Without a system to track changes, you will introduce regressions you cannot diagnose because you do not know what changed.

A practical prompt structure that works well across most tasks:

Keep prompts short and explicit. Long prompts with conflicting instructions produce inconsistent outputs.

LLM calls are slow. A typical generation can take two to ten seconds. You cannot make users wait for a synchronous response on most interactions.

The architecture that works best for most production use cases:

When the user triggers an AI action, the API immediately returns a job ID and sets the status to processing. A background worker handles the actual LLM call. The frontend polls for status or receives an update over WebSocket when the job completes.

This keeps your API response times predictable, lets you retry failed jobs, and gives you visibility into queue depth and processing time.

For streaming responses where you want to show output in real time as the model generates it, use Server Sent Events. They are simpler than WebSockets for unidirectional streaming and well supported in Node.js with NestJS.

If you ask an LLM to return JSON, it will sometimes return malformed JSON. If you ask it to follow a schema, it will occasionally miss a required field. If you ask it to stay within a character limit, it will sometimes exceed it.

Every LLM response in a production system should go through a validation layer before it reaches the user or gets stored in the database.

I use Zod for schema validation in TypeScript. The pattern looks like this: parse the model output, validate it against the expected schema, and if validation fails, either retry the call with the validation error included in the prompt or return a graceful fallback response to the user.

Never pass raw LLM output directly to your frontend or database without validation.

LLM API costs can escalate quickly. A single user making many requests in a session can generate significant spend if you do not have controls in place.

In every AI feature I build I implement:

Per user rate limits at the API gateway level. Daily and monthly spend limits per workspace or account. Usage logging so you can analyze which features are generating the most cost. Automatic fallback to a cheaper model when the primary model is rate limited or slow.

Set hard cost limits in your provider dashboard as a safety net. You do not want to discover a runaway process or an abuse pattern through your invoice.

Not all LLM calls need to go to the model on every request. For any task where the same input reliably produces equivalent output, caching can dramatically reduce both latency and cost.

Semantic caching is particularly useful here. Instead of exact match caching, you embed the input and cache the response against a vector. When a similar input comes in, you retrieve the cached response if the similarity is above a threshold.

This works well for FAQ style features, content suggestions based on category, and any task where slight variations in input should produce the same response.

Standard application monitoring is not enough for AI features. You need to track:

Latency per model and per task type. Token usage per request broken down by input and output. Validation failure rates, which indicate prompt quality issues. User level engagement with AI generated content, which tells you whether the outputs are actually useful.

A feature that generates outputs users never interact with is not a working feature regardless of whether the API calls succeed.

Shipping an AI feature without a way to turn it off.

Model quality changes when providers update their models. API reliability has incidents. Your prompt may suddenly produce bad outputs for a class of inputs you did not anticipate.

Every AI feature should have a feature flag that lets you disable it instantly without a code deployment. The fallback should be a non AI version of the same functionality where possible.

This is not pessimism. It is the same defensive engineering you apply to any external dependency.

If you are adding your first LLM integration to a production application, start with a low stakes, read only feature. Summaries, suggestions, and search enhancements are good first candidates. They add value without being in the critical path, which gives you space to learn how the model behaves in your specific context before you build anything that writes data or makes decisions.

Get the infrastructure right first. Async handling, output validation, rate limiting, monitoring. Then expand.

AI is powerful software. It rewards the same engineering discipline that all powerful software requires.

I am Ahad, Founder of REIVEX Technologies. I build AI platforms and production web systems for clients across the US, Middle East, and South Asia. See more at ahadnawaz.dev.
