Why I Move AI Model Calls to the Server — Security, Performance, and Everything In Between

wpnews.pro

When I was building Logicvisor — an AI-powered tool that reviews your algorithmic code, breaks down time and space complexity, and gives you the kind of feedback you'd want before a technical interview — I had to make a foundational architectural decision early on.

Where do the AI calls actually live?

It sounds simple. It really isn't. And I think it's a decision a lot of developers make too quickly, usually defaulting to whatever gets something running fastest. So I want to walk through how I thought about it, what the tradeoffs actually look like in practice, and why for Logicvisor — and honestly most production projects I work on — the answer was never really up for debate.

When your app needs to talk to an AI model — Gemini, Claude, GPT, whatever — that HTTP request to the model provider has to originate somewhere. You have two options:

Client-side: The browser makes the call directly to the AI provider's API.

Server-side: The browser calls your server, your server calls the AI provider, and the response comes back through your infrastructure.

That's the whole decision. But the consequences of each branch run deep.

Let's be fair to the other side first, because client-side AI calls aren't just laziness — there are legitimate reasons to reach for them.

Zero backend overhead. If you're prototyping, building an MVP, or hacking something together for a weekend project, standing up a server just to proxy AI calls adds friction you might not need yet. The client calls the API, gets a response, done.

One less network hop. Client → AI provider is a straight line. Client → your server → AI provider is also a straight line, but a longer one. Every additional hop is a potential source of latency, and if your server is not geographically close to the AI provider, that gap compounds.

Fast iteration during development. Tweak a prompt, refresh the page, see the result. No redeployment cycle, no server restart. For the early exploratory phase of building with AI, this feedback loop is genuinely valuable.

Fine for purely client-facing tools. If you're building something that doesn't touch your own database, doesn't need user sessions, and doesn't have sensitive business logic — a personal productivity tool, a browser extension, an internal utility — client-side calls can be perfectly appropriate.

So that's the honest upside. Now here's where it falls apart.

This is the most obvious one, but it's worth being precise about why it's as bad as it is.

When you make an API call from the browser, your API key has to be in that request. There's no way around this — the provider needs to authenticate you. And since that request is made from the browser, the key is accessible to anyone who opens DevTools, intercepts traffic, or extracts it from your bundled JavaScript.

The consequence isn't just that someone can see your key. It's that they can use it. At your expense. Without your knowledge. AI API billing is usage-based, which means a single bad actor with your key can run up a bill that drains your account before your monitoring even fires an alert — if you have monitoring at all.

Key rotation helps, but it's reactive. The damage is usually already done.

This one gets less attention but matters more than people realize.

The prompts you write are often where your actual product value lives. If you've spent time crafting a system prompt that makes your AI reviewer give structured, consistent, high-quality feedback on algorithmic code — that prompt is the product. Client-side calls expose it completely. A competitor can open DevTools, read your system prompt, and replicate your core feature in an afternoon.

On the server, your prompts never leave your infrastructure. The client sends input; the server decides what to do with it.

On the client side, there's nothing stopping a user from writing a script that hammers your AI endpoint in a loop. Every one of those requests hits the AI provider and costs you tokens. You have no rate limiting, no request validation, no way to enforce quotas per user.

You're not just vulnerable to malicious actors either — a bug in your own frontend code that causes unintended re-fetching can silently burn through your API budget.

AI API calls cost money per token. If multiple users ask your tool to review functionally identical code, why would you want to pay for that same inference three hundred times?

On the client side, you can't cache at the API level. Every identical request goes to the provider, incurs latency, and costs tokens. On the server, you can cache responses intelligently — hash the input, check your cache layer, return the cached result. You pay once.

Without server-side infrastructure, you have no centralized view of how your AI layer is actually being used. Which prompts are performing well? Which inputs are producing garbage responses? Which users are hitting rate limits? Where is your token spend going?

Client-side AI calls mean you're guessing at all of this. Logs, monitoring, and observability — the basic instrumentation of a production system — require a server in the loop.

With that context established, here's what you actually get when the AI calls live on the server.

Your key lives in an environment variable on the server. The client has zero knowledge of it, zero access to it, and zero ability to extract it. This is the minimum acceptable security posture for any application that will see real users.

You decide how many requests a given user can make in a given window. You can enforce this per account, per IP, per session — whatever your threat model calls for. Abuse becomes something you manage rather than something that happens to you.

export async function enforceAIRateLimit(
    userId: string,
    request?: NextRequest
): Promise<RateLimitResult> {
    const tierLimits = await getUserTierLimits(userId);

    if (!tierLimits) {
        throw new RateLimitError("Unable to determine user tier limits");
    }

    // Free/Pro users: per-minute limits
    if (tierLimits.ai_requests_per_minute !== null) {
        const minuteLimit = await checkRateLimit(userId, "ai_request", "minute");

        if (!minuteLimit.isAllowed) {
          await logRateLimitViolation(userId, "ai_request", "minute", ...);
          throw new RateLimitError(``Rate limit exceeded. Resets at ${minuteLimit.resetTime.toISOString()}``);
        }

        return minuteLimit;
    }

    // Admin users: daily + monthly
    const dailyLimit = await checkRateLimit(userId, "ai_request", "daily");
    // ...monthly check follows same pattern
}

Identical or near-identical inputs can return cached results, cutting both latency and cost. For a tool like Logicvisor where multiple users might submit similar sorting algorithm implementations, the savings on repeated inferences compound quickly.

// Normalize the code to a canonical form before hashing
// so that formatting differences don't result in cache misses
const canonicalCode = await canonicalizeCodeAST(solution, preferred_language);
const canonicalHash = await createCanonicalHash(
  typeof canonicalCode === "string" ? canonicalCode : ""
);

// Check cache before hitting the AI provider
const cachedReview = await getCachedAIReview(
  preferred_model.id + "-" + canonicalHash
);
if (cachedReview) {
  return NextResponse.json(
    { success: true, data: cachedReview },
    { status: 201 }
  );
}

This is where things get architecturally interesting. The client sends raw input — code, a question, a request. But your server knows things the client doesn't: who the user is, what their history looks like, what tier they're on, what language they've selected, what results they've already received. All of that context can be injected into the prompt before it ever leaves your infrastructure.

The client can't fake or manipulate that context because it never touches it.

Every request is logged. Every response is traceable. You can monitor token usage, flag anomalous behavior, track which prompts produce the best results, and debug production issues with actual data. This is what running software in production looks like.

await updateAPIUsageAnalytics("/api/internal", true, responseTime, aiTokensUsed, estimatedCost, user.id, {
  modelName: modelUsed,
  modelProvider,
  inputTokens,
  outputTokens,
  totalTokens: aiTokensUsed,
  actualCostUsd,
});

There's a broader architectural win here that goes beyond just AI calls. Without a server in the middle, the client has to orchestrate everything itself: call the AI, wait for the response, then maybe hit your database, wait again, then update the UI. Each of those is a visible for the user.

With a Backend for Frontend (BFF) pattern, the client makes one request. The server handles the AI call, processes the response, queries the database if needed, applies any business logic, and returns a single resolved payload. The user feels one network round trip instead of a cascading waterfall of them.

Before getting into the implementation specifics, here's the architectural difference visualised.

Browser → AI Provider directly · API key exposed in transit

Browser → API Route → [Cache · Rate Limiter · AI Provider · DB] → Browser

Let me get concrete. Here's what moving the AI layer to the server actually looked like in practice.

Logicvisor uses Supabase on the backend. At various points in the app, I need to pull data from multiple tables, run the AI review, and return everything the page needs in one shot.

If this were all happening on the client, you'd be looking at: call Supabase for user context → wait → call the AI provider → wait → call Supabase again for historical reviews → wait → render. Each of those waits is visible to the user, and each one is an opportunity for something to fail mid-chain.

On the server, those calls happen in close proximity to each other and to the data. The AI call, the database queries, and any necessary transformations all resolve server-side, and the client gets one clean response. The user experiences a single state, not a series of UI flickers.

There's also the matter of compute. Parsing and processing a large AI response — stripping JSON fences, validating structure, transforming the output into the format the UI expects — is work that browsers are not well-suited for. Browsers are memory and CPU constrained by design, and they're competing with the DOM, with other tabs, with everything the user has open. A server doesn't have those constraints.

Moving calls to the server means the client's entire interaction is scoped to your own API. It calls your endpoint, gets a response, done. It has no visibility into what your server does with that request internally — which external APIs it calls, what keys those calls carry, or how the response was constructed.

This isn't security through obscurity; the obfuscation is a structural property of the architecture. OS-level network tools could theoretically expose some of this, but you've dramatically raised the bar for what an attacker needs to do to compromise your stack.

Your system prompts — the part of the product that actually encodes your domain knowledge and review methodology — never leave the server. That's not a small thing.

A server in the loop makes proper auth architecture dramatically cleaner. I was able to set HttpOnly cookies, attach signed JWT tokens, and build a stateless authentication and authorization system that the client participates in without controlling.

Without a server, you end up storing tokens in localStorage or client-side state, which is a well-documented attack surface. The session becomes something the client manages, which means it's something an attacker can manipulate.

Token costs are real. For Logicvisor, where users might submit variations of common algorithm patterns — bubble sort, binary search, dynamic programming problems — I can cache AI responses keyed on a normalized hash of the input. A user submitting a well-known algorithm implementation gets a fast, cached response. The AI provider gets called once.

This also improves response times for cached queries significantly. The round trip to an AI provider is the most expensive part of the request by a wide margin. Eliminating it for repeat queries is the single highest-leverage performance optimization available to you.

Here's something the client-side approach makes nearly impossible: swapping AI providers without your frontend caring at all.

Logicvisor supports both Gemini and Groq depending on the user's selected model. Gemini for deeper analysis with its thinking budget and Google Search grounding, Groq for speed. Two different SDKs, two different response shapes, two different token counting strategies, two different pricing models. The client knows none of this. It sends a request, it gets a review back.

That abstraction only works because the AI calls live on the server:

switch (modelProvider.toLowerCase()) {
  case "google":
    // Gemini — thinking budget + Google Search grounding
    const response = await ai.models.generateContent({
      model: preferred_model.id,
      config: {
        thinkingConfig: { thinkingBudget: prompt.estimatedTokens },
        tools: [{ googleSearch: {} }],
        seed: SEED,
        temperature: TEMPERATURE,
      },
      contents,
    });
    aiReviewText = response.text ?? "";
    // Extract actual token counts from usageMetadata
    inputTokens = response.usageMetadata?.promptTokenCount || 0;
    outputTokens = response.usageMetadata?.candidatesTokenCount || 0;
    break;

  case "groq":
    // Llama 3.3 70B — forces structured JSON response
    const groqResponse = await groq.chat.completions.create({
      messages: [{ role: "user", content: prompt.content }],
      model: groqModel,
      temperature: TEMPERATURE,
      seed: SEED,
      response_format: { type: "json_object" },
      stream: false,
    });
    aiReviewText = groqResponse.choices[0].message.content ?? "";
    inputTokens = groqResponse.usage?.prompt_tokens || 0;
    outputTokens = groqResponse.usage?.completion_tokens || 0;
    break;
}

If I wanted to add a third provider tomorrow — say, Claude for a specific model tier — that's a new case

block on the server. The client contract doesn't change. No frontend deployment, no API key exposure, no breaking changes for users mid-session.

Try doing this cleanly when the calls are in the browser. You'd be shipping provider-specific SDK logic, API keys, and token counting math directly to the client — and every provider swap would mean a frontend change. The server is the only place this kind of abstraction is clean.

This also matters for cost tracking. Notice that each branch extracts token counts differently — Gemini from usageMetadata

, Groq from usage

. That per-provider normalization feeds into the analytics pipeline downstream, giving you a consistent view of cost across providers regardless of how each SDK reports it. That's only possible because all the provider-specific handling is in one place.

Here's the unglamorous part that doesn't get written about enough: AI models don't always return clean output.

Gemini, for example, sometimes wraps JSON responses in markdown fences even when you've explicitly told it not to. If your UI is trying to parse that response and render structured data, you need to clean it before it gets anywhere near the client.

On the server, I handle all of that: strip the fences, validate the JSON structure, handle the cases where the model returned something unexpected, and only send a clean, predictable payload to the client. If something goes wrong at this layer, I can log it, inspect it, and fix the prompt. The client just sees a well-structured response or a proper error.

export function extractJSONFromMarkdown(markdownString: string) {
  try {
    // Remove the ``` json wrapper
    let jsonString = markdownString.trim();

    // Remove ``` json from start
    if (jsonString.startsWith("``` json")) {
      jsonString = jsonString.substring(7);
    }

    // Remove ``` from end
    if (jsonString.endsWith("```")) {
      jsonString = jsonString.substring(0, jsonString.length - 3);
    }

    // Parse the JSON
    const parsedData: PromptResult = JSON.parse(jsonString.trim());

    // Extract and decode the markdown_review field
    let markdownReview = parsedData.markdown_review;
    if (markdownReview) {
      // Decode JSON escape sequences
      markdownReview = markdownReview
        .replace(/\\n/g, "\n")   // Convert \n to actual newlines
        .replace(/\\t/g, "\t")   // Convert \t to actual tabs
        .replace(/\\r/g, "\r")   // Convert \r to carriage returns
        .replace(/\\"/g, '"')    // Convert \" to actual quotes
        .replace(/\\\\/g, "\\"); // Convert \\ to actual backslashes

      return { parsedData, markdownReview };
    } else {
      console.log("No markdown_review field found");
      return null;
    }
  } catch (error) {
    console.error("Error parsing JSON:", error);
    return null;
  }
}

If this processing happened on the client, every user's browser would be doing it — inconsistently, with no visibility, and with no way to fix edge cases without a frontend deployment.

Honestly, not for Logicvisor.

Client-side AI calls are a valid tool in specific, narrow contexts — personal tools, internal utilities, quick prototypes where security is not a concern and scale is not a goal. The moment you have real users, real API costs, and real data flowing through your system, the calculus changes completely.

Security, performance, observability, and maintainability all point in the same direction. The extra infrastructure is real overhead. But it's the kind of overhead that pays for itself the first time an abuse attempt hits your rate limiter instead of your API bill.

For any production system where AI inference is part of the core product — keep it on the server. Build the client thin. Let the backend do the heavy lifting.

source & further reading

dev.to — original article Building a LINE First Fitness Agent with AI Cost Gates Building a Multi Platform AI Budget Coach 82% of Ad MCP Servers Are Single-Platform: We Surveyed the Landscape

Why I Move AI Model Calls to the Server — Security, Performance, and Everything In Between

Run your AI side-project on zahid.host