Reducing LLM Costs: Best Practices and Techniques

wpnews.pro

LLM costs accumulate in ways that are not always obvious. Tokens consumed by system prompts, repeated context windows, and verbose JSON outputs all inflate bills before a single useful response is returned. For teams running agentic workflows or processing long documents, the standard token-based meter can turn a prototype into a budget risk. The good news is that cost optimization is a systems problem, not just a modeling problem. With the right architecture decisions, you can cut inference spend without sacrificing quality.

Most providers bill by the token. That design rewards short prompts and penalizes long context. If your application passes entire documents, maintains multi-turn agent memory, or implements retrieval-augmented generation with large chunks, input tokens often outpace output tokens by an order of magnitude.

Oxlo.ai uses flat, per-request pricing. One API call costs the same whether you send a 50-token greeting or a 50,000-token legal brief. For long-context summarization, coding agents that keep full file trees in context, or conversational assistants with extensive system prompts, that model removes the direct coupling between context size and cost. You can design for accuracy and depth rather than token economy. See Oxlo.ai pricing for plan details.

Even under flat pricing, smaller payloads improve latency and reduce noise. Under token-based pricing, prompt compression is mandatory. Trim obsolete metadata, collapse redundant instructions, and evict stale conversation turns.

The following helper keeps the last k user-assistant pairs and summarizes older history into a single system message. Because Oxlo.ai is fully OpenAI SDK compatible, you can drop this into an existing client without changing your transport code.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.getenv("OXLO_API_KEY")
)

def trim_history(messages, keep_pairs=3):
    """Retain recent turns; summarize older ones to cut bloat."""
    if len(messages) <= (keep_pairs * 2) + 1:
        return messages
    system_msg = messages[0] if messages[0]["role"] == "system" else None
    recent = messages[-(keep_pairs * 2):]
    summary = {"role": "system", "content": "Earlier conversation summarized."}
    out = [system_msg, summary] if system_msg else [summary]
    out.extend(recent)
    return out

messages = [
    {"role": "system", "content": "You are a precise coding assistant."},
    {"role": "user", "content": "Refactor this 500-line module."},
    {"role": "assistant", "content": "..."},
]

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=trim_history(messages, keep_pairs=2)
)

Not every query needs your largest model. A fast classifier or even a cheap heuristic can route straightforward requests to smaller weights and reserve heavy reasoning for complex tasks. Oxlo.ai hosts more than 45 models across seven categories, all behind the same flat-request pricing. That means routing does not force you into a maze of token-rate tiers.

Here is a minimal cascade that tries a fast model first and escalates only if the answer looks incomplete:

def cascaded_chat(user_prompt, fast_model="qwen3-32b", strong_model="deepseek-r1-671b"):
    fast = client.chat.completions.create(
        model=fast_model,
        messages=[{"role": "user", "content": user_prompt}],
        max_tokens=256
    )
    text = fast.choices[0].message.content

    if "i don't know" in text.lower() or len(text) < 20:
        return client.chat.completions.create(
            model=strong_model,
            messages=[{"role": "user", "content": user_prompt}]
        )
    return fast

Unconstrained outputs waste tokens on rambling preambles. Use JSON mode, constrained grammars, or stop sequences to force the model to quit once the answer is complete. This improves reliability and, on token-based platforms, directly lowers cost.

response = client.chat.completions.create(
    model="qwen3-32b",
    messages=[{"role": "user", "content": "Extract the sender and date from this email."}],
    response_format={"type": "json_object"},
    stop=["\n\n"],          # Halt on first blank line after JSON
    max_tokens=128          # Hard ceiling
)

On Oxlo.ai, you are not billed extra if the model runs to the max_tokens

limit, but tighter constraints still yield lower latency and cleaner client code.

Some token-based providers offer prefix caching, which discounts repeated system prompts and static few-shot examples. That helps, yet any new user message, retrieved document, or agent scratchpad still adds fresh input tokens to the meter.

With Oxlo.ai’s flat per-request pricing, you can pass full system context every time without watching variable costs climb. That simplifies agent frameworks that rebuild the entire prompt state on each turn. You still benefit from application-level caching for latency, but your budget is not hostage to context-window inflation.

Proprietary endpoints often carry hidden premiums. Flat-rate platforms like Oxlo.ai let you experiment with open-source and proprietary models under identical billing logic. The catalog includes general-purpose flags such as Llama 3.3 70B, reasoning specialists like DeepSeek R1 671B and Kimi K2.6, and efficient coding models like Qwen 3 Coder 30B.

Because you pay per request, not per parameter, A/B testing a 70B open model against a closed alternative is a straight quality comparison. Token math does not distort the result, and you can swap endpoints by changing a single string in your OpenAI SDK client.

Grouping non-urgent work lets you saturate throughput and amortize connection overhead. Oxlo.ai has no cold starts on popular models, so batch pipelines do not suffer hidden warm-up latency.

import asyncio

async def batch_chat(prompts, model="llama-3.3-70b"):
    sem = asyncio.Semaphore(10)

    async def fetch(p):
        async with sem:
            return client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": p}]
            )

    return await asyncio.gather(*(fetch(p) for p in prompts))

results = asyncio.run(batch_chat(["Summarize A", "Summarize B", "Summarize C"]))

Cost optimization works best when it is embedded in design, not patched in later. Token-based pricing forces you to micro-manage every delimiter, every turn of conversation, and every retrieved paragraph. That is feasible, but it consumes engineering time that could go into product features.

Oxlo.ai’s request-based model inverts the incentive. Because the price is flat per API call, you can prioritize accuracy, user experience, and clean context management over token counting. For long-context workloads and agentic systems, that architectural shift often delivers larger savings than incremental prompt tweaks alone. Review the Oxlo.ai pricing page to see which plan fits your request volume, and start optimizing at the infrastructure layer.

source & further reading

dev.to — original article How I Find MCP Servers That Are Actually Maintained Faster PRs, Weaker Instincts: The Judgment Problem in AI-Assisted Engineering I Built a Portable AI Skill That Safely Upgrades .NET Applications

Reducing LLM Costs: Best Practices and Techniques

Run your AI side-project on zahid.host