cd /news/generative-ai/how-to-run-enterprise-genai-like-a-p… · home topics generative-ai article
[ARTICLE · art-20189] src=infoworld.com pub= topic=generative-ai verified=true sentiment=· neutral

How to run enterprise GenAI like a production service

Enterprise generative AI deployments require the same operational discipline as other user-facing production services, with the model functioning as one component in a pipeline handling identity, policy, retrieval, inference, and logging. Teams must define explicit service-level contracts including p95 latency, availability, error budgets, and per-request cost envelopes, as these constraints directly influence design choices for retrieval, model tier, and routing. Production success depends on continuous evaluation through representative query sets, separate measurement of retrieval and generation quality, structured observability with per-request traces, and cost-aware routing that caches results and selects the lightest model meeting the service contract.

read4 min publishedJun 1, 2026

Enterprise GenAI (generative AI) deployments succeed when teams run them with the same discipline they apply to other user-facing services. The model sits in the middle of a pipeline that handles identity, policy, retrieval, inference, and logging. Each stage affects quality, latency, cost, and risk. A pilot can hide these dependencies. Production traffic exposes them.

Familiar sequences are seen across large organizations. A small group proves a use case in days. Leadership asks for broad rollout. Usage climbs and the system behaves differently. Response times vary across the day. The assistant answers confidently with incomplete context. Cloud spend drifts upward without a clear owner. Teams respond by stacking more controls and more prompt variants. Progress slows.

Scale becomes manageable when GenAI is treated as a service with explicit constraints and measurable outcomes. It’s best to rely on a set of production disciplines to get there.

UST

Write the contract for the experience you plan to operate. Put numbers on it. Include p95 latency, availability, error budget, and expected behavior under load. Add a cost envelope per request. Capture policy requirements for data access, citation, and tool use.

This step changes design choices quickly. A team with a 2.5-second p95 target makes different retrieval and routing choices than a team that can tolerate 10 seconds. A team with a three-cent per answer budget makes different model tier choices than a team with a fifty-cent budget.

Most enterprise assistants rely on retrieval-augmented generation. Retrieval drives answer quality. Retrieval also drives unit economics through context size, re-ranking, and repeat work. I spend more time on retrieval quality than on prompt wording.

A production retrieval layer has four properties.

Continuous evaluation keeps the system stable as it evolves. A practical harness starts small.

Create a representative set of queries based on real user logs. Include ambiguous questions, known failure cases, and requests that require refusal.

Attach expectations. Some questions have a clear ground truth. Others can be expressed as constraints such as required citations, prohibited claims, or required policy language.

Measure retrieval and generation separately. Track recall and precision for retrieval with a labeled set of relevant documents. And track answer quality with automated checks plus targeted human review on high-risk paths.

Run the suite on every material change. That includes prompt updates, retriever tweaks, new data sources, and model version updates.

Teams often log only the prompt and the response. Production debugging needs more structure. I want a trace per request that includes the retrieval set, re-ranking scores, model routing decision, tool calls, policy decisions, and final output. I want a stable request ID that ties into incident workflows.

Observability should include outcome signals. A thumbs-up metric helps. A downstream outcome metric helps more. In support settings, track ticket resolution time. In engineering settings, track review cycle time.

UST

Token costs become material at scale. Cost control works best when it sits in the request path.

Use routing rules that start with cache and narrow context. Then choose the lightest model that meets the contract for the request. Reserve larger models for complex queries and tool-heavy flows. Keep a fallback that returns sources, asks for clarification, or hands off to a human queue.

A simplified routing sketch looks like this:

def answer(query, user):
    policy = enforce_policy(query, user)
    hit = cache.get(policy.cache_key)
    if hit and hit.fresh:
        return hit.value

    passages = retrieve(policy.sanitized_query, user_acl=policy.acl)
    top = rerank(passages)[:6]
    model = select_model(top, latency_ms=2500, cost_cents=3)
    draft = model.generate(build_prompt(top, policy))
    checked = verify(draft, top, policy)
    return finalize(checked, fallback="sources_only")

Each line should map to an owned component with metrics and an on-call plan.

GenAI systems degrade in many ways. The vector store slows down. The model endpoint rate limits. A data source disappears. A tool returns partial results. Production readiness depends on predictable behavior during these moments.

Teams should design a small set of degradation modes and test them. Common modes include sources-only answers, reduced context, smaller models, and explicit handoff. The experience stays coherent when the system signals what it can do and logs why it changed behavior.

Use a short checklist before a broad rollout.

Enterprise GenAI becomes dependable when the surrounding system is engineered for operation. The work looks familiar to anyone who has run services at scale. It includes contracts, measurements, routing, and ownership. Teams that invest in these disciplines can change the system without guessing the impact.

New Tech Forum** provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to *** doug_dineley@foundryco.com.*

── more in #generative-ai 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-to-run-enterpris…] indexed:0 read:4min 2026-06-01 ·