How to run enterprise GenAI like a production service

wpnews.pro

Enterprise GenAI (generative AI) deployments succeed when teams run them with the same discipline they apply to other user-facing services. The model sits in the middle of a pipeline that handles identity, policy, retrieval, inference, and logging. Each stage affects quality, latency, cost, and risk. A pilot can hide these dependencies. Production traffic exposes them.

Familiar sequences are seen across large organizations. A small group proves a use case in days. Leadership asks for broad rollout. Usage climbs and the system behaves differently. Response times vary across the day. The assistant answers confidently with incomplete context. Cloud spend drifts upward without a clear owner. Teams respond by stacking more controls and more prompt variants. Progress slows.

Scale becomes manageable when GenAI is treated as a service with explicit constraints and measurable outcomes. It’s best to rely on a set of production disciplines to get there.

UST

Write the contract for the experience you plan to operate. Put numbers on it. Include p95 latency, availability, error budget, and expected behavior under load. Add a cost envelope per request. Capture policy requirements for data access, citation, and tool use.

This step changes design choices quickly. A team with a 2.5-second p95 target makes different retrieval and routing choices than a team that can tolerate 10 seconds. A team with a three-cent per answer budget makes different model tier choices than a team with a fifty-cent budget.

Most enterprise assistants rely on retrieval-augmented generation. Retrieval drives answer quality. Retrieval also drives unit economics through context size, re-ranking, and repeat work. I spend more time on retrieval quality than on prompt wording.

A production retrieval layer has four properties.

Continuous evaluation keeps the system stable as it evolves. A practical harness starts small.

Create a representative set of queries based on real user logs. Include ambiguous questions, known failure cases, and requests that require refusal.

Attach expectations. Some questions have a clear ground truth. Others can be expressed as constraints such as required citations, prohibited claims, or required policy language.

Measure retrieval and generation separately. Track recall and precision for retrieval with a labeled set of relevant documents. And track answer quality with automated checks plus targeted human review on high-risk paths.

Run the suite on every material change. That includes prompt updates, retriever tweaks, new data sources, and model version updates.

Teams often log only the prompt and the response. Production debugging needs more structure. I want a trace per request that includes the retrieval set, re-ranking scores, model routing decision, tool calls, policy decisions, and final output. I want a stable request ID that ties into incident workflows.

Observability should include outcome signals. A thumbs-up metric helps. A downstream outcome metric helps more. In support settings, track ticket resolution time. In engineering settings, track review cycle time.

UST

Token costs become material at scale. Cost control works best when it sits in the request path.

Use routing rules that start with cache and narrow context. Then choose the lightest model that meets the contract for the request. Reserve larger models for complex queries and tool-heavy flows. Keep a fallback that returns sources, asks for clarification, or hands off to a human queue.

A simplified routing sketch looks like this:

def answer(query, user):
    policy = enforce_policy(query, user)
    hit = cache.get(policy.cache_key)
    if hit and hit.fresh:
        return hit.value

    passages = retrieve(policy.sanitized_query, user_acl=policy.acl)
    top = rerank(passages)[:6]
    model = select_model(top, latency_ms=2500, cost_cents=3)
    draft = model.generate(build_prompt(top, policy))
    checked = verify(draft, top, policy)
    return finalize(checked, fallback="sources_only")

Each line should map to an owned component with metrics and an on-call plan.

GenAI systems degrade in many ways. The vector store slows down. The model endpoint rate limits. A data source disappears. A tool returns partial results. Production readiness depends on predictable behavior during these moments.

Teams should design a small set of degradation modes and test them. Common modes include sources-only answers, reduced context, smaller models, and explicit handoff. The experience stays coherent when the system signals what it can do and logs why it changed behavior.

Use a short checklist before a broad rollout.

Enterprise GenAI becomes dependable when the surrounding system is engineered for operation. The work looks familiar to anyone who has run services at scale. It includes contracts, measurements, routing, and ownership. Teams that invest in these disciplines can change the system without guessing the impact.

—

New Tech Forum** provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to *** doug_dineley@foundryco.com.*

source & further reading

infoworld.com — original article An explosion of software is coming Snowflake’s Horizon Context aims to give AI agents a common understanding of the business Workday launches Agent Passport to test and monitor AI agents in the enterprise

How to run enterprise GenAI like a production service

Run your AI side-project on zahid.host