Agentic Inference Deployment: From Prose Skills to Deployed Endpoints NVIDIA researchers developed an agentic system for deploying machine learning models to ephemeral SageMaker endpoints, generating runtime code at deployment time from prose artifacts rather than reusable scripts. The approach addresses the complexity of GPU-bound inference deployments, including LLM-specific optimizations like KV cache management and quantization, by using a language-model orchestrator to produce concrete Python and shell commands for each deployment. This article describes an agentic approach to deploying machine learning models to ephemeral SageMaker endpoints using a multi-agent system in which all runtime code is generated at deployment time rather than committed as reusable scripts. The approach relies on prose artifacts — rules, prompts, reference layouts, code-generation blueprints, and standard operating procedures — and uses a language-model orchestrator to produce the concrete Python and shell commands for each deployment. We discuss the problem the design addresses, the architectural choices that support it, the prompt and skill structure, and the conditions under which the pattern generalizes beyond NVIDIA GPUs. We extend the analysis to cover Large Language Model LLM deployment patterns for both SageMaker and Amazon Bedrock, examining how the same agentic approach handles LLM-specific optimizations like KV cache management, quantization strategies, and multi-modal inference patterns. We also catalogue realistic limitations and extensions that fit within the same pattern. Development-time deployment of GPU-bound inference models to a managed endpoint is repetitive but not uniform. Every deployment carries a different combination of: For Large Language Models, additional complexity emerges: Most of these variables interact. A Triton ensemble that uses torch.compile during initialize requires both a sufficiently long stub timeout and a sufficiently long container-startup health-check timeout, both of which must be expressed through environment variables the container understands. An ensemble model that is missing a version directory with a placeholder file will not survive tar extraction, yielding an opaque “No model version was found” error. A DLC tag that looks compatible on paper may require a GPU driver newer than what the chosen instance family ships with. For LLMs, a quantized model that fits in GPU memory during loading may exceed memory limits once KV cache allocation begins, and a vLLM configuration optimized for throughput may timeout during the first inference due to CUDA graph compilation overhead. Engineers therefore tend to accumulate personal shell scripts that deploy exactly one model on exactly one instance type, and these scripts drift quickly. When a new framework, model, or container tag is introduced, the scripts must be duplicated and edited, and the invariants that should be enforced across all deployments — resource tagging, teardown, timeouts, tarball structure — are re-derived by whoever is doing the edit. The agentic approach studied here attempts to solve the repetition without freezing the variability. Rather than shipping a parameterized deployment script, it uses rules and references that describe what a deployment script should do, and an agent writes the script each time with the user’s values substituted in. A SageMaker inference endpoint is a managed process that loads a model from an S3 tarball into a container image and exposes an HTTPS inference API. The user supplies: SageMaker creates three linked resources — a model, an endpoint configuration, and an endpoint — and polls until the endpoint reaches InService. Teardown requires deletion of all three. Triton is sensitive to directory conventions. Each model directory at the tarball root must contain a config.pbtxt and at least one numbered version directory e.g. 1/ . The version directory must be non-empty after tar extraction; a placeholder file such as 1/.keep is used because empty directories do not always survive packaging. An ensemble model references its component models by name, and the top-level ensemble name must be passed to the container through SAGEMAKER TRITON DEFAULT MODEL NAME. Python-backend models carry additional constraints. Long-running initialisation code — most commonly torch.compile over multiple input shapes — can exceed Triton’s default 30-second stub timeout, at which point the backend process is terminated. The correct remedy is to extend the stub timeout through SAGEMAKER TRITON ADDITIONAL ARGS= — backend-config=python,stub-timeout-seconds=; the alternative SAGEMAKER TRITON BACKEND CONFIG variable is not honoured by the SageMaker Triton DLC. Large Language Models introduce additional serving complexity beyond traditional ML models: vLLM : Optimized for high-throughput LLM serving with PagedAttention for efficient KV cache management. Supports continuous batching, speculative decoding, and multi-LoRA serving. Requires careful tuning of — max-model-len, — gpu-memory-utilization, and — max-num-seqs parameters. TensorRT-LLM : NVIDIA’s optimized inference engine with support for INT8/INT4 quantization, multi-GPU tensor parallelism, and CUDA graph optimization. Requires model compilation to TensorRT engines, which is time-consuming but yields superior throughput. Text Generation Inference TGI : HuggingFace’s production-ready serving solution with built-in quantization GPTQ, AWQ, EETQ , flash attention, and streaming support. Simpler configuration but potentially lower peak throughput than vLLM or TensorRT-LLM. Transformers + Accelerate : Direct PyTorch serving using HuggingFace transformers with device map=”auto” for multi-GPU sharding. Most flexible but requires manual optimization for production workloads. Amazon Bedrock provides managed LLM inference without infrastructure management, supporting both foundation models Claude, Llama, Mistral and custom models via Provisioned Throughput or On-Demand inference. Provisioned Throughput : Reserved capacity with guaranteed performance, billed hourly regardless of usage. Suitable for predictable workloads requiring consistent latency. On-Demand : Pay-per-token pricing with variable latency based on service load. Cost-effective for development and variable workloads. Custom Model Import : Supports importing fine-tuned models in specific formats safetensors for Llama, specific checkpoint structures for other architectures with validation requirements that vary by base model family. Every NVIDIA CUDA release has a minimum driver version. SageMaker instances ship with a fixed driver per family that cannot be upgraded by the user. A DLC built with CUDA 12.2 requires a 535-series driver or newer; one built with CUDA 12.4 requires 550 or newer. As a result: Users who pick an instance by GPU memory alone — without checking the driver — will see deployment failures that look like CUDA errors but are really provisioning mismatches. This approach inherits a pattern observed in other agentic workflow systems: ship prose, generate code. The implementation contains no Python deployment scripts. It contains skill documents SKILL.md , framework references, system prompts, agent specifications, SOPs, and test scenarios. A language-model orchestrator reads those files and writes the Python fresh for each run, with the user’s account, bucket, model name, paths, and environment values already substituted into the source. This is the same pattern used for PySpark job generation in some data-pipeline systems: code generation is preferred over parameterisation when every instance of the task has a slightly different shape. Four principles from the broader inference-engineering literature 1 2 inform every decision the agent makes downstream. They appear here so later sections can refer to them without re-deriving the framing. Arithmetic intensity and the roofline model. Every GPU has an ops:byte ratio — its peak compute speed divided by its memory bandwidth. An H100 in FP16 sits at roughly 295 989 TFLOPS / 3.35 TB/s . An algorithm whose arithmetic intensity operations per byte of memory traffic is below that ratio is memory-bound; above, compute-bound. Standard attention during decode has arithmetic intensity around 62 — firmly memory-bound. This is the frame for every bandwidth and instance-selection decision in §6.2. Five independent optimization levers. Inference performance is shaped by five techniques that can be applied independently and combined: quantization §6.3 , speculative decoding §6.4 , caching §6.1, §6.2.1 , model parallelism §10 , and disaggregation §7 . Each touches a different bottleneck; they can compose, and they can interfere e.g., higher batch size starves speculation of compute . More constraints, better performance. The more the deployment workload is constrained — model architecture, sequence length distribution, latency SLO, traffic shape — the more aggressively each lever can be tuned. This is also the agent’s value proposition: it extracts those constraints from the user via Q&A and resolves them at deployment time, where a static script cannot. Three layers of the inference stack. Runtime single-instance performance , infrastructure scaling across replicas, regions, clouds , and tooling the abstraction surface . The agent operates at the runtime and infrastructure layers; the user retains control via the prose-driven Q&A as the tooling layer. Three principles determine most of the approach’s structure. No committed runtime scripts. The implementation does not contain a .py file that will run on the user’s machine. Every Python script that executes during deployment is produced at runtime by an agent from a code-generation blueprint. This avoids the usual drift between a generic deployment tool and the per-deployment values it needs to carry. Blueprints, not templates. A blueprint in this approach is a prose specification of a script’s behaviour: which boto3 calls to make, in which order, with which arguments, which errors to catch, and what to print to stdout versus stderr. The agent reads the blueprint and writes code; it does not render a template. This keeps the authored artifact focused on what the script must do and leaves the surface syntax flexible. RFC 2119 constraints. Every skill expresses its obligations as MUST / MUST NOT / SHOULD / SHOULD NOT / MAY constraints in tabular form. The orchestrator’s system prompt instructs it to treat these as hard rules rather than advisory guidance. This discipline is the principal mechanism for making agent behaviour deterministic enough to audit. Package-aware vs. generic execution. When the user points the agent at an existing ML package that already has a model converter and integration tests, the agent discovers them and reuses them; it does not generate a parallel packaging or testing script. When no such package exists, it falls back to generic framework references. The preference is stated explicitly: the agent MUST NOT generate a generic tarball script when the package has a converter, and MUST NOT generate a generic smoke test when the package has integration tests. The system comprises one orchestrator and specialized leaf agents for different deployment targets. model-deployer orchestrator ├── model-packager — generates packaging scripts ├── deployment-validator — generates and runs validation ├── teardown-sweeper — generates cleanup scripts ├── llm-optimizer — LLM-specific optimization and configuration ├── bedrock-deployer — Bedrock custom model deployment └── kv-cache-tuner — KV cache and memory optimization The orchestrator runs a higher-capacity reasoning model and has access to all skills, the shared deployment-rules context, and standard operating procedures for end-to-end deploy and teardown workflows. It is the only agent that holds conversational context with the user and the only agent permitted to delegate to sub-agents. Its allowed tools are read, write, shell, subagent, an introspection tool, and a skill-loader tool; its shell allow-list is narrow python3, aws, docker, tar, and a small number of read-only utilities . The leaf agents run smaller models with specialized responsibilities: A full deployment proceeds as follows: The approach organizes skills into three categories: shared infrastructure, traditional ML models, and LLM-specific capabilities. These skills apply to all model types: The kv-cache-optimizer skill encodes memory management strategies critical for LLM performance: The agent MUST generate optimization parameters based on model architecture and target hardware: 1. Calculate base model memory — parameter count × precision bytes × safety factor 1.2 2. Estimate KV cache memory — max seq len × batch size × num layers × hidden size × 2 × precision bytes3. Determine optimal allocation — total gpu memory - base model memory - overhead 2GB 4. Configure serving parameters: - vLLM: --gpu-memory-utilization=0.85, --max-model-len=