# Agentic Inference Deployment: From Prose Skills to Deployed Endpoints

> Source: <https://pub.towardsai.net/agentic-inference-deployment-from-prose-skills-to-deployed-endpoints-1efcfdd47807?source=rss----98111c9905da---4>
> Published: 2026-06-15 23:01:01+00:00

This article describes an agentic approach to deploying machine learning models to ephemeral SageMaker endpoints using a multi-agent system in which all runtime code is generated at deployment time rather than committed as reusable scripts. The approach relies on prose artifacts — rules, prompts, reference layouts, code-generation blueprints, and standard operating procedures — and uses a language-model orchestrator to produce the concrete Python and shell commands for each deployment. We discuss the problem the design addresses, the architectural choices that support it, the prompt and skill structure, and the conditions under which the pattern generalizes beyond NVIDIA GPUs. We extend the analysis to cover Large Language Model (LLM) deployment patterns for both SageMaker and Amazon Bedrock, examining how the same agentic approach handles LLM-specific optimizations like KV cache management, quantization strategies, and multi-modal inference patterns. We also catalogue realistic limitations and extensions that fit within the same pattern.

Development-time deployment of GPU-bound inference models to a managed endpoint is repetitive but not uniform. Every deployment carries a different combination of:

For Large Language Models, additional complexity emerges:

Most of these variables interact. A Triton ensemble that uses torch.compile during initialize() requires both a sufficiently long stub timeout and a sufficiently long container-startup health-check timeout, both of which must be expressed through environment variables the container understands. An ensemble model that is missing a version directory with a placeholder file will not survive tar extraction, yielding an opaque “No model version was found” error. A DLC tag that looks compatible on paper may require a GPU driver newer than what the chosen instance family ships with. For LLMs, a quantized model that fits in GPU memory during loading may exceed memory limits once KV cache allocation begins, and a vLLM configuration optimized for throughput may timeout during the first inference due to CUDA graph compilation overhead.

Engineers therefore tend to accumulate personal shell scripts that deploy exactly one model on exactly one instance type, and these scripts drift quickly. When a new framework, model, or container tag is introduced, the scripts must be duplicated and edited, and the invariants that should be enforced across all deployments — resource tagging, teardown, timeouts, tarball structure — are re-derived by whoever is doing the edit.

The agentic approach studied here attempts to solve the repetition without freezing the variability. Rather than shipping a parameterized deployment script, it uses rules and references that describe what a deployment script should do, and an agent writes the script each time with the user’s values substituted in.

A SageMaker inference endpoint is a managed process that loads a model from an S3 tarball into a container image and exposes an HTTPS inference API. The user supplies:

SageMaker creates three linked resources — a model, an endpoint configuration, and an endpoint — and polls until the endpoint reaches InService. Teardown requires deletion of all three.

Triton is sensitive to directory conventions. Each model directory at the tarball root must contain a config.pbtxt and at least one numbered version directory (e.g. 1/). The version directory must be non-empty after tar extraction; a placeholder file such as 1/.keep is used because empty directories do not always survive packaging. An ensemble model references its component models by name, and the top-level ensemble name must be passed to the container through SAGEMAKER_TRITON_DEFAULT_MODEL_NAME.

Python-backend models carry additional constraints. Long-running initialisation code — most commonly torch.compile over multiple input shapes — can exceed Triton’s default 30-second stub timeout, at which point the backend process is terminated. The correct remedy is to extend the stub timeout through SAGEMAKER_TRITON_ADDITIONAL_ARGS= — backend-config=python,stub-timeout-seconds=; the alternative SAGEMAKER_TRITON_BACKEND_CONFIG variable is not honoured by the SageMaker Triton DLC.

Large Language Models introduce additional serving complexity beyond traditional ML models:**vLLM**: Optimized for high-throughput LLM serving with PagedAttention for efficient KV cache management. Supports continuous batching, speculative decoding, and multi-LoRA serving. Requires careful tuning of — max-model-len, — gpu-memory-utilization, and — max-num-seqs parameters.**TensorRT-LLM**: NVIDIA’s optimized inference engine with support for INT8/INT4 quantization, multi-GPU tensor parallelism, and CUDA graph optimization. Requires model compilation to TensorRT engines, which is time-consuming but yields superior throughput.**Text Generation Inference (TGI)**: HuggingFace’s production-ready serving solution with built-in quantization (GPTQ, AWQ, EETQ), flash attention, and streaming support. Simpler configuration but potentially lower peak throughput than vLLM or TensorRT-LLM.**Transformers + Accelerate**: Direct PyTorch serving using HuggingFace transformers with device_map=”auto” for multi-GPU sharding. Most flexible but requires manual optimization for production workloads.

Amazon Bedrock provides managed LLM inference without infrastructure management, supporting both foundation models (Claude, Llama, Mistral) and custom models via Provisioned Throughput or On-Demand inference.**Provisioned Throughput**: Reserved capacity with guaranteed performance, billed hourly regardless of usage. Suitable for predictable workloads requiring consistent latency.**On-Demand**: Pay-per-token pricing with variable latency based on service load. Cost-effective for development and variable workloads.**Custom Model Import**: Supports importing fine-tuned models in specific formats (safetensors for Llama, specific checkpoint structures for other architectures) with validation requirements that vary by base model family.

Every NVIDIA CUDA release has a minimum driver version. SageMaker instances ship with a fixed driver per family that cannot be upgraded by the user. A DLC built with CUDA 12.2 requires a 535-series driver or newer; one built with CUDA 12.4 requires 550 or newer. As a result:

Users who pick an instance by GPU memory alone — without checking the driver — will see deployment failures that look like CUDA errors but are really provisioning mismatches.

This approach inherits a pattern observed in other agentic workflow systems: ship prose, generate code. The implementation contains no Python deployment scripts. It contains skill documents (SKILL.md), framework references, system prompts, agent specifications, SOPs, and test scenarios. A language-model orchestrator reads those files and writes the Python fresh for each run, with the user’s account, bucket, model name, paths, and environment values already substituted into the source. This is the same pattern used for PySpark job generation in some data-pipeline systems: code generation is preferred over parameterisation when every instance of the task has a slightly different shape.

Four principles from the broader inference-engineering literature [1] [2] inform every decision the agent makes downstream. They appear here so later sections can refer to them without re-deriving the framing.

**Arithmetic intensity and the roofline model.** Every GPU has an ops:byte ratio — its peak compute speed divided by its memory bandwidth. An H100 in FP16 sits at roughly 295 (989 TFLOPS / 3.35 TB/s). An algorithm whose arithmetic intensity (operations per byte of memory traffic) is below that ratio is memory-bound; above, compute-bound. Standard attention during decode has arithmetic intensity around 62 — firmly memory-bound. This is the frame for every bandwidth and instance-selection decision in §6.2.

**Five independent optimization levers.** Inference performance is shaped by five techniques that can be applied independently and combined: quantization (§6.3), speculative decoding (§6.4), caching (§6.1, §6.2.1), model parallelism (§10), and disaggregation (§7). Each touches a different bottleneck; they can compose, and they can interfere (e.g., higher batch size starves speculation of compute).

**More constraints, better performance.** The more the deployment workload is constrained — model architecture, sequence length distribution, latency SLO, traffic shape — the more aggressively each lever can be tuned. This is also the agent’s value proposition: it extracts those constraints from the user via Q&A and resolves them at deployment time, where a static script cannot.

**Three layers of the inference stack.** Runtime (single-instance performance), infrastructure (scaling across replicas, regions, clouds), and tooling (the abstraction surface). The agent operates at the runtime and infrastructure layers; the user retains control via the prose-driven Q&A as the tooling layer.

Three principles determine most of the approach’s structure.**No committed runtime scripts.** The implementation does not contain a .py file that will run on the user’s machine. Every Python script that executes during deployment is produced at runtime by an agent from a code-generation blueprint. This avoids the usual drift between a generic deployment tool and the per-deployment values it needs to carry.**Blueprints, not templates.** A blueprint in this approach is a prose specification of a script’s behaviour: which boto3 calls to make, in which order, with which arguments, which errors to catch, and what to print to stdout versus stderr. The agent reads the blueprint and writes code; it does not render a template. This keeps the authored artifact focused on what the script must do and leaves the surface syntax flexible.**RFC 2119 constraints.** Every skill expresses its obligations as MUST / MUST NOT / SHOULD / SHOULD NOT / MAY constraints in tabular form. The orchestrator’s system prompt instructs it to treat these as hard rules rather than advisory guidance. This discipline is the principal mechanism for making agent behaviour deterministic enough to audit.**Package-aware vs. generic execution.** When the user points the agent at an existing ML package that already has a model converter and integration tests, the agent discovers them and reuses them; it does not generate a parallel packaging or testing script. When no such package exists, it falls back to generic framework references. The preference is stated explicitly: the agent MUST NOT generate a generic tarball script when the package has a converter, and MUST NOT generate a generic smoke test when the package has integration tests.

The system comprises one orchestrator and specialized leaf agents for different deployment targets.

```
model-deployer (orchestrator)    ├── model-packager           — generates packaging scripts    ├── deployment-validator     — generates and runs validation    ├── teardown-sweeper         — generates cleanup scripts    ├── llm-optimizer           — LLM-specific optimization and configuration    ├── bedrock-deployer        — Bedrock custom model deployment    └── kv-cache-tuner          — KV cache and memory optimization
```

The orchestrator runs a higher-capacity reasoning model and has access to all skills, the shared deployment-rules context, and standard operating procedures for end-to-end deploy and teardown workflows. It is the only agent that holds conversational context with the user and the only agent permitted to delegate to sub-agents. Its allowed tools are read, write, shell, subagent, an introspection tool, and a skill-loader tool; its shell allow-list is narrow (python3, aws, docker, tar, and a small number of read-only utilities).

The leaf agents run smaller models with specialized responsibilities:

A full deployment proceeds as follows:

The approach organizes skills into three categories: shared infrastructure, traditional ML models, and LLM-specific capabilities.

These skills apply to all model types:

The kv-cache-optimizer skill encodes memory management strategies critical for LLM performance:

```
The agent MUST generate optimization parameters based on model architecture and target hardware:
1. Calculate base model memory — parameter count × precision bytes × safety factor (1.2)2. Estimate KV cache memory — max_seq_len × batch_size × num_layers × hidden_size × 2 × precision_bytes3. Determine optimal allocation — total_gpu_memory - base_model_memory - overhead (2GB)4. Configure serving parameters:   - vLLM: --gpu-memory-utilization=0.85, --max-model-len=<calculated>, --max-num-seqs=<optimal_batch>   - TGI: --max-total-tokens=<calculated>, --max-batch-prefill-tokens=<optimized>   - TensorRT-LLM: --max_batch_size=<calculated>, --max_input_len=<optimized>, --max_output_len=<optimized>
Memory allocation MUST account for:- Quantization impact on base model size- KV cache growth during generation- Intermediate activation memory for long sequences- Multi-GPU sharding overhead when tensor_parallel_size > 1
```

LLM inference has two distinct phases with fundamentally different performance characteristics. **Prefill** (processing the input prompt) is compute-bound because it processes many tokens in parallel with high arithmetic intensity. **Decode** (generating output tokens one at a time) is memory-bandwidth-bound because each generated token requires reading the entire KV cache from HBM with very low arithmetic intensity (≈1 FLOP per byte read). This means decode throughput is dominated by GPU memory bandwidth, not compute capability, and the KV cache is the single largest consumer of that bandwidth. (References [1][2])

The kv-cache-optimizer skill encodes bandwidth-aware optimization:

```
The agent MUST analyze memory bandwidth before recommending serving configuration:
1. Calculate KV cache bytes read per generated token:   bytes_per_token = 2 × num_layers × num_kv_heads × head_dim × seq_len × precision_bytes × batch_size   (factor of 2 for both K and V tensors)
2. Calculate theoretical decode throughput ceiling:   max_tokens_per_sec = gpu_memory_bandwidth_gb_per_sec / bytes_per_token_gb
Example: Llama-2-70B FP16 at seq_len=2048, batch=1:   - 80 layers × 8 KV heads × 128 head_dim × 2048 × 2 bytes × 2 (K+V) ≈ 671 MB/token   - On H100 (3.35 TB/s HBM3): ceiling ≈ 5000 tok/s (theoretical, single request)   - On A100 (2.0 TB/s HBM2e): ceiling ≈ 3000 tok/s   - On L4 (300 GB/s): ceiling ≈ 450 tok/s
3. Instance selection MUST reflect bandwidth, not just capacity:   - ml.p5.48xlarge (H100 SXM, 3.35 TB/s × 8): optimal for large models and long contexts   - ml.p4d.24xlarge (A100 SXM, 2.0 TB/s × 8): balanced choice for 13B-70B   - ml.g6.48xlarge (L4, 300 GB/s × 8): economical but bandwidth-constrained for decode   - ml.g5.48xlarge (A10G, 600 GB/s × 8): reasonable for 7B-13B models   - ml.inf2.48xlarge (Trainium, high-bandwidth HBM): Neuron SDK with bandwidth-aware kernels
4. Bandwidth optimization techniques MUST be evaluated:   a. Grouped-Query Attention (GQA): reduces num_kv_heads, proportionally reduces KV bytes      - Llama-2-70B uses GQA with 8 KV heads vs 64 query heads (8× reduction)      - Llama-3 family uses GQA across all sizes   b. Multi-Query Attention (MQA): single KV head, maximum bandwidth reduction      - Falcon, PaLM use MQA   c. FlashAttention / FlashAttention-2: fuses attention ops, reduces HBM round-trips      - MUST be enabled in serving framework (vLLM default, TGI --sharded)   d. PagedAttention: eliminates KV cache fragmentation, increases effective batch size      - vLLM native; increases bandwidth utilization efficiency   e. KV cache quantization: FP8 or INT8 K/V reduces bytes read per token      - vLLM --kv-cache-dtype fp8 (H100/H200 hardware support)      - Halves decode bandwidth cost with minimal quality impact
5. Batching amortizes bandwidth cost:   - Single request: bandwidth-bound, batch of N: bandwidth shared across N decodes   - Continuous batching (vLLM, TGI) MUST be enabled for multi-tenant workloads   - Larger batches increase TPS/instance until memory or compute saturation
6. Long context bandwidth scaling:   - KV cache grows linearly with sequence length   - At 32K context, decode TPS drops ~16× vs 2K context on same hardware   - Prefix caching (vLLM --enable-prefix-caching) amortizes repeated prompt prefixes   - MUST warn user when max_model_len causes decode TPS to fall below target SLO
```

The agent generates a bandwidth analysis table as part of the Resolved Requirements block:

```
## KV Cache Bandwidth Analysis
**Model:** Llama-3-70B FP16, GQA (8 KV heads)**KV bytes per token (seq=4096, batch=1):** 335 MB**KV bytes per token (seq=4096, batch=8):** 2.68 GB**Selected instance:** ml.p4d.24xlarge (A100 80GB × 8, 2.0 TB/s each)**Theoretical decode ceiling (batch=1):** ~5900 tok/s aggregate, ~740 tok/s per request**Theoretical decode ceiling (batch=8):** ~5900 tok/s aggregate, ~740 tok/s per request**With FP8 KV cache:** 2× improvement, requires H100/H200**Recommendation:** enable continuous batching, FlashAttention-2, and prefix caching
```

This analysis converts a class of production surprises — “we deployed on a cheaper instance and TPS is 10× worse” — into a pre-deployment calculation the user signs off on.**Anchoring the analysis to the roofline model.** The bandwidth calculations below are an instance of the general roofline framing introduced in §2.7: when an algorithm’s arithmetic intensity (operations per byte of memory traffic) is below the GPU’s ops:byte ratio, performance is memory-bandwidth-bound rather than compute-bound. For decode on an H100 (ops:byte ≈ 295 in FP16), standard attention’s arithmetic intensity for typical sequence lengths is roughly 62 — well below 295 — which is why decode TPS is dominated by HBM bandwidth, not by FLOPS. The agent’s instance-selection logic and KV-cache-tuning decisions follow directly from this asymmetry.

KV cache rarely fits entirely in GPU VRAM at production scale. The agent’s storage decisions follow a four-tier ladder, ordered by bandwidth to the GPU [1][2]:

The agent MUST treat KV cache placement as a tier-routing decision, not a binary “fits or doesn’t” check:

Mechanism: NVIDIA Dynamo’s **KV Block Manager (KVBM)** is the named API for moving KV blocks across these tiers and is the recommended substrate when the deployment uses Dynamo for orchestration. For non-Dynamo deployments, the agent SHOULD generate explicit offload configuration for the chosen serving framework (vLLM --cpu-offload-gb, TGI --cpu-memory-fraction, etc.) and warn when the chosen instance lacks a high-bandwidth CPU↔GPU interconnect that would make G2 offload practical.

Standard prefix caching only saves prefill work for the contiguous matching prefix between two requests. For RAG-style inputs where N retrieved chunks are concatenated in arbitrary order, only the first chunk benefits — chunks 2..N must be prefilled fresh on every request even when their KV caches were precomputed elsewhere. The naive workaround (concatenating precomputed KV caches without recomputing cross-attention) produces wrong outputs because each chunk’s KV was computed in isolation and never attended to its preceding chunks. (References [1][2])

**CacheBlend** and **LMCache** address this by selectively recomputing KV for a small fraction (~10–15%) of cross-chunk-sensitive tokens — empirically the connector tokens whose contextual representation depends on neighbors outside their chunk — while reusing the precomputed KV for the rest. This restores the cross-attention information at a fraction of full-prefill cost, with TTFT reductions of 2.2–3.3× over full prefill and quality within 0.02 F1 of full recompute.

**RFC 2119 selection rules:**

The llm-quantization-manager skill resolves quantization in two ordered steps: **what to quantize** (component sensitivity), then **what format** (dynamic range and granularity). This ordering matches the canonical inference-engineering framing and prevents the common failure mode of picking a format before knowing how aggressively it can be applied. (References [1][2])

**Step 1: Component sensitivity ordering.** Components of a model differ by orders of magnitude in how tolerant they are of reduced precision. Quantization decisions MUST proceed from least sensitive to most sensitive:

**Step 2: Format selection.** Once the agent knows which components to quantize, the format is selected by **dynamic range** (the span of representable values) and **granularity** (how many values share a single scale factor). Floating-point formats are preferred over integer formats for production because their exponent bits give them the dynamic range needed to represent outlier activations after quantization.

**RFC 2119 selection rules:**

**Step 3: Quality validation.** Quantization is not done until quality has been measured. The agent MUST generate a validation script that runs three checks against the original-precision baseline:

The agent MUST NOT mark a quantized deployment as production-ready if any of the three checks shows non-noise quality regression.

Speculative decoding generates more than one token per forward pass through the target model by drafting candidate tokens cheaply and validating them in parallel. It improves TPS and inter-token latency, never TTFT. The speculation-selector skill chooses among algorithms based on workload shape, batch size, and available training/distillation budget. (References [1][2])

**The shared mechanism.** All speculation algorithms follow the same loop: a speculator generates N draft tokens; the target model validates them in a single forward pass; accepted tokens are kept and the target generates one additional token, yielding N+1 tokens per pass. The speedup depends on three factors — draft cost, draft sequence length, and acceptance rate — and acceptance falls off rapidly past the first few draft tokens, so short, high-confidence sequences win.

**Algorithm selection.** The agent MUST select among four families based on workload shape:

**RFC 2119 selection rules:**

**Profiling output.** The agent generates a speculation analysis section in the deployment plan that reports the expected acceptance rate (estimated from a small calibration sample), the chosen draft sequence length, and the projected TPS uplift versus baseline decode. If the projected uplift is below 15%, the agent MUST recommend disabling speculation and surface the rationale.

The multi-modal-extensions skill addresses vision-language models:

```
Multi-modal deployment MUST handle both vision and language components:
1. Model architecture detection:   - LLaVA: Separate vision encoder (CLIP) + language model (Llama)   - BLIP-2: Q-Former bridge + frozen vision encoder + language model   - Flamingo: Perceiver resampler + cross-attention layers   - GPT-4V style: Integrated vision-language architecture
2. Preprocessing pipeline configuration:   - Image preprocessing: resize, normalize, tensor conversion   - Text tokenization: model-specific tokenizer with special tokens   - Multi-modal fusion: attention mask generation, position embeddings
3. Serving framework adaptations:   - vLLM: Multi-modal support via custom model implementations   - TGI: Limited multi-modal support, prefer custom containers   - Transformers: Full flexibility with pipeline abstractions
4. Memory optimization:   - Vision encoder memory: typically 1-4GB depending on architecture   - Cross-attention cache: additional KV cache for vision tokens   - Batch processing: group requests by image resolution for efficiency
```

The bedrock-custom-models skill handles the Bedrock-specific deployment path:

``` python
Bedrock custom model import MUST follow these steps:
1. Model validation:   - Check model format compatibility (safetensors for Llama family)   - Validate checkpoint structure matches base model requirements   - Verify tokenizer compatibility and special token handling   - Confirm model size limits (varies by base model family)
2. S3 preparation:   - Upload model artifacts to S3 with proper IAM permissions   - Structure follows Bedrock requirements: model files, tokenizer, config   - Generate model manifest with metadata and validation checksums
3. Import job creation:   - Call bedrock.create_model_import_job with proper role and S3 paths   - Monitor import status with exponential backoff polling   - Handle validation failures with specific error code interpretation
4. Provisioned throughput setup:   - Calculate minimum units based on model size and expected throughput   - Create provisioned throughput with appropriate commitment term   - Wait for READY status before enabling inference
5. Validation and testing:   - Test inference API with model-appropriate prompts   - Validate response format and quality   - Measure latency and throughput characteristics
```

The orchestrator uses this decision matrix to recommend deployment targets:

The package’s LLM_DEPLOYMENT_LESSONS.md catalogues LLM-specific errors and their prevention rules:

The llm-performance-profiler skill generates comprehensive performance analysis:

```
LLM performance profiling MUST measure these metrics:
1. Latency metrics:   - Time to first token (TTFT): Critical for interactive applications   - Inter-token latency: Consistency of generation speed   - End-to-end latency: Total request processing time
2. Throughput metrics:   - Tokens per second per request: Individual request performance   - Total tokens per second: System-wide throughput   - Requests per second: Concurrent request handling
3. Memory utilization:   - GPU memory usage during inference   - KV cache memory growth patterns   - Peak memory usage vs. sustained usage
4. Quality metrics:   - Response coherence (via automated evaluation)   - Instruction following accuracy   - Multi-turn conversation consistency
5. Cost efficiency:   - Cost per 1K tokens generated   - Instance utilization percentage   - Right-sizing recommendations based on usage patterns
```

The system generates adaptive configurations based on profiling results:

```
# Example generated optimization scriptdef optimize_vllm_config(model_size_gb, target_latency_ms, concurrent_users):    """Generated by llm-optimizer based on profiling results"""
# Calculate optimal memory allocation    gpu_memory_utilization = min(0.95, (model_size_gb + kv_cache_estimate) / total_gpu_memory)
# Optimize for latency vs throughput trade-off    if target_latency_ms < 100:        max_num_seqs = min(4, concurrent_users)  # Prioritize latency        max_model_len = 2048  # Shorter sequences for faster processing    else:        max_num_seqs = min(32, concurrent_users * 2)  # Maximize throughput        max_model_len = 4096  # Allow longer sequences
return {        "gpu_memory_utilization": gpu_memory_utilization,        "max_num_seqs": max_num_seqs,        "max_model_len": max_model_len,        "tensor_parallel_size": calculate_optimal_tp_size(model_size_gb)    }
```

The architecture does not depend on NVIDIA-specific concepts except in the decision tables that encode driver-and-DLC compatibility. The rest of the system — Q&A protocol, package discovery, blueprint-based code generation, tagging and TTL conventions, tarball validation, endpoint lifecycle, and post-deploy profiling — is orthogonal to the accelerator type.**AWS Inferentia and Trainium.** These accelerators are served via the Neuron SDK, which distributes its own inference containers. For LLMs, Neuron supports optimized transformers with automatic model partitioning across NeuronCores. The llm-serving-frameworks skill would add Neuron-specific configurations for sequence length optimization and NeuronCore utilization.**AMD Instinct (ROCm).** ROCm support for LLMs is emerging through ROCm-compatible PyTorch builds and specialized serving frameworks. The quantization strategies would need ROCm-specific calibration, and the KV cache optimization would account for AMD GPU memory hierarchies.**Intel Gaudi (HPU).** Intel’s Habana framework provides LLM optimization through specialized attention kernels and memory management. The multi-GPU sharding patterns would adapt to Gaudi’s scale-out architecture and inter-chip communication patterns.**CPU-only inference.** For smaller LLMs or development scenarios, CPU inference using optimized frameworks (llama.cpp, ONNX Runtime) becomes viable. The KV cache optimization shifts to system memory management, and the performance profiling focuses on CPU utilization and memory bandwidth.

Several extensions are consistent with the same pattern and artifact types:**Multi-LoRA serving.** Support for serving multiple LoRA adapters on a single base model, with dynamic adapter switching and memory-efficient adapter storage.**Speculative decoding.** Integration of draft models for faster generation, with automatic draft model selection and verification threshold tuning.**Retrieval-Augmented Generation (RAG).** Deployment patterns that include vector databases, embedding models, and retrieval pipelines alongside the LLM endpoint.**Function calling and tool use.** Structured output generation with JSON schema validation and external tool integration patterns.**Multi-modal chain deployment.** End-to-end deployment of vision-language pipelines including object detection, OCR, and cross-modal reasoning components.**Federated LLM serving.** Deployment across multiple regions or accounts with load balancing and failover capabilities.**Cost optimization automation.** Dynamic instance scaling based on usage patterns, automatic spot instance utilization, and cost-aware routing between Bedrock and SageMaker.

Several limitations are worth noting explicitly:

The agentic pattern is not tied to AWS. The AWS-specific parts — SageMaker endpoint lifecycle, Bedrock import APIs, ECR/S3, Service Quotas — are all contained in skills and decision tables, while the invariants that make the system useful (blueprint-driven code generation, package-aware execution, RFC 2119 constraints, pre-flight compatibility resolution, KV-cache bandwidth analysis, tagged ephemeral resources with TTL sweep) are provider-agnostic. Porting to another cloud is therefore primarily a matter of replacing the provider-specific skill set and decision tables, not rewriting the orchestrator.**Google Cloud Platform.** SageMaker endpoints map to Vertex AI Endpoints, Bedrock custom models map to Vertex AI Model Garden and the Generative AI on Vertex AI API, ECR maps to Artifact Registry, and S3 maps to Cloud Storage. The quota-check skill would query the Cloud Quotas API instead of Service Quotas, and the instance tables would encode the A2, A3, and G2 families (A100, H100, L4) with their associated driver versions. Vertex AI’s Prediction Container contract is similar enough to SageMaker’s (health-check endpoint, model artifact in GCS, environment variables for configuration) that the packaging blueprints need only minor edits. Gemini-family and partner models served through Vertex’s online prediction fit the same pattern as Bedrock on-demand vs provisioned throughput.**Microsoft Azure.** Azure Machine Learning managed online endpoints replace SageMaker endpoints, with analogous create-deployment, rolling-update, and traffic-splitting semantics. Azure Container Registry replaces ECR, Azure Blob Storage replaces S3, and the ND/NC/NG VM series (A100, H100, MI300X) replace the G/P instance families. Azure AI Foundry (formerly Azure OpenAI and Azure AI Studio) is the Bedrock analogue for managed LLM access, with its own custom-model-import flow for Llama and Mistral fine-tunes. The driver-and-container compatibility matrix differs — Azure ships a wider range of driver versions across VM sizes — but the shape of the decision table is identical.**Oracle Cloud Infrastructure.** OCI Data Science Model Deployment and the OCI Generative AI Service provide a smaller but structurally similar surface. The agent needs an OCI-specific skill for Object Storage, OCIR container registry, and the Data Science API, and a compatibility table for the BM.GPU.A10 / BM.GPU.H100 / BM.GPU.MI300X shapes. OCI is notable as the primary AWS alternative for AMD Instinct (MI300X) deployments, which remains a custom-container path on AWS.**Specialized AI clouds.** CoreWeave, Lambda Cloud, Crusoe, and similar providers expose Kubernetes-native GPU fleets rather than managed endpoint services. For these, the sagemaker-endpoint-manager skill would be replaced by a kubernetes-endpoint-manager skill that generates Deployment, Service, and HorizontalPodAutoscaler manifests alongside the same container images. The KV-cache bandwidth analysis becomes more relevant here, not less, because these providers often offer H100/H200/B200 instances at materially different price-per-bandwidth points than the hyperscalers, and the agent’s pre-deployment ceiling calculation is what lets a user choose between them rationally.**Cross-provider orchestration.** Once provider-specific skills exist, the orchestrator can treat cloud choice as another Q&A answer, with a decision table that routes based on existing account footprint, data residency, GPU availability, and price-per-token targets. The same deployment plan template, confirmation step, and teardown sweep apply. A natural extension is cost-aware routing: given a target latency SLO and a token budget, the agent computes the cheapest cloud-and-instance combination that satisfies both, and — if the user agrees — deploys there rather than defaulting to the provider the user happened to ask about first.

In every case, the invariant parts of the system are retained: prose skills, RFC 2119 constraints, blueprint-driven code generation, pre-deployment compatibility resolution, and tagged ephemeral endpoints with TTL-scoped teardown. The variability is localised to provider-specific API blueprints and lookup tables, which is where the architecture already places provider-specific concerns.

The agentic approach studied here demonstrates how agent-generated deployment code can handle the complexity of both traditional ML models and Large Language Models without freezing the deployment process into rigid parameterized scripts. By extending the prose-driven approach to LLM-specific concerns — quantization strategies, KV cache optimization, multi-modal preprocessing, and Bedrock integration — the system maintains its core principle of situational logic in prose rather than code.

The LLM extensions illustrate how the same architectural pattern scales to handle fundamentally different model types and deployment targets. The agent’s ability to generate framework-specific optimizations, calculate memory requirements, and adapt to different serving patterns demonstrates the flexibility of the blueprint-based approach.

The main consequences of this choice remain: (i) a sharp separation between what the system knows (prose skills and decision tables) and what it does (generated deployment code), (ii) deployment-time resolution of complex compatibility matrices that would otherwise surface as runtime errors, and (iii) an architecture whose model-specific and accelerator-specific parts are contained in extensible tables rather than hardcoded control flow.

Within its acknowledged limitations — development focus, maintenance requirements, and prose contract enforcement — the pattern provides a practical approach to managing the deployment complexity that emerges as ML models grow larger, more diverse, and more operationally demanding. The same principles that make traditional model deployment repeatable extend naturally to the LLM domain, where the stakes of getting the configuration right are often higher and the cost of getting it wrong more expensive.

[Agentic Inference Deployment: From Prose Skills to Deployed Endpoints](https://pub.towardsai.net/agentic-inference-deployment-from-prose-skills-to-deployed-endpoints-1efcfdd47807) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.