{"slug": "amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads", "title": "Amazon SageMaker AI Async Inference now supports inline request payloads", "summary": "Amazon SageMaker AI Async Inference now supports inline request payloads up to 128,000 bytes, eliminating the need to upload input data to Amazon S3 before each invocation. This reduces latency, simplifies client-side code, and lowers operational overhead for asynchronous inference workloads.", "body_md": "[Artificial Intelligence](https://aws.amazon.com/blogs/machine-learning/)\n\n# Amazon SageMaker AI Async Inference now supports inline request payloads\n\nToday, we’re announcing inline payload support for Amazon SageMaker AI Async Inference. Customers can now send inference payloads directly in the request body of the `InvokeEndpointAsync`\n\nAPI, removing the need to upload input data to Amazon Simple Storage Service (Amazon S3) before each invocation.\n\nFor payloads up to 128,000 bytes, this removes an entire network round-trip, simplifies client-side code, and reduces the operational surface area of asynchronous inference workloads.\n\nIn this post, we explain the motivation behind this feature, walk through the customer experience before and after, and show you how to start using inline payloads today.\n\n## Background: How async inference worked before\n\nYou can use [Amazon SageMaker AI Async Inference](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html) to queue inference requests and process them asynchronously. It’s a good fit for workloads with large payloads, variable traffic, or tolerance for seconds-to-minutes latency. It supports automatic scaling to zero, making it cost-efficient for bursty or batch-style workloads.\n\nUntil now, the workflow required two steps on every invocation:\n\n**Upload** the input payload to an Amazon S3 bucket.**Invoke** the endpoint, passing the S3 object URI as`InputLocation`\n\n.\n\nThe endpoint processes the request asynchronously and writes the output to a configured S3 output location, which the client polls or receives via Amazon Simple Notification Service (Amazon SNS) notification.\n\nThis two-step pattern works well for large payloads (images, audio, multi-MB documents). But for customers with small input payloads (in KB) who need longer processing times than real-time inference allows, the mandatory S3 dependency added unnecessary complexity.\n\n## What’s new: Inline payload via the Body parameter\n\nWith today’s launch, `InvokeEndpointAsync`\n\naccepts a new `Body`\n\nparameter. When present, the payload is sent inline in the API request itself, with no S3 upload required.\n\n**Key details:**\n\nAspect |\nDetails |\nNew parameter |\n`Body` , raw bytes, capped at 128,000 bytes. |\nMax inline size |\n128,000 bytes (raw payload). |\nMutual exclusivity |\n`Body` and `InputLocation` are mutually exclusive. The API rejects requests that set both. |\nOutput behavior |\nUnchanged. Output is written to the S3 `OutputLocation` . |\nEndpoint compatibility |\nDesigned to work with existing async endpoints; no model or container changes expected. |\nError handling |\nSize and mutual-exclusivity violations return synchronous `ValidationError` responses. |\nAvailability |\nAvailable in 31 commercial AWS Regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, TPE, CPT, MXP, TLV). |\n\n## Before and after: The customer experience\n\nThe change is clearest in code. The two examples that follow perform the same async invocation against the same endpoint. The first uses the S3 upload step that was required until now, and the second uses the inline `Body`\n\nparameter that replaces it.\n\n### Before: Upload to S3 first, then invoke\n\nThis approach requires:\n\n- An S3 client and input bucket provisioned.\n- AWS Identity and Access Management (IAM)\n`s3:PutObject`\n\npermission on the caller. - A naming scheme (UUID or similar) to avoid key collisions.\n- A cleanup strategy for stale input objects.\n\n### After: Send the payload inline\n\nNo S3 client, no `uuid`\n\n, no input bucket, no IAM grants on the input path, no stale-object cleanup.\n\n## Customer benefits\n\nSending the payload inline removes a network hop and a dependency from each request. That translates into five concrete benefits:\n\n**Reduced latency.** One network round-trip and one S3 PUT removed per request. For fan-out workloads, this latency savings compounds meaningfully.**Simpler architecture.** Avoids the input bucket provisioning, lifecycle policies, cross-account access patterns, and the caller’s IAM`s3:PutObject`\n\npermission on the input path.**Fewer error paths.** The request is a single API call. It either enqueues or it doesn’t.**Lower cost.** Removes the S3 PUT charge for the input upload on every inline invocation.**Immediate validation feedback.** Size and mutual-exclusivity errors are returned synchronously.\n\n## When to use each approach\n\nInline payloads are typically the simpler choice for small payloads, but `InputLocation`\n\nstill has its place. Use the following table to decide which path fits a given workload:\n\nScenario |\nRecommended approach |\n| Payload <= 128,000 bytes (JSON prompts, structured data) | Inline Simpler. Avoids one network round-trip and S3 PUT charges.`Body` . |\n| Payload > 128,000 bytes (images, audio, large documents) | Upload to S3 first.`InputLocation` . |\n| Mixed workload with variable payload sizes | Branch on size. Use `Body` for small, `InputLocation` for large. |\n| Need to retain input data in S3 for audit or replay | Keeps inputs in your bucket.`InputLocation` . |\n\n## Getting started\n\nSee the [example code notebook](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/03-features/async-inference-inline-payload/async_inline_payload.ipynb) for a full walkthrough.\n\nBefore you begin, make sure you have:\n\n- An existing Amazon SageMaker AI Async Inference endpoint (verify with\n`aws sagemaker describe-endpoint --endpoint-name my-async-endpoint`\n\n). - The latest AWS SDK for Python (Boto3) installed and configured with credentials.\n- IAM permissions for\n`sagemaker:InvokeEndpointAsync`\n\n. - An S3 output bucket configured for your async endpoint (for example,\n`my-output-bucket`\n\n).\n\n**Note:** Following this guide uses billable AWS resources. SageMaker AI async inference endpoints incur charges for instance hours, and S3 buckets incur charges for storage and requests. Follow the cleanup steps after completing the tutorial to avoid ongoing charges.\n\n### Steps\n\nInline payload support is available today. To use it:\n\n**Update your AWS SDK.** Install or upgrade Boto3 to the latest version:`pip install --upgrade boto3`\n\n.**Verify the installation:**`pip show boto3`\n\n.**Replace your invocation code.** In your application, substitute the S3 upload +`InputLocation`\n\npattern with a direct`Body`\n\nparameter, as shown in the preceding code example.**Test your invocation** by calling the`InvokeEndpointAsync`\n\nAPI with the`Body`\n\nparameter.**Verify the response** contains an`OutputLocation`\n\nfield.**Poll or monitor the S3** to confirm your inference result was written successfully.`OutputLocation`\n\nNo changes are needed to your endpoint configuration, model container, or output S3 setup.\n\n## Clean up\n\nTo avoid ongoing charges, delete the resources used in this walkthrough:\n\n- Delete the SageMaker AI endpoint if it was created for testing:\n- Delete the output S3 bucket (if no longer needed).\n**Warning:** Deleting an S3 bucket permanently removes the objects within it. Verify you have backed up any inference results you need to retain. - Remove any IAM policies created specifically for this tutorial.\n\n## Conclusion\n\nInline payload support for SageMaker AI Async Inference removes a common friction point in asynchronous inference workflows: the mandatory S3 upload for every request. For the majority of inference payloads that fit within 128,000 bytes, you can now make a single API call and let SageMaker AI handle the rest.\n\nThe feature is designed to be backward-compatible. Existing `InputLocation`\n\nworkflows continue unchanged. Both inline and S3 inputs are processed identically once the request is accepted, and models receive identical requests regardless of input source.\n\nGet started today by updating your AWS SDK and using the `Body`\n\nparameter on the [SageMaker AI InvokeEndpointAsync API](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpointAsync.html). To learn more about asynchronous inference, see the [Amazon SageMaker AI Async Inference documentation](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html).", "url": "https://wpnews.pro/news/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads", "canonical_source": "https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads/", "published_at": "2026-06-17 20:56:36+00:00", "updated_at": "2026-06-17 21:22:58.054575+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-infrastructure", "ai-tools"], "entities": ["Amazon SageMaker", "Amazon S3", "AWS", "Amazon Simple Notification Service"], "alternates": {"html": "https://wpnews.pro/news/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads", "markdown": "https://wpnews.pro/news/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads.md", "text": "https://wpnews.pro/news/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads.txt", "jsonld": "https://wpnews.pro/news/amazon-sagemaker-ai-async-inference-now-supports-inline-request-payloads.jsonld"}}