Amazon SageMaker AI Async Inference now supports inline request payloads

Amazon SageMaker AI Async Inference now supports inline request payloads up to 128,000 bytes, eliminating the need to upload input data to Amazon S3 before each invocation. This reduces latency, simplifies client-side code, and lowers operational overhead for asynchronous inference workloads.

Artificial Intelligence https://aws.amazon.com/blogs/machine-learning/ Amazon SageMaker AI Async Inference now supports inline request payloads Today, we’re announcing inline payload support for Amazon SageMaker AI Async Inference. Customers can now send inference payloads directly in the request body of the InvokeEndpointAsync API, removing the need to upload input data to Amazon Simple Storage Service Amazon S3 before each invocation. For payloads up to 128,000 bytes, this removes an entire network round-trip, simplifies client-side code, and reduces the operational surface area of asynchronous inference workloads. In this post, we explain the motivation behind this feature, walk through the customer experience before and after, and show you how to start using inline payloads today. Background: How async inference worked before You can use Amazon SageMaker AI Async Inference https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html to queue inference requests and process them asynchronously. It’s a good fit for workloads with large payloads, variable traffic, or tolerance for seconds-to-minutes latency. It supports automatic scaling to zero, making it cost-efficient for bursty or batch-style workloads. Until now, the workflow required two steps on every invocation: Upload the input payload to an Amazon S3 bucket. Invoke the endpoint, passing the S3 object URI as InputLocation . The endpoint processes the request asynchronously and writes the output to a configured S3 output location, which the client polls or receives via Amazon Simple Notification Service Amazon SNS notification. This two-step pattern works well for large payloads images, audio, multi-MB documents . But for customers with small input payloads in KB who need longer processing times than real-time inference allows, the mandatory S3 dependency added unnecessary complexity. What’s new: Inline payload via the Body parameter With today’s launch, InvokeEndpointAsync accepts a new Body parameter. When present, the payload is sent inline in the API request itself, with no S3 upload required. Key details: Aspect | Details | New parameter | Body , raw bytes, capped at 128,000 bytes. | Max inline size | 128,000 bytes raw payload . | Mutual exclusivity | Body and InputLocation are mutually exclusive. The API rejects requests that set both. | Output behavior | Unchanged. Output is written to the S3 OutputLocation . | Endpoint compatibility | Designed to work with existing async endpoints; no model or container changes expected. | Error handling | Size and mutual-exclusivity violations return synchronous ValidationError responses. | Availability | Available in 31 commercial AWS Regions BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, TPE, CPT, MXP, TLV . | Before and after: The customer experience The change is clearest in code. The two examples that follow perform the same async invocation against the same endpoint. The first uses the S3 upload step that was required until now, and the second uses the inline Body parameter that replaces it. Before: Upload to S3 first, then invoke This approach requires: - An S3 client and input bucket provisioned. - AWS Identity and Access Management IAM s3:PutObject permission on the caller. - A naming scheme UUID or similar to avoid key collisions. - A cleanup strategy for stale input objects. After: Send the payload inline No S3 client, no uuid , no input bucket, no IAM grants on the input path, no stale-object cleanup. Customer benefits Sending the payload inline removes a network hop and a dependency from each request. That translates into five concrete benefits: Reduced latency. One network round-trip and one S3 PUT removed per request. For fan-out workloads, this latency savings compounds meaningfully. Simpler architecture. Avoids the input bucket provisioning, lifecycle policies, cross-account access patterns, and the caller’s IAM s3:PutObject permission on the input path. Fewer error paths. The request is a single API call. It either enqueues or it doesn’t. Lower cost. Removes the S3 PUT charge for the input upload on every inline invocation. Immediate validation feedback. Size and mutual-exclusivity errors are returned synchronously. When to use each approach Inline payloads are typically the simpler choice for small payloads, but InputLocation still has its place. Use the following table to decide which path fits a given workload: Scenario | Recommended approach | | Payload <= 128,000 bytes JSON prompts, structured data | Inline Simpler. Avoids one network round-trip and S3 PUT charges. Body . | | Payload 128,000 bytes images, audio, large documents | Upload to S3 first. InputLocation . | | Mixed workload with variable payload sizes | Branch on size. Use Body for small, InputLocation for large. | | Need to retain input data in S3 for audit or replay | Keeps inputs in your bucket. InputLocation . | Getting started See the example code notebook https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/03-features/async-inference-inline-payload/async inline payload.ipynb for a full walkthrough. Before you begin, make sure you have: - An existing Amazon SageMaker AI Async Inference endpoint verify with aws sagemaker describe-endpoint --endpoint-name my-async-endpoint . - The latest AWS SDK for Python Boto3 installed and configured with credentials. - IAM permissions for sagemaker:InvokeEndpointAsync . - An S3 output bucket configured for your async endpoint for example, my-output-bucket . Note: Following this guide uses billable AWS resources. SageMaker AI async inference endpoints incur charges for instance hours, and S3 buckets incur charges for storage and requests. Follow the cleanup steps after completing the tutorial to avoid ongoing charges. Steps Inline payload support is available today. To use it: Update your AWS SDK. Install or upgrade Boto3 to the latest version: pip install --upgrade boto3 . Verify the installation: pip show boto3 . Replace your invocation code. In your application, substitute the S3 upload + InputLocation pattern with a direct Body parameter, as shown in the preceding code example. Test your invocation by calling the InvokeEndpointAsync API with the Body parameter. Verify the response contains an OutputLocation field. Poll or monitor the S3 to confirm your inference result was written successfully. OutputLocation No changes are needed to your endpoint configuration, model container, or output S3 setup. Clean up To avoid ongoing charges, delete the resources used in this walkthrough: - Delete the SageMaker AI endpoint if it was created for testing: - Delete the output S3 bucket if no longer needed . Warning: Deleting an S3 bucket permanently removes the objects within it. Verify you have backed up any inference results you need to retain. - Remove any IAM policies created specifically for this tutorial. Conclusion Inline payload support for SageMaker AI Async Inference removes a common friction point in asynchronous inference workflows: the mandatory S3 upload for every request. For the majority of inference payloads that fit within 128,000 bytes, you can now make a single API call and let SageMaker AI handle the rest. The feature is designed to be backward-compatible. Existing InputLocation workflows continue unchanged. Both inline and S3 inputs are processed identically once the request is accepted, and models receive identical requests regardless of input source. Get started today by updating your AWS SDK and using the Body parameter on the SageMaker AI InvokeEndpointAsync API https://docs.aws.amazon.com/sagemaker/latest/APIReference/API runtime InvokeEndpointAsync.html . To learn more about asynchronous inference, see the Amazon SageMaker AI Async Inference documentation https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html .