Amazon SageMaker AI Async Inference now supports inline request payloads

wpnews.pro

Artificial Intelligence Today, we’re announcing inline payload support for Amazon SageMaker AI Async Inference. Customers can now send inference payloads directly in the request body of the InvokeEndpointAsync

API, removing the need to upload input data to Amazon Simple Storage Service (Amazon S3) before each invocation.

For payloads up to 128,000 bytes, this removes an entire network round-trip, simplifies client-side code, and reduces the operational surface area of asynchronous inference workloads. In this post, we explain the motivation behind this feature, walk through the customer experience before and after, and show you how to start using inline payloads today.

Background: How async inference worked before #

You can use Amazon SageMaker AI Async Inference to queue inference requests and process them asynchronously. It’s a good fit for workloads with large payloads, variable traffic, or tolerance for seconds-to-minutes latency. It supports automatic scaling to zero, making it cost-efficient for bursty or batch-style workloads.

Until now, the workflow required two steps on every invocation:

Upload the input payload to an Amazon S3 bucket.Invoke the endpoint, passing the S3 object URI asInputLocation

.

The endpoint processes the request asynchronously and writes the output to a configured S3 output location, which the client polls or receives via Amazon Simple Notification Service (Amazon SNS) notification.

This two-step pattern works well for large payloads (images, audio, multi-MB documents). But for customers with small input payloads (in KB) who need longer processing times than real-time inference allows, the mandatory S3 dependency added unnecessary complexity.

What’s new: Inline payload via the Body parameter #

With today’s launch, InvokeEndpointAsync

accepts a new Body

parameter. When present, the payload is sent inline in the API request itself, with no S3 upload required.

Key details:

Aspect | Details | New parameter | Body , raw bytes, capped at 128,000 bytes. | Max inline size | 128,000 bytes (raw payload). | Mutual exclusivity | Body and InputLocation are mutually exclusive. The API rejects requests that set both. | Output behavior | Unchanged. Output is written to the S3 OutputLocation . | Endpoint compatibility | Designed to work with existing async endpoints; no model or container changes expected. | Error handling | Size and mutual-exclusivity violations return synchronous ValidationError responses. | Availability | Available in 31 commercial AWS Regions (BOM, PDX, YUL, IAD, CMH, SFO, LHR, ICN, SYD, HKG, YYC, GRU, QRO, DUB, CDG, FRA, ZRH, ARN, ZAZ, NRT, KIX, SIN, CGK, MEL, KUL, BKK, HYD, TPE, CPT, MXP, TLV). |

Before and after: The customer experience #

The change is clearest in code. The two examples that follow perform the same async invocation against the same endpoint. The first uses the S3 upload step that was required until now, and the second uses the inline Body

parameter that replaces it.

Before: Upload to S3 first, then invoke

This approach requires:

An S3 client and input bucket provisioned.
AWS Identity and Access Management (IAM) s3:PutObject

permission on the caller. - A naming scheme (UUID or similar) to avoid key collisions.

A cleanup strategy for stale input objects.

After: Send the payload inline

No S3 client, no uuid

, no input bucket, no IAM grants on the input path, no stale-object cleanup.

Customer benefits #

Sending the payload inline removes a network hop and a dependency from each request. That translates into five concrete benefits:

Reduced latency. One network round-trip and one S3 PUT removed per request. For fan-out workloads, this latency savings compounds meaningfully.Simpler architecture. Avoids the input bucket provisioning, lifecycle policies, cross-account access patterns, and the caller’s IAMs3:PutObject

permission on the input path.Fewer error paths. The request is a single API call. It either enqueues or it doesn’t.Lower cost. Removes the S3 PUT charge for the input upload on every inline invocation.Immediate validation feedback. Size and mutual-exclusivity errors are returned synchronously.

When to use each approach #

Inline payloads are typically the simpler choice for small payloads, but InputLocation still has its place. Use the following table to decide which path fits a given workload:

Scenario | Recommended approach | | Payload <= 128,000 bytes (JSON prompts, structured data) | Inline Simpler. Avoids one network round-trip and S3 PUT charges.Body . | | Payload > 128,000 bytes (images, audio, large documents) | Upload to S3 first.InputLocation . | | Mixed workload with variable payload sizes | Branch on size. Use Body for small, InputLocation for large. | | Need to retain input data in S3 for audit or replay | Keeps inputs in your bucket.InputLocation . |

Getting started #

See the example code notebook for a full walkthrough. Before you begin, make sure you have:

An existing Amazon SageMaker AI Async Inference endpoint (verify with

aws sagemaker describe-endpoint --endpoint-name my-async-endpoint ). - The latest AWS SDK for Python (Boto3) installed and configured with credentials.

IAM permissions for sagemaker:InvokeEndpointAsync

. - An S3 output bucket configured for your async endpoint (for example,

my-output-bucket ).

Note: Following this guide uses billable AWS resources. SageMaker AI async inference endpoints incur charges for instance hours, and S3 buckets incur charges for storage and requests. Follow the cleanup steps after completing the tutorial to avoid ongoing charges.

Steps

Inline payload support is available today. To use it: Update your AWS SDK. Install or upgrade Boto3 to the latest version:pip install --upgrade boto3

.Verify the installation:pip show boto3

.Replace your invocation code. In your application, substitute the S3 upload +InputLocation

pattern with a directBody

parameter, as shown in the preceding code example.Test your invocation by calling theInvokeEndpointAsync

API with theBody

parameter.Verify the response contains anOutputLocation

field.Poll or monitor the S3 to confirm your inference result was written successfully.OutputLocation

No changes are needed to your endpoint configuration, model container, or output S3 setup.

Clean up #

To avoid ongoing charges, delete the resources used in this walkthrough:

Delete the SageMaker AI endpoint if it was created for testing:
Delete the output S3 bucket (if no longer needed). Warning: Deleting an S3 bucket permanently removes the objects within it. Verify you have backed up any inference results you need to retain. - Remove any IAM policies created specifically for this tutorial.

Conclusion #

Inline payload support for SageMaker AI Async Inference removes a common friction point in asynchronous inference workflows: the mandatory S3 upload for every request. For the majority of inference payloads that fit within 128,000 bytes, you can now make a single API call and let SageMaker AI handle the rest. The feature is designed to be backward-compatible. Existing InputLocation

workflows continue unchanged. Both inline and S3 inputs are processed identically once the request is accepted, and models receive identical requests regardless of input source.

Get started today by updating your AWS SDK and using the Body

parameter on the SageMaker AI InvokeEndpointAsync API. To learn more about asynchronous inference, see the Amazon SageMaker AI Async Inference documentation.

source & further reading

aws.amazon.com — original article Announcing the Agentic Catalog Experience in Amazon Quick Optimizing production agents with Amazon Bedrock AgentCore Observability Deploying Kimi K3 on AWS