{"slug": "announcing-openai-compatible-api-support-for-amazon-sagemaker-ai-endpoints", "title": "Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints", "summary": "Amazon SageMaker AI now supports OpenAI-compatible API for real-time inference endpoints, allowing users to invoke models by changing only the endpoint URL without custom clients or code rewrites. The update enables bearer token authentication and works with OpenAI SDK, LangChain, and Strands Agents frameworks for agentic workflows and multi-model hosting on dedicated infrastructure.", "body_md": "[Artificial Intelligence](https://aws.amazon.com/blogs/machine-learning/)\n\n# Announcing OpenAI-compatible API support for Amazon SageMaker AI endpoints\n\nToday, Amazon SageMaker AI introduces OpenAI-compatible API support for real-time inference endpoints. If you use the OpenAI SDK, LangChain, or Strands Agents, you can now invoke models on SageMaker AI by changing only your endpoint URL. You don’t need a custom client, a SigV4 wrapper, or code rewrites.\n\n## Overview\n\nWith this launch, SageMaker AI endpoints expose an `/openai/v1`\n\npath that accepts Chat Completions requests and returns responses as is from the container, including streaming. OpenAI endpoints are turned on for all endpoints and inference components using standard SageMaker AI APIs and SDK.\n\nSageMaker AI routes based on the endpoint name in the URL, so any OpenAI-compatible client works out of the box. You can now create time-limited bearer tokens for your endpoints and use them with your OpenAI clients.\n\nFor a working example that includes deployment and invocation, see the accompanying [notebook on GitHub](https://github.com/aws-samples/sagemaker-genai-hosting-examples/blob/main/03-features/openai/sagemaker-inference-openai-api.ipynb).\n\n“We run AI coding agents that use multiple LLM providers through an LLM gateway (Bifrost) speaking the OpenAI chat completions protocol. The bearer token feature lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint — no custom SigV4 signing — so it works natively with our gateway, Vercel AI SDK, and standard OpenAI clients.” says Giorgio Piatti (AI/ML Engineer –\n\n[Caffeine.AI])\n\n## Use cases\n\n### Agentic workflows on owned infrastructure\n\nIf you build multi-step AI agents with frameworks like Strands Agents or LangChain, you can now run those workflows entirely on your own SageMaker AI endpoints. Your agents call models using the same OpenAI-compatible interface they were built on, but inference runs on dedicated GPU instances in your own account.\n\n### Multi-model hosting with a single interface\n\nIf you run multiple models—for example, Llama for general tasks, a fine-tuned Mistral for domain-specific work, and a smaller model for classification—you can host all of them on a single SageMaker AI endpoint using inference components. Each model gets its own resource allocation, and every one is callable through the same OpenAI SDK. You don’t need separate API clients or routing logic in application code.\n\n### Serving fine-tuned models without code changes\n\nIf you fine-tune open source models for your specific use case, you can deploy them on SageMaker AI and call them through the same OpenAI-compatible interface that your applications already use. The only change is the endpoint URL. The rest of the application—the SDK calls, the streaming logic, the prompt formatting—stays the same.\n\n## Solution overview\n\nIn this post, we walk through the following:\n\n- How bearer token authentication works with SageMaker AI endpoints.\n- Deploying and invoking a single-model endpoint.\n- Deploying and invoking inference components for multi-model deployments.\n- Integration with the Strands Agents framework.\n\n### Prerequisites\n\nTo follow along with this walkthrough, you must have the following:\n\n- An AWS account with permissions to create SageMaker AI endpoints.\n- The SageMaker Python SDK (\n`pip install sagemaker`\n\n). - The OpenAI Python SDK (\n`pip install openai`\n\n). - A model stored in Amazon Simple Storage Service (Amazon S3). For example, Qwen3-4B downloaded from Hugging Face.\n- An AWS Identity and Access Management (IAM) execution role to create the endpoints, with the\n`AmazonSageMakerFullAccess`\n\npolicy. - An IAM execution role with the\n`sagemaker:CallWithBearerToken`\n\nand`sagemaker:InvokeEndpoint`\n\npermissions to invoke the endpoint.\n\n### Authentication with bearer tokens\n\nSageMaker AI OpenAI-compatible endpoints use bearer token authentication. The SageMaker Python SDK includes a token generator that creates time-limited tokens (valid for up to 12 hours) from your existing AWS credentials. No additional secrets or API keys are required.\n\nThe token contains your role or user credentials, and it requires the `sagemaker:CallWithBearerToken`\n\nand `sagemaker:InvokeEndpoint`\n\naction permissions.\n\n### Generate a token\n\nUse the following Python script to generate a token.\n\nThe token generator uses whatever AWS credentials are available in your environment: IAM user credentials, an instance profile on Amazon Elastic Compute Cloud (Amazon EC2), or an AWS IAM Identity Center (SSO) session.\n\nThe `generate_token`\n\nfunction generates a time-limited bearer token for authenticating with SageMaker APIs. By default, tokens are valid for 12 hours, though you can override this with the `expiry`\n\nparameter using a `timedelta`\n\nvalue anywhere between 1 second and 12 hours. The function accepts a region, an optional `aws_credentials_provider`\n\n, and the expiry duration. If no AWS Region is provided, it falls back to the `AWS_REGION`\n\nenvironment variable. If no credentials provider is supplied, it resolves credentials using the default AWS credential chain, which searches multiple sources, including environment variables, `~/.aws/credentials`\n\n, `~/.aws/config`\n\n, container credentials, and instance profiles. For the full resolution order, see the [Boto3 credentials documentation](https://docs.aws.amazon.com/boto3/latest/guide/configuration.html).\n\n### Auto-refresh tokens for long-running applications\n\nFor applications that run continuously, you can implement an auto-refreshing pattern using `httpx`\n\nso that a fresh token is generated on each request:\n\n### IAM permissions\n\nThe IAM role or user invoking the endpoint needs the following permissions:\n\nAs a best practice, always restrict the `Resource`\n\nto specific endpoint ARNs for `InvokeEndpoint`\n\nrather than using a wildcard. The bearer token generated from this role has the same level of access, so a narrowly scoped policy limits the blast radius if a token is inadvertently exposed. Note that `CallWithBearerToken`\n\nrequires a wildcard (`\"*\"`\n\n) for the `Resource`\n\nfield. It doesn’t support resource-level restrictions.\n\n### How the token works\n\nThe bearer token is a base64-encoded SigV4 pre-signed URL. When you call `generate_token`\n\n, the SageMaker AI SDK constructs a request to the SageMaker AI service for the `CallWithBearerToken`\n\naction, signs it locally using your AWS credentials, and encodes the resulting signed URL as a portable token string. No network call is made during token generation. The signing happens entirely on the client side. When you present this token to a SageMaker AI endpoint, the service decodes it, validates the SigV4 signature, verifies that the token hasn’t expired, and confirms that the originating IAM identity has the required permissions. The token’s effective lifetime is the lesser of the expiry value and the remaining validity of the AWS credentials used to sign it.\n\n**Security best practice:** The bearer token carries the same authorization as the underlying AWS credentials used to generate it. Treat tokens with the same care as credentials. Scope the IAM role used for token generation to the minimum permissions required, specifically `sagemaker:InvokeEndpoint`\n\nand `sagemaker:CallWithBearerToken`\n\non only the endpoint ARNs that the caller needs to access. Don’t generate tokens from roles with expansive permissions, such as those granted by `AdministratorAccess`\n\nor `SageMakerFullAccess`\n\nmanaged policies.\n\nDon’t store tokens on disk, in environment variables, in configuration files, in databases, or in distributed caches. Don’t log tokens, and only transmit them over encrypted communication protocols such as HTTPS. Token generation is a local operation with no network overhead, so the recommended practice is to generate a fresh token at the point of use or use the auto-refreshing `httpx.Auth`\n\npattern shown in the preceding example. This avoids the risk of token leakage and helps you use a token with maximum remaining validity. As a best practice, set the token expiry to the shortest duration your workload requires.\n\n### Deploy a single-model endpoint\n\nA single-model endpoint hosts one model and serves requests directly. The following example deploys Qwen3-4B using the SageMaker AI vLLM Deep Learning Container on an `ml.g6.2xlarge`\n\ninstance.\n\nNote: SageMaker AI endpoints incur charges while in service, regardless of traffic. For more details, see the [Amazon SageMaker AI pricing page](https://aws.amazon.com/sagemaker-ai/pricing/).\n\nThe endpoint transitions to `InService`\n\nstatus within a few minutes. When ready, it serves both the standard SageMaker AI `/invocations`\n\npath and the OpenAI-compatible path at `/openai/v1/chat/completions`\n\n.\n\n### Invoke a single-model endpoint\n\nWith the endpoint in service, invoke it using the OpenAI Python SDK. The base URL follows this format:\n\nThe `model`\n\nfield is passed through to the container. Because SageMaker AI routes requests based on the endpoint name in the URL, you can keep this field empty or set it to match the model name your container expects.\n\n### Deploy an inference component endpoint\n\nWith inference components, you can host multiple models on a single endpoint, each with dedicated compute resource allocations. With inference components, the model is associated with the component rather than the endpoint configuration:\n\nYou can create additional inference components on the same endpoint to host multiple models with independent scaling and resource allocation.\n\n### Invoke inference components\n\nTo invoke a specific inference component, include its name in the URL path:\n\nThe following example shows two inference components on a shared endpoint, each targeted by a separate OpenAI client that shares a connection pool:\n\nThe shared `httpx.Client`\n\nallows both OpenAI client instances to reuse the same TLS sessions and connection pool.\n\n### Integrate with Strands Agents\n\nStrands Agents is an open source SDK for building AI agents. Because Strands Agents supports OpenAI-compatible model providers, you can now run multi-agent workflows entirely on your own SageMaker AI infrastructure. This gives you the flexibility of agentic applications with the control of dedicated endpoints. Your data never leaves your account, and you choose exactly which model version your agents run.\n\n## Clean up\n\nTo avoid ongoing charges, delete your endpoints and associated resources when you’re done. SageMaker AI endpoints incur costs while in service, regardless of whether they are receiving traffic.\n\n## Conclusion\n\nWith OpenAI-compatible API support, Amazon SageMaker AI removes the integration barrier between where most AI applications are today and the infrastructure they need to scale. You can keep your existing code, use any OpenAI-compatible framework, and run inference on dedicated endpoints with the GPU, scaling, and data residency controls you need. To get started, deploy a model on a SageMaker AI real-time endpoint using a [supported container](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-openai-compatible.html#realtime-endpoints-openai-compatible-containers), install the [SageMaker Python SDK](https://sagemaker.readthedocs.io/), and point your OpenAI client at the endpoint URL. To learn more, see [Use SageMaker AI with OpenAI-compatible APIs](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-openai-compatible.html) in the *Amazon SageMaker AI Developer Guide*, or open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/) to create your first endpoint.", "url": "https://wpnews.pro/news/announcing-openai-compatible-api-support-for-amazon-sagemaker-ai-endpoints", "canonical_source": "https://aws.amazon.com/blogs/machine-learning/announcing-openai-compatible-api-support-for-amazon-sagemaker-ai-endpoints/", "published_at": "2026-05-20 23:59:57+00:00", "updated_at": "2026-05-26 08:08:20.088148+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "generative-ai", "ai-infrastructure"], "entities": ["Amazon SageMaker AI", "OpenAI", "LangChain", "Strands Agents", "GitHub", "Bifrost", "AWS"], "alternates": {"html": "https://wpnews.pro/news/announcing-openai-compatible-api-support-for-amazon-sagemaker-ai-endpoints", "markdown": "https://wpnews.pro/news/announcing-openai-compatible-api-support-for-amazon-sagemaker-ai-endpoints.md", "text": "https://wpnews.pro/news/announcing-openai-compatible-api-support-for-amazon-sagemaker-ai-endpoints.txt", "jsonld": "https://wpnews.pro/news/announcing-openai-compatible-api-support-for-amazon-sagemaker-ai-endpoints.jsonld"}}