How to securely connect ADK agents to models on Cloud Run

wpnews.pro

The Agent Development Kit (ADK) simplifies authentication for agents and tools, but is more challenging with the LiteLLM connector when accessing models hosted on Cloud Run. This guide explores how to acquire Google-signed OpenID (ID) tokens and inject them into the LiteLLM communication channel using ADK.

Google Cloud Run provides a robust, built-in access control mechanism based on enforced authentication and IAM policies. When it is enabled, only calls that are made by authenticated accounts which have the specific Cloud Run Invoker role, are accepted, protecting your service from unauthorized invocations.

To implement the authenticated call you have to implement the following steps:

Acquire credentials either by implementing a sign-in process or using Application Default Credentials Fetch an ID tokenfrom a user orworkload identity- Use the identity token as a bearer token in the HTTP Authorization headerwhen you call Cloud Run endpoint

Agent Development Kit (ADK) greatly simplifies implementation of these steps for you. When your agent calls another agent or an MCP (Model Context Protocol) server that runs on Cloud Run, the framework handles discovering the application’s default credentials, fetching the token, and injecting the token into each call between agents and calls from agents to remote MCP tools. The framework also implements token refreshing when the current token is expired. As a developer, you don’t need to implement anything beyond configuring your remote MCP server or agent objects.

The situation is different when you configure an ADK agent to use a model that is deployed in Cloud Run. ADK provides a LiteLLM connector that allows an agent to use non-Gemini models hosted at remote endpoints. However, you will need to take care of making authenticated calls to the model yourself. How can you do that?

Method 1: Static header #

The LiteLLM connector uses the litellm

Python package to call the remote endpoints exposing Ollama, vLLM and other LLM engines. The package supports passing custom HTTP headers in the calls to LiteLLM APIs like acompletion

, using the external_headers

parameter, which is set to the map of header’s names and values.

from google.adk.agents import Agent
from google.adk.models import LiteLlm
import google.auth
import google.auth.transport.requests
from google.oauth2 import id_token

aud = "https://model-123456789012.us-central1.run.app"
creds, _ = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)
if hasattr(creds, "id_token") and creds.id_token:
   token = creds.id_token
else:
   token = id_token.fetch_id_token(auth_req, aud)

model = LiteLlm(
  model=f"ollama_chat/gemma3:270m", 
  api_base=aud,
  extra_headers={
        "Authorization": f"Bearer {token}",
  },
)

agent = Agent(
    name="content_builder",
    model=model,
    instruction="agent system instructions",
)

This approach works as long as the token

is valid. Once the fetched token is expired the calls to the model will fail with the “HTTP 401 Unauthorized” status code because Cloud Run will block calls made with the expired token. This method would fit for agents deployed on Cloud Run with scale to zero configuration and having a low invocation frequency. In this deployment pattern the agent will be frequently restarted which will lead to fetching a new authentication token.

Method 2: Dynamic token injection #

When the agent is expected to run continuously, the token has to be refreshed upon receiving an error. To implement this, you would need to extend the LiteLLMClient class of ADK:

import os
from typing import Any, Optional, Union
from fastapi import HTTPException
from google.adk.models.lite_llm import LiteLLMClient
import google.auth
import google.auth.transport.requests
from google.oauth2 import id_token
from litellm.exceptions import AuthenticationError
from litellm import CustomStreamWrapper, ModelResponse

_creds, _ = google.auth.default()

def _get_auth_token(aud: str) -> Optional[str]:
  """
  Obtains a Google-signed ID token for the given audience (Cloud Run service URL).
  """
    
  try:
    auth_req = google.auth.transport.requests.Request()
    _creds.refresh(auth_req)
    if hasattr(_creds, "id_token") and _creds.id_token:
      return _creds.id_token
    fetched_id_token = id_token.fetch_id_token(auth_req, aud)
    return fetched_id_token
  except Exception as e:
    print(f"Error obtaining ID token for {aud}: {e}")
    return None

class LiteLLMClientEx(LiteLLMClient):
  """
  Overrides the LiteLLMClient to inject a bearer token into the request headers.
  """

  def __init__(self, audience: str, **data: Any) -> None:
    self.token: Optional[str] = None
    self.aud = audience
    super().__init__(**data)

  async def acompletion(
      self, model, messages, tools, **kwargs
  ) -> Union[ModelResponse, CustomStreamWrapper]:
    """
    Updating headers to inject bearer token.
    """

    self.token = self.token or _get_auth_token(self.aud)
    if "extra_headers" not in kwargs or kwargs["extra_headers"] is None:
      kwargs["extra_headers"] = {}
    kwargs["extra_headers"]["Authorization"] = f"Bearer {self.token}"

    try:
      return await super().acompletion(
          model=model, messages=messages, tools=tools, **kwargs
      )
    except (AuthenticationError, HTTPException) as e:
      if isinstance(e, HTTPException) and e.status_code != 401:
        raise
      self.token = _get_auth_token(self.aud)
      kwargs["extra_headers"]["Authorization"] = f"Bearer {self.token}"
      return await super().acompletion(
          model=model, messages=messages, tools=tools, **kwargs
      )

Then use the extended LiteLLMClientEx

when initializing the LiteLLM

connector:

model = LiteLlm(
  model=f"ollama_chat/gemma3:270m", 
  api_base="https://model-123456789012.us-central1.run.app",
  llm_client=LiteLLMClientEx(
                audience="https://model-123456789012.us-central1.run.app"),
)

Note that the exception handler in the LiteLLMClientEx

code snippet that handles the HTTP 401 status error is simplified for use in this post. Cloud Run returns HTTP 401 Unauthorized in multiple scenarios including mismatched audience of the token, wrong token type or malformed token and more. The recommended way to distinguish between expired token and other scenarios is to examine the WWW-Authenticate

header for the error description.

Method 3: Using litellm-proxy #

If you cannot modify the agent code or Method 2 does not work for you for other reasons, you can use a litellm-proxy configuration to automatically populate the Authorization

header of the proxied requests. The litellm proxy can be deployed as a sidecar container alongside your agent, off the authentication logic from your primary agent code.

Once configured, you will use the local endpoint with the proxy port in the LiteLLM

connector configuration without need for any additional modifications of the agent’s code.

model = LiteLlm(
  model=f"ollama_chat/gemma3:270m", 
  api_base="http://0.0.0.0:4000",
)

What’s next #

Wrapping up, there are three different methods to implement injection of ID token into LiteLLM connector communication with a model hosted on Cloud Run. Choosing the best one for you is simple.

Do you expect your agent instance to run less than one hour before being terminated? If you deploy your agent on Cloud Run with autoscale down to zero, and expect having low request frequency, your instance will be shut down after being idle. In this scenario you can use Method 1. It is simple and will guarantee that the fetched ID token will not expire until the instance is shut down.

Do you need to centralize authentication logic for calling your model’s endpoint outside of agents? If you plan to run more than one agent that use the same model or there is a decision not to add the model authentication logic into the agent, you should use Method 3 - deploy the LiteLLM proxy as a sidecar to each agent or as a standalone service and to configure agents to use the proxy’s endpoint to access the model. One subtle nuance of running the proxy as a standalone Cloud Run service is that you will need to implement authentication logic to call that service which is effectively the same as implementing Method 2.

If you answered “No” to both of the above questions, your choice should be Method 2.

Here are a few useful references that you can use to move forward with implementing authentication:

Take codelabs Building a Multi-Agent System using Gemini and Gemma modelsandCreate a Cloud Run service with a sidecar - Explore Method 2 implementation in the Multi-Agent Course Creator with Self-Hosted LLM on GPUGithub repo. - Read a blog post Securely Call Cloud Run Service From Anywhere

source & further reading

leoy.blog — original article Securing Google API Keys How to wear Model Armor 2: Integrating with ADK and LangChain

How to securely connect ADK agents to models on Cloud Run

Method 1: Static header #

Method 2: Dynamic token injection #

Method 3: Using litellm-proxy #

What’s next #

Run your AI side-project on zahid.host