Building Micro Agents as Production-Grade Microservices

This article describes how to build production-grade AI agent systems using a microservices architecture, moving beyond single-process prototypes that fail at scale. It advocates for designing each "micro agent" as an independent service with its own API contract, memory scope, and SLA, using technologies like FastAPI, gRPC, Kafka, and Kubernetes. The guide provides concrete implementation patterns including stateless LLM inference, external memory stores, idempotent tool calls, async task queues, and a standardized project structure with health checks and observability.

Build production-grade AI agent systems using microservices. Covers FastAPI, gRPC, Kafka, Kubernetes, OpenTelemetry, and fault-tolerant orchestration patterns in Python. Table of Contents - Introduction & Motivation - Core Architecture Principles - Agent Service Design - The AgentRunner Loop - Inter-Agent Communication - Tool Registry Service - Memory Architecture - Context Window Management - Orchestrator & Supervisor Pattern - Security & Authorization - Observability: Traces, Logs, Metrics - Deployment on Kubernetes - Scaling Strategies - Fault Tolerance & Retry Strategies - Testing Agent Microservices - CI/CD Pipeline for Agent Services - Cost Management & Token Budgeting - Production Readiness Checklist - Reference Architecture Diagram Introduction & Motivation Why monolithic agent systems fail in production A single-process agent that handles reasoning, tool calls, memory retrieval, and output generation works well in prototypes. In production it breaks in predictable ways: - Latency coupling — one slow tool call blocks the entire inference loop - Unscalable compute — you cannot scale the summarization workload independently from the search workload - Blast radius — a single LLM API timeout or memory corruption takes the whole system down - Zero deployment granularity — updating one tool integration requires redeploying everything - No isolation for billing — impossible to attribute compute cost to individual agent functions The microservice solution Each autonomous capability becomes an independently deployable, independently scalable service with: - Its own API surface HTTP/gRPC - Its own health checks and readiness probes - Its own memory scope no shared in-process state - Its own tool bindings resolved at runtime from a Tool Registry - Its own observability distributed traces, metrics, structured logs What is a Micro Agent? A micro agent is a bounded autonomous service that: - Accepts a task prompt + context + session ID via an API call - Runs a plan → act → observe loop using an LLM backend - Invokes tools via a centralized Tool Registry - Stores and retrieves conversation state from an external memory store - Returns a typed result or emits an event to downstream consumers Key insight:A micro agent is not a “smart function” — it is a service with its own API contract, memory scope, failure modes, and SLA. Design it accordingly. Core Architecture Principles Single Responsibility Each agent owns exactly one reasoning domain. Examples: Stateless Reasoning, Stateful Memory The LLM inference step must be stateless . Memory lives in external stores: No conversation history should ever live in in-process RAM between requests. Schema-First Tool Contracts Every tool must have a JSON Schema definition published to a shared Tool Registry before any agent can invoke it. No ad-hoc function signatures. This enables: - Runtime input validation before LLM output reaches backend services - Auto-generated documentation - Tool versioning with backwards compatibility checks Idempotent Actions Any tool call that modifies external state send email, write to DB, trigger webhook must be idempotent. Strategies: - Use idempotency keys at the HTTP layer pass Idempotency-Key header - Use message deduplication at the queue level Kafka exactly-once semantics - Design tool handlers to be safe to retry: check-then-act patterns Async by Default Long-running agent tasks multi-step research, code generation + execution must use async task queues — not synchronous HTTP with long timeouts. Client ──► POST /tasks ──► Kafka/BullMQ ──► AgentWorker Client ──► GET /tasks/{id} ──► Redis status polling ◄── WebSocket/SSE push optional Explicit Context Boundaries Each agent invocation carries a bounded context packet — never grow unbounded message histories. A ContextManager service compresses/summarizes history before injection. Agent Service Design Project Layout Each agent is a containerized FastAPI or gRPC service with this canonical structure: agent-search/ ├── agent/ │ ├── core.py AgentRunner: plan → act → observe loop │ ├── prompts.py System prompt + few-shot templates │ ├── memory.py ContextManager: load/compress/save │ ├── tools.py Tool bindings calls Tool Registry │ └── schemas.py Pydantic models for all I/O ├── api/ │ ├── routes.py POST /run, GET /status/{task id} │ ├── middleware.py Auth, rate limiting, request tracing │ └── deps.py Dependency injection: DB, Redis, LLM client ├── tests/ │ ├── unit/ │ ├── integration/ │ └── fixtures/ ├── Dockerfile ├── pyproject.toml └── k8s/ ├── deployment.yaml ├── service.yaml ├── hpa.yaml └── configmap.yaml API Contract Every agent exposes these HTTP endpoints at minimum: POST /run Submit a task sync, short tasks only POST /tasks Submit a task async, returns task id GET /tasks/{task id} Poll task status and result GET /health Liveness probe GET /ready Readiness probe checks LLM + memory store GET /metrics Prometheus metrics endpoint python agent/schemas.py from pydantic import BaseModel, Field from typing import Optional, Dict, Any from enum import Enum class TaskStatus str, Enum : PENDING = "pending" RUNNING = "running" COMPLETED = "completed" FAILED = "failed" CANCELLED = "cancelled" class AgentTask BaseModel : id: str session id: str prompt: str metadata: Dict str, Any = Field default factory=dict max steps: int = Field default=10, ge=1, le=25 token budget: int = Field default=8192, ge=512, le=32768 class AgentResult BaseModel : task id: str status: TaskStatus output: Optional str = None steps used: int = 0 tokens used: int = 0 tool calls: int = 0 error: Optional str = None duration ms: int = 0 The AgentRunner Loop Full Implementation python agent/core.py import asyncio import time from opentelemetry import trace from tenacity import retry, stop after attempt, wait exponential jitter tracer = trace.get tracer name MAX STEPS = 15 class AgentRunner: def init self, agent id: str, config: AgentConfig : self.agent id = agent id self.llm = LLMClient model=config.model, timeout=30 self.memory = ContextManager agent id, max tokens=config.context limit self.tools = ToolRegistryClient config.tool registry url self.metrics = AgentMetrics agent id async def run self, task: AgentTask - AgentResult: start = time.monotonic with tracer.start as current span "agent.run" as span: span.set attribute "agent.id", self.agent id span.set attribute "agent.task id", task.id span.set attribute "agent.session", task.session id try: result = await self. run loop task, span except TokenBudgetExceeded as e: result = AgentResult task id=task.id, status=TaskStatus.COMPLETED, output=e.partial output, error="token budget exceeded" except Exception as e: span.record exception e result = AgentResult task id=task.id, status=TaskStatus.FAILED, error=str e finally: result.duration ms = int time.monotonic - start 1000 self.metrics.record result return result async def run loop self, task: AgentTask, span - AgentResult: Load available tools from registry tool schemas = await self.tools.fetch agent id=self.agent id Load and compress conversation history context = await self.memory.load task.session id messages = build messages context, task.prompt total tokens = 0 tool call count = 0 for step in range task.max steps : span.set attribute "agent.current step", step with tracer.start as current span "agent.llm call" as llm span: response = await self. complete with retry messages, tool schemas llm span.set attribute "llm.prompt tokens", response.usage.prompt tokens llm span.set attribute "llm.completion tokens", response.usage.completion tokens total tokens += response.usage.total tokens if total tokens task.token budget: raise TokenBudgetExceeded partial output=response.content, tokens used=total tokens if response.finish reason == "stop": await self.memory.save task.session id, messages + response.message return AgentResult task id=task.id, status=TaskStatus.COMPLETED, output=response.content, steps used=step + 1, tokens used=total tokens, tool calls=tool call count if response.tool calls: tool call count += len response.tool calls results = await self. execute tools response.tool calls messages.append response.message messages.extend tool result messages results Hit max steps — return best available output return AgentResult task id=task.id, status=TaskStatus.COMPLETED, output=response.content, steps used=task.max steps, tokens used=total tokens, error="max steps reached" @retry stop=stop after attempt 3 , wait=wait exponential jitter max=15 async def complete with retry self, messages, tools : return await self.llm.complete messages=messages, tools=tools async def execute tools self, tool calls : tasks = self.tools.invoke tc for tc in tool calls return await asyncio.gather tasks, return exceptions=True Inter-Agent Communication Pattern Selection Matrix gRPC Service Definition For synchronous sub-agent calls, gRPC provides strong typing, bidirectional streaming, and efficient binary serialization. // proto/agent service.proto syntax = "proto3"; package agents.v1; service AgentService { rpc RunTask TaskRequest returns TaskResponse ; rpc StreamSteps TaskRequest returns stream StepEvent ; rpc Health HealthRequest returns HealthResponse ; } message TaskRequest { string task id = 1; string session id = 2; string prompt = 3; map<string, string metadata = 4; int32 max steps = 5; int32 token budget = 6; } message TaskResponse { string task id = 1; string status = 2; string output = 3; int32 steps used = 4; int32 tokens used = 5; string error = 6; } message StepEvent { int32 step number = 1; string type = 2; // "llm call" | "tool call" | "tool result" string content = 3; } Kafka Event Schema For async pipeline handoffs between agents, use Avro or JSON schemas registered in a Schema Registry. { "schema": { "type": "record", "name": "AgentTaskEvent", "namespace": "com.myco.agents.v1", "fields": {"name": "task id", "type": "string"}, {"name": "source agent", "type": "string"}, {"name": "target agent", "type": "string"}, {"name": "session id", "type": "string"}, {"name": "prompt", "type": "string"}, {"name": "context", "type": {"type": "map", "values": "string"}}, {"name": "created at", "type": {"type": "long", "logicalType": "timestamp-millis"}} } } Kafka Producer in Orchestrator python In orchestrator when dispatching to agent-search from aiokafka import AIOKafkaProducer import json async def dispatch to agent target agent: str, task: AgentTask : producer = AIOKafkaProducer bootstrap servers=KAFKA BROKERS await producer.start try: event = { "task id": task.id, "source agent": "orchestrator", "target agent": target agent, "session id": task.session id, "prompt": task.prompt, "created at": int time.time 1000 } await producer.send and wait topic=f"agent.tasks.{target agent}", value=json.dumps event .encode , key=task.session id.encode , partition by session headers= "trace-id", get current trace id .encode finally: await producer.stop Tool Registry Service Architecture The Tool Registry is a centralized FastAPI service that stores, validates, and serves tool definitions. It acts as a typed API gateway for all agent→tool traffic. Tool Registration Schema Tool self-registers on startup class ToolDefinition BaseModel : name: str version: str description: str parameters: Dict str, Any JSON Schema returns: Dict str, Any JSON Schema endpoint: str where registry routes calls health url: str auth type: str "api key" | "oauth2" | "none" rate limit: int calls per minute per agent timeout ms: int = 10000 Registration call at tool service startup @app.on event "startup" async def register tool : registry = ToolRegistryClient TOOL REGISTRY URL await registry.register ToolDefinition name="web search", version="2.1.0", description="Search the web and return ranked results", parameters={ "type": "object", "properties": { "query": {"type": "string", "maxLength": 500}, "num results": {"type": "integer", "minimum": 1, "maximum": 20} }, "required": "query" }, returns={ "type": "array", "items": { "type": "object", "properties": { "url": {"type": "string"}, "title": {"type": "string"}, "snippet": {"type": "string"} } } }, endpoint=f"{SERVICE URL}/invoke", health url=f"{SERVICE URL}/health", auth type="api key", rate limit=60, timeout ms=8000 Registry Validation Layer python Tool Registry validates before forwarding async def invoke tool agent id: str, tool name: str, params: dict : tool = await db.get tool tool name if not tool: raise ToolNotFoundError tool name Validate against JSON Schema jsonschema.validate params, tool.parameters raises on invalid input Check rate limit if not await rate limiter.check agent id, tool name, tool.rate limit : raise RateLimitExceeded f"{tool name} limit: {tool.rate limit}/min" Forward to tool service with timeout async with httpx.AsyncClient timeout=tool.timeout ms / 1000 as client: response = await client.post tool.endpoint, json={"params": params}, headers={"X-Agent-Id": agent id, "X-Request-Id": str uuid4 } response.raise for status return response.json Memory Architecture Memory Tier Selection ContextManager Implementation python agent/memory.py import json from redis.asyncio import Redis from qdrant client import QdrantClient from typing import List class ContextManager: def init self, agent id: str, max tokens: int = 4096 : self.agent id = agent id self.max tokens = max tokens self.redis = Redis.from url REDIS URL self.qdrant = QdrantClient QDRANT URL self.embedder = EmbeddingClient async def load self, session id: str - List dict : 1. Load recent turns from Redis raw = await self.redis.get f"session:{session id}:messages" messages = json.loads raw if raw else 2. Retrieve semantically relevant past context if messages: last user msg = next m for m in reversed messages if m "role" == "user" embedding = await self.embedder.embed last user msg "content" relevant = await self.qdrant.search collection name=f"agent {self.agent id} memory", query vector=embedding, limit=3 Prepend as system context for hit in relevant: messages.insert 0, { "role": "system", "content": f" Past context {hit.payload 'summary' }" } 3. Compress if over token limit return await self. compress if needed messages async def save self, session id: str, messages: List dict : Save last 20 turns to Redis recent = messages -20: await self.redis.setex f"session:{session id}:messages", 86400, 24h TTL json.dumps recent If session is long, generate and store a summary in vector DB if len messages 30: summary = await self. summarize messages embedding = await self.embedder.embed summary await self.qdrant.upsert collection name=f"agent {self.agent id} memory", points= { "id": session id, "vector": embedding, "payload": {"summary": summary, "session id": session id} } async def compress if needed self, messages: List dict - List dict : token count = estimate tokens messages if token count <= self.max tokens: return messages Keep system messages + last N user/assistant turns system msgs = m for m in messages if m "role" == "system" recent turns = messages -12: last 6 exchanges return system msgs + recent turns Context Window Management Token Estimation php import tiktoken def estimate tokens messages: list, model: str = "gpt-4o" - int: enc = tiktoken.encoding for model model total = 0 for msg in messages: total += 4 per-message overhead total += len enc.encode msg.get "content", "" or "" if "tool calls" in msg: for tc in msg "tool calls" : total += len enc.encode json.dumps tc return total class TokenBudget: def init self, total: int, model: str : self.total = total self.model = model self.used = 0 self.reserved = 1024 always reserve for output @property def available for input self : return self.total - self.reserved - self.used def consume self, tokens: int : self.used += tokens if self.used self.total - self.reserved: raise TokenBudgetExceeded tokens used=self.used Orchestrator & Supervisor Pattern Orchestrator: Task Decomposition The Orchestrator is itself an agent microservice, but its role is planning and coordination rather than execution. python orchestrator/core.py class OrchestratorAgent: async def execute self, user request: str, session id: str - str: Step 1: Decompose into a DAG of sub-tasks plan = await self.planner.decompose user request Returns: {"id": "t1", "agent": "search", "task": "...", "deps": }, {"id": "t2", "agent": "summarize", "task": "...", "deps": "t1" }, {"id": "t3", "agent": "email", "task": "...", "deps": "t2" } Step 2: Execute in topological order, parallel where possible results = {} for wave in topological waves plan : All tasks in a wave have their deps satisfied wave results = await asyncio.gather self.supervisor.dispatch step, results for step in wave for step, result in zip wave, wave results : results step "id" = result Step 3: Synthesize final output return await self.synthesizer.merge results, user request def topological waves plan: list - list: """Return plan steps grouped into parallel execution waves.""" completed = set waves = remaining = list plan while remaining: wave = s for s in remaining if all d in completed for d in s "deps" waves.append wave completed.update s "id" for s in wave remaining = s for s in remaining if s "id" not in completed return waves Supervisor: Retry & Escalation class Supervisor: def init self, agent clients: dict : self.agent clients = agent clients async def dispatch self, step: dict, context: dict - StepResult: task prompt = self. inject context step "task" , context, step "deps" for attempt in range 3 : try: return await asyncio.wait for self.agent clients step "agent" .run task prompt , timeout=60.0 except asyncio.TimeoutError: if attempt == 2: raise SupervisorEscalation step, "timeout after 3 attempts" await asyncio.sleep 2 attempt 1s, 2s, 4s except AgentError as e: if e.is unrecoverable: raise SupervisorEscalation step, str e await asyncio.sleep 2 attempt def inject context self, task: str, results: dict, dep ids: list - str: context parts = results dep id .output for dep id in dep ids if dep id in results if context parts: return f"Context from previous steps:\n{chr 10 .join context parts }\n\nTask: {task}" return task Security & Authorization Agent Identity & JWT Verification Each agent service must verify that incoming requests are from authorized callers. Use short-lived JWT tokens signed by an internal auth service. python api/middleware.py from fastapi import Request, HTTPException from jose import jwt, JWTError ALLOWED CALLERS = {"orchestrator", "supervisor", "api-gateway"} async def verify agent token request: Request : token = request.headers.get "Authorization", "" .removeprefix "Bearer " if not token: raise HTTPException status code=401, detail="Missing auth token" try: payload = jwt.decode token, PUBLIC KEY, algorithms= "RS256" caller = payload.get "sub" if caller not in ALLOWED CALLERS: raise HTTPException status code=403, detail=f"Caller {caller} not authorized" request.state.caller = caller except JWTError as e: raise HTTPException status code=401, detail=f"Invalid token: {e}" Secrets Management Never store API keys in environment literals or ConfigMaps. Use Kubernetes Secrets mounted as environment variables, or preferably HashiCorp Vault with the Vault Agent Sidecar. k8s/deployment.yaml secrets section env: - name: OPENAI API KEY valueFrom: secretKeyRef: name: agent-secrets key: openai-api-key - name: TOOL REGISTRY TOKEN valueFrom: secretKeyRef: name: agent-secrets key: tool-registry-token Tool Call Authorization The Tool Registry enforces agent-level RBAC: which agents can invoke which tools. Tool Registry ACL check TOOL ACL = { "agent-search": "web search", "vector search", "knowledge base" , "agent-email": "send email", "get email thread" , "agent-code": "code exec", "git read", "package search" , "agent-data": "sql query", "csv read", "chart generate" , } async def check tool acl agent id: str, tool name: str : allowed tools = TOOL ACL.get agent id, if tool name not in allowed tools: raise PermissionError f"{agent id} is not authorized to call {tool name}" Observability: Traces, Logs, Metrics Distributed Tracing Setup OpenTelemetry python observability/tracing.py from opentelemetry import trace from opentelemetry.exporter.otlp.proto.grpc.trace exporter import OTLPSpanExporter from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor from opentelemetry.instrumentation.redis import RedisInstrumentor from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor def setup tracing service name: str : provider = TracerProvider resource=Resource attributes={SERVICE NAME: service name} provider.add span processor BatchSpanProcessor OTLPSpanExporter endpoint=OTEL ENDPOINT trace.set tracer provider provider Auto-instrument frameworks FastAPIInstrumentor .instrument RedisInstrumentor .instrument HTTPXClientInstrumentor .instrument Standard Span Attributes for Agent Calls Always set these attributes on every agent and LLM span: In AgentRunner. run loop: span.set attribute "agent.id", self.agent id span.set attribute "agent.task id", task.id span.set attribute "agent.session id", task.session id span.set attribute "agent.step", step span.set attribute "llm.model", config.model span.set attribute "llm.prompt tokens", response.usage.prompt tokens span.set attribute "llm.completion tokens", response.usage.completion tokens span.set attribute "llm.finish reason", response.finish reason In Tool Registry on invoke: span.set attribute "tool.name", tool name span.set attribute "tool.version", tool.version span.set attribute "tool.caller agent", agent id span.set attribute "tool.latency ms", latency ms Prometheus Metrics python observability/metrics.py from prometheus client import Counter, Histogram, Gauge agent tasks total = Counter "agent tasks total", "Total tasks processed", "agent id", "status" agent task duration = Histogram "agent task duration seconds", "Task end-to-end latency", "agent id" , buckets= 0.5, 1, 2, 5, 10, 30, 60, 120 agent llm tokens = Counter "agent llm tokens total", "LLM tokens consumed", "agent id", "token type" token type: prompt | completion agent tool calls = Counter "agent tool calls total", "Tool invocations", "agent id", "tool name", "status" agent steps per task = Histogram "agent steps per task", "Number of steps per task runaway guard ", "agent id" , buckets= 1, 2, 3, 5, 8, 10, 15, 20, 25 orchestrator queue depth = Gauge "orchestrator queue depth", "Pending tasks in orchestrator queue" Alert Rules alerting/rules.yaml groups: - name: agent-alerts rules: - alert: AgentHighErrorRate expr: rate agent tasks total{status="failed"} 5m 0.05 for: 2m annotations: summary: "{{ $labels.agent id }} failure rate above 5%" - alert: AgentRunawayTask expr: histogram quantile 0.99, agent steps per task 15 for: 5m annotations: summary: "Agent tasks exceeding 15 steps — possible runaway loop" - alert: LLMTokenCostSpike expr: rate agent llm tokens total 10m 50000 for: 5m annotations: summary: "Token consumption rate spike — check for loops" - alert: AgentLatencyHigh expr: histogram quantile 0.99, agent task duration seconds 10 for: 5m annotations: summary: "p99 task latency above 10s" Structured Logging python Never log raw prompts or PII. Log task IDs and outcome codes. import structlog log = structlog.get logger log.info "agent.task.completed", task id=task.id, session id=task.session id, hashed in prod agent id=self.agent id, steps=result.steps used, tokens=result.tokens used, duration ms=result.duration ms, tool calls=result.tool calls, status=result.status, trace id=get current trace id Deployment on Kubernetes Deployment Manifest k8s/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: agent-search labels: app: agent-search version: v1.4.2 team: ai-platform spec: replicas: 2 selector: matchLabels: app: agent-search template: metadata: labels: app: agent-search version: v1.4.2 annotations: prometheus.io/scrape: "true" prometheus.io/path: "/metrics" prometheus.io/port: "8080" spec: serviceAccountName: agent-search containers: - name: agent image: registry.myco.io/agent-search@sha256:<digest Always pin by digest ports: - containerPort: 8080 HTTP API name: http - containerPort: 50051 gRPC name: grpc env: - name: AGENT ID value: "agent-search" - name: TOOL REGISTRY URL valueFrom: {configMapKeyRef: {name: agent-config, key: tool-registry-url}} - name: REDIS URL valueFrom: {secretKeyRef: {name: agent-secrets, key: redis-url}} - name: OPENAI API KEY valueFrom: {secretKeyRef: {name: agent-secrets, key: openai-api-key}} - name: OTEL EXPORTER OTLP ENDPOINT valueFrom: {configMapKeyRef: {name: observability-config, key: otel-endpoint}} resources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "2" memory: "2Gi" livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 15 failureThreshold: 3 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 10 failureThreshold: 2 lifecycle: preStop: exec: command: "/bin/sh", "-c", "sleep 5" drain connections before shutdown topologySpreadConstraints: - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: {app: agent-search} Horizontal Pod Autoscaler Custom Metrics Scale on Kafka consumer lag and p99 task latency, not just CPU: k8s/hpa.yaml apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: agent-search-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: agent-search minReplicas: 2 maxReplicas: 20 behavior: scaleUp: stabilizationWindowSeconds: 60 policies: - type: Pods value: 4 periodSeconds: 60 scaleDown: stabilizationWindowSeconds: 300 be conservative scaling down metrics: - type: External external: metric: name: kafka consumer group lag selector: matchLabels: topic: agent.tasks.search target: type: AverageValue averageValue: "100" - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 PodDisruptionBudget Ensure at least one replica is always available during rolling updates: k8s/pdb.yaml apiVersion: policy/v1 kind: PodDisruptionBudget metadata: name: agent-search-pdb spec: minAvailable: 1 selector: matchLabels: app: agent-search Scaling Strategies Per-Agent Scaling Logic Multi-Model Fallback If the primary LLM is unavailable or rate-limited, automatically route to a fallback: class LLMClient: MODEL CASCADE = "gpt-4o", primary "gpt-4o-mini", cheaper fallback "claude-sonnet-4-6", cross-vendor fallback async def complete self, messages: list, kwargs - LLMResponse: for model in self.MODEL CASCADE: try: return await self. call model model, messages, kwargs except RateLimitError, ModelUnavailable : log.warning "llm.fallback", from model=model, reason="rate limit or unavailable" continue raise AllModelsUnavailable Fault Tolerance & Retry Strategies Circuit Breaker on LLM Client python from circuitbreaker import circuit class LLMClientWithCircuitBreaker: @circuit failure threshold=5, recovery timeout=30, expected exception=LLMError async def complete self, messages: list, kwargs - LLMResponse: return await self. raw complete messages, kwargs The circuit opens after 5 consecutive failures and remains open for 30 seconds, serving fallback responses or routing to a secondary model during that window. Exponential Backoff with Jitter python from tenacity import retry, stop after attempt, wait exponential jitter, retry if exception type @retry stop=stop after attempt 3 , wait=wait exponential jitter initial=1, max=60 , retry=retry if exception type RateLimitError, TimeoutError, ServiceUnavailable async def call tool with retry tool name: str, params: dict : return await tool registry.invoke tool name, params Dead Letter Queue Handler dlq handler.py — consumes from dead-letter topic class DLQHandler: async def process self, event: AgentTaskEvent : log.error "agent.task.dlq", task id=event.task id, target agent=event.target agent, attempt count=event.retry count, original error=event.last error Alert on-call if error is novel if await self.is novel error event.last error : await self.pagerduty.alert event Store for human review dashboard await self.db.insert dlq item event Auto-re-queue with modified params after 1 hour optional if event.retry count < 2 and event.auto retry eligible: await asyncio.sleep 3600 event.retry count += 1 await self.kafka.send "agent.tasks." + event.target agent, event Step-Level Checkpointing python class CheckpointedAgentRunner AgentRunner : async def run loop self, task: AgentTask, span - AgentResult: Restore from checkpoint if available checkpoint = await self.redis.get f"checkpoint:{task.id}" if checkpoint: state = json.loads checkpoint messages = state "messages" total tokens = state "total tokens" start step = state "step" + 1 log.info "agent.checkpoint.restored", task id=task.id, step=start step else: context = await self.memory.load task.session id messages = build messages context, task.prompt total tokens = 0 start step = 0 for step in range start step, task.max steps : response = await self. complete with retry messages, tool schemas messages.append response.message Persist checkpoint after each step await self.redis.setex f"checkpoint:{task.id}", 3600, json.dumps {"messages": messages, "total tokens": total tokens, "step": step} if response.finish reason == "stop": await self.redis.delete f"checkpoint:{task.id}" break return build result task, response, total tokens, step Testing Agent Microservices Testing Pyramid python tests/unit/test agent runner.py import pytest from unittest.mock import AsyncMock, patch @pytest.fixture def mock llm : llm = AsyncMock llm.complete.return value = LLMResponse content="Here is the search result.", finish reason="stop", usage=Usage prompt tokens=100, completion tokens=50, total tokens=150 return llm async def test agent completes in one step mock llm : runner = AgentRunner "agent-search", test config runner.llm = mock llm result = await runner.run AgentTask id="t1", session id="s1", prompt="find AI news" assert result.status == TaskStatus.COMPLETED assert result.steps used == 1 assert result.tokens used == 150 mock llm.complete.assert called once async def test agent respects token budget mock llm : mock llm.complete.return value = LLMResponse content="...", finish reason="tool calls", usage=Usage prompt tokens=900, completion tokens=100, total tokens=1000 task = AgentTask id="t1", session id="s1", prompt="...", token budget=500 runner = AgentRunner "agent-search", test config runner.llm = mock llm result = await runner.run task assert result.error == "token budget exceeded" Integration Testing with a Mock LLM Server Use a local mock LLM server e.g., wiremock or a FastAPI stub that returns deterministic responses for testing tool call flows end-to-end without hitting real APIs. tests/integration/test tool flow.py async def test search agent calls web search tool mock llm server, real redis, real tool registry : Configure mock LLM to respond with a tool call on first turn mock llm server.set response step=0, response=TOOL CALL RESPONSE mock llm server.set response step=1, response=FINAL RESPONSE runner = AgentRunner "agent-search", integration config result = await runner.run AgentTask id="t1", session id="s1", prompt="Search for AI news" assert result.status == TaskStatus.COMPLETED assert result.tool calls == 1 assert real tool registry.was invoked "web search" Chaos Testing Use Chaos Mesh or Litmus to test resilience: - Pod kill: Kill a random agent pod — verify Supervisor retries succeed - Network partition: Block agent→tool-registry traffic — verify circuit breaker opens - LLM latency injection: Add 15s delay to LLM calls — verify timeout and fallback activate - Kafka partition leader election: Simulate Kafka failover — verify no task loss via consumer offset management CI/CD Pipeline for Agent Services .github/workflows/agent-service.yml name: Agent Service CI/CD on: push: paths: "agents/agent-search/ " jobs: test: runs-on: ubuntu-latest services: redis: image: redis:7-alpine ports: "6379:6379" steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: {python-version: "3.12"} - run: pip install -e ". dev " - run: pytest tests/ --cov=agent --cov-fail-under=85 security-scan: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Trivy vulnerability scan uses: aquasecurity/trivy-action@master with: {image-ref: "agent-search:${{ github.sha }}", exit-code: "1"} build-push: needs: test, security-scan runs-on: ubuntu-latest steps: - name: Build and push pinned by digest run: | docker buildx build --platform linux/amd64,linux/arm64 \ -t registry.myco.io/agent-search:${{ github.sha }} \ --push agents/agent-search/ Capture digest for deployment DIGEST=$ docker inspect --format='{{index .RepoDigests 0}}' \ registry.myco.io/agent-search:${{ github.sha }} echo "IMAGE DIGEST=$DIGEST" $GITHUB ENV deploy-staging: needs: build-push runs-on: ubuntu-latest steps: - name: Deploy to staging run: | kubectl set image deployment/agent-search \ agent=registry.myco.io/agent-search@${{ env.IMAGE DIGEST }} \ -n staging kubectl rollout status deployment/agent-search -n staging --timeout=120s smoke-test-staging: needs: deploy-staging steps: - run: python tests/smoke/run smoke tests.py --env staging deploy-production: needs: smoke-test-staging environment: production steps: - name: Rolling deploy to production run: | kubectl set image deployment/agent-search \ agent=registry.myco.io/agent-search@${{ env.IMAGE DIGEST }} \ -n production kubectl rollout status deployment/agent-search -n production --timeout=300s Cost Management & Token Budgeting Per-Agent Token Accounting Track token usage per agent, per session, and per user to enable chargebacks and anomaly detection. python class TokenAccountant: async def record self, agent id: str, session id: str, usage: Usage : Increment per-agent daily counter await self.redis.incrby f"tokens:{agent id}:{today }", usage.total tokens await self.redis.expire f"tokens:{agent id}:{today }", 86400 7 Increment per-session counter for user billing await self.redis.incrby f"tokens:session:{session id}", usage.total tokens Write to time-series DB for cost dashboards await self.influx.write measurement="llm tokens", tags={"agent id": agent id, "model": usage.model}, fields={"prompt": usage.prompt tokens, "completion": usage.completion tokens}, async def get estimated cost agent id: str - float: tokens = int await redis.get f"tokens:{agent id}:{today }" or 0 GPT-4o pricing: $2.50/1M prompt, $10/1M completion example return tokens / 1 000 000 5.0 blended estimate Budget Enforcement at Session Level MAX SESSION TOKENS = 50 000 hard cap per user session async def check session budget session id: str : used = int await redis.get f"tokens:session:{session id}" or 0 if used MAX SESSION TOKENS: raise SessionBudgetExceeded session id=session id, tokens used=used, limit=MAX SESSION TOKENS Production Readiness Checklist Service-Level Requirements - Agent has /health endpoint that checks LLM client connectivity - Agent has /ready endpoint that checks memory store Redis and Tool Registry reachability - All tool calls are schema-validated by Tool Registry before execution - Agent-level RBAC enforced: agent X cannot invoke tools it is not authorized for - JWT verification on all inter-agent gRPC and HTTP calls - Secrets loaded from Kubernetes Secrets or Vault — never from env literals or ConfigMaps Reliability Requirements - Context window size is bounded — no unbounded message history growth - Token budget enforced per task with hard ceiling - MAX STEPS guard in place to prevent runaway loops - Exponential backoff with jitter on all LLM calls - Circuit breaker configured on LLM client threshold, recovery timeout - Exponential backoff on all tool calls - Failed tasks routed to Dead Letter Queue — not silently dropped - Step-level checkpointing for tasks expected to exceed 60 seconds - Multi-model fallback cascade configured primary → cheaper → cross-vendor Observability Requirements - OpenTelemetry distributed tracing with trace context propagation - All LLM completions traced with token counts and latency - All tool calls traced with tool name, version, and outcome - Prometheus metrics exported: task count, duration, token usage, tool calls, step count - Alerts configured: high error rate, runaway steps, token cost spike, high latency - Structured logging JSON with task id, session id hashed , trace id — no raw prompt content Deployment Requirements - Agent image pinned to digest, not mutable tag never :latest - HPA configured with appropriate metrics queue lag and latency, not just CPU - PodDisruptionBudget set minAvailable = 1 - Pod topology spread constraints configured for HA across nodes - Resource requests and limits set no QoS class “BestEffort” - Rolling update strategy with preStop sleep for graceful shutdown - Integration tests cover “tool call fails → agent recovers” path - Load tests simulate 10× expected peak concurrency before go-live Cost Control Requirements - Token usage recorded per agent and per session - Session-level budget cap enforced - Token cost alerting configured per agent - DLQ monitored — no silent retry storms