{"slug": "building-micro-agents-as-production-grade-microservices", "title": "Building Micro Agents as Production-Grade Microservices", "summary": "This article describes how to build production-grade AI agent systems using a microservices architecture, moving beyond single-process prototypes that fail at scale. It advocates for designing each \"micro agent\" as an independent service with its own API contract, memory scope, and SLA, using technologies like FastAPI, gRPC, Kafka, and Kubernetes. The guide provides concrete implementation patterns including stateless LLM inference, external memory stores, idempotent tool calls, async task queues, and a standardized project structure with health checks and observability.", "body_md": "Build production-grade AI agent systems using microservices. Covers FastAPI, gRPC, Kafka, Kubernetes, OpenTelemetry, and fault-tolerant orchestration patterns in Python.\n\n### Table of Contents\n\n- Introduction & Motivation\n- Core Architecture Principles\n- Agent Service Design\n- The AgentRunner Loop\n- Inter-Agent Communication\n- Tool Registry Service\n- Memory Architecture\n- Context Window Management\n- Orchestrator & Supervisor Pattern\n- Security & Authorization\n- Observability: Traces, Logs, Metrics\n- Deployment on Kubernetes\n- Scaling Strategies\n- Fault Tolerance & Retry Strategies\n- Testing Agent Microservices\n- CI/CD Pipeline for Agent Services\n- Cost Management & Token Budgeting\n- Production Readiness Checklist\n- Reference Architecture Diagram\n\n### Introduction & Motivation\n\n#### Why monolithic agent systems fail in production\n\nA single-process agent that handles reasoning, tool calls, memory retrieval, and output generation works well in prototypes. In production it breaks in predictable ways:\n\n-\n**Latency coupling**— one slow tool call blocks the entire inference loop -\n**Unscalable compute**— you cannot scale the summarization workload independently from the search workload -\n**Blast radius**— a single LLM API timeout or memory corruption takes the whole system down -\n**Zero deployment granularity**— updating one tool integration requires redeploying everything -\n**No isolation for billing**— impossible to attribute compute cost to individual agent functions\n\n#### The microservice solution\n\nEach autonomous capability becomes an independently deployable, independently scalable service with:\n\n- Its own API surface (HTTP/gRPC)\n- Its own health checks and readiness probes\n- Its own memory scope (no shared in-process state)\n- Its own tool bindings (resolved at runtime from a Tool Registry)\n- Its own observability (distributed traces, metrics, structured logs)\n\n#### What is a Micro Agent?\n\nA **micro agent** is a bounded autonomous service that:\n\n- Accepts a task (prompt + context + session ID) via an API call\n- Runs a plan → act → observe loop using an LLM backend\n- Invokes tools via a centralized Tool Registry\n- Stores and retrieves conversation state from an external memory store\n- Returns a typed result or emits an event to downstream consumers\n\nKey insight:A micro agent is not a “smart function” — it is a service with its own API contract, memory scope, failure modes, and SLA. Design it accordingly.\n\n### Core Architecture Principles\n\n#### Single Responsibility\n\nEach agent owns exactly one reasoning domain. Examples:\n\n#### Stateless Reasoning, Stateful Memory\n\nThe LLM inference step **must be stateless**. Memory lives in external stores:\n\nNo conversation history should ever live in in-process RAM between requests.\n\n#### Schema-First Tool Contracts\n\nEvery tool must have a JSON Schema definition published to a shared Tool Registry before any agent can invoke it. No ad-hoc function signatures. This enables:\n\n- Runtime input validation before LLM output reaches backend services\n- Auto-generated documentation\n- Tool versioning with backwards compatibility checks\n\n#### Idempotent Actions\n\nAny tool call that modifies external state (send email, write to DB, trigger webhook) must be idempotent. Strategies:\n\n- Use\n**idempotency keys** at the HTTP layer (pass Idempotency-Key header) - Use\n**message deduplication** at the queue level (Kafka exactly-once semantics) - Design tool handlers to be safe to retry: check-then-act patterns\n\n#### Async by Default\n\nLong-running agent tasks (multi-step research, code generation + execution) must use async task queues — not synchronous HTTP with long timeouts.\n\nClient ──► POST /tasks ──► Kafka/BullMQ ──► AgentWorker\n\nClient ──► GET /tasks/{id} ──► Redis (status polling)\n\n◄── WebSocket/SSE push (optional)\n\n#### Explicit Context Boundaries\n\nEach agent invocation carries a **bounded context packet** — never grow unbounded message histories. A ContextManager service compresses/summarizes history before injection.\n\n### Agent Service Design\n\n#### Project Layout\n\nEach agent is a containerized FastAPI or gRPC service with this canonical structure:\n\nagent-search/\n\n├── agent/\n\n│ ├── core.py # AgentRunner: plan → act → observe loop\n\n│ ├── prompts.py # System prompt + few-shot templates\n\n│ ├── memory.py # ContextManager: load/compress/save\n\n│ ├── tools.py # Tool bindings (calls Tool Registry)\n\n│ └── schemas.py # Pydantic models for all I/O\n\n├── api/\n\n│ ├── routes.py # POST /run, GET /status/{task_id}\n\n│ ├── middleware.py # Auth, rate limiting, request tracing\n\n│ └── deps.py # Dependency injection: DB, Redis, LLM client\n\n├── tests/\n\n│ ├── unit/\n\n│ ├── integration/\n\n│ └── fixtures/\n\n├── Dockerfile\n\n├── pyproject.toml\n\n└── k8s/\n\n├── deployment.yaml\n\n├── service.yaml\n\n├── hpa.yaml\n\n└── configmap.yaml\n\n#### API Contract\n\nEvery agent exposes these HTTP endpoints at minimum:\n\nPOST /run Submit a task (sync, short tasks only)\n\nPOST /tasks Submit a task (async, returns task_id)\n\nGET /tasks/{task_id} Poll task status and result\n\nGET /health Liveness probe\n\nGET /ready Readiness probe (checks LLM + memory store)\n\nGET /metrics Prometheus metrics endpoint\n\n``` python\n# agent/schemas.py\nfrom pydantic import BaseModel, Field\nfrom typing import Optional, Dict, Any\nfrom enum import Enum\n\nclass TaskStatus(str, Enum):\n    PENDING = \"pending\"\n    RUNNING = \"running\"\n    COMPLETED = \"completed\"\n    FAILED = \"failed\"\n    CANCELLED = \"cancelled\"\n\nclass AgentTask(BaseModel):\n    id: str\n    session_id: str\n    prompt: str\n    metadata: Dict[str, Any] = Field(default_factory=dict)\n    max_steps: int = Field(default=10, ge=1, le=25)\n    token_budget: int = Field(default=8192, ge=512, le=32768)\n\nclass AgentResult(BaseModel):\n    task_id: str\n    status: TaskStatus\n    output: Optional[str] = None\n    steps_used: int = 0\n    tokens_used: int = 0\n    tool_calls: int = 0\n    error: Optional[str] = None\n    duration_ms: int = 0\n```\n\n### The AgentRunner Loop\n\n#### Full Implementation\n\n``` python\n# agent/core.py\nimport asyncio\nimport time\nfrom opentelemetry import trace\nfrom tenacity import retry, stop_after_attempt, wait_exponential_jitter\n\ntracer = trace.get_tracer( __name__ )\nMAX_STEPS = 15\n\nclass AgentRunner:\n    def __init__ (self, agent_id: str, config: AgentConfig):\n        self.agent_id = agent_id\n        self.llm = LLMClient(model=config.model, timeout=30)\n        self.memory = ContextManager(agent_id, max_tokens=config.context_limit)\n        self.tools = ToolRegistryClient(config.tool_registry_url)\n        self.metrics = AgentMetrics(agent_id)\n\n    async def run(self, task: AgentTask) -> AgentResult:\n        start = time.monotonic()\n\n        with tracer.start_as_current_span(\"agent.run\") as span:\n            span.set_attribute(\"agent.id\", self.agent_id)\n            span.set_attribute(\"agent.task_id\", task.id)\n            span.set_attribute(\"agent.session\", task.session_id)\n\n            try:\n                result = await self._run_loop(task, span)\n            except TokenBudgetExceeded as e:\n                result = AgentResult(\n                    task_id=task.id,\n                    status=TaskStatus.COMPLETED,\n                    output=e.partial_output,\n                    error=\"token_budget_exceeded\"\n                )\n            except Exception as e:\n                span.record_exception(e)\n                result = AgentResult(\n                    task_id=task.id,\n                    status=TaskStatus.FAILED,\n                    error=str(e)\n                )\n            finally:\n                result.duration_ms = int((time.monotonic() - start) * 1000)\n                self.metrics.record(result)\n\n            return result\n\n    async def _run_loop(self, task: AgentTask, span) -> AgentResult:\n        # Load available tools from registry\n        tool_schemas = await self.tools.fetch(agent_id=self.agent_id)\n\n        # Load and compress conversation history\n        context = await self.memory.load(task.session_id)\n        messages = build_messages(context, task.prompt)\n\n        total_tokens = 0\n        tool_call_count = 0\n\n        for step in range(task.max_steps):\n            span.set_attribute(\"agent.current_step\", step)\n\n            with tracer.start_as_current_span(\"agent.llm_call\") as llm_span:\n                response = await self._complete_with_retry(messages, tool_schemas)\n                llm_span.set_attribute(\"llm.prompt_tokens\", response.usage.prompt_tokens)\n                llm_span.set_attribute(\"llm.completion_tokens\", response.usage.completion_tokens)\n\n            total_tokens += response.usage.total_tokens\n\n            if total_tokens > task.token_budget:\n                raise TokenBudgetExceeded(\n                    partial_output=response.content,\n                    tokens_used=total_tokens\n                )\n\n            if response.finish_reason == \"stop\":\n                await self.memory.save(task.session_id, messages + [response.message])\n                return AgentResult(\n                    task_id=task.id,\n                    status=TaskStatus.COMPLETED,\n                    output=response.content,\n                    steps_used=step + 1,\n                    tokens_used=total_tokens,\n                    tool_calls=tool_call_count\n                )\n\n            if response.tool_calls:\n                tool_call_count += len(response.tool_calls)\n                results = await self._execute_tools(response.tool_calls)\n                messages.append(response.message)\n                messages.extend(tool_result_messages(results))\n\n        # Hit max steps — return best available output\n        return AgentResult(\n            task_id=task.id,\n            status=TaskStatus.COMPLETED,\n            output=response.content,\n            steps_used=task.max_steps,\n            tokens_used=total_tokens,\n            error=\"max_steps_reached\"\n        )\n\n    @retry(stop=stop_after_attempt(3), wait=wait_exponential_jitter(max=15))\n    async def _complete_with_retry(self, messages, tools):\n        return await self.llm.complete(messages=messages, tools=tools)\n\n    async def _execute_tools(self, tool_calls):\n        tasks = [self.tools.invoke(tc) for tc in tool_calls]\n        return await asyncio.gather(*tasks, return_exceptions=True)\n```\n\n### Inter-Agent Communication\n\n#### Pattern Selection Matrix\n\n#### gRPC Service Definition\n\nFor synchronous sub-agent calls, gRPC provides strong typing, bidirectional streaming, and efficient binary serialization.\n\n```\n// proto/agent_service.proto\nsyntax = \"proto3\";\npackage agents.v1;\n\nservice AgentService {\n  rpc RunTask (TaskRequest) returns (TaskResponse);\n  rpc StreamSteps (TaskRequest) returns (stream StepEvent);\n  rpc Health (HealthRequest) returns (HealthResponse);\n}\n\nmessage TaskRequest {\n  string task_id = 1;\n  string session_id = 2;\n  string prompt = 3;\n  map<string, string> metadata = 4;\n  int32 max_steps = 5;\n  int32 token_budget = 6;\n}\n\nmessage TaskResponse {\n  string task_id = 1;\n  string status = 2;\n  string output = 3;\n  int32 steps_used = 4;\n  int32 tokens_used = 5;\n  string error = 6;\n}\n\nmessage StepEvent {\n  int32 step_number = 1;\n  string type = 2; // \"llm_call\" | \"tool_call\" | \"tool_result\"\n  string content = 3;\n}\n```\n\n#### Kafka Event Schema\n\nFor async pipeline handoffs between agents, use Avro or JSON schemas registered in a Schema Registry.\n\n```\n{\n  \"schema\": {\n    \"type\": \"record\",\n    \"name\": \"AgentTaskEvent\",\n    \"namespace\": \"com.myco.agents.v1\",\n    \"fields\": [\n      {\"name\": \"task_id\", \"type\": \"string\"},\n      {\"name\": \"source_agent\", \"type\": \"string\"},\n      {\"name\": \"target_agent\", \"type\": \"string\"},\n      {\"name\": \"session_id\", \"type\": \"string\"},\n      {\"name\": \"prompt\", \"type\": \"string\"},\n      {\"name\": \"context\", \"type\": {\"type\": \"map\", \"values\": \"string\"}},\n      {\"name\": \"created_at\", \"type\": {\"type\": \"long\", \"logicalType\": \"timestamp-millis\"}}\n    ]\n  }\n}\n```\n\n#### Kafka Producer (in Orchestrator)\n\n``` python\n# In orchestrator when dispatching to agent-search\nfrom aiokafka import AIOKafkaProducer\nimport json\n\nasync def dispatch_to_agent(target_agent: str, task: AgentTask):\n    producer = AIOKafkaProducer(bootstrap_servers=KAFKA_BROKERS)\n    await producer.start()\n    try:\n        event = {\n            \"task_id\": task.id,\n            \"source_agent\": \"orchestrator\",\n            \"target_agent\": target_agent,\n            \"session_id\": task.session_id,\n            \"prompt\": task.prompt,\n            \"created_at\": int(time.time() * 1000)\n        }\n        await producer.send_and_wait(\n            topic=f\"agent.tasks.{target_agent}\",\n            value=json.dumps(event).encode(),\n            key=task.session_id.encode(), # partition by session\n            headers=[(\"trace-id\", get_current_trace_id().encode())]\n        )\n    finally:\n        await producer.stop()\n```\n\n### Tool Registry Service\n\n#### Architecture\n\nThe Tool Registry is a centralized FastAPI service that stores, validates, and serves tool definitions. It acts as a typed API gateway for all agent→tool traffic.\n\n#### Tool Registration Schema\n\n```\n# Tool self-registers on startup\nclass ToolDefinition(BaseModel):\n    name: str\n    version: str\n    description: str\n    parameters: Dict[str, Any] # JSON Schema\n    returns: Dict[str, Any] # JSON Schema\n    endpoint: str # where registry routes calls\n    health_url: str\n    auth_type: str # \"api_key\" | \"oauth2\" | \"none\"\n    rate_limit: int # calls per minute per agent\n    timeout_ms: int = 10000\n\n# Registration call at tool service startup\n@app.on_event(\"startup\")\nasync def register_tool():\n    registry = ToolRegistryClient(TOOL_REGISTRY_URL)\n    await registry.register(ToolDefinition(\n        name=\"web_search\",\n        version=\"2.1.0\",\n        description=\"Search the web and return ranked results\",\n        parameters={\n            \"type\": \"object\",\n            \"properties\": {\n                \"query\": {\"type\": \"string\", \"maxLength\": 500},\n                \"num_results\": {\"type\": \"integer\", \"minimum\": 1, \"maximum\": 20}\n            },\n            \"required\": [\"query\"]\n        },\n        returns={\n            \"type\": \"array\",\n            \"items\": {\n                \"type\": \"object\",\n                \"properties\": {\n                    \"url\": {\"type\": \"string\"},\n                    \"title\": {\"type\": \"string\"},\n                    \"snippet\": {\"type\": \"string\"}\n                }\n            }\n        },\n        endpoint=f\"{SERVICE_URL}/invoke\",\n        health_url=f\"{SERVICE_URL}/health\",\n        auth_type=\"api_key\",\n        rate_limit=60,\n        timeout_ms=8000\n    ))\n```\n\n#### Registry Validation Layer\n\n``` python\n# Tool Registry validates before forwarding\nasync def invoke_tool(agent_id: str, tool_name: str, params: dict):\n    tool = await db.get_tool(tool_name)\n\n    if not tool:\n        raise ToolNotFoundError(tool_name)\n\n    # Validate against JSON Schema\n    jsonschema.validate(params, tool.parameters) # raises on invalid input\n\n    # Check rate limit\n    if not await rate_limiter.check(agent_id, tool_name, tool.rate_limit):\n        raise RateLimitExceeded(f\"{tool_name} limit: {tool.rate_limit}/min\")\n\n    # Forward to tool service with timeout\n    async with httpx.AsyncClient(timeout=tool.timeout_ms / 1000) as client:\n        response = await client.post(\n            tool.endpoint,\n            json={\"params\": params},\n            headers={\"X-Agent-Id\": agent_id, \"X-Request-Id\": str(uuid4())}\n        )\n        response.raise_for_status()\n        return response.json()\n```\n\n### Memory Architecture\n\n#### Memory Tier Selection\n\n#### ContextManager Implementation\n\n``` python\n# agent/memory.py\nimport json\nfrom redis.asyncio import Redis\nfrom qdrant_client import QdrantClient\nfrom typing import List\n\nclass ContextManager:\n    def __init__ (self, agent_id: str, max_tokens: int = 4096):\n        self.agent_id = agent_id\n        self.max_tokens = max_tokens\n        self.redis = Redis.from_url(REDIS_URL)\n        self.qdrant = QdrantClient(QDRANT_URL)\n        self.embedder = EmbeddingClient()\n\n    async def load(self, session_id: str) -> List[dict]:\n        # 1. Load recent turns from Redis\n        raw = await self.redis.get(f\"session:{session_id}:messages\")\n        messages = json.loads(raw) if raw else []\n\n        # 2. Retrieve semantically relevant past context\n        if messages:\n            last_user_msg = next(m for m in reversed(messages) if m[\"role\"] == \"user\")\n            embedding = await self.embedder.embed(last_user_msg[\"content\"])\n            relevant = await self.qdrant.search(\n                collection_name=f\"agent_{self.agent_id}_memory\",\n                query_vector=embedding,\n                limit=3\n            )\n            # Prepend as system context\n            for hit in relevant:\n                messages.insert(0, {\n                    \"role\": \"system\",\n                    \"content\": f\"[Past context] {hit.payload['summary']}\"\n                })\n\n        # 3. Compress if over token limit\n        return await self._compress_if_needed(messages)\n\n    async def save(self, session_id: str, messages: List[dict]):\n        # Save last 20 turns to Redis\n        recent = messages[-20:]\n        await self.redis.setex(\n            f\"session:{session_id}:messages\",\n            86400, # 24h TTL\n            json.dumps(recent)\n        )\n\n        # If session is long, generate and store a summary in vector DB\n        if len(messages) > 30:\n            summary = await self._summarize(messages)\n            embedding = await self.embedder.embed(summary)\n            await self.qdrant.upsert(\n                collection_name=f\"agent_{self.agent_id}_memory\",\n                points=[{\n                    \"id\": session_id,\n                    \"vector\": embedding,\n                    \"payload\": {\"summary\": summary, \"session_id\": session_id}\n                }]\n            )\n\n    async def _compress_if_needed(self, messages: List[dict]) -> List[dict]:\n        token_count = estimate_tokens(messages)\n        if token_count <= self.max_tokens:\n            return messages\n\n        # Keep system messages + last N user/assistant turns\n        system_msgs = [m for m in messages if m[\"role\"] == \"system\"]\n        recent_turns = messages[-12:] # last 6 exchanges\n        return system_msgs + recent_turns\n```\n\n### Context Window Management\n\n#### Token Estimation\n\n``` php\nimport tiktoken\n\ndef estimate_tokens(messages: list, model: str = \"gpt-4o\") -> int:\n    enc = tiktoken.encoding_for_model(model)\n    total = 0\n    for msg in messages:\n        total += 4 # per-message overhead\n        total += len(enc.encode(msg.get(\"content\", \"\") or \"\"))\n        if \"tool_calls\" in msg:\n            for tc in msg[\"tool_calls\"]:\n                total += len(enc.encode(json.dumps(tc)))\n    return total\n\nclass TokenBudget:\n    def __init__ (self, total: int, model: str):\n        self.total = total\n        self.model = model\n        self.used = 0\n        self.reserved = 1024 # always reserve for output\n\n    @property\n    def available_for_input(self):\n        return self.total - self.reserved - self.used\n\n    def consume(self, tokens: int):\n        self.used += tokens\n        if self.used > self.total - self.reserved:\n            raise TokenBudgetExceeded(tokens_used=self.used)\n```\n\n### Orchestrator & Supervisor Pattern\n\n#### Orchestrator: Task Decomposition\n\nThe Orchestrator is itself an agent microservice, but its role is planning and coordination rather than execution.\n\n``` python\n# orchestrator/core.py\nclass OrchestratorAgent:\n    async def execute(self, user_request: str, session_id: str) -> str:\n        # Step 1: Decompose into a DAG of sub-tasks\n        plan = await self.planner.decompose(user_request)\n        # Returns: [{\"id\": \"t1\", \"agent\": \"search\", \"task\": \"...\", \"deps\": []},\n        # {\"id\": \"t2\", \"agent\": \"summarize\", \"task\": \"...\", \"deps\": [\"t1\"]},\n        # {\"id\": \"t3\", \"agent\": \"email\", \"task\": \"...\", \"deps\": [\"t2\"]}]\n\n        # Step 2: Execute in topological order, parallel where possible\n        results = {}\n        for wave in topological_waves(plan):\n            # All tasks in a wave have their deps satisfied\n            wave_results = await asyncio.gather(*[\n                self.supervisor.dispatch(step, results)\n                for step in wave\n            ])\n            for step, result in zip(wave, wave_results):\n                results[step[\"id\"]] = result\n\n        # Step 3: Synthesize final output\n        return await self.synthesizer.merge(results, user_request)\n\ndef topological_waves(plan: list) -> list:\n    \"\"\"Return plan steps grouped into parallel execution waves.\"\"\"\n    completed = set()\n    waves = []\n    remaining = list(plan)\n    while remaining:\n        wave = [s for s in remaining if all(d in completed for d in s[\"deps\"])]\n        waves.append(wave)\n        completed.update(s[\"id\"] for s in wave)\n        remaining = [s for s in remaining if s[\"id\"] not in completed]\n    return waves\n```\n\n#### Supervisor: Retry & Escalation\n\n```\nclass Supervisor:\n    def __init__ (self, agent_clients: dict):\n        self.agent_clients = agent_clients\n\n    async def dispatch(self, step: dict, context: dict) -> StepResult:\n        task_prompt = self._inject_context(step[\"task\"], context, step[\"deps\"])\n\n        for attempt in range(3):\n            try:\n                return await asyncio.wait_for(\n                    self.agent_clients[step[\"agent\"]].run(task_prompt),\n                    timeout=60.0\n                )\n            except asyncio.TimeoutError:\n                if attempt == 2:\n                    raise SupervisorEscalation(step, \"timeout_after_3_attempts\")\n                await asyncio.sleep(2 ** attempt) # 1s, 2s, 4s\n            except AgentError as e:\n                if e.is_unrecoverable:\n                    raise SupervisorEscalation(step, str(e))\n                await asyncio.sleep(2 ** attempt)\n\n    def _inject_context(self, task: str, results: dict, dep_ids: list) -> str:\n        context_parts = [results[dep_id].output for dep_id in dep_ids if dep_id in results]\n        if context_parts:\n            return f\"Context from previous steps:\\n{chr(10).join(context_parts)}\\n\\nTask: {task}\"\n        return task\n```\n\n### Security & Authorization\n\n#### Agent Identity & JWT Verification\n\nEach agent service must verify that incoming requests are from authorized callers. Use short-lived JWT tokens signed by an internal auth service.\n\n``` python\n# api/middleware.py\nfrom fastapi import Request, HTTPException\nfrom jose import jwt, JWTError\n\nALLOWED_CALLERS = {\"orchestrator\", \"supervisor\", \"api-gateway\"}\n\nasync def verify_agent_token(request: Request):\n    token = request.headers.get(\"Authorization\", \"\").removeprefix(\"Bearer \")\n    if not token:\n        raise HTTPException(status_code=401, detail=\"Missing auth token\")\n    try:\n        payload = jwt.decode(token, PUBLIC_KEY, algorithms=[\"RS256\"])\n        caller = payload.get(\"sub\")\n        if caller not in ALLOWED_CALLERS:\n            raise HTTPException(status_code=403, detail=f\"Caller {caller} not authorized\")\n        request.state.caller = caller\n    except JWTError as e:\n        raise HTTPException(status_code=401, detail=f\"Invalid token: {e}\")\n```\n\n#### Secrets Management\n\nNever store API keys in environment literals or ConfigMaps. Use Kubernetes Secrets mounted as environment variables, or preferably HashiCorp Vault with the Vault Agent Sidecar.\n\n```\n# k8s/deployment.yaml (secrets section)\nenv:\n  - name: OPENAI_API_KEY\n    valueFrom:\n      secretKeyRef:\n        name: agent-secrets\n        key: openai-api-key\n  - name: TOOL_REGISTRY_TOKEN\n    valueFrom:\n      secretKeyRef:\n        name: agent-secrets\n        key: tool-registry-token\n```\n\n#### Tool Call Authorization\n\nThe Tool Registry enforces agent-level RBAC: which agents can invoke which tools.\n\n```\n# Tool Registry ACL check\nTOOL_ACL = {\n    \"agent-search\": [\"web_search\", \"vector_search\", \"knowledge_base\"],\n    \"agent-email\": [\"send_email\", \"get_email_thread\"],\n    \"agent-code\": [\"code_exec\", \"git_read\", \"package_search\"],\n    \"agent-data\": [\"sql_query\", \"csv_read\", \"chart_generate\"],\n}\n\nasync def check_tool_acl(agent_id: str, tool_name: str):\n    allowed_tools = TOOL_ACL.get(agent_id, [])\n    if tool_name not in allowed_tools:\n        raise PermissionError(f\"{agent_id} is not authorized to call {tool_name}\")\n```\n\n### Observability: Traces, Logs, Metrics\n\n#### Distributed Tracing Setup (OpenTelemetry)\n\n``` python\n# observability/tracing.py\nfrom opentelemetry import trace\nfrom opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter\nfrom opentelemetry.sdk.trace import TracerProvider\nfrom opentelemetry.sdk.trace.export import BatchSpanProcessor\nfrom opentelemetry.instrumentation.fastapi import FastAPIInstrumentor\nfrom opentelemetry.instrumentation.redis import RedisInstrumentor\nfrom opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor\n\ndef setup_tracing(service_name: str):\n    provider = TracerProvider(\n        resource=Resource(attributes={SERVICE_NAME: service_name})\n    )\n    provider.add_span_processor(\n        BatchSpanProcessor(OTLPSpanExporter(endpoint=OTEL_ENDPOINT))\n    )\n    trace.set_tracer_provider(provider)\n\n    # Auto-instrument frameworks\n    FastAPIInstrumentor().instrument()\n    RedisInstrumentor().instrument()\n    HTTPXClientInstrumentor().instrument()\n```\n\n#### Standard Span Attributes for Agent Calls\n\nAlways set these attributes on every agent and LLM span:\n\n```\n# In AgentRunner._run_loop:\nspan.set_attribute(\"agent.id\", self.agent_id)\nspan.set_attribute(\"agent.task_id\", task.id)\nspan.set_attribute(\"agent.session_id\", task.session_id)\nspan.set_attribute(\"agent.step\", step)\nspan.set_attribute(\"llm.model\", config.model)\nspan.set_attribute(\"llm.prompt_tokens\", response.usage.prompt_tokens)\nspan.set_attribute(\"llm.completion_tokens\", response.usage.completion_tokens)\nspan.set_attribute(\"llm.finish_reason\", response.finish_reason)\n\n# In Tool Registry on invoke:\nspan.set_attribute(\"tool.name\", tool_name)\nspan.set_attribute(\"tool.version\", tool.version)\nspan.set_attribute(\"tool.caller_agent\", agent_id)\nspan.set_attribute(\"tool.latency_ms\", latency_ms)\n```\n\n#### Prometheus Metrics\n\n``` python\n# observability/metrics.py\nfrom prometheus_client import Counter, Histogram, Gauge\n\nagent_tasks_total = Counter(\n    \"agent_tasks_total\",\n    \"Total tasks processed\",\n    [\"agent_id\", \"status\"]\n)\n\nagent_task_duration = Histogram(\n    \"agent_task_duration_seconds\",\n    \"Task end-to-end latency\",\n    [\"agent_id\"],\n    buckets=[0.5, 1, 2, 5, 10, 30, 60, 120]\n)\n\nagent_llm_tokens = Counter(\n    \"agent_llm_tokens_total\",\n    \"LLM tokens consumed\",\n    [\"agent_id\", \"token_type\"] # token_type: prompt | completion\n)\n\nagent_tool_calls = Counter(\n    \"agent_tool_calls_total\",\n    \"Tool invocations\",\n    [\"agent_id\", \"tool_name\", \"status\"]\n)\n\nagent_steps_per_task = Histogram(\n    \"agent_steps_per_task\",\n    \"Number of steps per task (runaway guard)\",\n    [\"agent_id\"],\n    buckets=[1, 2, 3, 5, 8, 10, 15, 20, 25]\n)\n\norchestrator_queue_depth = Gauge(\n    \"orchestrator_queue_depth\",\n    \"Pending tasks in orchestrator queue\"\n)\n```\n\n#### Alert Rules\n\n```\n# alerting/rules.yaml\ngroups:\n  - name: agent-alerts\n    rules:\n      - alert: AgentHighErrorRate\n        expr: rate(agent_tasks_total{status=\"failed\"}[5m]) > 0.05\n        for: 2m\n        annotations:\n          summary: \"{{ $labels.agent_id }} failure rate above 5%\"\n\n      - alert: AgentRunawayTask\n        expr: histogram_quantile(0.99, agent_steps_per_task) > 15\n        for: 5m\n        annotations:\n          summary: \"Agent tasks exceeding 15 steps — possible runaway loop\"\n\n      - alert: LLMTokenCostSpike\n        expr: rate(agent_llm_tokens_total[10m]) > 50000\n        for: 5m\n        annotations:\n          summary: \"Token consumption rate spike — check for loops\"\n\n      - alert: AgentLatencyHigh\n        expr: histogram_quantile(0.99, agent_task_duration_seconds) > 10\n        for: 5m\n        annotations:\n          summary: \"p99 task latency above 10s\"\n```\n\n#### Structured Logging\n\n``` python\n# Never log raw prompts or PII. Log task IDs and outcome codes.\nimport structlog\n\nlog = structlog.get_logger()\n\nlog.info(\"agent.task.completed\",\n    task_id=task.id,\n    session_id=task.session_id, # hashed in prod\n    agent_id=self.agent_id,\n    steps=result.steps_used,\n    tokens=result.tokens_used,\n    duration_ms=result.duration_ms,\n    tool_calls=result.tool_calls,\n    status=result.status,\n    trace_id=get_current_trace_id()\n)\n```\n\n### Deployment on Kubernetes\n\n#### Deployment Manifest\n\n```\n# k8s/deployment.yaml\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  name: agent-search\n  labels:\n    app: agent-search\n    version: v1.4.2\n    team: ai-platform\nspec:\n  replicas: 2\n  selector:\n    matchLabels:\n      app: agent-search\n  template:\n    metadata:\n      labels:\n        app: agent-search\n        version: v1.4.2\n      annotations:\n        prometheus.io/scrape: \"true\"\n        prometheus.io/path: \"/metrics\"\n        prometheus.io/port: \"8080\"\n    spec:\n      serviceAccountName: agent-search\n      containers:\n        - name: agent\n          image: registry.myco.io/agent-search@sha256:<digest> # Always pin by digest\n          ports:\n            - containerPort: 8080 # HTTP API\n              name: http\n            - containerPort: 50051 # gRPC\n              name: grpc\n          env:\n            - name: AGENT_ID\n              value: \"agent-search\"\n            - name: TOOL_REGISTRY_URL\n              valueFrom: {configMapKeyRef: {name: agent-config, key: tool-registry-url}}\n            - name: REDIS_URL\n              valueFrom: {secretKeyRef: {name: agent-secrets, key: redis-url}}\n            - name: OPENAI_API_KEY\n              valueFrom: {secretKeyRef: {name: agent-secrets, key: openai-api-key}}\n            - name: OTEL_EXPORTER_OTLP_ENDPOINT\n              valueFrom: {configMapKeyRef: {name: observability-config, key: otel-endpoint}}\n          resources:\n            requests:\n              cpu: \"500m\"\n              memory: \"512Mi\"\n            limits:\n              cpu: \"2\"\n              memory: \"2Gi\"\n          livenessProbe:\n            httpGet:\n              path: /health\n              port: 8080\n            initialDelaySeconds: 10\n            periodSeconds: 15\n            failureThreshold: 3\n          readinessProbe:\n            httpGet:\n              path: /ready\n              port: 8080\n            initialDelaySeconds: 5\n            periodSeconds: 10\n            failureThreshold: 2\n          lifecycle:\n            preStop:\n              exec:\n                command: [\"/bin/sh\", \"-c\", \"sleep 5\"] # drain connections before shutdown\n      topologySpreadConstraints:\n        - maxSkew: 1\n          topologyKey: kubernetes.io/hostname\n          whenUnsatisfiable: DoNotSchedule\n          labelSelector:\n            matchLabels: {app: agent-search}\n```\n\n#### Horizontal Pod Autoscaler (Custom Metrics)\n\nScale on Kafka consumer lag and p99 task latency, not just CPU:\n\n```\n# k8s/hpa.yaml\napiVersion: autoscaling/v2\nkind: HorizontalPodAutoscaler\nmetadata:\n  name: agent-search-hpa\nspec:\n  scaleTargetRef:\n    apiVersion: apps/v1\n    kind: Deployment\n    name: agent-search\n  minReplicas: 2\n  maxReplicas: 20\n  behavior:\n    scaleUp:\n      stabilizationWindowSeconds: 60\n      policies:\n        - type: Pods\n          value: 4\n          periodSeconds: 60\n    scaleDown:\n      stabilizationWindowSeconds: 300 # be conservative scaling down\n  metrics:\n    - type: External\n      external:\n        metric:\n          name: kafka_consumer_group_lag\n          selector:\n            matchLabels:\n              topic: agent.tasks.search\n        target:\n          type: AverageValue\n          averageValue: \"100\"\n    - type: Resource\n      resource:\n        name: cpu\n        target:\n          type: Utilization\n          averageUtilization: 70\n```\n\n#### PodDisruptionBudget\n\nEnsure at least one replica is always available during rolling updates:\n\n```\n# k8s/pdb.yaml\napiVersion: policy/v1\nkind: PodDisruptionBudget\nmetadata:\n  name: agent-search-pdb\nspec:\n  minAvailable: 1\n  selector:\n    matchLabels:\n      app: agent-search\n```\n\n### Scaling Strategies\n\n#### Per-Agent Scaling Logic\n\n#### Multi-Model Fallback\n\nIf the primary LLM is unavailable or rate-limited, automatically route to a fallback:\n\n```\nclass LLMClient:\n    MODEL_CASCADE = [\n        \"gpt-4o\", # primary\n        \"gpt-4o-mini\", # cheaper fallback\n        \"claude-sonnet-4-6\", # cross-vendor fallback\n    ]\n\n    async def complete(self, messages: list, **kwargs) -> LLMResponse:\n        for model in self.MODEL_CASCADE:\n            try:\n                return await self._call_model(model, messages, **kwargs)\n            except (RateLimitError, ModelUnavailable):\n                log.warning(\"llm.fallback\", from_model=model, reason=\"rate_limit_or_unavailable\")\n                continue\n        raise AllModelsUnavailable()\n```\n\n### Fault Tolerance & Retry Strategies\n\n#### Circuit Breaker on LLM Client\n\n``` python\nfrom circuitbreaker import circuit\n\nclass LLMClientWithCircuitBreaker:\n    @circuit(failure_threshold=5, recovery_timeout=30, expected_exception=LLMError)\n    async def complete(self, messages: list, **kwargs) -> LLMResponse:\n        return await self._raw_complete(messages, **kwargs)\n```\n\nThe circuit opens after 5 consecutive failures and remains open for 30 seconds, serving fallback responses or routing to a secondary model during that window.\n\n#### Exponential Backoff with Jitter\n\n``` python\nfrom tenacity import (\n    retry, stop_after_attempt,\n    wait_exponential_jitter, retry_if_exception_type\n)\n\n@retry(\n    stop=stop_after_attempt(3),\n    wait=wait_exponential_jitter(initial=1, max=60),\n    retry=retry_if_exception_type((RateLimitError, TimeoutError, ServiceUnavailable))\n)\nasync def call_tool_with_retry(tool_name: str, params: dict):\n    return await tool_registry.invoke(tool_name, params)\n```\n\n#### Dead Letter Queue Handler\n\n```\n# dlq_handler.py — consumes from dead-letter topic\nclass DLQHandler:\n    async def process(self, event: AgentTaskEvent):\n        log.error(\"agent.task.dlq\",\n            task_id=event.task_id,\n            target_agent=event.target_agent,\n            attempt_count=event.retry_count,\n            original_error=event.last_error\n        )\n\n        # Alert on-call if error is novel\n        if await self.is_novel_error(event.last_error):\n            await self.pagerduty.alert(event)\n\n        # Store for human review dashboard\n        await self.db.insert_dlq_item(event)\n\n        # Auto-re-queue with modified params after 1 hour (optional)\n        if event.retry_count < 2 and event.auto_retry_eligible:\n            await asyncio.sleep(3600)\n            event.retry_count += 1\n            await self.kafka.send(\"agent.tasks.\" + event.target_agent, event)\n```\n\n#### Step-Level Checkpointing\n\n``` python\nclass CheckpointedAgentRunner(AgentRunner):\n    async def _run_loop(self, task: AgentTask, span) -> AgentResult:\n        # Restore from checkpoint if available\n        checkpoint = await self.redis.get(f\"checkpoint:{task.id}\")\n        if checkpoint:\n            state = json.loads(checkpoint)\n            messages = state[\"messages\"]\n            total_tokens = state[\"total_tokens\"]\n            start_step = state[\"step\"] + 1\n            log.info(\"agent.checkpoint.restored\", task_id=task.id, step=start_step)\n        else:\n            context = await self.memory.load(task.session_id)\n            messages = build_messages(context, task.prompt)\n            total_tokens = 0\n            start_step = 0\n\n        for step in range(start_step, task.max_steps):\n            response = await self._complete_with_retry(messages, tool_schemas)\n            messages.append(response.message)\n\n            # Persist checkpoint after each step\n            await self.redis.setex(\n                f\"checkpoint:{task.id}\",\n                3600,\n                json.dumps({\"messages\": messages, \"total_tokens\": total_tokens, \"step\": step})\n            )\n\n            if response.finish_reason == \"stop\":\n                await self.redis.delete(f\"checkpoint:{task.id}\")\n                break\n\n        return build_result(task, response, total_tokens, step)\n```\n\n### Testing Agent Microservices\n\n#### Testing Pyramid\n\n``` python\n# tests/unit/test_agent_runner.py\nimport pytest\nfrom unittest.mock import AsyncMock, patch\n\n@pytest.fixture\ndef mock_llm():\n    llm = AsyncMock()\n    llm.complete.return_value = LLMResponse(\n        content=\"Here is the search result.\",\n        finish_reason=\"stop\",\n        usage=Usage(prompt_tokens=100, completion_tokens=50, total_tokens=150)\n    )\n    return llm\n\nasync def test_agent_completes_in_one_step(mock_llm):\n    runner = AgentRunner(\"agent-search\", test_config)\n    runner.llm = mock_llm\n\n    result = await runner.run(AgentTask(id=\"t1\", session_id=\"s1\", prompt=\"find AI news\"))\n\n    assert result.status == TaskStatus.COMPLETED\n    assert result.steps_used == 1\n    assert result.tokens_used == 150\n    mock_llm.complete.assert_called_once()\n\nasync def test_agent_respects_token_budget(mock_llm):\n    mock_llm.complete.return_value = LLMResponse(\n        content=\"...\", finish_reason=\"tool_calls\",\n        usage=Usage(prompt_tokens=900, completion_tokens=100, total_tokens=1000)\n    )\n    task = AgentTask(id=\"t1\", session_id=\"s1\", prompt=\"...\", token_budget=500)\n    runner = AgentRunner(\"agent-search\", test_config)\n    runner.llm = mock_llm\n\n    result = await runner.run(task)\n    assert result.error == \"token_budget_exceeded\"\n```\n\n#### Integration Testing with a Mock LLM Server\n\nUse a local mock LLM server (e.g., wiremock or a FastAPI stub) that returns deterministic responses for testing tool call flows end-to-end without hitting real APIs.\n\n```\n# tests/integration/test_tool_flow.py\nasync def test_search_agent_calls_web_search_tool(mock_llm_server, real_redis, real_tool_registry):\n    # Configure mock LLM to respond with a tool call on first turn\n    mock_llm_server.set_response(step=0, response=TOOL_CALL_RESPONSE)\n    mock_llm_server.set_response(step=1, response=FINAL_RESPONSE)\n\n    runner = AgentRunner(\"agent-search\", integration_config)\n    result = await runner.run(AgentTask(id=\"t1\", session_id=\"s1\", prompt=\"Search for AI news\"))\n\n    assert result.status == TaskStatus.COMPLETED\n    assert result.tool_calls == 1\n    assert real_tool_registry.was_invoked(\"web_search\")\n```\n\n#### Chaos Testing\n\nUse Chaos Mesh or Litmus to test resilience:\n\n-\n**Pod kill:** Kill a random agent pod — verify Supervisor retries succeed -\n**Network partition:** Block agent→tool-registry traffic — verify circuit breaker opens -\n**LLM latency injection:** Add 15s delay to LLM calls — verify timeout and fallback activate -\n**Kafka partition leader election:** Simulate Kafka failover — verify no task loss via consumer offset management\n\n#### CI/CD Pipeline for Agent Services\n\n```\n# .github/workflows/agent-service.yml\nname: Agent Service CI/CD\n\non:\n  push:\n    paths: [\"agents/agent-search/**\"]\n\njobs:\n  test:\n    runs-on: ubuntu-latest\n    services:\n      redis:\n        image: redis:7-alpine\n        ports: [\"6379:6379\"]\n    steps:\n      - uses: actions/checkout@v4\n      - uses: actions/setup-python@v5\n        with: {python-version: \"3.12\"}\n      - run: pip install -e \".[dev]\"\n      - run: pytest tests/ --cov=agent --cov-fail-under=85\n\n  security-scan:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - name: Trivy vulnerability scan\n        uses: aquasecurity/trivy-action@master\n        with: {image-ref: \"agent-search:${{ github.sha }}\", exit-code: \"1\"}\n\n  build-push:\n    needs: [test, security-scan]\n    runs-on: ubuntu-latest\n    steps:\n      - name: Build and push (pinned by digest)\n        run: |\n          docker buildx build --platform linux/amd64,linux/arm64 \\\n            -t registry.myco.io/agent-search:${{ github.sha }} \\\n            --push agents/agent-search/\n          # Capture digest for deployment\n          DIGEST=$(docker inspect --format='{{index .RepoDigests 0}}' \\\n            registry.myco.io/agent-search:${{ github.sha }})\n          echo \"IMAGE_DIGEST=$DIGEST\" >> $GITHUB_ENV\n\n  deploy-staging:\n    needs: build-push\n    runs-on: ubuntu-latest\n    steps:\n      - name: Deploy to staging\n        run: |\n          kubectl set image deployment/agent-search \\\n            agent=registry.myco.io/agent-search@${{ env.IMAGE_DIGEST }} \\\n            -n staging\n          kubectl rollout status deployment/agent-search -n staging --timeout=120s\n\n  smoke-test-staging:\n    needs: deploy-staging\n    steps:\n      - run: python tests/smoke/run_smoke_tests.py --env staging\n\n  deploy-production:\n    needs: smoke-test-staging\n    environment: production\n    steps:\n      - name: Rolling deploy to production\n        run: |\n          kubectl set image deployment/agent-search \\\n            agent=registry.myco.io/agent-search@${{ env.IMAGE_DIGEST }} \\\n            -n production\n          kubectl rollout status deployment/agent-search -n production --timeout=300s\n```\n\n### Cost Management & Token Budgeting\n\n#### Per-Agent Token Accounting\n\nTrack token usage per agent, per session, and per user to enable chargebacks and anomaly detection.\n\n``` python\nclass TokenAccountant:\n    async def record(self, agent_id: str, session_id: str, usage: Usage):\n        # Increment per-agent daily counter\n        await self.redis.incrby(f\"tokens:{agent_id}:{today()}\", usage.total_tokens)\n        await self.redis.expire(f\"tokens:{agent_id}:{today()}\", 86400 * 7)\n\n        # Increment per-session counter (for user billing)\n        await self.redis.incrby(f\"tokens:session:{session_id}\", usage.total_tokens)\n\n        # Write to time-series DB for cost dashboards\n        await self.influx.write(\n            measurement=\"llm_tokens\",\n            tags={\"agent_id\": agent_id, \"model\": usage.model},\n            fields={\"prompt\": usage.prompt_tokens, \"completion\": usage.completion_tokens},\n        )\n\nasync def get_estimated_cost(agent_id: str) -> float:\n    tokens = int(await redis.get(f\"tokens:{agent_id}:{today()}\") or 0)\n    # GPT-4o pricing: $2.50/1M prompt, $10/1M completion (example)\n    return (tokens / 1_000_000) * 5.0 # blended estimate\n```\n\n#### Budget Enforcement at Session Level\n\n```\nMAX_SESSION_TOKENS = 50_000 # hard cap per user session\n\nasync def check_session_budget(session_id: str):\n    used = int(await redis.get(f\"tokens:session:{session_id}\") or 0)\n    if used > MAX_SESSION_TOKENS:\n        raise SessionBudgetExceeded(\n            session_id=session_id,\n            tokens_used=used,\n            limit=MAX_SESSION_TOKENS\n        )\n```\n\n### Production Readiness Checklist\n\n#### Service-Level Requirements\n\n- Agent has /health endpoint that checks LLM client connectivity\n- Agent has /ready endpoint that checks memory store (Redis) and Tool Registry reachability\n- All tool calls are schema-validated by Tool Registry before execution\n- Agent-level RBAC enforced: agent X cannot invoke tools it is not authorized for\n- JWT verification on all inter-agent gRPC and HTTP calls\n- Secrets loaded from Kubernetes Secrets or Vault — never from env literals or ConfigMaps\n\n#### Reliability Requirements\n\n- Context window size is bounded — no unbounded message history growth\n- Token budget enforced per task with hard ceiling\n- MAX_STEPS guard in place to prevent runaway loops\n- Exponential backoff with jitter on all LLM calls\n- Circuit breaker configured on LLM client (threshold, recovery timeout)\n- Exponential backoff on all tool calls\n- Failed tasks routed to Dead Letter Queue — not silently dropped\n- Step-level checkpointing for tasks expected to exceed 60 seconds\n- Multi-model fallback cascade configured (primary → cheaper → cross-vendor)\n\n#### Observability Requirements\n\n- OpenTelemetry distributed tracing with trace context propagation\n- All LLM completions traced with token counts and latency\n- All tool calls traced with tool name, version, and outcome\n- Prometheus metrics exported: task count, duration, token usage, tool calls, step count\n- Alerts configured: high error rate, runaway steps, token cost spike, high latency\n- Structured logging (JSON) with task_id, session_id (hashed), trace_id — no raw prompt content\n\n#### Deployment Requirements\n\n- Agent image pinned to digest, not mutable tag (never :latest)\n- HPA configured with appropriate metrics (queue lag and latency, not just CPU)\n- PodDisruptionBudget set (minAvailable >= 1)\n- Pod topology spread constraints configured for HA across nodes\n- Resource requests and limits set (no QoS class “BestEffort”)\n- Rolling update strategy with preStop sleep for graceful shutdown\n- Integration tests cover “tool call fails → agent recovers” path\n- Load tests simulate 10× expected peak concurrency before go-live\n\n#### Cost Control Requirements\n\n- Token usage recorded per agent and per session\n- Session-level budget cap enforced\n- Token cost alerting configured per agent\n- DLQ monitored — no silent retry storms", "url": "https://wpnews.pro/news/building-micro-agents-as-production-grade-microservices", "canonical_source": "https://dev.to/murali8k/building-micro-agents-as-production-grade-microservices-f4j", "published_at": "2026-05-24 04:24:16+00:00", "updated_at": "2026-05-24 05:02:28.513968+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "developer-tools", "cloud-computing"], "entities": ["FastAPI", "gRPC", "Kafka", "Kubernetes", "OpenTelemetry", "Python", "Redis", "WebSocket"], "alternates": {"html": "https://wpnews.pro/news/building-micro-agents-as-production-grade-microservices", "markdown": "https://wpnews.pro/news/building-micro-agents-as-production-grade-microservices.md", "text": "https://wpnews.pro/news/building-micro-agents-as-production-grade-microservices.txt", "jsonld": "https://wpnews.pro/news/building-micro-agents-as-production-grade-microservices.jsonld"}}