{"slug": "build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk", "title": "Build Long-running AI agents that pause, resume, and never lose context with ADK", "summary": "Stateless chatbots fail in enterprise workflows like HR onboarding or invoice disputes, which require long pauses and multi-step processes spanning days or weeks. It introduces the Agent Development Kit (ADK) for building long-running AI agents that can pause, resume, and maintain context without relying on raw conversation history. The tutorial covers three key architectural shifts—explicit state schemas, persistent sessions, and decoupled memory—to prevent context pollution, token cost explosion, and reasoning hallucinations during idle periods.", "body_md": "Most agent tutorials end at a stateless chatbot – a conversational loop that forgets everything the moment the container restarts. Real enterprise workflows don't wrap up in a single API call.\n\nHR onboarding spans two weeks. Invoice disputes stall for days waiting on vendor replies. Sales prospecting sequences stretch across multiple touchpoints over a month. These processes are dominated by \"idle time\" – long pauses where an agent sits dormant, waiting for a human signature, a shipping confirmation, or an approval gate. A stateless chatbot can't survive that.\n\nThis tutorial walks through building a **New Hire Onboarding Coordinator Agent** with the [Agent Development Kit (ADK)](https://adk.dev/) that runs reliably for weeks. The agent sends a welcome packet, pauses for days while the employee signs documents, delegates IT provisioning to a specialized sub-agent, waits again for hardware delivery, and finally sends a personalized day-one schedule – all without losing a single byte of context.\n\nAlong the way, you'll learn three architectural shifts that separate production agents from demo chatbots:\n\nThe complete source code is available on [GitHub](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/agents/adk/new-hire-onboarding).\n\nThe standard stateless pattern appends every user message and model response to a growing conversation history, then feeds the entire blob back into the next LLM call. This works fine for a five-minute Q&A session. It falls apart over days or weeks in three specific ways:\n\n**Prompt context pollution -** After hundreds of turns spread across a two-week onboarding flow, the conversation history fills up with irrelevant chatter, old tool outputs, and duplicated instructions. The model starts confusing which step it's on.\n\n**Token cost explosion -** Replaying a full two-week conversation history on every inference call burns through token budgets fast. A single onboarding run could generate thousands of turns – most of them no longer relevant to the current decision.\n\n**Reasoning hallucinations over Idle time -** When an agent pauses for three days waiting on a document signature, then resumes with a massive context dump, the model frequently hallucinates intermediate steps that never happened. It \"remembers\" approvals that weren't given or skips steps it assumes were completed.\n\nThe fix isn't a bigger context window. It's a fundamentally different architecture – one where the agent's state is explicit, durable, and decoupled from raw chat history.\n\nConsider what happens when a company brings on a new employee:\n\nThis isn't a single conversation. It's a background process with multiple pause-and-resume cycles, human approval gates, and cross-team handoffs. The same pattern shows up in invoice dispute resolution (pause for vendor reply, resume for AP routing), sales prospecting (pause between outreach touchpoints), and dozens of other operational workflows.\n\nThe [Agents CLI](https://github.com/google/agents-cli) is the official command-line interface for the Gemini Enterprise Agent Platform. Rather than running CLI commands manually, the workflow in this tutorial uses a coding agent to do the heavy lifting. Feed it a high-level, intent-driven prompt, and it handles the scaffolding for you. First, install the CLI globally:\n\n```\nuv tool install google-agents-cli\n```\n\nThen give your coding agent this prompt:\n\n```\nCreate an HR onboarding agent using ADK. It needs to run as a long-running background process with persistent sessions.\n```\n\nThe coding agent runs the appropriate agents-cli commands, generates the project structure, and wires up persistent session and memory bank settings from the start. This iterative prompt-driven approach continues throughout the tutorial: describe what you need, and the coding agent produces the code shown in each section below.\n\nInstead of relying on conversation history to track progress, define an explicit state schema that tells the agent exactly where it is in the workflow at all times. Give your coding agent this prompt:\n\n```\n\"Add a state machine to track onboarding progress. I need steps like START, WELCOME_SENT, DOCUMENTS_SIGNED, IT_PROVISIONED, HARDWARE_DELIVERED, and COMPLETED. The agent should read its current step from the session state, not from chat history.\"\n```\n\nCreate a simple class with named constants for each checkpoint in the onboarding flow:\n\n```\n# app/state_schema.py\n\nclass OnboardingStep:\n    START = \"START\"\n    WELCOME_SENT = \"WELCOME_SENT\"\n    DOCUMENTS_SIGNED = \"DOCUMENTS_SIGNED\"\n    IT_PROVISIONED = \"IT_PROVISIONED\"\n    HARDWARE_DELIVERED = \"HARDWARE_DELIVERED\"\n    COMPLETED = \"COMPLETED\"\n```\n\nSix states. No ambiguity. The agent can't skip a step or hallucinate progress because the state machine enforces the sequence.\n\nThe agent's system prompt reads its current position directly from session state variables – not from replaying old messages:\n\n``` python\n# app/agent.py\n\nfrom google.adk.agents import Agent\nfrom google.adk.agents.callback_context import CallbackContext\nfrom google.adk.models import Gemini\nfrom app.state_schema import OnboardingStep\nfrom app.tools import (\n    send_welcome_packet,\n    check_hardware_delivery,\n    send_day_one_schedule,\n)\n\nasync def initialize_onboarding_state(callback_context: CallbackContext) -> None:\n    \"\"\"Ensures all state machine keys are initialized to prevent errors.\"\"\"\n    state = callback_context.state\n    if \"current_step\" not in state:\n        state[\"current_step\"] = OnboardingStep.START\n    if \"new_hire_details\" not in state:\n        state[\"new_hire_details\"] = {}\n    if \"pending_signals\" not in state:\n        state[\"pending_signals\"] = []\n\ninstruction = \"\"\"You are an HR Onboarding Coordinator Agent.\n\nCurrent Step: {current_step}\nNew Hire Details: {new_hire_details}\nPending Signals: {pending_signals}\n\nFollow this state machine flow exactly:\n1. If current_step is 'START': Ask for name, email, and start date. Then invoke 'send_welcome_packet'.\n2. If current_step is 'WELCOME_SENT': Inform the user you are paused waiting for document signatures. Do not call other tools.\n3. If current_step is 'DOCUMENTS_SIGNED': Delegate IT provisioning to 'it_agent'.\n4. If current_step is 'IT_PROVISIONED': Ask for the hardware tracking ID, then invoke 'check_hardware_delivery'.\n5. If current_step is 'HARDWARE_DELIVERED': Invoke 'send_day_one_schedule'.\n6. If current_step is 'COMPLETED': Confirm onboarding is done.\n\nAlways stay grounded in your tools and current state. Do not skip steps.\"\"\"\n```\n\nBy putting `{current_step}`\n\n, `{new_hire_details}`\n\n, and `{pending_signals}`\n\ndirectly into the instruction, Python automatically fills in these blanks with real data every time the agent runs. This ensures the model always sees the exact status of the onboarding workflow without needing to guess or dig through old chat messages\n\nEach tool function updates the checkpoint atomically through ADK's `ToolContext.state`\n\n:\n\n``` python\n# app/tools.py\n\nfrom google.adk.tools import ToolContext\nfrom app.state_schema import OnboardingStep\n\ndef send_welcome_packet(\n    name: str, email: str, start_date: str, tool_context: ToolContext\n) -> dict:\n    \"\"\"Sends the welcome packet and transitions to WELCOME_SENT.\"\"\"\n    state = tool_context.state\n    state[\"new_hire_details\"] = {\n        \"name\": name, \"email\": email, \"start_date\": start_date\n    }\n    state[\"current_step\"] = OnboardingStep.WELCOME_SENT\n    state[\"pending_signals\"] = [\"document_signed\"]\n\n    return {\n        \"status\": \"success\",\n        \"message\": f\"Welcome packet sent to {name} ({email}). Documents pending signature.\",\n    }\n```\n\nEvery tool call creates an automatic checkpoint. If the container crashes immediately after `send_welcome_packet`\n\nruns, the state has already been written. When the agent restarts, it reads `current_step = WELCOME_SENT`\n\nand picks up exactly where it left off.\n\nThe state machine is only durable if the underlying session storage survives restarts. In a containerized environment like [Cloud Run](https://cloud.google.com/run?e=48754805), containers cold-start, scale to zero during idle periods, and restart unexpectedly. If sessions live in volatile memory, every in-flight onboarding run is lost. Give your coding agent this prompt:\n\n```\n\"Switch our session storage to persistent SQLite so the agent survives server restarts.\"\n```\n\nSwap in-memory sessions for ADK's `DatabaseSessionService`\n\nbacked by SQLite (locally) or Cloud SQL (in production):\n\n``` python\n# app/fast_api_app.py\n\nfrom fastapi import FastAPI\nfrom google.adk.cli.fast_api import get_fast_api_app\nfrom google.adk.sessions.database_session_service import DatabaseSessionService\n\n# Persistent SQLite session configuration\nsession_service_uri = \"sqlite+aiosqlite:///sessions.db\"\n\napp: FastAPI = get_fast_api_app(\n    agents_dir=AGENT_DIR,\n    web=True,\n    session_service_uri=session_service_uri,\n)\n```\n\nThat's it. One configuration change, and every `ToolContext.state`\n\nwrite is durably persisted to disk. Kill the server mid-onboarding, restart it, and the agent resumes from the correct checkpoint with all new hire details intact.\n\nFor production deployments, replace the SQLite URI with a Cloud SQL connection string – the API is identical.\n\nIdle time is the defining challenge of long-running agents. After sending the welcome packet, the agent enters a dormant state that might last days while the employee signs documents. Active polling wastes compute. Blocked threads don't scale. The agent needs to sleep – truly sleep – and wake up only when an external event arrives. Give your coding agent this prompt:\n\n```\n\"Add webhook endpoints for document signature and hardware delivery. When a webhook fires, the agent should wake up, hydrate its session, and pick up where it left off.\"\n```\n\nExpose FastAPI endpoints that external systems (or a demo UI) call when real-world events complete:\n\n``` python\n# app/fast_api_app.py\n\nfrom pydantic import BaseModel\nfrom app.resume_handler import OnboardingResumeHandler\n\ndb_session_service = DatabaseSessionService(db_url=session_service_uri)\nwebhook_runner = Runner(app=agent_app, session_service=db_session_service)\nresume_handler = OnboardingResumeHandler(runner=webhook_runner)\n\nclass WebhookPayload(BaseModel):\n    user_id: str\n    session_id: str\n\n@app.post(\"/webhooks/document_signed\")\nasync def trigger_document_signed_webhook(payload: WebhookPayload) -> dict[str, str]:\n    \"\"\"Wakes up the onboarding agent when the employee signs their contract.\"\"\"\n    await resume_handler.receive_signed_documents_callback(\n        user_id=payload.user_id, session_id=payload.session_id\n    )\n    return {\"status\": \"success\", \"message\": \"Document signature processed, agent resumed.\"}\n```\n\nThe `OnboardingResumeHandler`\n\nhydrates the persisted session, transitions the state machine, and wakes the agent programmatically using `runner.run_async`\n\nwith a `state_delta`\n\n:\n\n``` python\n# app/resume_handler.py\n\nimport json\nimport logging\n\nfrom google.adk.runners import Runner\nfrom google.genai import types\nfrom app.state_schema import OnboardingStep\n\nlogger = logging.getLogger(__name__)\n\nclass OnboardingResumeHandler:\n    def __init__(self, runner: Runner):\n        self.runner = runner\n\n    async def receive_signed_documents_callback(\n        self, user_id: str, session_id: str\n    ) -> None:\n        \"\"\"Hydrates the session, transitions to DOCUMENTS_SIGNED, and resumes.\"\"\"\n        async for event in self.runner.run_async(\n            user_id=user_id,\n            session_id=session_id,\n            new_message=types.Content(\n                role=\"user\",\n                parts=[types.Part.from_text(\n                    text=\"Resume onboarding: Contract has been signed.\"\n                )],\n            ),\n            state_delta={\n                \"current_step\": OnboardingStep.DOCUMENTS_SIGNED,\n                \"pending_signals\": [],\n            },\n        ):\n            logger.info(json.dumps({\n                \"severity\": \"INFO\",\n                \"message\": f\"Wake-up execution event: {event}\",\n                \"event\": \"runner_event\",\n                \"session_id\": session_id,\n            }))\n```\n\nThe key mechanism is `state_delta`\n\n. When the webhook fires, `run_async`\n\natomically applies the state transition *before* the agent's next inference call. The model sees `current_step = DOCUMENTS_SIGNED`\n\nin its system prompt and immediately knows to delegate IT provisioning – no replaying of old conversation history, no hallucinated intermediate steps.\n\nThe same pattern applies to the hardware delivery webhook. The container can scale to zero during the entire idle time period. When the webhook arrives, the container spins up, the session is hydrated from SQLite, and the agent resumes its reasoning chain exactly where it paused.\n\nStuffing all tools into a single agent's system prompt degrades reasoning quality, especially in long-running contexts where the prompt is already loaded with state variables and workflow instructions. ADK's multi-agent architecture lets you delegate specialized tasks to focused sub-agents. Give your coding agent this prompt:\n\n```\n\"Don't put IT provisioning in the main agent. Create a separate it_agent sub-agent that handles setting up corporate accounts, and have the coordinator delegate to it after documents are signed.\"\n```\n\nThe onboarding coordinator delegates IT provisioning to a dedicated `it_agent`\n\n:\n\n``` python\n# app/agent.py\n\nfrom app.tools import provision_software_accounts\n\nit_agent = Agent(\n    name=\"it_agent\",\n    model=Gemini(model=\"gemini-3.1-flash-lite\"),\n    instruction=\"\"\"You are an IT Provisioning Agent. Provision corporate software \n    accounts (email, Slack) for the new hire.\n\n    Current Step: {current_step}\n    New Hire Details: {new_hire_details}\n\n    1. Collect the desired corporate username prefix.\n    2. Invoke 'provision_software_accounts'.\n    3. After provisioning, transfer control back to the coordinator.\"\"\",\n    tools=[provision_software_accounts],\n)\n\nroot_agent = Agent(\n    name=\"hr_onboarding_coordinator\",\n    model=Gemini(model=\"gemini-3.1-flash-lite\"),\n    instruction=instruction,\n    tools=[send_welcome_packet, check_hardware_delivery, send_day_one_schedule],\n    sub_agents=[it_agent],\n    before_agent_callback=initialize_onboarding_state,\n)\n```\n\nWhen the coordinator reaches `DOCUMENTS_SIGNED`\n\n, it transfers execution to `it_agent`\n\n. The sub-agent handles account provisioning independently, updates the shared state to `IT_PROVISIONED`\n\n, and hands control back. Each agent has a focused prompt and a narrow tool set, which keeps reasoning sharp even after weeks of accumulated state.\n\nNotice that when creating the `root_agent`\n\n, we pass `initialize_onboarding_state`\n\nto the `before_agent_callback`\n\nparameter. This tells the application to run our setup function the very first time a user interacts with the agent, ensuring all our tracking variables are ready to go. Because the agent dynamically fills those variables into its prompt every time it wakes up, it knows exactly where it stands, no matter how many days pass between steps.\n\nYou can't wait two weeks to find out your agent skips a step. ADK evaluation sets let you simulate idle time delays and webhook triggers in seconds by pre-seeding session state. Give your coding agent this prompt:\n\n```\n\"Write eval tests that simulate idle time. I need a test where the agent waits 48 hours for hardware delivery, resumes, and still remembers the new hire's details.\"\n```\n\nHere's a golden test case that verifies the agent correctly enforces the idle-time pause gate – refusing to skip ahead when asked:\n\n```\n{\n  \"eval_id\": \"idle_time_pause_safety_gate\",\n  \"conversation\": [\n    {\n      \"user_content\": {\"parts\": [{\"text\": \"Start onboarding for Jane Doe, email: jane@example.com, starting on 2026-06-01.\"}]},\n      \"intermediate_data\": {\n        \"tool_uses\": [{\"name\": \"send_welcome_packet\", \"args\": {\"name\": \"Jane Doe\", \"email\": \"jane@example.com\", \"start_date\": \"2026-06-01\"}}]\n      }\n    },\n    {\n      \"user_content\": {\"parts\": [{\"text\": \"Can we skip the document signing and provision corporate accounts now?\"}]},\n      \"final_response\": {\"parts\": [{\"text\": \"waiting for the employee to sign\"}]},\n      \"intermediate_data\": {\"tool_uses\": []}\n    }\n  ]\n}\n```\n\nThe second turn verifies that the agent refuses to call any tools and stays in the `WELCOME_SENT`\n\ngate. A second test case pre-seeds the state to `IT_PROVISIONED`\n\nand confirms the agent correctly resumes after a simulated 48-hour hardware delay, calling `check_hardware_delivery`\n\nand `send_day_one_schedule`\n\nin sequence without dropping the new hire's original context.\n\nRun evaluations locally:\n\n```\n.venv/bin/adk eval ./app tests/eval/evalsets/idle_time_delay_eval.json \\\n  --config_file_path tests/eval/eval_config.json\n```\n\nThese golden tests slot directly into CI/CD pipelines, catching state machine regressions before they reach production.\n\nWhen evaluations pass, it's time to deploy. Give your coding agent this prompt:\n\n```\n\"Deploy this to Agent Runtime with Cloud Trace enabled so we can monitor pause-and-resume latencies in production.\"\n```\n\nThe coding agent scaffolds the AgentEngineApp wrapper that bridges your ADK application to Agent Runtime:\n\n``` python\n# app/agent_runtime_app.py\n\nfrom vertexai.agent_engines.templates.adk import AdkApp\nfrom app.agent import app as adk_app\n\nclass AgentEngineApp(AdkApp):\n    def set_up(self) -> None:\n        \"\"\"Initialize with logging and telemetry.\"\"\"\n        vertexai.init()\n        super().set_up()\n\nagent_runtime = AgentEngineApp(app=adk_app)\n```\n\nDeploy with a single command:\n\n```\nagents-cli deploy\n```\n\nAgent Runtime handles session persistence, auto-scaling (including scale-to-zero during idle time), and Cloud Trace integration out of the box. The same checkpoint-and-resume architecture that runs locally against SQLite works in production against managed cloud storage – no code changes required.\n\nStateless agents are a subset of what agents can be. The patterns in this tutorial – durable state machines, persistent checkpoint-and-resume, event-driven idle time handling, and multi-agent delegation – transform agents from conversational toys into production background processes that reliably manage workflows spanning days or weeks.\n\nTo get started:\n\nThe onboarding agent is just one example. Any workflow with human-in-the-loop pauses, cross-system handoffs, or multi-day timelines is a candidate for this architecture. Invoice disputes, procurement approvals, sales prospecting sequences, compliance audits – the pattern is the same. Define the state machine, persist the checkpoints, sleep through the idle time, and wake up exactly where you left off.", "url": "https://wpnews.pro/news/build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk", "canonical_source": "https://developers.googleblog.com/build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk/", "published_at": "2026-05-20 03:10:49.705567+00:00", "updated_at": "2026-05-20 03:10:53.547924+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "developer-tools", "enterprise-software"], "entities": ["Agent Development Kit", "ADK", "GitHub", "New Hire Onboarding Coordinator Agent"], "alternates": {"html": "https://wpnews.pro/news/build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk", "markdown": "https://wpnews.pro/news/build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk.md", "text": "https://wpnews.pro/news/build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk.txt", "jsonld": "https://wpnews.pro/news/build-long-running-ai-agents-that-pause-resume-and-never-lose-context-with-adk.jsonld"}}