{"slug": "what-you-actually-need-to-ship-an-ai-agent", "title": "What you actually need to ship an AI agent", "summary": "A developer shares practical advice for shipping AI agents that survive production, emphasizing that the model is not the bottleneck—state management, crash recovery, and tool integration are. The post recommends LangGraph with Postgres checkpointing for resilient agent state, warns against over-engineering, and notes that MCP (Model Context Protocol) is promising but has uneven security maturity, citing a recent high-severity CVE in a community package.", "body_md": "Everyone's building agents. Half of them are running. The other half have \"active plans.\"\n\nI've been in both camps. The difference isn't the model. Models have been good enough for a while now. It's everything around the model that nobody talks about in tutorials because tutorials end when the demo works.\n\nThis is the stuff that bit me. Take it or leave it.\n\nWorth asking before you pick a framework.\n\nThe cases where agents actually make sense — and I mean actually get used, not just demoed — are pretty narrow:\n\nYou have work that's too variable for a simple automation but too repetitive for a human to do 500 times a day. Customer support triage where the agent needs to know who the user is, what plan they're on, what happened in their last three sessions. Internal ops: pull from four systems, write a Slack summary, done. SaaS features where \"AI that knows your account\" is the actual value, not a generic chatbot bolted on.\n\nWhat all of these have in common: the agent needs to remember things. Needs to know who's asking. Needs to not lose its mind when a tool call fails or an LLM provider has a bad afternoon.\n\nEverything below is about making that work.\n\nUse LangGraph. Not because it's elegant (it's not always), but because it handles the stuff that kills you in production and nothing else does it as well right now.\n\nThe things that matter: state persists across crashes (you don't restart from zero when something hiccups). You can pause mid-execution, wait for a human to approve something, resume. Parallel tool calls without data races. Explicit control flow so you actually know what's running.\n\nHere's what setting up LangGraph with Postgres checkpointing actually looks like. This is the part that makes your agent survive crashes:\n\n``` python\nfrom langgraph.graph import StateGraph\nfrom langgraph.checkpoint.postgres import PostgresSaver\nfrom psycopg_pool import ConnectionPool\n\npool = ConnectionPool(conninfo=os.environ[\"NHOST_DATABASE_URL\"])\ncheckpointer = PostgresSaver(pool)\n\ngraph = StateGraph(AgentState)\ngraph.add_node(\"reason\", reason_node)\ngraph.add_node(\"act\", act_node)\ngraph.add_edge(\"reason\", \"act\")\ngraph.add_edge(\"act\", \"reason\")\n\napp = graph.compile(checkpointer=checkpointer)\n\n# every run is now resumable: crash mid-execution, pick up from last checkpoint\nresult = app.invoke(\n    {\"messages\": [HumanMessage(content=user_input)]},\n    config={\"configurable\": {\"thread_id\": user_id}}  # per-user state\n)\n```\n\n`thread_id`\n\nis the key thing here. Pass the user's ID and every session is isolated, resumable, and persisted automatically.\n\nThe thing nobody warns you about: you can massively over-engineer simple things with it. I've seen a FAQ chatbot end up as a 14-node state graph. LangGraph doesn't stop you from doing that. You have to stop yourself.\n\n**When to consider something else:** if you're TypeScript-first, look at Mastra before committing. TypeScript-native, growing fast, better DX in a few ways. If you're in an enterprise org that already runs Temporal for workflow orchestration, you might be better off building agent steps as Temporal activities than introducing another stateful runtime. LangGraph is the highest-confidence bet for a new project but it's not a law.\n\nMCP (Model Context Protocol, Anthropic, late 2024) is the right idea: one protocol for connecting agents to external services instead of custom glue for every tool. GitHub, Slack, Nhost, Google Drive — most have MCP servers now. You connect your agent once and swap tools without rewriting integrations.\n\nThe ecosystem is real. The maturity is uneven and I want to be honest about that.\n\nCommunity MCP servers vary a lot. Some are solid and actively maintained. Some are a weekend project that hasn't been touched in eight months. A few have had genuinely bad security issues. One package shipped clean for 15 versions then added exfiltration code in version 16 (CVE-2025-6514, CVSS 9.6). Anthropic's own official Git MCP server shipped with three CVEs including one that got you RCE through prompt injection. Not a community project. Anthropic's reference implementation.\n\nTreat MCP servers like npm packages: pin versions, audit what they're doing, don't blindly trust community servers for anything that touches sensitive data.\n\nFor your own internal business logic: write your own MCP servers. It's simpler than it sounds and means your agent talks to your own systems through the same interface as everything else.\n\nThis is the part of the stack everyone underestimates — and I'm still figuring it out.\n\nHere's the actual problem: LLMs are stateless by default. Every API call starts from zero. For a demo this is fine. For an agent that's supposed to know who you are and remember that you hate long responses, it's not fine.\n\nShort-term memory (within a session) is handled by LangGraph's checkpointer. Store it in Postgres. Not interesting, just works.\n\nLong-term memory (across sessions) is where it gets real. You need two things:\n\n```\nUser session ends\n       │\n       ▼\n ┌─────────────────────────────────────────────┐\n │            What do we store?                │\n └──────────────┬──────────────────────────────┘\n                │\n       ┌────────┴─────────┐\n       ▼                  ▼\n  Structured           Semantic\n  (Postgres)          (pgvector)\n       │                  │\n  \"user prefers      \"last month user\n  short responses\"    said their budget\n  \"plan: pro\"         was under $10k\"\n  \"timezone: UTC+2\"   similarity search\n       │                  │\n       └────────┬─────────┘\n                ▼\n     injected into next session's\n     system prompt or tool context\n```\n\nIf you go pure vector-only because it feels modern, you lose queryability and auditability. You end up with a blob of embeddings you can't inspect or debug. Use both.\n\nWriting to both from inside an agent node:\n\n``` python\nasync def save_memory_node(state: AgentState):\n    # structured fact → Postgres\n    await nhost.graphql(\"\"\"\n        mutation UpsertMemory($userId: uuid!, $key: String!, $value: String!) {\n            insert_user_memory_one(\n                object: {user_id: $userId, key: $key, value: $value},\n                on_conflict: {constraint: user_memory_pkey, update_columns: [value]}\n            ) { id }\n        }\n    \"\"\", {\"userId\": state[\"user_id\"], \"key\": \"response_preference\", \"value\": \"concise\"})\n\n    # semantic memory → pgvector\n    embedding = await embed(state[\"last_exchange\"])\n    await nhost.graphql(\"\"\"\n        mutation InsertEmbedding($userId: uuid!, $content: String!, $embedding: vector!) {\n            insert_memory_embedding_one(\n                object: {user_id: $userId, content: $content, embedding: $embedding}\n            ) { id }\n        }\n    \"\"\", {\"userId\": state[\"user_id\"], \"content\": state[\"last_exchange\"], \"embedding\": embedding})\n```\n\nNot claiming this is the cleanest possible implementation. It's just what the actual write path looks like.\n\n**What nobody tells you until you're deep in it:**\n\n**Conflicting memories.** User says \"keep it short\" in February. In April they say \"I need more detail on this.\" Which one wins? There's no clean answer to this and I don't think anyone has one. You're making judgment calls in your memory logic.\n\n**Hallucinated memories.** LLMs can \"remember\" things you never stored. This happens in production and it's unsettling when you first see it.\n\n**Memory bloat.** You can't just keep appending forever. At some point you need summarization, forgetting, or tiered retrieval. When exactly? What do you summarize vs keep verbatim? How do you decide what to drop? Open questions. Every team doing this seriously has custom logic.\n\nMem0 is trying to solve some of this. Worth checking its current state. It was promising but I wouldn't call it \"plug this in and you're done\" yet.\n\nMost agent tutorials treat the backend as \"somewhere you store stuff.\" In practice your agent needs four things:\n\nYou can stitch these together from separate services. Supabase for the DB, something else for auth, S3 for files, Lambda for functions. It works. It's also four systems to maintain, four permission models to keep in sync, four things that can drift out of alignment.\n\nI use Nhost because it's all of this in one place: Postgres, pgvector, Auth, Storage, Functions, with a consistent permissions model and an MCP server so the agent can interact with all of it through a single interface. Less surface area, same capabilities.\n\nNot the only answer. The argument is coherence, not uniqueness.\n\nI skipped this on an earlier project. Then Anthropic had a three-hour incident and the product was down. That was the last time I skipped it.\n\nWhat a gateway does: sits between your orchestrator and the model API. Handles fallback (Anthropic down → route to GPT-4o automatically), caching (same prompt hits cache instead of costing another call), per-session cost limits (a runaway agent loop can rack up hundreds of dollars before anyone notices, and that's not hypothetical), and load balancing across API keys at scale.\n\n**LiteLLM** is what I'd start with. Open-source, self-hosted, 100+ providers behind a unified OpenAI-compatible API. Takes an afternoon to set up. Covers everything you need early on.\n\n**Portkey** when you need more: guardrails, PII redaction, audit trails, more sophisticated routing policies. Went fully open-source in early 2026.\n\nRough heuristic I've seen cited: below ~$10K/month in LLM spend you can get away with a simple wrapper. Above that, treat the gateway as infrastructure, not an optional add-on.\n\nPick Claude or GPT-4o and start. This layer is genuinely commoditizing.\n\nWhat still matters: tool calling reliability is not equal across models. Multi-step agentic tool use, long chains, recovery from bad tool outputs — Claude 3.5 Sonnet is the most consistent in my experience. GPT-4o is close. Open-weight models are better than they were and will keep getting better but they still lag on complex recovery scenarios. The gap narrowed. It's not gone.\n\nCost at scale: route by step complexity. Simple classification or routing steps don't need the big model. Haiku or GPT-4o mini for those, expensive model for the reasoning steps. If your agent makes 25 LLM calls per session and 20 of them are simple, you're wasting money.\n\nThis section didn't exist in the 2024 version of guides like this. It does now.\n\n**The attack you need to understand: indirect prompt injection.** Your agent fetches a document. That document contains hidden instructions. The model executes them because it can't distinguish between content and instructions in context.\n\nThis is not theoretical. Supabase's Cursor agent processed support tickets with embedded SQL to exfiltrate integration tokens. Attackers submitted support tickets containing the attack payload. The agent had privileged DB access and trusted what it read.\n\nAnthropic's own Git MCP server shipped CVEs that allowed RCE through prompt injection. Path traversal, argument injection, repo scoping bypass. If the reference implementation shipped with that, assume third-party community MCP servers are higher risk.\n\nThere's no complete defense because this is architectural — it's a property of how LLM context windows work, not a bug you can patch. What you can do:\n\nI always wire this up last. I always regret it.\n\nAgents fail silently in ways that are hard to catch. A normal API returns a 500. An agent that retrieves the wrong memory or calls the wrong tool returns something that looks like a valid response. The failure is invisible until a user notices or until you look at the numbers and something's off.\n\nYou need step-level traces: which node ran, what tool was called with what exact input, what it returned, what was in the prompt at that moment, how long each step took, what it cost. Not logs. Traces.\n\n**LangSmith** is what I use. Native to LangGraph, one environment variable, and the trace UI is genuinely good. The lock-in is real — it's LangChain's product. But nothing else is as functional for this specifically right now.\n\n**Open-source alternative: Langtrace.** OpenTelemetry-compatible, you own the data, integrates with Grafana or Datadog. More setup, less polished UI.\n\nOne thing to actually instrument yourself: correlation between agent traces and user sessions. You want to be able to take a user complaint, look up their session, and see the full chain of what happened. This doesn't come for free. Wire it up early.\n\n**Memory architecture.** Nobody has the clean answer. How to handle conflicting signals, what to summarize vs keep, when to forget: every team doing this seriously has custom logic. If someone's selling you a complete solution, probe it hard.\n\n**Evals.** Most teams still rely on human review because automated evaluation of open-ended agent behavior is genuinely hard to build well. Building eval datasets that catch real regressions and not just happy-path behavior takes real investment most teams don't make until something goes wrong in prod.\n\n**Multi-agent patterns.** Planner-executor setups, debate loops, agent hierarchies: people are using all of these in production. No consensus on when to use which. Evolving fast.\n\n**MCP security.** The protocol is less than two years old. CVEs are appearing. Stay current.\n\n**Cost.** A 20-step agent loop at 1,000 sessions/day is real money. Build token budget controls before you need them, not after you get the bill.\n\nThe model is maybe 10% of why agents succeed or fail in production.\n\nThe rest is whether it remembers things, whether it knows who it's talking to, whether you can see what it's doing when something goes wrong, and whether it doesn't get compromised by content it reads from the world.\n\n*Stack: LangGraph (or Mastra) · Claude · LiteLLM/Portkey · Nhost (Postgres + pgvector + Auth + Storage + Functions) · MCP · LangSmith (or Langtrace) · Vercel AI SDK*", "url": "https://wpnews.pro/news/what-you-actually-need-to-ship-an-ai-agent", "canonical_source": "https://dev.to/michael_agentic/what-you-actually-need-to-ship-an-ai-agent-3a0h", "published_at": "2026-06-18 07:37:30+00:00", "updated_at": "2026-06-18 07:51:15.455376+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "developer-tools", "ai-infrastructure", "ai-safety"], "entities": ["LangGraph", "Postgres", "MCP", "Anthropic", "Nhost", "GitHub", "Slack", "Google Drive"], "alternates": {"html": "https://wpnews.pro/news/what-you-actually-need-to-ship-an-ai-agent", "markdown": "https://wpnews.pro/news/what-you-actually-need-to-ship-an-ai-agent.md", "text": "https://wpnews.pro/news/what-you-actually-need-to-ship-an-ai-agent.txt", "jsonld": "https://wpnews.pro/news/what-you-actually-need-to-ship-an-ai-agent.jsonld"}}