Everyone's building agents. Half of them are running. The other half have "active plans."
I've been in both camps. The difference isn't the model. Models have been good enough for a while now. It's everything around the model that nobody talks about in tutorials because tutorials end when the demo works.
This is the stuff that bit me. Take it or leave it.
Worth asking before you pick a framework.
The cases where agents actually make sense — and I mean actually get used, not just demoed — are pretty narrow:
You have work that's too variable for a simple automation but too repetitive for a human to do 500 times a day. Customer support triage where the agent needs to know who the user is, what plan they're on, what happened in their last three sessions. Internal ops: pull from four systems, write a Slack summary, done. SaaS features where "AI that knows your account" is the actual value, not a generic chatbot bolted on.
What all of these have in common: the agent needs to remember things. Needs to know who's asking. Needs to not lose its mind when a tool call fails or an LLM provider has a bad afternoon.
Everything below is about making that work.
Use LangGraph. Not because it's elegant (it's not always), but because it handles the stuff that kills you in production and nothing else does it as well right now.
The things that matter: state persists across crashes (you don't restart from zero when something hiccups). You can mid-execution, wait for a human to approve something, resume. Parallel tool calls without data races. Explicit control flow so you actually know what's running.
Here's what setting up LangGraph with Postgres checkpointing actually looks like. This is the part that makes your agent survive crashes:
from langgraph.graph import StateGraph
from langgraph.checkpoint.postgres import PostgresSaver
from psycopg_pool import ConnectionPool
pool = ConnectionPool(conninfo=os.environ["NHOST_DATABASE_URL"])
checkpointer = PostgresSaver(pool)
graph = StateGraph(AgentState)
graph.add_node("reason", reason_node)
graph.add_node("act", act_node)
graph.add_edge("reason", "act")
graph.add_edge("act", "reason")
app = graph.compile(checkpointer=checkpointer)
result = app.invoke(
{"messages": [HumanMessage(content=user_input)]},
config={"configurable": {"thread_id": user_id}} # per-user state
)
thread_id
is the key thing here. Pass the user's ID and every session is isolated, resumable, and persisted automatically.
The thing nobody warns you about: you can massively over-engineer simple things with it. I've seen a FAQ chatbot end up as a 14-node state graph. LangGraph doesn't stop you from doing that. You have to stop yourself.
When to consider something else: if you're TypeScript-first, look at Mastra before committing. TypeScript-native, growing fast, better DX in a few ways. If you're in an enterprise org that already runs Temporal for workflow orchestration, you might be better off building agent steps as Temporal activities than introducing another stateful runtime. LangGraph is the highest-confidence bet for a new project but it's not a law.
MCP (Model Context Protocol, Anthropic, late 2024) is the right idea: one protocol for connecting agents to external services instead of custom glue for every tool. GitHub, Slack, Nhost, Google Drive — most have MCP servers now. You connect your agent once and swap tools without rewriting integrations.
The ecosystem is real. The maturity is uneven and I want to be honest about that.
Community MCP servers vary a lot. Some are solid and actively maintained. Some are a weekend project that hasn't been touched in eight months. A few have had genuinely bad security issues. One package shipped clean for 15 versions then added exfiltration code in version 16 (CVE-2025-6514, CVSS 9.6). Anthropic's own official Git MCP server shipped with three CVEs including one that got you RCE through prompt injection. Not a community project. Anthropic's reference implementation.
Treat MCP servers like npm packages: pin versions, audit what they're doing, don't blindly trust community servers for anything that touches sensitive data.
For your own internal business logic: write your own MCP servers. It's simpler than it sounds and means your agent talks to your own systems through the same interface as everything else.
This is the part of the stack everyone underestimates — and I'm still figuring it out.
Here's the actual problem: LLMs are stateless by default. Every API call starts from zero. For a demo this is fine. For an agent that's supposed to know who you are and remember that you hate long responses, it's not fine.
Short-term memory (within a session) is handled by LangGraph's checkpointer. Store it in Postgres. Not interesting, just works.
Long-term memory (across sessions) is where it gets real. You need two things:
User session ends
│
▼
┌─────────────────────────────────────────────┐
│ What do we store? │
└──────────────┬──────────────────────────────┘
│
┌────────┴─────────┐
▼ ▼
Structured Semantic
(Postgres) (pgvector)
│ │
"user prefers "last month user
short responses" said their budget
"plan: pro" was under $10k"
"timezone: UTC+2" similarity search
│ │
└────────┬─────────┘
▼
injected into next session's
system prompt or tool context
If you go pure vector-only because it feels modern, you lose queryability and auditability. You end up with a blob of embeddings you can't inspect or debug. Use both.
Writing to both from inside an agent node:
async def save_memory_node(state: AgentState):
await nhost.graphql("""
mutation UpsertMemory($userId: uuid!, $key: String!, $value: String!) {
insert_user_memory_one(
object: {user_id: $userId, key: $key, value: $value},
on_conflict: {constraint: user_memory_pkey, update_columns: [value]}
) { id }
}
""", {"userId": state["user_id"], "key": "response_preference", "value": "concise"})
embedding = await embed(state["last_exchange"])
await nhost.graphql("""
mutation InsertEmbedding($userId: uuid!, $content: String!, $embedding: vector!) {
insert_memory_embedding_one(
object: {user_id: $userId, content: $content, embedding: $embedding}
) { id }
}
""", {"userId": state["user_id"], "content": state["last_exchange"], "embedding": embedding})
Not claiming this is the cleanest possible implementation. It's just what the actual write path looks like.
What nobody tells you until you're deep in it:
Conflicting memories. User says "keep it short" in February. In April they say "I need more detail on this." Which one wins? There's no clean answer to this and I don't think anyone has one. You're making judgment calls in your memory logic.
Hallucinated memories. LLMs can "remember" things you never stored. This happens in production and it's unsettling when you first see it.
Memory bloat. You can't just keep appending forever. At some point you need summarization, forgetting, or tiered retrieval. When exactly? What do you summarize vs keep verbatim? How do you decide what to drop? Open questions. Every team doing this seriously has custom logic.
Mem0 is trying to solve some of this. Worth checking its current state. It was promising but I wouldn't call it "plug this in and you're done" yet.
Most agent tutorials treat the backend as "somewhere you store stuff." In practice your agent needs four things:
You can stitch these together from separate services. Supabase for the DB, something else for auth, S3 for files, Lambda for functions. It works. It's also four systems to maintain, four permission models to keep in sync, four things that can drift out of alignment.
I use Nhost because it's all of this in one place: Postgres, pgvector, Auth, Storage, Functions, with a consistent permissions model and an MCP server so the agent can interact with all of it through a single interface. Less surface area, same capabilities.
Not the only answer. The argument is coherence, not uniqueness.
I skipped this on an earlier project. Then Anthropic had a three-hour incident and the product was down. That was the last time I skipped it.
What a gateway does: sits between your orchestrator and the model API. Handles fallback (Anthropic down → route to GPT-4o automatically), caching (same prompt hits cache instead of costing another call), per-session cost limits (a runaway agent loop can rack up hundreds of dollars before anyone notices, and that's not hypothetical), and load balancing across API keys at scale.
LiteLLM is what I'd start with. Open-source, self-hosted, 100+ providers behind a unified OpenAI-compatible API. Takes an afternoon to set up. Covers everything you need early on.
Portkey when you need more: guardrails, PII redaction, audit trails, more sophisticated routing policies. Went fully open-source in early 2026.
Rough heuristic I've seen cited: below ~$10K/month in LLM spend you can get away with a simple wrapper. Above that, treat the gateway as infrastructure, not an optional add-on.
Pick Claude or GPT-4o and start. This layer is genuinely commoditizing.
What still matters: tool calling reliability is not equal across models. Multi-step agentic tool use, long chains, recovery from bad tool outputs — Claude 3.5 Sonnet is the most consistent in my experience. GPT-4o is close. Open-weight models are better than they were and will keep getting better but they still lag on complex recovery scenarios. The gap narrowed. It's not gone.
Cost at scale: route by step complexity. Simple classification or routing steps don't need the big model. Haiku or GPT-4o mini for those, expensive model for the reasoning steps. If your agent makes 25 LLM calls per session and 20 of them are simple, you're wasting money.
This section didn't exist in the 2024 version of guides like this. It does now.
The attack you need to understand: indirect prompt injection. Your agent fetches a document. That document contains hidden instructions. The model executes them because it can't distinguish between content and instructions in context.
This is not theoretical. Supabase's Cursor agent processed support tickets with embedded SQL to exfiltrate integration tokens. Attackers submitted support tickets containing the attack payload. The agent had privileged DB access and trusted what it read.
Anthropic's own Git MCP server shipped CVEs that allowed RCE through prompt injection. Path traversal, argument injection, repo scoping bypass. If the reference implementation shipped with that, assume third-party community MCP servers are higher risk.
There's no complete defense because this is architectural — it's a property of how LLM context windows work, not a bug you can patch. What you can do:
I always wire this up last. I always regret it.
Agents fail silently in ways that are hard to catch. A normal API returns a 500. An agent that retrieves the wrong memory or calls the wrong tool returns something that looks like a valid response. The failure is invisible until a user notices or until you look at the numbers and something's off.
You need step-level traces: which node ran, what tool was called with what exact input, what it returned, what was in the prompt at that moment, how long each step took, what it cost. Not logs. Traces.
LangSmith is what I use. Native to LangGraph, one environment variable, and the trace UI is genuinely good. The lock-in is real — it's LangChain's product. But nothing else is as functional for this specifically right now.
Open-source alternative: Langtrace. OpenTelemetry-compatible, you own the data, integrates with Grafana or Datadog. More setup, less polished UI.
One thing to actually instrument yourself: correlation between agent traces and user sessions. You want to be able to take a user complaint, look up their session, and see the full chain of what happened. This doesn't come for free. Wire it up early.
Memory architecture. Nobody has the clean answer. How to handle conflicting signals, what to summarize vs keep, when to forget: every team doing this seriously has custom logic. If someone's selling you a complete solution, probe it hard.
Evals. Most teams still rely on human review because automated evaluation of open-ended agent behavior is genuinely hard to build well. Building eval datasets that catch real regressions and not just happy-path behavior takes real investment most teams don't make until something goes wrong in prod.
Multi-agent patterns. Planner-executor setups, debate loops, agent hierarchies: people are using all of these in production. No consensus on when to use which. Evolving fast.
MCP security. The protocol is less than two years old. CVEs are appearing. Stay current.
Cost. A 20-step agent loop at 1,000 sessions/day is real money. Build token budget controls before you need them, not after you get the bill.
The model is maybe 10% of why agents succeed or fail in production.
The rest is whether it remembers things, whether it knows who it's talking to, whether you can see what it's doing when something goes wrong, and whether it doesn't get compromised by content it reads from the world.
Stack: LangGraph (or Mastra) · Claude · LiteLLM/Portkey · Nhost (Postgres + pgvector + Auth + Storage + Functions) · MCP · LangSmith (or Langtrace) · Vercel AI SDK