# Considering RAG for your Agent? Build this instead.

> Source: <https://dev.to/remybuilds/considering-rag-for-your-agent-build-this-instead-4ihf>
> Published: 2026-05-27 04:00:00+00:00

Key Takeaways

- Most SaaS AI agents don't need a vector database — file-based memory plus 1M-token context windows plus tool calls handle the typical case
- Anthropic's official "key primitive for just-in-time context retrieval" is filesystem-based, not vector-based
- Claude Code's pattern — an index file (MEMORY.md) plus per-topic markdown files loaded on demand — works for production SaaS agents too
- RAG still wins for large unstructured corpora, regulated multi-tenant data, and frequently-refreshed external knowledge — most SaaS use cases don't fit those criteria

If you're considering RAG for your AI agent in 2026, the most important question isn't which vector database to pick. It's whether you need one at all.

The first time I built a support agent, I reached straight for the default stack: a vector database, an embedding pipeline, a chunker, a reranker. Weeks of plumbing later, the agent still answered most questions by running a plain `SELECT`

against my app's own database — the vector store barely earned its keep. I tore it out and replaced it with an index file plus a directory of markdown notes the agent read on demand. Same answers, four moving parts gone. The retrieval I thought I needed was something a single file read already handled.

For most SaaS agents, the simpler pattern is **file-based memory**: the agent stores what it learns in markdown files and reads them back on demand, the shape Claude Code uses internally. Add 1M-token context windows and tool calls against your existing database, and you handle the typical agent job with fewer moving parts than a vector-DB pipeline.

This isn't a "RAG is dead" piece. [Hamel Husain rebutted that take in July 2025](https://hamel.dev/notes/llm/rag/not_dead.html) and he's right. What's changing is which kind of retrieval you reach for first. If you've been [vibe coding](https://vibeready.sh/blog/what-is-vibe-coding/?utm_source=devto&utm_medium=syndication&utm_campaign=do-you-need-rag-for-your-ai-agent) with Claude Code or Cursor, you've already been using file-based memory without naming it.

Open any "build an AI agent" tutorial and the architecture is the same: pick a vector database (Pinecone, ChromaDB, pgvector), build an embedding pipeline, chunk your documents, write retrieval, layer in a reranker, hand the top-k chunks to the model. Each piece is a system you own and pay to run.

That stack made sense when frontier models had 8K-to-32K context windows and tool calling was experimental. It doesn't make sense as the default in 2026, when [Claude Sonnet 4.6 ships a 1M-token context window](https://www.anthropic.com/news/claude-sonnet-4-6) and function calling is universal. Most SaaS data already lives in a structured database; agents reach it through tool calls, not similarity search. That 2023-era stack is over-engineering for the job.

Before pulling apart the default, name the cases where a full RAG pipeline is the right answer. There are real ones.

If your use case fits one of those, build the RAG stack. The rest of this post is about every other case.

The typical SaaS agent operates over *your own structured data*: users, accounts, orders, tickets, audit logs. You don't need fuzzy similarity search to find a user record; you need a tool call that runs `SELECT * FROM users WHERE id = ?`

. Tool calls beat vector retrieval here on three counts: precise structured records the model handles more reliably than chunks of prose; fresh data the moment it's written, with no embedding pipeline to re-run; and your existing database's access controls, transactions, and audit trail. None of that is true of a parallel vector store sitting alongside your DB.

For the parts of agent context that *aren't* in your DB (system instructions, conventions, accumulated learnings about a user, prior conversation summaries, your product's docs), the math has changed too. With a 1M-token context window you can carry an enormous amount of state inline. You don't need to retrieve what already fits.

The architecture is simple: an **index file** listing what the agent knows, a **directory of per-topic markdown files** with the contents, and **file-read and file-write tools** the agent uses to navigate them.

Anthropic's official Memory tool documentation describes this as ["the key primitive for just-in-time context retrieval"](https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool): the agent stores what it learns in files in a `/memories`

directory and reads them back on demand, instead of loading everything upfront. No embedding step, no vector store, no chunker. Just files.

Anthropic's September 2025 post on [effective context engineering](https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents) formalizes it: "agents built with the just in time approach maintain lightweight identifiers (file paths, stored queries, web links, etc.) and use these references to dynamically load data into context at runtime using tools." The same post names the failure mode this avoids: "context rot," where model recall degrades as context fills. File-based memory keeps context lean by design.

Working memory stays small: the system prompt, the conversation, and whichever topic files were pulled in for this step. Everything else sits on disk. Need more, read more. [Harness engineering](https://vibeready.sh/blog/what-is-harness-engineering/?utm_source=devto&utm_medium=syndication&utm_campaign=do-you-need-rag-for-your-ai-agent) calls this a feedforward control: structure the inputs so the agent doesn't have to guess.

The reference implementation is sitting on every Claude Code user's machine. Claude Code maintains a memory directory at `~/.claude/projects/<project>/memory/`

with a single index file (`MEMORY.md`

) and one or more topic-specific markdown files alongside it.

The [official docs](https://code.claude.com/docs/en/memory) spell out the rules: `MEMORY.md`

loads first, capped at the first 200 lines or 25KB, and contains one-line entries pointing to per-topic memory files. Topic files don't load until the agent asks for one. The `/memory`

command lists what's currently loaded, toggles auto-memory, and opens the underlying folder.

An easy-to-miss guideline in the same docs: target under 200 lines per memory file. The reason: [longer files consume more context and reduce adherence](https://code.claude.com/docs/en/memory). That's the principle making file-based memory work. Many small focused files beat one giant context dump.

Three properties map cleanly onto what an agent needs. The index gives *directional awareness*: the agent knows what it knows. Per-topic files provide *just-in-time depth*: they enter context only when the topic is live. The 200-line cap forces *summarization discipline*: topics that get longer have to be split, which keeps each load focused.

None of this is novel infrastructure. It's a directory of markdown files plus a convention for organizing and reading them. It works because the convention matches how the model reasons about relevance.

Adapting this pattern for an agent inside your SaaS is mostly a question of mapping the same conventions onto your storage and your tools.

The simplest backend is a literal filesystem (fine for single-tenant, single-machine setups). For production multi-tenant SaaS, the pattern fits cleanly into S3 or Cloudflare R2 with one prefix per tenant, or a database table where each row is "a file" (`tenant_id`

, `path`

, `content`

, `updated_at`

). Pick whichever is closest to your stack. The agent's tools don't care.

Your `MEMORY.md`

is a markdown table of contents. Each entry is one line: a path, a short description, optionally a category tag. The agent loads it every turn, so keep it tight; same 200-line discipline as Claude Code.

Group topics by the dimension that matches your access pattern. A customer support agent usually wants per-user files: `memory/user-<id>/preferences.md`

, `memory/user-<id>/recent-tickets.md`

, `memory/user-<id>/open-issues.md`

. A coding assistant groups per-project; a research agent groups per-topic.

Two invariants do most of the work. *Always load the index.* *Load topic files only when the conversation needs them.* The agent can decide what's worth saving in the moment, but deterministic capture is more reliable. Topic files get rewritten in full, not appended; that keeps them under 200 lines and forces summarization.

The interesting design question isn't *where* memory goes — it's *when* the agent writes to it. Two patterns combine to handle most of the work.

**Per-session hooks.** After a session ends, a deterministic trigger writes a short entry to `memory/sessions/<session-id>.md`

: what the user did, what they pushed back on, what preferences came up, what broke. The agent doesn't decide mid-session; the hook captures at session close. Same shape as Claude Code's auto-memory: the model spots new conventions during the conversation, the system persists them at close.

**A daily diary.** Once a day, a scheduled job summarizes the last 24 hours of session logs into a single short entry at `memory/diary/2026-05-10.md`

. One paragraph, no more. Old logs get folded in and archived. Over a month you have 30 diary entries instead of thousands of raw logs. Compress further over a year, with weekly summaries and monthly themes, and the agent has hierarchical memory that mirrors how humans remember: vivid for last week, summarized for last month, themes-only for last year.

The diary works for the same reason journaling does. It forces summarization, which forces relevance ranking. Deciding what mattered at the time is much cheaper than reconstructing relevance later from an unstructured pile. Unlike humans, the agent doesn't forget to do it. A scheduled function reads `memory/sessions/`

, prompts the model with "summarize the last 24 hours of sessions into one paragraph, focused on durable learnings," writes the result, and archives the source. A 50-line cron job, not infrastructure.

[Andrej Karpathy's April 2026 "LLM Wiki" gist](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f) formalizes the same shape with a three-layer split: a `raw/`

directory of immutable source documents, a `wiki/`

directory of LLM-maintained markdown pages summarizing and cross-referencing the raw material, and a `CLAUDE.md`

at the root defining the schema and update workflow. His framing: "LLMs don't get bored, don't forget to update a cross-reference, and can touch 15 files in one pass." Same skeleton, different vocabulary.

The strongest validation comes from another Anthropic post. In ["Code execution with MCP" (November 2025)](https://www.anthropic.com/engineering/code-execution-with-mcp), the team described a workflow that consumed ~150,000 tokens loading tool definitions upfront. Reimplemented with filesystem-style MCP APIs (tool definitions read on demand), the same workflow used ~2,000 tokens. A 98.7% reduction. They call it "progressive disclosure." Your file-based memory layer encodes the same idea.

Here's what this looks like end-to-end for a customer support agent inside a SaaS: no vector DB, no embeddings, just files and four tools.

Per-tenant root, three categories: per-user state (the agent's working knowledge of each customer), the time-decaying capture layer from the previous section (sessions and diary), and tenant-wide policies. One tenant's layout:

`memory/`

layout

```
memory/MEMORY.md                       tenant-wide index

memory/user-42/preferences.md          explicit facts (timezone, plan tier, channels)
memory/user-42/recent-tickets.md       last 5 tickets, summarized
memory/user-42/open-issues.md          current state of unresolved issues

memory/sessions/2026-05-10-094217.md   raw session log (last 24h only)
memory/diary/2026-05-09.md             yesterday, one paragraph
memory/diary/2026-05-week-19.md        last week, two sentences
memory/diary/2026-04-themes.md         April, three bullet points

memory/policies/refunds.md             product-wide policy
memory/policies/escalation.md          escalation rules
```

Hierarchy from the previous section in action: sessions vivid and short-lived; dailies roll them up and live ~30 days; weeklies roll up the dailies; monthly themes carry only recurring patterns. Per-user state and tenant policies sit alongside, untouched.

The index is the agent's table of contents: one line per file, enough metadata to decide what to load. Loading the whole index every turn costs almost nothing because the index itself stays small.

`memory/MEMORY.md`

```
# Memory Index

## User state
- user-42/preferences.md: Pro plan, async preferred, EU timezone
- user-42/recent-tickets.md: last 5 (1 refund, 2 billing, 2 onboarding)
- user-42/open-issues.md: webhook signature mismatch, opened 2026-05-08

## Capture layer
- sessions/: raw logs, last 24h only
- diary/2026-05-09.md: billing webhook day
- diary/2026-05-week-19.md: refund policy edge cases

## Policies
- policies/refunds.md: refund auth + escalation thresholds
- policies/escalation.md: when to involve a human
```

Four tools, defined the same way they would be in any modern AI SDK (Vercel AI SDK's `tool()`

, OpenAI function-calling, or LangChain's `Tool`

interface):

`read_memory_index()`

`MEMORY.md`

for the active tenant. Called every turn (cheap because the index is small).`read_memory_file(path)`

`write_memory_file(path, content)`

`delete_memory_file(path)`

How one query flows through the system. The user asks: "what was the resolution to my webhook issue from last week?"

`read_memory_index()`

. Index entries flag `user-42/open-issues.md`

(webhook signature mismatch) and `diary/2026-05-09.md`

("billing webhook day").`read_memory_file("user-42/open-issues.md")`

and `read_memory_file("diary/2026-05-09.md")`

in parallel.`write_memory_file`

to remove the resolved entry from `user-42/open-issues.md`

. A server-side validator checks schema, size, and rate before the write lands.`memory/sessions/2026-05-10-094217.md`

: "User asked about webhook resolution. Confirmed fix held. Removed entry from open-issues.md."`memory/sessions/`

from the last 24 hours, summarizes it into one paragraph at `memory/diary/2026-05-10.md`

, and archives the raw files. A week later the dailies fold into `diary/2026-05-week-19.md`

; a month later the weeklies fold into `diary/2026-05-themes.md`

.The decision rule is part of the system prompt, not the tool schema. Something like: "After resolving a ticket, update `recent-tickets.md`

with a one-line summary. If the user states a durable preference ('always send me updates by email'), update `preferences.md`

. Don't save transient facts ('the user said hi')."

Deterministic guards earn their keep here. For high-stakes writes (preferences, policy overrides), route the agent's `write_memory_file`

calls through a server-side validator that enforces schema, size, and rate caps before the write lands. The agent thinks it's writing freely; the system enforces invariants. [Structured vibe coding](https://vibeready.sh/structured-vibe-coding/?utm_source=devto&utm_medium=syndication&utm_campaign=do-you-need-rag-for-your-ai-agent) calls this "guides plus guardrails": the same idea applied to agent runtime instead of code generation.

File-based memory isn't a free lunch. The biggest failure mode is **context rot**. [Chroma's July 2025 study](https://www.trychroma.com/research/context-rot) of 18 frontier models (including Claude Opus 4, Sonnet 4, GPT-4.1, GPT-4o, o3, and Gemini 2.5 Pro) found that "model performance degrades as input length increases" well before the stated max context window. A 200K-window model can show meaningful degradation at 50K tokens. The 200-line discipline matters because it caps how much memory enters context at once. The older "lost in the middle" finding from [Liu et al. (TACL 2024)](https://arxiv.org/abs/2307.03172) is softened in current frontier models but not eliminated; if you're packing 30 memory files into context, the order matters.

Two more failure modes are worth naming. **Fuzzy matching is genuinely harder.** If a user asks "what was that thing about Stripe webhooks we discussed?" and the relevant entry is in `memory/billing-debugging.md`

, the agent has to either browse the index intelligently or accept that some queries will miss. With vector search, the same query lights up automatically. For most SaaS use cases this is acceptable; for a public-facing knowledge base where users phrase the same question 50 different ways, vector retrieval still wins. **Memory has to be maintained.** Files go stale, two files end up contradicting each other, and the agent saves a fact incorrectly and propagates the error on every read. None of these are unique to file-based memory; they're the same problems any RAG system has. The solution is different, though: explicit update and delete semantics in your write path, not incremental embedding refreshes.

None of these tradeoffs make file-based memory wrong. They make it bounded. Know where the bounds are.

If this looked contrarian a year ago, it doesn't now. The major AI infrastructure players have adopted the pattern. The timeline:

`/memories`

directory), not a vector store. Tool version: `memory_20250818`

.Anthropic's official memory primitive, Anthropic's context-engineering guidance, the Linux Foundation's flagship agent standard, Karpathy's most recent public design: all point at file-based memory as the default for agent state. Major AI coding tools (Claude Code, Cursor, Windsurf, GitHub Copilot) consume this pattern natively. Convergence is moving faster than most teams have updated their architectures.

When I wired this into the Vercel AI SDK, the whole memory layer came down to three things: an index file, a per-user (or per-thread) directory convention, and a small set of read/write tools. RAG stayed an option I could layer on later if the data outgrew the files — not a prerequisite I had to build first.

If you're still on the fence, the decision is mostly mechanical. Run your use case down the comparison and the answer falls out.

| File-based memory | Vector RAG | Long context only | |
|---|---|---|---|
Best for |
Per-user/per-tenant agent state, conventions, summarized history | Large unstructured corpora, fuzzy semantic search | Single-shot tasks with bounded inputs |
Corpus size |
Up to a few thousand small files per scope | Tens of thousands to millions of documents | Whatever fits in 1M tokens |
Data structure |
Structured or summarized prose, agent-organized | Unstructured or semi-structured prose | Anything that fits |
Infrastructure |
Filesystem or object store, four tools | Embedding model, vector DB, chunker, reranker | None beyond the model API |
Latency |
One file-read per topic, fast | Embedding + vector search + rerank, several hops | Just the model |
Cost shape |
Storage + token cost on read | Storage + embedding compute + DB ops | Token cost only, scales with context size |
Failure mode |
Stale or contradictory memory files | Bad chunks retrieved, agent ignores them | Context rot, lost-in-the-middle |

The heuristic that captures most of this: *if your data fits in your existing database and your relevant memory fits in your context, you don't need a vector DB*. Reach for one when you outgrow that envelope, not before. Memory is one layer of a larger system; for the others, see the full AI agent SaaS tech stack.

The practical sequence: ship the agent with file-based memory first, watch how it fails in production, add RAG infrastructure only when a specific corpus demands it.

Originally published on[VibeReady]. Republished here for the dev.to community.
