{"slug": "how-to-benchmark-persistent-repo-memory-for-coding-agents", "title": "How to benchmark persistent repo memory for coding agents", "summary": "Greplica, a persistent memory tool for coding agents, improved planning performance on open-source repositories by 43% lower cost, 49% fewer tokens, 36% fewer tool calls, and 26% less time in benchmarks using the SWE-chat dataset. The tool stores relevant context from prior sessions and retrieves it for new tasks, reducing exploration overhead.", "body_md": "[← Blog](blog/)\n\n# Benchmarking Greplica: Significant uplift on planning tasks on open-source repositories\n\n## Overview\n\nGreplica improves coding-agent performance on complex engineering tasks by giving agents access to relevant memory from prior development sessions.\n\nWe benchmarked Greplica using the ** SWE-chat dataset** on 10 selected high-context tasks across open-source repositories, and found that agents with Greplica memory consistently reached plans with less exploration than baseline agents that started from scratch.\n\nAgents using Greplica performed better on all counts:\n\n**43%** lower estimated cost**49%** fewer tokens consumed**36%** fewer tool calls**26%** less time taken\n\nRelevant context saved in memory and revealed to the agent when doing a related task improves task understanding, finding right subsystems and accounting for prior decisions, eventually leading to concentrated gains in producing an implementation plan.\n\nIn this post we walk through how how Greplica helps agents, how we designed that benchmark, what we measured, and what the pilot results show.\n\n## Why Coding Agents Need Memory\n\nCoding agents are reasoning systems built around LLMs. On starting a new session, their context window only contains the user prompt, global skills and `AGENTS.md`\n\n. From there they must rebuild understanding of the codebase through tool calls: `grep`\n\n, `glob`\n\n, `read`\n\n, shell commands, and file inspection. Large repositories contain many millions lines of code, which means meaningful time and tokens lost reconstructing context that may already have been learned in previous sessions.\n\nA larger context window does not automatically solve this. Too much irrelevant context can make the agent slower, more expensive, and less accurate. When the window fills up, harnesses compact the conversation and useful intermediate reasoning can be lost.\n\nDevelopers compensate by giving project instructions in prompts, or writing them into `AGENTS.md`\n\nor other repo-level documentation. These are useful, but difficult to maintain, hard to keep current, and not designed for task-specific retrieval. As the project grows they either become too sparse or too large to trust.\n\nWhat coding agents need is not just more context. They need **persistent, queryable engineering memory**.\n\n## What Greplica Does\n\nGreplica works in the background, looking out for important bits of context to capture. It uses your coding session transcripts and fresh code changes to extract useful facts like architectural decisions, learnings from prior attempts, gotchas and edge cases. These are stored in a persistent SQLite-backed graph, automatically at the end of each session.\n\nWhen an agent receives a new task, it can **query Greplica before broad manual exploration**. Instead of rediscovering the repository from scratch, it retrieves relevant prior context and uses that to produce a better plan.\n\nWe designed this benchmark to test whether that works on realistic, temporally valid session sequences.\n\n## Benchmark Design\n\nWe started with a specific question:\n\n*If a coding agent has access to memory built from prior related sessions on the same repository, does it produce a better plan for a later task — faster and with less exploration?*\n\n### Why planning, not implementation\n\nWe chose the planning phase because most of an agent's initial exploration is spent understanding the repo, locating the right subsystem, and turning that context into a plan.\n\n### Data source\n\nCases are built from the ** SALT-NLP/SWE-chat** dataset: real developer sessions with transcripts, checkpoints, and edit patches across many open-source repos.\n\nEach case is a **sequence of coding sessions**:\n\n**Prior (memory-building) sessions**(2-4) — chronologically before the session chosen for testing. Memory is built only from these.** Held-out (test) session**— a later session on the same repo. Its main engineering task becomes the benchmark prompt. The agent never sees this transcript during memory build.\n\nWe built memory from prior sessions and ensured future sessions must not leak into memory.\n\n### Repository and task selection\n\nWe first shortlisted repositories by credibility (number of Github stars), history (number of past commits), and continuity (multiple contiguous sessions on related work).\n\nFrom those, we chose 10 sessions where the user was doing highly contextual work: related to prior sessions or tasks requiring subsystem understanding rather than a one-file fix.\n\n**These tasks mimic real world development tasks in large, complex repositories.**\n\n## Task Construction\n\nFor each chosen session, we inspect the work that happened in it and constructed a **prompt for a planning task**, mimicking what a real engineer might ask. Parallelly for verification, we made the LLM capture **gold-facts** in a hidden `judge.md`\n\nfile, containing expected components of a good plan based on what the user made the LLM actually do.\n\nWe then materialize the repo at the **pre-task base commit** and start two arms - baseline and Greplica-arm. Greplica-arm uses **memory built using prior sessions** (i.e. transcripts and edit artifacts).\n\nMemory is built the way a user would actually use Greplica:\n\n- Bootstrap Greplica on the repo at the prior session's start checkpoint\n- Reconstruct the session's code diff from SWE-chat edit artifacts\n- Invoke\nwith human/assistant transcript text and repo context`greplica-update-memory`\n\n- Save the updated memory and repeat for the next session, until we reach the held-out test session\n\nFor reference, one high-context memory build produced **37 claims** across bootstrap plus three update sessions (21 + 6 + 5 + 5).\n\n## Evaluation & Results\n\nThe LLM judge reads the user-facing prompt, hidden gold guidance (`judge.md`\n\n) and the created `final-plan.md`\n\n.\n\nWe measure **plan quality** (LLM-judge's boolean scores across multiple dimensions in judge.md), **tokens consumed**, **tool calls** and **elapsed time**.\n\nPilot runs used **gpt-5.4** for planning and judging. Results are single-run per arm unless noted; baseline trajectories can be noisy on identical prompts.\n\nAcross the selected top 10 tasks:\n\n**43% less**\n\n**26% less**\n\n**36% fewer**\n\nPer task:\n\n| Task | Cost | Time | Tool calls | |||||\n|---|---|---|---|---|---|---|---|---|\n| Baseline | Greplica | % Delta | Baseline | Greplica | % Delta | Baseline | Greplica | |\n|\n\n[Gemini Voyager sync auth bug](https://github.com/Nagi-ovo/gemini-voyager)[Gemini Voyager AI folder organize](https://github.com/Nagi-ovo/gemini-voyager)[Gemini Voyager cross browser fork](https://github.com/Nagi-ovo/gemini-voyager)[Gemini Voyager chrome store restored](https://github.com/Nagi-ovo/gemini-voyager)[IPTVnator playback layout](https://github.com/4gray/iptvnator)[Gemini Voyager quote reply IME](https://github.com/Nagi-ovo/gemini-voyager)[Gemini Voyager changelog badge](https://github.com/Nagi-ovo/gemini-voyager)[Gemini Voyager i18n bundle](https://github.com/Nagi-ovo/gemini-voyager)[IPTVnator add playlist entrypoint](https://github.com/4gray/iptvnator)### Readout\n\nGreplica saved cost, tokens, tool calls, and time across the selected top 10 tasks. The strongest wins came from tasks where the missing context lived in prior sessions: onboarding/provider behavior in `moltis`\n\n, conversation and release behavior in `gemini-voyager`\n\n, and playlist/playback architecture in `iptvnator`\n\n.\n\n## Conclusion\n\nThe planning phase is the highest-touch parts of agentic software development. When the initial plan is wrong, incomplete, or based on missing context, the rest of the run compounds the error.\n\nHuman developers use their own memory to give useful nudges to coding agents in prompts. However it is often insufficient, and coding agents either rediscover context through expensive exploration or miss it entirely.\n\nGreplica gives agents a way to retrieve that memory directly, and the benefits are stark.\n\nOur SWE-chat plan benchmark pilot shows that when agents have access to **temporally valid, task-relevant** persistent memory, they can plan complex coding tasks with lower cost, fewer tokens, fewer tool calls, and less time — especially on tasks where prior sessions contain the missing subsystem context.\n\n## Why `AGENTS.md`\n\nIs Not Enough\n\nRepo-level instruction files are useful, but not a scalable memory layer.\n\nThey require manual maintenance, do not support task-specific retrieval, and do not preserve the history of engineering decisions — failed attempts, migrations, design tradeoffs, and subsystem-specific gotchas.\n\nGreplica continuously captures context from development work, stores it in a structured graph, and retrieves the relevant subset when an agent needs it.\n\n## Future Work\n\n- Expand from ten pilot tasks to fifty-plus high-context cases with 3–5 repeated runs per arm and median reporting\n- Wire cost estimation for gpt-5.4 and other agent models in the harness scorer\n- Explore LLM-based retrieval methods apart from current semantic score and keyword based retrieval\n- Include other sources of information (Github issues, PRs, PRDs) to add on context\n\nIf you find this work interesting or have feedback, please find us on [Discord](https://discord.gg/eNXJwHwYk8).", "url": "https://wpnews.pro/news/how-to-benchmark-persistent-repo-memory-for-coding-agents", "canonical_source": "https://autoloops.ai/greplica/blog/benchmarking-greplica/", "published_at": "2026-07-04 14:03:59+00:00", "updated_at": "2026-07-04 14:20:11.235906+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "developer-tools", "machine-learning"], "entities": ["Greplica", "SWE-chat", "SALT-NLP", "SQLite"], "alternates": {"html": "https://wpnews.pro/news/how-to-benchmark-persistent-repo-memory-for-coding-agents", "markdown": "https://wpnews.pro/news/how-to-benchmark-persistent-repo-memory-for-coding-agents.md", "text": "https://wpnews.pro/news/how-to-benchmark-persistent-repo-memory-for-coding-agents.txt", "jsonld": "https://wpnews.pro/news/how-to-benchmark-persistent-repo-memory-for-coding-agents.jsonld"}}