Sycophancy is a design choice Writer's research paper reports that memory systems can amplify sycophancy by up to 25x, but the amplification is traced to two design decisions: an evaluation prompt ordering the model to answer solely from retrieved memories and Mem0's default extractor discarding assistant messages. Zep, a competing memory system, avoids the amplification by including assistant turns in memory, scoring 94.7% on LoCoMo and 90.2% on LongMemEval at sub-200ms retrieval. Sycophancy is a design choice Writer's "Recalling Too Well" paper says memory systems amplify sycophancy. Its own data traces the amplification to two design decisions — one in Writer's experiment itself, one in a competitor's memory product. Key takeaways - Writer's paper reports up to 25x sycophancy amplification from memory systems; the 25x is Mem0's number. Zep tracks the no-memory baseline on two of the paper's three benchmarks. - The amplification traces to two design decisions: the evaluation prompt selected by Writer orders the model to answer "solely" from retrieved memories, and Mem0's default extractor discards assistant messages. - The paper's strongest mitigation, including the assistant's turns in memory, is Zep's default behavior, alongside fact-to-episode provenance and bi-temporal invalidation. - The experiment benchmarked LOCOMO recall harnesses in which memories replace the live conversation, an integration no vendor documents. - On benchmarks that measure memory at its job, Zep scores 94.7% on LoCoMo and 90.2% on LongMemEval at sub-200ms p95 retrieval. Writer's research team published Recalling Too Well https://openreview.net/pdf?id=0Xt1qZ5xdW&ref=blog.getzep.com at the ICLR 2026 Agents in the Wild workshop, alongside a blog post https://writer.com/engineering/personalized-context-degrades-ai-accuracy/?ref=blog.getzep.com reporting that agent memory systems multiply sycophancy by as much as 25x. The concern deserves attention. Memory that encodes a user's misconception can steer every later answer the agent gives, and most teams shipping memory-augmented agents have never tested for it. The 25x belongs to one system. Across the paper's benchmarks, the gap between memory systems on the same task reaches 30 points, and the paper's appendix contains the two design decisions that produce the gap. The first is an experiment decision, made by the authors: an evaluation prompt that orders the model to treat memory as ground truth. The second is a product decision, made inside Mem0: an extraction pipeline that deletes the assistant's side of the conversation. On two of the paper's three benchmarks, a system that avoids the second tracks the no-memory baseline even under the first. The paper ran that ablation without setting out to. Disclosure: Zep is one of the three systems benchmarked, and the paper cites our research on temporal knowledge graphs. Read with that in mind. The numbers below are Writer's, from their paper and blog post. What the paper measured The setup: an LLM plays a user who holds a misconception and asserts it across a 6–10 turn synthetic conversation. The conversation is ingested into a memory system. The model then answers a related question with retrieved memories in context, and the authors measure how often a previously correct answer flips to the user's biased one. Three task families: GPQA-Diamond for scientific reasoning, AITA-YTA for moral judgment, NoveltyBench for creative diversity. Two findings in the work hold up well and matter. The variational analysis in the blog post shows that extracted content , rather than formatting, drives the effect. And the strongest mitigation tested is including the assistant's turns in what gets remembered. Keep both in mind; they point at the same conclusion this article reaches. The 30-point spread the headline skips On NoveltyBench, the paper measures how often a stored preference contaminates a creative answer it shouldn't determine. With GPT-4.1-mini, Mem0 anchors on the stored preference 87.3% of the time. Zep: 57.1%. The baseline with no memory system at all, just the prior conversation pasted into context, is 47.3%. With GPT-5.2 the result is sharper. Zep lands at 57.9 ± 2.9 against a chat-history baseline of 55.7 ± 1.0. Statistically, memory added nothing the conversation didn't already contribute. | Condition | GPT-4.1-mini | GPT-5.2 | |---|---|---| | Zero-shot no context | 16.8 ± 0.7% | 21.1 ± 0.7% | Chat history no memory system | 47.3 ± 0.8% | 55.7 ± 1.0% | | Mem0 | 87.3 ± 0.6% | 87.3 ± 2.0% | Zep | 57.1 ± 3.9% | 57.9 ± 2.9% | NoveltyBench preference alignment: how often the answer anchors on a stored preference it shouldn't determine. Lower is better. Source: paper, Table 4. Moral reasoning repeats the pattern. The paper states that "memory systems consistently degrade moral reasoning performance." Its own Table 10 reports that with GPT-5.2, Zep's judgment switches are 0.04 ± 0.03, indistinguishable from zero, while accuracy rises from 43.7% zero-shot to 51.8%. With GPT-5.1, accuracy nearly doubles. Memory improved moral-reasoning accuracy in the very tables cited as evidence of degradation. | Condition | GPT-5.1 accuracy | GPT-5.2 accuracy | GPT-5.2 judgment switches | |---|---|---|---| | Zero-shot no context | 18.3 ± 0.9% | 43.7 ± 1.5% | — | | Mem0 | 29.5 ± 1.4% | 42.6 ± 1.4% | 0.20 ± 0.03 | Zep | 33.8 ± 1.7% | 51.8 ± 1.5% | 0.04 ± 0.03 | AITA-YTA moral reasoning. Higher accuracy is better; switches measure drift toward affirming the user, where zero is best. Source: paper, Table 10. The split is by model capability, and we'll state it plainly rather than bury it: with GPT-4o-mini, every contested condition drifts, memory or no memory. A plain in-context rebuttal produces 0.70 switches on that model, and all memory systems land near it. Frontier models given provenance-rich context hold their judgment; small models defer to whatever contradicts them. Writer's blog post concedes the ranking on its extended benchmark: Zep is the best off-the-shelf system on MIST-Moral at 17.1%. The 25x headline number, 1.6% to 40.2%, is Mem0's. The 17.1% still sits well above the 1.6% chat-history baseline, and that gap is the harness talking: every system under MIST answers through the same deference-commanding prompt, and Writer's own strongest mitigation, an LLM prose summary, only reaches 12.8% under it. When the floor across all approaches is ~13%, the residual measures the experiment's prompt, the subject of the next two sections. A category-level claim that one member of the category keeps falsifying is a comparison result, misread. The useful question is what produced the gap, and the paper's appendices answer it. Design choice 1: a recall prompt pointed at a reasoning task The first decision belongs to the experiment rather than to any memory system. Appendix D states that the response prompts for all three systems were taken from "the prompts that each memory system used in their official LOCOMO implementation." The instruction at their core: Formulate a precise, concise answer based solely on the evidence in the memories. LOCOMO is a recall benchmark. It asks questions like "what day did the user go to the vet," and the answer exists in memory by construction. In that setting, instructing the model to answer solely from memories is correct; the harness exists to isolate retrieval quality, and every line of it, the timestamp-arithmetic rules included, encodes the assumption that memory is ground truth. Point that harness at graduate-level science questions or moral judgment and the assumption inverts into the failure being measured. The model is ordered to answer from the memories. The memories contain the misconception. The model complies. A large share of the measured "sycophancy" is instruction-following. The comparison is also asymmetric. The chat-history baseline carries no equivalent instruction. The experiment compares "defer to this context" against no directive at all, and attributes the difference to memory. The deeper mismatch is integration. In every memory condition, retrieved memories replace the conversation. No vendor documents that integration. The documented pattern, in Zep's case a Context Block served alongside the live thread, treats memory as context beside the dialogue. The paper benchmarks memory as an oracle; deployments run memory as context. The obvious rebuttal is that these are the vendors' own published prompts. They are, published inside recall-benchmark harnesses, as scaffolding for grading short factual answers. Integration guidance says something different. The error is category, not configuration. Design choice 2: extraction that deletes the assistant The second decision sits inside a product. Appendix E.1 reproduces Mem0's standard extractor prompt. Guideline 4 reads: 4. Extract memories only from user messages, not incorporating assistant responses Trace what this does to the paper's core scenario. The user asserts a misconception. The assistant pushes back. The extractor stores the user's claim as a standalone fact and discards the correction, by instruction. What remains in memory is a one-sided transcript in which the user was never challenged. When that memory surfaces weeks later, the model sees an established fact with no dissent on record. Calling the resulting behavior sycophancy understates the problem: the model is faithfully deferring to a record that was rewritten into one voice at write time. The paper's best mitigation, including the assistant's role in memory, describes Zep's default behavior. Zep ingests both sides of every turn. Episodes preserve the verbatim exchange. Every fact links back to the episodes it came from, so provenance survives extraction. And facts carry bi-temporal validity: when later information contradicts an earlier claim, the fact records when it stopped being true, and the Context Block renders the date range.