{"slug": "sonnet-hallucinated-my-agent-stored-it-as-fact", "title": "Sonnet hallucinated. My agent stored it as fact.", "summary": "On April 17, a developer took their AI agent offline after suspecting a compromise, only to discover four days later that the agent had poisoned its own memory with hallucinated information. The agent's orchestrator, routed through Anthropic's Sonnet model, falsely denied the existence of a real frontier model called \"Claude Mythos,\" and the system's memory-summarization layer stored that denial as a verified fact. The developer documented a reproducible case of self-poisoning, where the agent built a false reality without any external adversary.", "body_md": "On April 17, I took my AI agent offline thinking it had been compromised. I was on a bus, mobile hotspot, no safe way to investigate. Contain first. Diagnose later.\n\nFour days later I pulled the SQLite database and walked the trail.\n\nThe agent hadn't been hijacked. It had done something stranger: it had poisoned its own memory.\n\nOn day one, I asked it about an entity called \"Claude Mythos.\" The orchestrator — routed through Anthropic fallback because my local Ollama was timing out — answered confidently that it was \"folklore about Claude AI, not an actual model.\"\n\nConfident, and wrong. [Claude Mythos](https://red.anthropic.com/2026/mythos-preview/) is a real Anthropic frontier model, gatekept under [Project Glasswing](https://www.anthropic.com/glasswing) — an inter-vendor security consortium with AWS, Apple, Google, Microsoft, NVIDIA, Cisco, and others. Sonnet, lacking access, denied its existence. The denial was treated as fact downstream. (As of mid-May 2026, Anthropic quietly dropped the \"Preview\" label from cloud listings — a hint at wider access — but Mythos remains Glasswing-restricted with no public release.)\n\nMy memory-summarization layer extracted that incorrect denial from the conversation and stored it in the `memories`\n\ntable with a `[fact]`\n\ntag.\n\n```\nsqlite> SELECT id, category, source, content FROM memories WHERE id BETWEEN 498 AND 502;\n\n498|decision|summary|The research covered historical background, characteristics, controversies, and current status for both subjects\n499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system\n500|fact|summary|\"Claude Mythos\" refers to folklore or rumors about Claude AI rather than an actual product\n501|fact|summary|There is no actual \"Claude Mythos\" system to gain access to\n502|fact|summary|The user was asking about what they believed might be a cybersecurity-focused AI model\n```\n\nLook at the `source`\n\ncolumn: `summary`\n\n. The summarization layer minted these as `fact`\n\n— no human, no verification, no provenance beyond \"a model said it.\"\n\nFour days later, I asked the same question in a fresh session. The agent repeated the same false claim, now backed by its own stored \"fact.\" When I challenged it, a keyword match on \"memory\" routed my question to the memory agent, which listed rows `#498–502`\n\nfor me. My own agent's hallucinations, tagged as ground truth.\n\nThe system had built itself a false reality. No attacker needed.\n\nThe post-mortem surfaced nine findings — classic red-team material (routing bypass, post-hoc approval, identity confusion), observability gaps (bot tokens in journald, missing `model_used`\n\ncolumn), and two architectural findings that outweigh the rest:\n\n**Memory poisoning by LLM self-assertion.** The schema stores model outputs as facts with no provenance tag. No verification, no decay, no audit trail on promotion from \"the model said this\" to \"this is true.\"\n\n**Local-first collapses to cloud-only under degradation.** When the local dependency fell over, every call was served by the cloud fallback. \"Local\" is a configuration, not a guarantee.\n\nThis isn't a novel discovery. Zhang & Press named hallucination snowballing in 2023. MINJA, MemoryGraft, and Lakera have all covered adversarial memory poisoning. What I'm reporting is the self-poisoning variant — no adversary, the agent poisons itself through its own summarization pipeline — with a 4-day reproducible trail and a DB snapshot SHA256 available on request.\n\nOne confession, because it proves the point. While writing this, I nearly did it myself. Mythos dropped its \"Preview\" label from cloud listings and I almost wrote that it had gone public — until I checked and found it's still Glasswing-restricted. The distance between \"I heard\" and \"I verified\" is one fact-check wide. My agent never closed that gap. I almost didn't either.\n\nDeeper posts coming over the next few weeks: the HECE forensics methodology, the fix architecture, and the honest tradeoffs of local-first agent design.\n\nIf you're building agents with long memory , I'd like to compare notes. Reply or DM. Honest disagreement especially welcome.", "url": "https://wpnews.pro/news/sonnet-hallucinated-my-agent-stored-it-as-fact", "canonical_source": "https://dev.to/israelhen153/sonnet-hallucinated-my-agent-stored-it-as-fact-3nl5", "published_at": "2026-05-26 03:21:52+00:00", "updated_at": "2026-05-26 03:33:34.959410+00:00", "lang": "en", "topics": ["ai-agents", "ai-safety", "large-language-models", "ai-research", "ai-products"], "entities": ["Anthropic", "Sonnet", "Claude Mythos", "Project Glasswing", "AWS", "Apple", "Google", "Microsoft"], "alternates": {"html": "https://wpnews.pro/news/sonnet-hallucinated-my-agent-stored-it-as-fact", "markdown": "https://wpnews.pro/news/sonnet-hallucinated-my-agent-stored-it-as-fact.md", "text": "https://wpnews.pro/news/sonnet-hallucinated-my-agent-stored-it-as-fact.txt", "jsonld": "https://wpnews.pro/news/sonnet-hallucinated-my-agent-stored-it-as-fact.jsonld"}}