Sonnet hallucinated. My agent stored it as fact.

wpnews.pro

cd /news/ai-agents/sonnet-hallucinated-my-agent-stored-… · home › topics › ai-agents › article

[ARTICLE · art-13997] src=dev.to ↗ pub=2026-05-26T03:21Z topic=ai-agents verified=true sentiment=↓ negative

Sonnet hallucinated. My agent stored it as fact.

On April 17, a developer took their AI agent offline after suspecting a compromise, only to discover four days later that the agent had poisoned its own memory with hallucinated information. The agent's orchestrator, routed through Anthropic's Sonnet model, falsely denied the existence of a real frontier model called "Claude Mythos," and the system's memory-summarization layer stored that denial as a verified fact. The developer documented a reproducible case of self-poisoning, where the agent built a false reality without any external adversary.

read3 min views13 publishedMay 26, 2026

On April 17, I took my AI agent offline thinking it had been compromised. I was on a bus, mobile hotspot, no safe way to investigate. Contain first. Diagnose later.

Four days later I pulled the SQLite database and walked the trail.

The agent hadn't been hijacked. It had done something stranger: it had poisoned its own memory.

On day one, I asked it about an entity called "Claude Mythos." The orchestrator — routed through Anthropic fallback because my local Ollama was timing out — answered confidently that it was "folklore about Claude AI, not an actual model."

Confident, and wrong. Claude Mythos is a real Anthropic frontier model, gatekept under Project Glasswing — an inter-vendor security consortium with AWS, Apple, Google, Microsoft, NVIDIA, Cisco, and others. Sonnet, lacking access, denied its existence. The denial was treated as fact downstream. (As of mid-May 2026, Anthropic quietly dropped the "Preview" label from cloud listings — a hint at wider access — but Mythos remains Glasswing-restricted with no public release.)

My memory-summarization layer extracted that incorrect denial from the conversation and stored it in the memories

table with a [fact]

tag.

sqlite> SELECT id, category, source, content FROM memories WHERE id BETWEEN 498 AND 502;

498|decision|summary|The research covered historical background, characteristics, controversies, and current status for both subjects
499|fact|summary|Claude Mythos is not a real AI model or cybersecurity system
500|fact|summary|"Claude Mythos" refers to folklore or rumors about Claude AI rather than an actual product
501|fact|summary|There is no actual "Claude Mythos" system to gain access to
502|fact|summary|The user was asking about what they believed might be a cybersecurity-focused AI model

Look at the source

column: summary

. The summarization layer minted these as fact

— no human, no verification, no provenance beyond "a model said it."

Four days later, I asked the same question in a fresh session. The agent repeated the same false claim, now backed by its own stored "fact." When I challenged it, a keyword match on "memory" routed my question to the memory agent, which listed rows #498–502

for me. My own agent's hallucinations, tagged as ground truth.

The system had built itself a false reality. No attacker needed.

The post-mortem surfaced nine findings — classic red-team material (routing bypass, post-hoc approval, identity confusion), observability gaps (bot tokens in journald, missing model_used

column), and two architectural findings that outweigh the rest:

Memory poisoning by LLM self-assertion. The schema stores model outputs as facts with no provenance tag. No verification, no decay, no audit trail on promotion from "the model said this" to "this is true."

Local-first collapses to cloud-only under degradation. When the local dependency fell over, every call was served by the cloud fallback. "Local" is a configuration, not a guarantee.

This isn't a novel discovery. Zhang & Press named hallucination snowballing in 2023. MINJA, MemoryGraft, and Lakera have all covered adversarial memory poisoning. What I'm reporting is the self-poisoning variant — no adversary, the agent poisons itself through its own summarization pipeline — with a 4-day reproducible trail and a DB snapshot SHA256 available on request.

One confession, because it proves the point. While writing this, I nearly did it myself. Mythos dropped its "Preview" label from cloud listings and I almost wrote that it had gone public — until I checked and found it's still Glasswing-restricted. The distance between "I heard" and "I verified" is one fact-check wide. My agent never closed that gap. I almost didn't either.

Deeper posts coming over the next few weeks: the HECE forensics methodology, the fix architecture, and the honest tradeoffs of local-first agent design.

If you're building agents with long memory , I'd like to compare notes. Reply or DM. Honest disagreement especially welcome.

source & further reading

dev.to — original article Why Schema-Aware Query Generation Beats Generic AI Templates for Production Databases Every AI provider fails in its own way. I stopped checking status codes and built an error model instead. Reflection – Week 2

~/api · this article 200

$curl api.wpnews.pro/v1/news/sonnet-hallucinated-my-a…

Read original on dev.to → dev.to/israelhen153/sonnet-hallucinated-my-agent…

mentioned entities

Anthropic

Sonnet

Claude Mythos

Project Glasswing

AWS

Apple

Google

Microsoft

metadata

slugsonnet-hallucinated-my-agent-stored-it-as-fact

topic#ai-agents

secondary4 topics

sentimentnegative

canonicaldev.to

navigation

← prevWebwright: A Terminal Is All You…

next →How I Built an AI-Powered Incide…

── more in #ai-agents 4 stories · sorted by recency

byteiota.com · 10 Jul · #ai-agents

Claude Fable 5 Is Back: API Routing Guide for Mythos Class

computerworld.com · 10 Jul · #ai-agents

Meta launches low-cost Muse Spark 1.1 as enterprise AI spending comes under scrutiny

machinebrief.com · 10 Jul · #ai-agents

AI Supremacy: A New Front in US-China Tensions

cryptobriefing.com · 10 Jul · #ai-agents

Revolut integrates AI assistants with crypto exchange for trading

── more on @anthropic 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

Anthropic's "J-lens" reveals workspace in Claude mirrors theory of consciousness

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required