Building FailureDNA: an agent memory that knows when not to trust itself

A developer built FailureDNA, a persistent memory system for incident-response agents that prevents them from repeating known failures or blindly reusing stale fixes. The system uses a deterministic validity gate to evaluate past outcomes before allowing the model to act, enforcing avoidance of prior failures. In benchmarks, FailureDNA lifted first-action resolution, cut unsafe first actions, and repeated zero historical failures or stale successes.

Submitted for the Global AI Hackathon Series with Qwen Cloud — Track 1: MemoryAgent. Give an incident-response agent a vector database of past incidents and it will do something that looks smart and is quietly dangerous: when a new outage resembles an old one, it retrieves the most similar past incident and reuses whatever action it finds there. The problem is that similarity is not applicability. The most similar past incident might be the one where restart service failed . Or where increase connection pool worked — but only because the database driver was psycopg2 and the topology was single-region, both of which have since changed. A cosine score of 1.0 tells you the symptoms rhyme. It tells you nothing about whether the fix still holds. In incident response, that gap is expensive. Repeating a remediation that already failed burns the most costly minutes of an outage; reusing a fix whose preconditions have drifted can make the incident worse. So I built FailureDNA : a persistent memory that accumulates real outcomes and reasons about whether past experience should be used , inspected , or avoided — before the model is allowed to act on it. The architecture has one opinionated rule: the model selects; it never decides what's valid. php Incident - embed symptoms Qwen text-embedding-v3 - pgvector semantic search on Alibaba Cloud RDS - fuse semantic + keyword scores - DETERMINISTIC validity gate <- the important part - Qwen picks one allowlisted action validated JSON - execute - persist the real outcome back to memory The validity gate is deliberately boring and deterministic: | Prior outcome | Environment match | Disposition | |---|---|---| | failure | any | avoid | | success | full match | use | | success | driver / topology / config hash changed | inspect | No model decides whether a memory is trustworthy. And critically, avoid is enforced, not advised : an action with a symptom-matching prior failure is removed from the candidate list before the model sees it . The agent cannot repeat a known failure even if it wanted to — which matters, because a live LLM handed the same memories as a "hint" will sometimes ignore the hint. The creative part which action, given the evidence goes to Qwen; the part that must never hallucinate is this memory valid? did this action succeed? stays in deterministic code. I used Qwen Cloud through its OpenAI-compatible DashScope endpoint, which made two things nearly free: text-embedding-v3 turns incident symptoms into 1024-d vectors for pgvector search. Hybrid retrieval fuses semantic similarity weight 0.70 with keyword overlap 0.30 , so it catches both paraphrased and exact-token symptoms. temperature=0 with thinking disabled — fast, deterministic-ish output that I validate before anything executes.Because it's OpenAI-compatible, the whole client is a thin, well-typed wrapper with explicit timeouts and one retry — no exotic SDK to fight. A demo where the new thing wins is easy to fake, so FailureDNA ships a benchmark designed to be hard on itself: three modes no memory , naive , failuredna on identical seeded history, hidden simulator outcomes, evaluator-only safe/unsafe labels, isolated memory per mode, and static shortcut baselines always inspect downstream , … to check it isn't just rediscovering that one action is usually right. FailureDNA lifts first-action resolution well above the naive agent, cuts unsafe first actions sharply, resolves in fewer actions, and repeats zero historical failures and zero stale successes. The honest caveat I left in the open: in this small scenario set, a static always-inspect policy also scores well — which is exactly why the shortcut audit exists. FailureDNA's value isn't a magic action; it's that it never repeats a known failure and never blindly reuses a stale fix as environments change — the behavior that generalizes beyond a fixed benchmark. The backend runs as a custom container on Alibaba Cloud Function Compute FastAPI, port 9000 , memory persists in ApsaraDB RDS for PostgreSQL + pgvector HNSW , and the image lives in ACR Personal Edition . A few things bit, and are worth writing down for the next person: .fcapp.run domain forces downloads. Content-Disposition: attachment to HTML and JSON responses, so a browser downloads your dashboard or health JSON instead of rendering it. I serve the UI from GitHub Pages and added a small /health/ready . Access-Control- headers it even reflects the request origin . The app's only CORS responsibility is to return /health/cors-debug endpoint and a build marker so "is my new code actually live?" is a one-glance check.The most interesting open problem is the inspect disposition. Today the deterministic gate hard-removes avoid actions but leaves inspect ones available with a warning. The right next step is a real verification tool behind inspect — so a stale success is checked against the current environment, not just flagged. That keeps the thesis intact: let the model be creative where creativity helps, and let deterministic code and real checks hold the line where being wrong is expensive. Try it: Live dashboard https://prabhakaran-jm.github.io/failuredna/ · API status https://prabhakaran-jm.github.io/failuredna/api.html · GitHub MIT https://github.com/prabhakaran-jm/failuredna Built with Qwen Cloud + Alibaba Cloud Function Compute and RDS pgvector .