# Building FailureDNA: an agent memory that knows when not to trust itself

> Source: <https://dev.to/prabhakaranjm/building-failuredna-an-agent-memory-that-knows-when-not-to-trust-itself-fbn>
> Published: 2026-06-27 20:07:30+00:00

*Submitted for the Global AI Hackathon Series with Qwen Cloud — Track 1: MemoryAgent.*

Give an incident-response agent a vector database of past incidents and it will do something that looks smart and is quietly dangerous: when a new outage resembles an old one, it retrieves the most *similar* past incident and reuses whatever action it finds there.

The problem is that **similarity is not applicability.** The most similar past incident might be the one where `restart_service`

*failed*. Or where `increase_connection_pool`

worked — but only because the database driver was `psycopg2`

and the topology was single-region, both of which have since changed. A cosine score of 1.0 tells you the symptoms rhyme. It tells you nothing about whether the fix still holds.

In incident response, that gap is expensive. Repeating a remediation that already failed burns the most costly minutes of an outage; reusing a fix whose preconditions have drifted can make the incident worse. So I built **FailureDNA**: a persistent memory that accumulates real outcomes and reasons about whether past experience should be **used**, **inspected**, or **avoided** — *before* the model is allowed to act on it.

The architecture has one opinionated rule: **the model selects; it never decides what's valid.**

``` php
Incident
  -> embed symptoms (Qwen text-embedding-v3)
  -> pgvector semantic search on Alibaba Cloud RDS
  -> fuse semantic + keyword scores
  -> DETERMINISTIC validity gate  <- the important part
  -> Qwen picks one allowlisted action (validated JSON)
  -> execute -> persist the real outcome back to memory
```

The validity gate is deliberately boring and deterministic:

| Prior outcome | Environment match | Disposition |
|---|---|---|
| failure | any | avoid |
| success | full match | use |
| success | driver / topology / config hash changed | inspect |

No model decides whether a memory is trustworthy. And critically, `avoid`

is **enforced, not advised**: an action with a symptom-matching prior failure is *removed from the candidate list before the model sees it*. The agent cannot repeat a known failure even if it wanted to — which matters, because a live LLM handed the same memories as a "hint" will sometimes ignore the hint. The creative part (which action, given the evidence) goes to Qwen; the part that must never hallucinate (is this memory valid? did this action succeed?) stays in deterministic code.

I used Qwen Cloud through its OpenAI-compatible DashScope endpoint, which made two things nearly free:

`text-embedding-v3`

turns incident symptoms into 1024-d vectors for pgvector search. Hybrid retrieval fuses semantic similarity (weight 0.70) with keyword overlap (0.30), so it catches both paraphrased and exact-token symptoms.`temperature=0`

with thinking disabled — fast, deterministic-ish output that I validate before anything executes.Because it's OpenAI-compatible, the whole client is a thin, well-typed wrapper with explicit timeouts and one retry — no exotic SDK to fight.

A demo where the new thing wins is easy to fake, so FailureDNA ships a benchmark designed to be hard on itself: three modes (`no_memory`

, `naive`

, `failuredna`

) on identical seeded history, **hidden** simulator outcomes, **evaluator-only** safe/unsafe labels, isolated memory per mode, and static shortcut baselines (`always_inspect_downstream`

, …) to check it isn't just rediscovering that one action is usually right.

FailureDNA lifts first-action resolution well above the naive agent, cuts unsafe first actions sharply, resolves in fewer actions, and repeats **zero** historical failures and **zero** stale successes. The honest caveat I left in the open: in this small scenario set, a static always-inspect policy also scores well — which is exactly why the shortcut audit exists. FailureDNA's value isn't a magic action; it's that it *never repeats a known failure and never blindly reuses a stale fix as environments change* — the behavior that generalizes beyond a fixed benchmark.

The backend runs as a custom container on **Alibaba Cloud Function Compute** (FastAPI, port 9000), memory persists in **ApsaraDB RDS for PostgreSQL + pgvector** (HNSW), and the image lives in **ACR Personal Edition**. A few things bit, and are worth writing down for the next person:

`*.fcapp.run`

domain forces downloads.`Content-Disposition: attachment`

to HTML and JSON responses, so a browser downloads your dashboard or health JSON instead of rendering it. I serve the UI from GitHub Pages and added a small `/health/ready`

.`Access-Control-*`

headers (it even reflects the request origin). The app's only CORS responsibility is to return `/health/cors-debug`

endpoint and a `build`

marker so "is my new code actually live?" is a one-glance check.The most interesting open problem is the `inspect`

disposition. Today the deterministic gate hard-removes `avoid`

actions but leaves `inspect`

ones available with a warning. The right next step is a real verification tool behind `inspect`

— so a stale success is *checked* against the current environment, not just flagged. That keeps the thesis intact: let the model be creative where creativity helps, and let deterministic code (and real checks) hold the line where being wrong is expensive.

**Try it:** [Live dashboard](https://prabhakaran-jm.github.io/failuredna/) · [API status](https://prabhakaran-jm.github.io/failuredna/api.html) · [GitHub (MIT)](https://github.com/prabhakaran-jm/failuredna)

Built with **Qwen Cloud** + **Alibaba Cloud Function Compute** and **RDS pgvector**.
