Your AI Agent's Memory Has No Expiry Date: I Scored Freshness on a Real Corpus

A developer identified a critical failure in AI agent memory systems where stale retrieval chunks that were once correct can outrank fresh ones due to near-identical similarity scores. The developer implemented a freshness gate that tags each chunk with age and a time-to-live (TTL) based on fact volatility, then down-ranks, blocks, or refuses stale chunks before the model reads them. The gate's deterministic output across two queries demonstrates how it prevents the model from serving outdated information, such as a $29/mo plan price that had changed to $39/mo weeks earlier.

My agent confidently quoted a price from 40 days ago. The retrieval was perfect. The fact was dead. The chunk it pulled said "Pro plan is $29/mo." High similarity to the question, top of the ranking, grammatical, on-topic. Everything a retriever is built to reward. The only problem: the plan had moved to $39 weeks earlier, and the $29 chunk had been sitting in memory the whole time, looking exactly as trustworthy as the day it was written. Worse, the $39 chunk was right there in the corpus too , and the retriever scored the two within 0.002 of each other. A margin that thin isn't a real preference; re-embed the same corpus on a newer model version or a different batch and it can swap sign. Which fact you serve ends up riding on noise nobody designed. That is the failure I want to fix today. Not bad retrieval. Stale retrieval that looks like good retrieval, decided by a tie-break nobody designed. Quick answer: A memory or RAG chunk that was correct when stored can quietly go stale, and similarity search is blind to age. When the stale chunk and the fresh chunk are near-duplicates, they score almost identically, so which one lands top-1 turns on a margin too thin to mean anything: a sliver of embedder noise that moves with the model version or the batch. Whichever edges ahead today, naive top-k serves it as the truth. A freshness gate tags every chunk with its age and a TTL based on how volatile that kind of fact is, then down-ranks, blocks, or refuses, before the model reads anything. That turns a fragile similarity ordering into a deterministic rule. Below is a small, zero-network gate with its real, deterministic output across two queries. This is for anyone running an agent on top of their own corpus: a RAG pipeline, an MCP memory tool, a while loop that stuffs retrieved chunks into context. If you have ever watched your model be confidently wrong and couldn't tell why, this is one of the whys. The fact was right once. Nobody checked whether it still was. Here is what the gate prints. Same corpus, same ranker, two queries, each run twice naive vs gated . The cand column is "did this chunk clear the relevance floor for this query": === query: 'what does the pro plan cost' now=day 1000 === naive top-k rank by similarity only id age ttl sim fresh cand verdict c1 40 3 0.903 0.00 y STALE BLOCK c2 1 3 0.901 0.67 y FRESH c5 5 7 0.840 0.29 y STALE WARN c3 5 3650 0.710 1.00 y FRESH c6 43 7 0.220 0.00 - STALE BLOCK - injects: c1 "Pro plan is $29/mo" sim 0.903 freshness-gated blocked if stale id age ttl sim fresh cand verdict c1 40 3 0.903 0.00 y STALE BLOCK c2 1 3 0.901 0.67 y FRESH c5 5 7 0.840 0.29 y STALE WARN c3 5 3650 0.710 1.00 y FRESH c6 43 7 0.220 0.00 - STALE BLOCK - injects: c2 "Pro plan is $39/mo" sim 0.901 === query: 'is the $29 summer promo still active' now=day 1000 === naive top-k rank by similarity only id age ttl sim fresh cand verdict c1 40 3 0.880 0.00 y STALE BLOCK c6 43 7 0.830 0.00 y STALE BLOCK c2 1 3 0.410 0.67 - FRESH c3 5 3650 0.320 1.00 - FRESH c5 5 7 0.200 0.29 - STALE WARN - injects: c1 "Pro plan is $29/mo" sim 0.880 freshness-gated blocked if stale id age ttl sim fresh cand verdict c1 40 3 0.880 0.00 y STALE BLOCK c6 43 7 0.830 0.00 y STALE BLOCK c2 1 3 0.410 0.67 - FRESH c3 5 3650 0.320 1.00 - FRESH c5 5 7 0.200 0.29 - STALE WARN - REFUSE: every on-topic chunk is stale, no fresh answer to give Look at the four injects / REFUSE lines. That is the whole article. Query 1, naive top-k, injects c1 , the $29 fossil. Not because $29 is "more relevant" than $39, but because c1 scored 0.903 and c2 scored 0.901. Two near-duplicate price strings, 0.002 apart. With these exact scores the sort is deterministic, c1 does land on top every run, I'm not hiding that. The dishonest part is treating that 0.002 as a decision . It isn't. Re-embed this corpus on a newer model build or a different batch and the order can flip, because nothing about the two vectors actually encodes which price is current. The retriever has no opinion about freshness; it hands back whichever near-duplicate edged ahead and naive top-k takes the top. That fragile edge is the lottery, not a fact about $29. Query 1, gated, injects c2 , the live $39 price, every time. Same ranker, same similarity floor. The only thing the gated pass adds is age: it STALE BLOCK s c1 a 40-day-old price against a 3-day TTL and falls through to the freshest candidate underneath. The coin flip is gone. The verdict is a rule, not a tie-break. Query 2 is the case naive never shows you. Someone asks whether the old $29 promo is still running. The only on-topic chunks c1 , c6 are both stale; the fresh chunks c2 , c3 are off-topic and sit below the relevance floor, so they never qualify. Naive serves the stale c1 anyway, "yes, $29." The gated pass has nothing fresh AND relevant to offer, so it REFUSE s instead of confidently lying. A missing answer the agent admits to beats a fossil it's sure about. I want to draw a hard line before going further, because this is easy to confuse with a different bug. This is not data that was wrong when you collected it. The $29 chunk was correct . On the day it was stored, the Pro plan really was $29/mo. There was no lie at the source, no poisoned page, no parsing error. It was a true fact. It just rotted. So the line is: some bugs are about data that was wrong when you got it. This one is about data that was right, and went stale. Validity checks, schema canaries, source-trust scoring all look at the moment of collection. None of them look at the gap between stored at and now . That gap is the entire problem here, and similarity search is blind to it. A chunk's embedding does not age. "Pro plan is $29/mo" sits at the same point in vector space forever, and "Pro plan is $39/mo" sits about 0.002 away. The price moved in the real world; neither vector did. So the retriever cannot tell the fossil from the current fact. It hands back whichever one edged ahead by that hair and calls it the best match. With no age signal, "best match" between two near-duplicates rests on a margin smaller than the noise between model versions, and over enough re-embeds and queries that margin eventually points at the fossil. Here's the uncomfortable part. An agent with no memory of the price asks. It calls a tool, hits the source, gets $39. Slow, but correct. An agent with a stale memory of the price does not ask. Why would it? It has a high-confidence chunk sitting right there at similarity 0.903, a hair above the live one. It skips the lookup, injects $29, and reasons forward: quotes the customer, drafts the invoice, picks the wrong tier in a comparison. Every downstream step inherits the rot, and each one looks just as confident as if the number were right. Empty memory makes an agent slow. Stale memory makes it confidently wrong, which is the expensive kind of wrong. The whole reason you added memory was to skip the lookup. That shortcut is exactly what turns one dead fact into a chain of dead reasoning. A freshness gate gives the shortcut a tripwire: trust the cached fact when it's fresh, fall back to the live lookup when it isn't. The gate does one cheap thing at retrieval time. For each candidate chunk it computes a freshness score: python def freshness score chunk, now : age = now - chunk "stored on" ttl = TTL chunk "cls" score = 1.0 - age / ttl return age, ttl, max 0.0, min 1.0, score runnable, stdlib only. Age is now - stored at . The score is how much of the chunk's time-to-live is left, clamped to 0, 1 . New chunk, score near 1.0. Past its TTL, score 0.0. Then a verdict: FRESH score ≥ 0.5 : inject normally. STALE WARN 0 < score < 0.5 : keep it, but multiply its rank key by the score so a fresher chunk can overtake it. STALE BLOCK score 0 : never inject. Fall through or refuse.All three verdicts actually fire in the output above, which is the point of running it instead of describing it. c5 "Pro plan includes 5 seats", an availability fact 5 days old against a 7-day TTL lands at score 0.29, so it's a STALE WARN : in the gated pass its rank key becomes 0.840 0.29 = 0.24 , which drops it from second-by-similarity to behind both fresh chunks. It isn't dropped, just demoted. That is the STALE WARN branch executing on a real chunk, not a claim about one. The interesting part is TTL chunk "cls" . Freshness is not raw age. A price and a historical fact age at completely different rates, so they get different TTLs: TTL = {"price": 3, "availability": 7, "schedule": 30, "reference": 3650} Watch what that does in the output. Chunk c3 , "Pro plan billing is monthly", is 5 days old. Older than c2 . But it's a reference fact with a 3650-day TTL, so its freshness is 1.00 and it stays FRESH . Meanwhile c1 , 40 days old, is a price with a 3-day TTL, so it's flatly STALE BLOCK . Same age, opposite verdict, because the half-life of the fact is what's being measured, not the calendar. One more knob: SIM FLOOR . A chunk has to clear it to be a candidate at all the cand column . That floor is what makes REFUSE possible. In query 2, the only on-topic chunks are stale, and the fresh chunks fall below the floor, so the gated pass has nothing both fresh and relevant to serve. It declines rather than reaching past the floor for an off-topic-but-fresh chunk, or under the block for a stale-but-relevant one. That is the answer to the first obvious objection: what about evergreen facts? They are fine. Evergreen means a long TTL, which means the gate leaves them alone. The same age that kills a price barely scratches a definition. Freshness is age measured against the half-life of that kind of fact, not a clock. This is the part I refuse to fake, because a freshness score is only as honest as the numbers behind it. I did not measure "facts decay at X% per day." Nobody can; it depends entirely on what the fact is about. What I do have is real volatility data. Across 2,190 production runs on our own scrapers, the same sources change at wildly different rates. Price and stock fields churn between runs constantly. Reference and historical fields barely move; in one batch of 12 records I re-checked, 5 had changed since the previous run and 7 were byte-for-byte identical. So the TTLs in this gate are modeled on observed source churn , not a measured decay curve. They are config. The honest framing is: "I have watched which classes of facts go stale fast and which don't, and I encoded that as TTLs you should recalibrate for your domain." If you scrape a stock exchange, your price TTL is minutes, not days. If you index legal statutes, your reference TTL is years. That is also the honest limit of this whole approach. The class-to-TTL mapping is a judgment call. The gate does not discover it for you. It gives you a place to put the judgment, and then it applies that judgment uniformly and visibly, which is more than similarity search does. A few objections deserve real answers, not a hand-wave. "Just re-index the corpus more often." Sure, if you can. But re-embedding is periodic and expensive, and it answers a different question. Re-indexing keeps the corpus current. The gate answers "can I trust this specific chunk at this specific retrieval , right now?" Those are orthogonal. You can run a nightly re-index and still serve a 40-day-old price at 9am because the source moved at 8:55. The gate is the cheap guard at read time; re-indexing is the slow refresh. Use both. "TTL-by-type is arbitrary." Partly true. The class boundaries and the numbers are a design decision, and a wrong TTL gives a wrong verdict. I'd rather have a wrong-but-visible TTL I can tune than an invisible assumption that every retrieved fact is eternally current, which is what plain top-k quietly assumes. "The similarity scores are basically tied; maybe the retriever just ranks c2 first anyway." Sometimes it will, and that is exactly the problem. c1 and c2 are near-duplicate price strings, so they score within 0.002 of each other. With one fixed set of scores that ordering is stable, on this corpus c1 wins every run, I'm not pretending otherwise. But 0.002 is well inside the noise floor between embedder versions and batches: re-embed and the order can invert, because the gap encodes nothing about which price is live. On this corpus the dead chunk edged it out; re-run on a newer model and the live one might, until the day it doesn't. Relying on which near-duplicate edges ahead is not a freshness strategy, it's luck. The gate replaces the coin flip with a rule: blocked if past TTL, full stop. It is the only component in the path that even knows c1 is older, and that is the single fact that makes the outcome deterministic instead of lucky.Three moves, in order of effort. Stamp every chunk with stored at and a source lineage when you write it to memory. If you're not already doing this, it's the best hour you'll spend all week, because you cannot reason about freshness you never recorded. Tag each chunk with a volatility class. Start with four buckets like the ones above. You don't need a taxonomy; you need "does this kind of fact change in hours, days, or years." Run the gate between retrieval and the model. Block STALE BLOCK , down-rank STALE WARN , and decide explicitly what happens when everything is stale. Refusing "I don't have a current figure" beats injecting a confident fossil. A wrong answer your agent is sure about costs more than a missing one it admits to. One thing the gate buys you for free: the verdict column is an audit trail. When a customer says "your bot quoted the old price," you don't guess. You look at the retrieval log, see c1 ... STALE BLOCK or STALE WARN , and know exactly which fossil got served and how old it was. The lineage field source tells you where it came from so you can go re-pull it. Plain top-k gives you none of that; it just hands over the top vector and forgets it ever ranked the others. Debuggability is the quiet second win here, and on a real agent it might matter more than the block itself. Related, if you also fetch live pages inside the agent: a 200 OK body is not automatically usable content either. I wrote a separate gate for that https://blog.spinov.online/blog/ai-agent-trusts-200-ok-page-was-garbage/ , but freshness is the storage-side twin of the same idea. That gate guards what you just fetched; this one guards what you stored months ago. Trust the timestamp, not the vibe. Here's the full script. Stdlib only, zero network, deterministic. NOW and the similarity scores are hardcoded so two runs print byte-identical stdout; I ran it twice and the output md5 matched 58aa51a486481c8bc20ffb6d4ef80ccd . Drop in your own corpus and TTLs: """freshness gate.py: a retrieval freshness gate for agent memory. Stdlib only. Zero network. Deterministic: NOW and similarity are hardcoded, so two runs print byte-identical stdout stable under md5 . Idea: a memory/RAG chunk was TRUE when stored, then quietly went stale. Its embedding never ages, so it keeps the same similarity to the query as the day it was written. When a stale chunk and a fresh chunk are near- duplicates two prices for the same plan , their cosine similarity is near-equal, and which one lands top-1 is a tie-break lottery: insertion order, sort stability, a hair of embedder noise. Sooner or later naive top-k serves the fossil. The gate makes age a first-class signal and removes the lottery deterministically: it scores each chunk against the TTL of its volatility class, and down-ranks, BLOCKs, or REFUSEs before the chunk ever reaches the model. TTLs are modeled on real volatility we observed across 2,190 production runs price/stock fields churned run-to-run; reference facts barely moved . They are config, not measured decay rates. Calibrate per domain. """ Fixed "today" so age is deterministic. Days since each chunk was stored. NOW DAY = 1000 TTL per volatility class, in days. Modeled on observed source churn, not a decay rate. "price" moves fast; "reference" is near-evergreen. TTL = {"price": 3, "availability": 7, "schedule": 30, "reference": 3650} A chunk has to clear this similarity to be a candidate at all. Below it, the chunk is off-topic for the query and never gets injected. This is what lets the gate REFUSE: if every on-topic chunk is stale, there is nothing fresh AND relevant left, so we decline instead of serving an off-topic fresh chunk or a stale on-topic one. SIM FLOOR = 0.50 Corpus. Each chunk: id, text, stored on absolute day , volatility class, and the retriever's cosine similarity per query hardcoded: stands in for the embedding model . Note c1 and c2 are near-duplicate price strings, so their similarity to the price query is near-equal 0.903 vs 0.901 : the embedder cannot tell the fresh one from the fossil. QUERIES = { "q1": "what does the pro plan cost", "q2": "is the $29 summer promo still active", } CORPUS = {"id": "c1", "text": "Pro plan is $29/mo", "stored on": 960, "cls": "price", "sim": {"q1": 0.903, "q2": 0.88}}, {"id": "c2", "text": "Pro plan is $39/mo", "stored on": 999, "cls": "price", "sim": {"q1": 0.901, "q2": 0.41}}, {"id": "c3", "text": "Pro plan billing is monthly", "stored on": 995, "cls": "reference", "sim": {"q1": 0.710, "q2": 0.32}}, {"id": "c5", "text": "Pro plan includes 5 seats", "stored on": 995, "cls": "availability", "sim": {"q1": 0.840, "q2": 0.20}}, {"id": "c6", "text": "Summer promo: 20% off Pro", "stored on": 957, "cls": "availability", "sim": {"q1": 0.22, "q2": 0.83}}, def freshness score chunk, now : """Age vs the TTL of the chunk's class. 1.0 = brand new, 0.0 = = TTL old.""" age = now - chunk "stored on" ttl = TTL chunk "cls" score = 1.0 - age / ttl return age, ttl, max 0.0, min 1.0, score def verdict score : if score = 0.5: return "FRESH" if score 0.0: return "STALE WARN" return "STALE BLOCK" def rank chunks, query key, now, gated : """One ranker for both passes. Same similarity signal, same SIM FLOOR. gated=True adds exactly one thing: age. FRESH passes through, STALE WARN is down-ranked by its freshness score, STALE BLOCK is dropped.""" rows = for c in chunks: sim = c "sim" query key age, ttl, score = freshness score c, now v = verdict score candidate = sim = SIM FLOOR off-topic chunks never inject keep = candidate rank key = sim if gated and candidate: if v == "STALE BLOCK": keep = False never inject a blocked fact elif v == "STALE WARN": rank key = sim score down-rank, don't drop rows.append { c, "sim q": sim, "age": age, "ttl": ttl, "score": score, "v": v, "cand": candidate, "keep": keep, "key": rank key} kept = r for r in rows if r "keep" kept.sort key=lambda r: r "key" , reverse=True return rows, kept 0 if kept else None def show query key, gated : rows, top = rank CORPUS, query key, NOW DAY, gated tag = " freshness-gated " if gated else " naive top-k " print f"{tag} {'blocked if stale' if gated else 'rank by similarity only'}" print "id age ttl sim fresh cand verdict" for r in sorted rows, key=lambda r: r "sim q" , reverse=True : print f"{r 'id' :<4} {r 'age' : 3} {r 'ttl' : 4} {r 'sim q' :.3f} " f"{r 'score' :.2f} {'y' if r 'cand' else '-'} {r 'v' }" if top: print f"- injects: {top 'id' } \"{top 'text' }\" sim {top 'sim q' :.3f} " else: print "- REFUSE: every on-topic chunk is stale, no fresh answer to give" print if name == " main ": for qk, qtext in QUERIES.items : print f"=== query: {qtext r} now=day {NOW DAY} ===\n" show qk, gated=False show qk, gated=True The gate is deliberately dumb. No model call, no embedding, no clever decay math. Just age against a TTL you set, applied where it matters: before the fact reaches the model, not after the model has already believed it. What's the shortest-lived fact your agent has ever quoted back to you with full confidence? I'm collecting volatility classes and would love a TTL you've had to set absurdly low. Drop it in the comments. 👇 Follow for more numbers from production agent runs. AI disclosure: drafted with AI assistance, but every line of code here was actually run, and the stdout above is its real, unedited output.