Building a Memory Agent That Actually Forgets (And the Three Bugs That Taught Me Why That's Hard)

A developer building a memory agent for the Global AI Hackathon Series discovered that implementing effective forgetting is harder than expected. The agent, built with Qwen Cloud, FastAPI, Neon Postgres, and Upstash Redis, initially scored memories incorrectly, treating a user's name as trivial and vague statements as more important than specific facts. After debugging three key issues, the developer learned that LLM-based scoring requires explicit calibration rules to prioritize meaningful information.

By Nidhi: Built for the Global AI Hackathon Series with Qwen Cloud, Track 1: MemoryAgent When I signed up for Track 1, I figured memory was basically a vector database problem. Store embeddings, retrieve similar messages, done. The first prototype worked fine. Then I started actually trying to use it, and that's when things got interesting. Most AI chat is stateless. Close the tab, and the next conversation starts from zero. That's a strange limitation for something we keep calling "intelligent": a person who knew you well wouldn't forget your name, your job, or what you told them you were working on, three days ago. So the plan was: build a memory layer underneath a chat agent. Every turn would recall relevant past context, respond with it in mind, and extract new facts worth remembering. Important things stick around. Trivial things expire. That part came together quickly. FastAPI for the backend, split into two services: a memory core handling storage and intelligence, and a thin chat layer on top calling into it. Qwen Cloud for both the chat model and the embeddings. Neon Postgres with pgvector for the actual vector storage, Upstash Redis for caching, all of it deployed on Alibaba Cloud ECS. Within a couple of days I had a working chat UI that stored memories, scored them for importance, and recalled them on the next message. It looked done. It was not done. Getting there wasn't clean, either. Two FastAPI services that needed to actually talk to each other, a Postgres instance on Neon that needed pgvector enabled before anything would even insert, a Redis cache that silently no-ops if it can't connect instead of telling you it failed, and then the actual deploy — SSH into an Alibaba Cloud ECS box, realize the venv isn't activated by forgetting which environment you are in, watch both services crash with ModuleNotFoundError , activate it, restart, repeat. None of that shows up in an architecture diagram. All of it ate real time. I started running real conversations through it, not test data, just talking to it the way a user would. At one point I told it my name. A day later, I asked what it remembered, and my name had a 24-hour expiration timer on it. Separately, I'd mentioned growing cherry tomatoes and basil, and that specific fact scored lower than the vague, generic statement "I'm an indoor gardener" that I'd said in the same conversation. That's backwards in a way that matters. The entire value of a memory system is recalling specifics — "user grows cherry tomatoes" is useful; "user likes gardening" barely is. But the scoring prompt I'd written only gave Qwen three anchor points: trivial greetings at 0.0, vague preferences at 0.5, and core facts at 1.0. There was no guidance at all for "specific personal fact" or "person's own name," so it was filling in the gap with something closer to vibes than calibration. The fix was to stop assuming the model would infer the right priorities and instead write them down explicitly: a name should never score below 0.6. A specific instance of something should never score lower than the general category it belongs to. Once I added those as hard calibration rules rather than implicit expectations, both problems disappeared. The lesson that stuck with me: when you're using an LLM as a scoring function, it will dutifully fill in whatever gaps you leave in your rubric, in whatever direction is easiest- not necessarily the direction you'd want. This one was sneakier because it looked like success. I'd ask the agent for something — a training plan, a morning routine — and then say "save this for me." It would reply with a perfectly confident confirmation: "I've updated my memory with your specific training plan." And then I'd check the actual memory panel. Nothing relevant was there. The bug was structural. My extraction step — the part that pulls durable facts out of a conversation turn to store — only ever looked at the current turn: the user's message and the assistant's reply, nothing else. When the user's message was "save this," there was nothing in that extraction call's view of the world to actually save. The real content — the training plan — had been generated several messages earlier, in a different turn the extractor never saw. The chat reply that generated the confirmation had full conversation history available to it, so it sounded confident and specific. The extraction call that was supposed to act on that confirmation had none of that context. Two parts of the same pipeline were operating with completely different amounts of information, and nothing in the architecture made that mismatch visible until I went looking for it by hand. The fix was to feed recent conversation history into the extraction prompt too, with explicit instructions: if the user is pointing back at something said earlier rather than stating a new fact directly, resolve the reference using that history before deciding what to extract. After the fix, the same "save this" request correctly pulled out the actual content of whatever it was referring to. What stuck with me here wasn't the fix — it was how convincing the failure was. A wrong answer that sounds tentative gets double-checked. A wrong answer delivered with total confidence usually doesn't. I only caught this because I happened to check the memory panel out of habit, not because anything in the system signaled a problem. Smart Forget was supposed to be one of the more interesting pieces.Instead of blindly deleting memories the moment a timer expired, it would gather anything past its expiration window and ask Qwen to make one more judgment call: does this still matter, even though its time technically ran out? I built it, tested it, and every single run reported back reviewed: 0, deleted: 0 . I assumed that meant nothing had expired yet in my short test windows, which was plausible — TTLs were measured in days, and I'd only been testing for a few hours at a time. It wasn't that. It was a crash, quietly swallowed by my own error handling. The database returns timestamps as timezone-aware datetime objects. My code was comparing them against datetime.utcnow , which is timezone- naive . Python refuses to subtract one from the other, throws a TypeError , and my try/except block caught that error and defaulted to "keep." The scary part: nothing crashed loudly enough to notice. My own error handling turned a real failure into something that looked exactly like cautious, conservative behavior — when really, the feature had never run a single actual comparison. I only found this because I built a real test suite, not unit tests with mocks, but a script that makes live HTTP calls against the actual running deployment and checks real outcomes. And even then, the test initially passed, because 0 deleted wasn't technically wrong; it just wasn't meaningfully right either. The thing that actually caught it was going one level further: manually pushing a memory's expiration into the past directly in the database, triggering Smart Forget, and watching the server logs in real time. That's when the actual TypeError showed up, instead of being hidden behind a graceful-looking fallback. The fix itself was small: switch to timezone-aware datetimes everywhere, fix the underlying schema too so the mismatch couldn't quietly recur. The bigger fix was procedural: a passing test and a correctly-behaving system are not the same thing, and the gap between them is exactly where things like this hide. None of these three bugs threw an error message I noticed, or failed a test on the first pass. Two of them gave me a confident-sounding reply while doing nothing underneath. The third one looked cautious and careful while actually just being broken. Turns out a smooth reply, a 200 status code, and a "0 deleted" that I assumed meant "nothing needed deleting" are not the same thing as "this is actually working." What caught all three, every time: stop trusting the explanation and go check the data. Open the actual memory panel instead of reading the reply text. Tail the actual server logs instead of trusting the status code. Manufacture the edge case by hand instead of waiting for it to show up on its own. None of that is clever. It's just being willing to look one layer past whatever the system is telling you, instead of taking its word for it. The project that came out the other side does what the brief asked: it remembers what matters, lets go of what doesn't, and reconciles contradictions instead of accumulating them forever. But the part I'd actually want to talk about in an interview isn't the architecture diagram. It's the three different ways a system can look like it's working when it isn't, and what it took to notice each one. Built with Qwen Cloud qwen-plus and text-embedding-v3 , FastAPI, Neon PostgreSQL with pgvector, Upstash Redis, and deployed on Alibaba Cloud ECS, for Track 1: MemoryAgent of the Global AI Hackathon Series with Qwen Cloud.