{"slug": "i-spent-a-week-fixing-my-chatbot-s-memory-here-s-what-worked", "title": "I spent a week fixing my chatbot's memory — here's what worked", "summary": "A developer spent a week fixing a customer support chatbot's memory issues after users complained it had \"the memory of a goldfish.\" After failing with simple truncation and conversation summarization, the engineer implemented a vector memory system using sentence-transformers embeddings to retrieve only the most relevant past messages. The approach allowed the chatbot to hold coherent multi-turn conversations without exceeding context window limits or significantly increasing API costs.", "body_md": "Two months ago, I shipped a customer support chatbot for my SaaS product. It worked great for the first three messages. Then it started forgetting what the user said earlier, repeating itself, and giving contradictory advice. Users noticed. One wrote: \"Your bot has the memory of a goldfish.\"\n\nI had hit the classic LLM context window wall. My initial implementation just stuffed the entire conversation history into the prompt. That worked until conversations grew beyond 4k tokens. Then I tried truncation, but that lost critical context. The problem felt unsolvable without either breaking the bank on bigger context windows or losing information.\n\nHere's what I tried, what failed, and the approach that finally let my chatbot hold coherent multi-turn conversations without blowing up my API costs.\n\nMy first attempt was embarrassingly simple:\n\n```\nmessages = []\nwhile True:\n    user_input = input()\n    messages.append({\"role\": \"user\", \"content\": user_input})\n    response = openai.ChatCompletion.create(\n        model=\"gpt-3.5-turbo\",\n        messages=messages\n    )\n    assistant_reply = response.choices[0].message.content\n    messages.append({\"role\": \"assistant\", \"content\": assistant_reply})\n    print(assistant_reply)\n```\n\nThis works for about 10-15 turns before hitting the 4096 token limit. After that, the API throws an error. So I added truncation: keep only the last N messages.\n\nI set a max_messages=20 and dropped the oldest ones:\n\n``` python\ndef trim_messages(messages, max_messages=20):\n    return messages[-max_messages:]\n```\n\nNow the API never hit the limit, but the bot forgot everything from early in the conversation. A user would say \"I already told you my email is [test@example.com](mailto:test@example.com)\" and the bot would ask for it again. Worse, if the user corrected themselves earlier, the bot would revert to the wrong info after the correction scrolled off.\n\nThis is the goldfish problem. Truncation is simple but brutal.\n\nNext I tried summarizing the conversation history every few turns and injecting the summary as a system message:\n\n``` python\nimport openai\n\ndef summarize_history(messages):\n    history_text = \"\\n\".join([f\"{m['role']}: {m['content']}\" for m in messages])\n    response = openai.ChatCompletion.create(\n        model=\"gpt-3.5-turbo\",\n        messages=[\n            {\"role\": \"system\", \"content\": \"Summarize the key facts and context from this conversation.\"},\n            {\"role\": \"user\", \"content\": history_text}\n        ]\n    )\n    return response.choices[0].message.content\n```\n\nI called this every 10 turns and replaced the old history with the summary plus the last few raw messages. It worked better — the bot remembered key facts like names and preferences. But it added latency (an extra API call per summary) and cost. Also, summaries sometimes lost nuance. A user's frustrated tone or a subtle correction could get flattened.\n\nAfter reading about how RAG (Retrieval Augmented Generation) works for documents, I realized I could apply the same idea to conversation history. Instead of keeping everything or summarizing everything, I would store each message as an embedding and retrieve only the most relevant past messages when generating a response.\n\nHere's the architecture:\n\nI used `sentence-transformers`\n\nfor embeddings and a simple in-memory vector store (for production I'd use Pinecone or Chroma). The code looks like this:\n\n``` python\nfrom sentence_transformers import SentenceTransformer\nimport numpy as np\n\nclass VectorMemory:\n    def __init__(self, model_name='all-MiniLM-L6-v2'):\n        self.model = SentenceTransformer(model_name)\n        self.embeddings = []\n        self.messages = []\n        self.timestamps = []\n\n    def add(self, role, content, timestamp):\n        emb = self.model.encode(content)\n        self.embeddings.append(emb)\n        self.messages.append({'role': role, 'content': content})\n        self.timestamps.append(timestamp)\n\n    def retrieve(self, query, top_k=5):\n        query_emb = self.model.encode(query)\n        if not self.embeddings:\n            return []\n        scores = np.dot(self.embeddings, query_emb)\n        top_indices = np.argsort(scores)[-top_k:][::-1]\n        return [self.messages[i] for i in top_indices]\n```\n\nThen in the main loop:\n\n```\nmemory = VectorMemory()\nrecent_window = []\n\nwhile True:\n    user_input = input(\"You: \")\n    memory.add('user', user_input, time.time())\n    recent_window.append({'role': 'user', 'content': user_input})\n\n    # Retrieve relevant past context\n    relevant = memory.retrieve(user_input, top_k=5)\n    # Keep only recent 3 messages for immediate context\n    recent = recent_window[-3:]\n\n    # Build prompt\n    system_prompt = \"You are a helpful assistant. Use the retrieved context to answer.\"\n    context_messages = [{'role': 'system', 'content': system_prompt}]\n    context_messages += [{'role': 'retrieved', 'content': m['content']} for m in relevant]\n    context_messages += recent\n\n    response = openai.ChatCompletion.create(\n        model='gpt-3.5-turbo',\n        messages=context_messages\n    )\n    assistant_reply = response.choices[0].message.content\n    memory.add('assistant', assistant_reply, time.time())\n    recent_window.append({'role': 'assistant', 'content': assistant_reply})\n    print(f\"Bot: {assistant_reply}\")\n```\n\nThis approach has a few key advantages:\n\nIt's not perfect. Here are the downsides I discovered:\n\n**Semantic retrieval isn't always right.** A user might ask \"What was my order number?\" and the retrieval correctly finds the earlier message with the order number. But if they ask \"Can you repeat that?\" the embedding might not match the earlier context. I added a fallback: if retrieval returns low similarity scores, include the last 10 messages raw.\n\n**Cold start problem.** For the first few messages, there's nothing to retrieve. That's fine — the recent window handles it.\n\n**Memory grows unbounded.** I had to implement a cleanup: delete embeddings older than 24 hours or beyond a max count (10,000 messages). For production, use a proper vector database with TTL.\n\n**Cost of embeddings.** Each message requires an embedding call. Using a local model like `all-MiniLM-L6-v2`\n\nis free and fast (CPU inference in ~10ms). If you use an API-based embedding service, costs add up. I stuck with local.\n\n**When not to use this:** If your conversations are always short (<10 turns), truncation is simpler and works fine. If your bot needs strict chronological recall (like \"what did I say last?\") rather than semantic recall, a sliding window with summarization might be better.\n\nI'd start with the vector memory approach from day one. The naive truncation cost me a week of user complaints and refactoring. I'd also build a small evaluation set of 50 test conversations with expected answers, so I could measure recall accuracy before shipping.\n\nAlso, I'd separate short-term and long-term memory. The last 5 messages should always be in the prompt (for immediate context), while retrieval only pulls from older history. That hybrid prevents the bot from missing the last thing the user just said.\n\nContext management is the hardest part of building a conversational AI that feels smart. It's not about the model — GPT-3.5 is plenty capable. It's about feeding it the right information at the right time. Vector-based retrieval turned my goldfish into an elephant.\n\nIf you're building a chatbot and hitting the context wall, don't reach for a bigger model. Reach for a better memory system.\n\nWhat's your approach to handling long conversations? I'd love to hear what's worked for you.", "url": "https://wpnews.pro/news/i-spent-a-week-fixing-my-chatbot-s-memory-here-s-what-worked", "canonical_source": "https://dev.to/__c1b9e06dc90a7e0a676b/i-spent-a-week-fixing-my-chatbots-memory-heres-what-worked-43dj", "published_at": "2026-05-30 08:01:58+00:00", "updated_at": "2026-05-30 08:11:22.145915+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-tools", "natural-language-processing", "ai-startups"], "entities": ["OpenAI", "GPT-3.5-turbo", "ChatGPT"], "alternates": {"html": "https://wpnews.pro/news/i-spent-a-week-fixing-my-chatbot-s-memory-here-s-what-worked", "markdown": "https://wpnews.pro/news/i-spent-a-week-fixing-my-chatbot-s-memory-here-s-what-worked.md", "text": "https://wpnews.pro/news/i-spent-a-week-fixing-my-chatbot-s-memory-here-s-what-worked.txt", "jsonld": "https://wpnews.pro/news/i-spent-a-week-fixing-my-chatbot-s-memory-here-s-what-worked.jsonld"}}