# I spent a week fixing my chatbot's memory — here's what worked

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/i-spent-a-week-fixing-my-chatbots-memory-heres-what-worked-43dj>
> Published: 2026-05-30 08:01:58+00:00

Two months ago, I shipped a customer support chatbot for my SaaS product. It worked great for the first three messages. Then it started forgetting what the user said earlier, repeating itself, and giving contradictory advice. Users noticed. One wrote: "Your bot has the memory of a goldfish."

I had hit the classic LLM context window wall. My initial implementation just stuffed the entire conversation history into the prompt. That worked until conversations grew beyond 4k tokens. Then I tried truncation, but that lost critical context. The problem felt unsolvable without either breaking the bank on bigger context windows or losing information.

Here's what I tried, what failed, and the approach that finally let my chatbot hold coherent multi-turn conversations without blowing up my API costs.

My first attempt was embarrassingly simple:

```
messages = []
while True:
    user_input = input()
    messages.append({"role": "user", "content": user_input})
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages
    )
    assistant_reply = response.choices[0].message.content
    messages.append({"role": "assistant", "content": assistant_reply})
    print(assistant_reply)
```

This works for about 10-15 turns before hitting the 4096 token limit. After that, the API throws an error. So I added truncation: keep only the last N messages.

I set a max_messages=20 and dropped the oldest ones:

``` python
def trim_messages(messages, max_messages=20):
    return messages[-max_messages:]
```

Now the API never hit the limit, but the bot forgot everything from early in the conversation. A user would say "I already told you my email is [test@example.com](mailto:test@example.com)" and the bot would ask for it again. Worse, if the user corrected themselves earlier, the bot would revert to the wrong info after the correction scrolled off.

This is the goldfish problem. Truncation is simple but brutal.

Next I tried summarizing the conversation history every few turns and injecting the summary as a system message:

``` python
import openai

def summarize_history(messages):
    history_text = "\n".join([f"{m['role']}: {m['content']}" for m in messages])
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "Summarize the key facts and context from this conversation."},
            {"role": "user", "content": history_text}
        ]
    )
    return response.choices[0].message.content
```

I called this every 10 turns and replaced the old history with the summary plus the last few raw messages. It worked better — the bot remembered key facts like names and preferences. But it added latency (an extra API call per summary) and cost. Also, summaries sometimes lost nuance. A user's frustrated tone or a subtle correction could get flattened.

After reading about how RAG (Retrieval Augmented Generation) works for documents, I realized I could apply the same idea to conversation history. Instead of keeping everything or summarizing everything, I would store each message as an embedding and retrieve only the most relevant past messages when generating a response.

Here's the architecture:

I used `sentence-transformers`

for embeddings and a simple in-memory vector store (for production I'd use Pinecone or Chroma). The code looks like this:

``` python
from sentence_transformers import SentenceTransformer
import numpy as np

class VectorMemory:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.embeddings = []
        self.messages = []
        self.timestamps = []

    def add(self, role, content, timestamp):
        emb = self.model.encode(content)
        self.embeddings.append(emb)
        self.messages.append({'role': role, 'content': content})
        self.timestamps.append(timestamp)

    def retrieve(self, query, top_k=5):
        query_emb = self.model.encode(query)
        if not self.embeddings:
            return []
        scores = np.dot(self.embeddings, query_emb)
        top_indices = np.argsort(scores)[-top_k:][::-1]
        return [self.messages[i] for i in top_indices]
```

Then in the main loop:

```
memory = VectorMemory()
recent_window = []

while True:
    user_input = input("You: ")
    memory.add('user', user_input, time.time())
    recent_window.append({'role': 'user', 'content': user_input})

    # Retrieve relevant past context
    relevant = memory.retrieve(user_input, top_k=5)
    # Keep only recent 3 messages for immediate context
    recent = recent_window[-3:]

    # Build prompt
    system_prompt = "You are a helpful assistant. Use the retrieved context to answer."
    context_messages = [{'role': 'system', 'content': system_prompt}]
    context_messages += [{'role': 'retrieved', 'content': m['content']} for m in relevant]
    context_messages += recent

    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=context_messages
    )
    assistant_reply = response.choices[0].message.content
    memory.add('assistant', assistant_reply, time.time())
    recent_window.append({'role': 'assistant', 'content': assistant_reply})
    print(f"Bot: {assistant_reply}")
```

This approach has a few key advantages:

It's not perfect. Here are the downsides I discovered:

**Semantic retrieval isn't always right.** A user might ask "What was my order number?" and the retrieval correctly finds the earlier message with the order number. But if they ask "Can you repeat that?" the embedding might not match the earlier context. I added a fallback: if retrieval returns low similarity scores, include the last 10 messages raw.

**Cold start problem.** For the first few messages, there's nothing to retrieve. That's fine — the recent window handles it.

**Memory grows unbounded.** I had to implement a cleanup: delete embeddings older than 24 hours or beyond a max count (10,000 messages). For production, use a proper vector database with TTL.

**Cost of embeddings.** Each message requires an embedding call. Using a local model like `all-MiniLM-L6-v2`

is free and fast (CPU inference in ~10ms). If you use an API-based embedding service, costs add up. I stuck with local.

**When not to use this:** If your conversations are always short (<10 turns), truncation is simpler and works fine. If your bot needs strict chronological recall (like "what did I say last?") rather than semantic recall, a sliding window with summarization might be better.

I'd start with the vector memory approach from day one. The naive truncation cost me a week of user complaints and refactoring. I'd also build a small evaluation set of 50 test conversations with expected answers, so I could measure recall accuracy before shipping.

Also, I'd separate short-term and long-term memory. The last 5 messages should always be in the prompt (for immediate context), while retrieval only pulls from older history. That hybrid prevents the bot from missing the last thing the user just said.

Context management is the hardest part of building a conversational AI that feels smart. It's not about the model — GPT-3.5 is plenty capable. It's about feeding it the right information at the right time. Vector-based retrieval turned my goldfish into an elephant.

If you're building a chatbot and hitting the context wall, don't reach for a bigger model. Reach for a better memory system.

What's your approach to handling long conversations? I'd love to hear what's worked for you.
