# I almost gave up on my AI assistant — here’s how I fixed context handling

> Source: <https://dev.to/__c1b9e06dc90a7e0a676b/i-almost-gave-up-on-my-ai-assistant-heres-how-i-fixed-context-handling-40gl>
> Published: 2026-06-13 10:00:44+00:00

I’ve been building a personal AI assistant for the past few months. You know the kind: you chat with it, it remembers what you said, and it helps with tasks like summarizing emails, answering questions about your notes, or just being a sounding board.

It started as a weekend project. A few Python scripts, an OpenAI-compatible API endpoint, and a simple loop in the terminal. I was smug. "Look, I built an AI!" But then things got ugly.

The moment I started having longer conversations, the bot became useless. It would forget what I said three messages ago, contradict itself, or start repeating the same advice. I was throwing more and more tokens at the API, and my wallet was crying. Something had to change.

My first attempt was trivial: just append every new message to a list and send the whole history as the `messages`

array to the API. That worked… for about 10 exchanges. Then token limits kicked in. The API started truncating the oldest messages, breaking the conversation flow.

I tried a sliding window approach—keep only the last N messages. Better, but the assistant lost the long-term context. If I asked it to "remind me of that book I mentioned yesterday," it had no idea. I was essentially lobotomizing my bot every few turns.

Another dead end was summarizing earlier parts of the conversation on every turn. That worked technically, but it added latency and cost. Each turn, I had to re-summarize the entire history. Not sustainable.

I needed a system that could:

This turned out to be a well-known pattern in conversational AI: **hierarchical context management**. I just didn't know the name then.

Here’s the high-level design:

```
[Messages]
  ├─ Recent (last 5-10 messages) → passed raw to the API
  └─ Older history → periodically summarized into a static summary string
```

The key insight is that you don’t need to summarize after every message. You only need to rotate the summary when the conversation has grown enough to push out important content. For my use case, I set a threshold: once the recent window exceeds 6 messages AND the oldest message in that window is older than X minutes, I trigger a summarization.

Here’s the Python class that implements this:

``` python
import time
from typing import List, Dict, Optional

class ContextManager:
    def __init__(self, max_recent: int = 6, summary: str = ""):
        self.max_recent = max_recent
        self.summary = summary
        self.recent_messages: List[Dict] = []
        self.last_summary_time = time.time()

    def add_message(self, role: str, content: str):
        self.recent_messages.append({"role": role, "content": content})
        if len(self.recent_messages) > self.max_recent:
            self._maybe_summarize()

    def _maybe_summarize(self):
        # Summarize only if enough time has passed and we have overflow
        if time.time() - self.last_summary_time < 60:
            return
        # Move older messages into summary
        older = self.recent_messages[:-self.max_recent + 2]  # keep last 2 raw
        if older:
            new_summary = self._summarize_messages(older)
            self.summary = new_summary if new_summary else self.summary
            self.recent_messages = self.recent_messages[-self.max_recent + 2:]
            self.last_summary_time = time.time()

    def _summarize_messages(self, msgs: List[Dict]) -> str:
        # This is where you call an LLM to produce a concise summary
        # For minimal dependency, I used a simple concatenation + truncation
        # but a real LLM call is better.
        text = "\n".join(m["content"] for m in msgs)
        # Truncate to 500 chars (naive fallback, better to use real summarization)
        return text[:500] if len(text) > 500 else text

    def build_context(self, system_prompt: str) -> List[Dict]:
        system = {"role": "system", "content": f"{system_prompt}\nSummary of earlier conversation: {self.summary}"}
        return [system] + self.recent_messages
```

This class builds the context array that you send to the API. The system prompt now includes a compressed summary, and the recent messages are raw. The trade-off? The summary can lose nuance. But it’s good enough for 90% of use cases.

Here’s how I hook it into an actual OpenAI-compatible API (I used the endpoint from `ai.interwestinfo.com`

in my config):

``` python
import openai

context = ContextManager(max_recent=6)
# ... after some conversation
user_input = "What were we discussing about the book?"
context.add_message("user", user_input)

messages = context.build_context("You are a helpful assistant.")

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=messages,
    api_base="https://ai.interwestinfo.com/v1"  # my custom endpoint
)
assistant_reply = response.choices[0].message.content
context.add_message("assistant", assistant_reply)
```

This pattern worked for me. The bot now remembers key points from ten minutes ago, and I’m not bankrupting on tokens.

`max_recent`

, threshold for summarization, and summary length are all knobs you can turn. Start small and increase until you meet your quality/cost balance.If I were to start over, I’d build the summarization step as an async background job. Right now, the `_maybe_summarize`

call blocks the main thread when it triggers. Not a big deal for a CLI assistant, but for a web app with many concurrent users, that’s a problem.

I’d also pre-validate the summary length against the model’s token limit. In my current version, the summary can grow beyond the system prompt slot, causing the API to truncate the recent messages. I need to enforce a token budget.

Finally, I’d make the syncing with a database explicit. Right now the context is in-memory. If the server restarts, the assistant forgets everything. A simple Redis store would fix that.

I’m curious how other devs solve this. Do you use a fixed token window? A vector store? Or do you rely on the model’s internal memory (and pay the price)? Let me know in the comments—I’d love to compare notes.
