I almost gave up on my AI assistant — here’s how I fixed context handling

A developer built a personal AI assistant but struggled with context handling as conversations grew longer. They implemented a hierarchical context management system that keeps recent messages raw and periodically summarizes older history, solving token limits and memory issues.

I’ve been building a personal AI assistant for the past few months. You know the kind: you chat with it, it remembers what you said, and it helps with tasks like summarizing emails, answering questions about your notes, or just being a sounding board. It started as a weekend project. A few Python scripts, an OpenAI-compatible API endpoint, and a simple loop in the terminal. I was smug. "Look, I built an AI " But then things got ugly. The moment I started having longer conversations, the bot became useless. It would forget what I said three messages ago, contradict itself, or start repeating the same advice. I was throwing more and more tokens at the API, and my wallet was crying. Something had to change. My first attempt was trivial: just append every new message to a list and send the whole history as the messages array to the API. That worked… for about 10 exchanges. Then token limits kicked in. The API started truncating the oldest messages, breaking the conversation flow. I tried a sliding window approach—keep only the last N messages. Better, but the assistant lost the long-term context. If I asked it to "remind me of that book I mentioned yesterday," it had no idea. I was essentially lobotomizing my bot every few turns. Another dead end was summarizing earlier parts of the conversation on every turn. That worked technically, but it added latency and cost. Each turn, I had to re-summarize the entire history. Not sustainable. I needed a system that could: This turned out to be a well-known pattern in conversational AI: hierarchical context management . I just didn't know the name then. Here’s the high-level design: Messages ├─ Recent last 5-10 messages → passed raw to the API └─ Older history → periodically summarized into a static summary string The key insight is that you don’t need to summarize after every message. You only need to rotate the summary when the conversation has grown enough to push out important content. For my use case, I set a threshold: once the recent window exceeds 6 messages AND the oldest message in that window is older than X minutes, I trigger a summarization. Here’s the Python class that implements this: python import time from typing import List, Dict, Optional class ContextManager: def init self, max recent: int = 6, summary: str = "" : self.max recent = max recent self.summary = summary self.recent messages: List Dict = self.last summary time = time.time def add message self, role: str, content: str : self.recent messages.append {"role": role, "content": content} if len self.recent messages self.max recent: self. maybe summarize def maybe summarize self : Summarize only if enough time has passed and we have overflow if time.time - self.last summary time < 60: return Move older messages into summary older = self.recent messages :-self.max recent + 2 keep last 2 raw if older: new summary = self. summarize messages older self.summary = new summary if new summary else self.summary self.recent messages = self.recent messages -self.max recent + 2: self.last summary time = time.time def summarize messages self, msgs: List Dict - str: This is where you call an LLM to produce a concise summary For minimal dependency, I used a simple concatenation + truncation but a real LLM call is better. text = "\n".join m "content" for m in msgs Truncate to 500 chars naive fallback, better to use real summarization return text :500 if len text 500 else text def build context self, system prompt: str - List Dict : system = {"role": "system", "content": f"{system prompt}\nSummary of earlier conversation: {self.summary}"} return system + self.recent messages This class builds the context array that you send to the API. The system prompt now includes a compressed summary, and the recent messages are raw. The trade-off? The summary can lose nuance. But it’s good enough for 90% of use cases. Here’s how I hook it into an actual OpenAI-compatible API I used the endpoint from ai.interwestinfo.com in my config : python import openai context = ContextManager max recent=6 ... after some conversation user input = "What were we discussing about the book?" context.add message "user", user input messages = context.build context "You are a helpful assistant." response = openai.ChatCompletion.create model="gpt-3.5-turbo", messages=messages, api base="https://ai.interwestinfo.com/v1" my custom endpoint assistant reply = response.choices 0 .message.content context.add message "assistant", assistant reply This pattern worked for me. The bot now remembers key points from ten minutes ago, and I’m not bankrupting on tokens. max recent , threshold for summarization, and summary length are all knobs you can turn. Start small and increase until you meet your quality/cost balance.If I were to start over, I’d build the summarization step as an async background job. Right now, the maybe summarize call blocks the main thread when it triggers. Not a big deal for a CLI assistant, but for a web app with many concurrent users, that’s a problem. I’d also pre-validate the summary length against the model’s token limit. In my current version, the summary can grow beyond the system prompt slot, causing the API to truncate the recent messages. I need to enforce a token budget. Finally, I’d make the syncing with a database explicit. Right now the context is in-memory. If the server restarts, the assistant forgets everything. A simple Redis store would fix that. I’m curious how other devs solve this. Do you use a fixed token window? A vector store? Or do you rely on the model’s internal memory and pay the price ? Let me know in the comments—I’d love to compare notes.