99. Build a Chatbot With Memory

A developer built a chatbot with memory by including the entire conversation history in every prompt to an LLM API. This approach allows the chatbot to understand context across multiple turns, such as knowing that "there" refers to Paris after the capital of France was mentioned. However, the method has a hard limit based on the LLM's context window, which can fill up during long conversations and requires strategies like token estimation to manage.

You ask a chatbot: "What's the capital of France?" It says: "Paris." You ask: "What's the population there?" It says: "Where?" That's a stateless chatbot. Every message is treated as a completely new conversation. It has no idea what "there" refers to. It has no memory. Real conversation doesn't work like this. Context carries forward. References accumulate. The chatbot needs to know what came before. This post builds a chatbot with memory. One that knows what you said two messages ago, what topic you're discussing, and what decisions were made earlier. Every time you call an LLM API, it starts fresh. It has zero memory of previous calls. The only context it has is what you put in the current prompt. The trick that makes chatbots work: you include the entire conversation history in every prompt. Turn 1: USER: What's the capital of France? → Send to LLM: "User: What's the capital of France?" → LLM replies: "Paris" Turn 2: USER: What's the population there? → Send to LLM: "User: What's the capital of France? Assistant: Paris. User: What's the population there?" → LLM sees full context, knows "there" = Paris Turn 3: → Send EVERYTHING from turns 1, 2, and now 3 Every message appends to a growing list. That list goes into every subsequent prompt. The LLM can refer back to it because it's in the current context. Simple. But it has a hard limit: the context window. Every LLM has a maximum number of tokens it can process at once. GPT-3.5-turbo: 16k tokens. GPT-4: 128k tokens. LLaMA-7B: 4k tokens. A long conversation fills up that window. When the conversation exceeds the limit, you can't just include everything. You need a strategy. Estimate token count rough: 1 token ≈ 4 characters for English def estimate tokens text: str - int: return len text // 4 def estimate conversation tokens messages: list - int: total = 0 for msg in messages: total += estimate tokens msg 'content' total += 4 overhead per message role, formatting return total Show how fast a conversation fills up messages = example turns = "user", "Tell me about machine learning." , "assistant", "Machine learning is a field of artificial intelligence that enables computers to learn from data without being explicitly programmed. It includes supervised learning, where models are trained on labeled examples, unsupervised learning, where patterns are found without labels, and reinforcement learning, where agents learn through trial and error." , "user", "What about deep learning specifically?" , "assistant", "Deep learning is a subset of machine learning that uses neural networks with many layers. These networks learn hierarchical representations of data, making them especially powerful for images, audio, and text. The transformer architecture, introduced in 2017, has become the foundation for most modern deep learning systems." , "user", "Can you give me examples of real applications?" , "assistant", "Sure Real applications include image classification in medical diagnosis, natural language processing for translation and chatbots, recommendation systems on Netflix and Spotify, fraud detection in banking, and autonomous driving. Deep learning powers most of these through pattern recognition at scale." , print f"{'Turn':<6} {'New tokens':<14} {'Total tokens':<14} {'% of 4k limit'}" print "-" 50 for role, content in example turns: messages.append {'role': role, 'content': content} total = estimate conversation tokens messages new = estimate tokens content print f"{len messages :<6} {new:<14} {total:<14} {total/4000:.1%}" Output: Turn New tokens Total tokens % of 4k limit -------------------------------------------------- 1 12 16 0.4% 2 73 93 2.3% 3 13 110 2.8% 4 65 179 4.5% 5 15 198 5.0% 6 72 274 6.9% A long conversation about a complex topic can easily hit 2000-3000 tokens. Add RAG context and system prompts, and you're at the limit fast. Keep only the last N messages. Simple and effective. python from collections import deque from typing import List, Optional class SlidingWindowChatbot: def init self, model pipeline, window size: int = 10, system prompt: str = "You are a helpful assistant." : self.model = model pipeline self.window size = window size max messages to keep self.system prompt = system prompt self.history = deque maxlen=window size def chat self, user message: str - str: Add user message to history self.history.append {'role': 'user', 'content': user message} Build the prompt with history messages = {'role': 'system', 'content': self.system prompt} + list self.history Call the model using a simple text format for demo prompt = self. format prompt messages response = self.model prompt Add assistant response to history self.history.append {'role': 'assistant', 'content': response} return response def format prompt self, messages: List dict - str: formatted = "" for msg in messages: if msg 'role' == 'system': formatted += f"System: {msg 'content' }\n\n" elif msg 'role' == 'user': formatted += f"Human: {msg 'content' }\n" else: formatted += f"Assistant: {msg 'content' }\n" formatted += "Assistant:" return formatted def get history self - list: return list self.history def clear self : self.history.clear print "Conversation history cleared." Simulate a conversation using a mock model for demo def mock model prompt: str - str: In production: replace with real LLM call if "capital of france" in prompt.lower : return "The capital of France is Paris." elif "population" in prompt.lower and "paris" in prompt.lower : return "Paris has a population of approximately 2.1 million in the city proper, and about 12 million in the greater metropolitan area." elif "famous landmark" in prompt.lower : return "Paris is famous for the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe." elif "eiffel tower" in prompt.lower : return "The Eiffel Tower was built between 1887 and 1889, designed by engineer Gustave Eiffel. It stands 330 meters tall." else: return "I understand. Could you tell me more?" bot = SlidingWindowChatbot mock model, window size=6 Simulate multi-turn conversation turns = "What's the capital of France?", "What's the population there?", "What are some famous landmarks in that city?", "Tell me more about the Eiffel Tower.", "When was it built?", for user input in turns: print f"\nUser: {user input}" response = bot.chat user input print f"Bot: {response}" print f"\nHistory has {len bot.get history } messages max {bot.window size} " Output: User: What's the capital of France? Bot: The capital of France is Paris. User: What's the population there? Bot: Paris has a population of approximately 2.1 million in the city proper... User: What are some famous landmarks in that city? Bot: Paris is famous for the Eiffel Tower, the Louvre Museum... User: Tell me more about the Eiffel Tower. Bot: The Eiffel Tower was built between 1887 and 1889... User: When was it built? Bot: I understand. Could you tell me more? History has 6 messages max 6 The bot understands "there" Paris and "that city" Paris from context. The sliding window keeps the last 6 messages. When history gets long, summarize old messages and keep recent ones in full. python class SummaryMemoryChatbot: def init self, model pipeline, summarizer pipeline, max recent: int = 6, summary threshold: int = 10, system prompt: str = "You are a helpful assistant." : self.model = model pipeline self.summarizer = summarizer pipeline self.max recent = max recent self.threshold = summary threshold self.system = system prompt self.history = self.summary = "" compressed memory of older turns def maybe summarize self : if len self.history < self.threshold: return Summarize the oldest half of history n to summarize = len self.history // 2 old messages = self.history :n to summarize self.history = self.history n to summarize: Format old messages as text old text = "\n".join f"{m 'role' .title }: {m 'content' }" for m in old messages Summarize in production, call LLM to summarize new summary input = f"{self.summary}\n\n{old text}" if self.summary else old text self.summary = self. summarize new summary input print f" Memory Summarized {n to summarize} messages into summary" def summarize self, text: str - str: In production: call LLM with a summarization prompt Here: mock it return f" Summary of earlier conversation: The user asked about France, Paris, its population ~2.1M , and Paris landmarks including the Eiffel Tower. " def format prompt self - str: parts = f"System: {self.system}\n" if self.summary: parts.append f" Earlier conversation summary : {self.summary}\n" for msg in self.history -self.max recent: : role = "Human" if msg 'role' == 'user' else "Assistant" parts.append f"{role}: {msg 'content' }" parts.append "Assistant:" return "\n".join parts def chat self, user message: str - str: self.history.append {'role': 'user', 'content': user message} self. maybe summarize prompt = self. format prompt response = self.model prompt self.history.append {'role': 'assistant', 'content': response} return response def memory status self : print f"Summary: {'yes' if self.summary else 'none'}" print f"Recent messages in full: {min len self.history , self.max recent }" print f"Total history: {len self.history }" summary bot = SummaryMemoryChatbot mock model, None, max recent=6, summary threshold=8 for user input in turns 2: repeat to trigger summarization response = summary bot.chat user input summary bot.memory status Extract and store specific facts about the user or conversation entities. python import re from typing import Dict class EntityMemoryChatbot: def init self, model pipeline, system prompt: str = "You are a helpful assistant." : self.model = model pipeline self.system = system prompt self.history = self.entities: Dict str, str = {} entity store def extract entities self, message: str : Simplified entity extraction in production: use NER model or LLM patterns = { 'name': r" ?:my name is|I am|I'm \s+ A-Z a-z + ", 'location': r" ?:I live in|I'm from|I'm in \s+ A-Z a-z + ?:\s+ A-Z a-z + ? ", 'job': r" ?:I am a|I work as a|I'm a \s+ a-z + ?:\s+ a-z + ? ", 'topic': r" ?:I want to learn about|I'm studying|I need help with \s+ a-z\s + " } for entity type, pattern in patterns.items : match = re.search pattern, message, re.IGNORECASE if match: self.entities entity type = match.group 1 .strip def build entity context self - str: if not self.entities: return "" lines = "Known facts about the user:" for entity, value in self.entities.items : lines.append f" - {entity}: {value}" return "\n".join lines def format prompt self - str: parts = f"System: {self.system}" entity ctx = self. build entity context if entity ctx: parts.append entity ctx for msg in self.history -8: : role = "Human" if msg 'role' == 'user' else "Assistant" parts.append f"{role}: {msg 'content' }" parts.append "Assistant:" return "\n".join parts def chat self, user message: str - str: self. extract entities user message self.history.append {'role': 'user', 'content': user message} prompt = self. format prompt response = self.model prompt self.history.append {'role': 'assistant', 'content': response} return response Test entity memory def entity mock model prompt: str - str: if "name" in prompt.lower and "Alex" in prompt: return "Nice to meet you, Alex " elif "Alex" in prompt and "recommend" in prompt.lower : return "Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn." elif "course" in prompt.lower : return "For machine learning, the Andrew Ng Coursera course is excellent for beginners." else: return "Tell me more about what you'd like to learn." entity bot = EntityMemoryChatbot entity mock model conversations = "Hi, my name is Alex.", "I want to learn about machine learning.", "Can you recommend something?", "Are there any courses?", for user input in conversations: print f"\nUser: {user input}" response = entity bot.chat user input print f"Bot: {response}" print f"\nExtracted entities: {entity bot.entities}" Output: User: Hi, my name is Alex. Bot: Nice to meet you, Alex User: I want to learn about machine learning. Bot: Tell me more about what you'd like to learn. User: Can you recommend something? Bot: Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn. User: Are there any courses? Bot: For machine learning, the Andrew Ng Coursera course is excellent for beginners. Extracted entities: {'name': 'Alex', 'topic': 'machine learning'} The bot remembers the user's name and topic across all turns. python import openai import json from datetime import datetime class ProductionChatbot: def init self, system prompt: str = "You are a helpful AI assistant.", model: str = "gpt-3.5-turbo", max history: int = 20, max tokens: int = 500, temperature: float = 0.7 : self.client = openai.OpenAI self.model = model self.max history = max history self.max tokens = max tokens self.temperature = temperature self.history = self.system = system prompt self.created at = datetime.now def chat self, user message: str - str: self.history.append {'role': 'user', 'content': user message} Trim history if too long if len self.history self.max history: self.history = self.history -self.max history: Build message list for API messages = {'role': 'system', 'content': self.system} + self.history Call API response = self.client.chat.completions.create model=self.model, messages=messages, max tokens=self.max tokens, temperature=self.temperature, assistant message = response.choices 0 .message.content self.history.append {'role': 'assistant', 'content': assistant message} return assistant message def save conversation self, filepath: str : data = { 'created at': self.created at.isoformat , 'saved at': datetime.now .isoformat , 'model': self.model, 'system': self.system, 'messages': self.history } with open filepath, 'w' as f: json.dump data, f, indent=2 print f"Saved {len self.history } messages to {filepath}" def load conversation self, filepath: str : with open filepath, 'r' as f: data = json.load f self.history = data 'messages' self.system = data.get 'system', self.system print f"Loaded {len self.history } messages from {filepath}" def reset self : self.history = print "Conversation reset." def get stats self - dict: n user = sum 1 for m in self.history if m 'role' == 'user' n assistant = sum 1 for m in self.history if m 'role' == 'assistant' total chars = sum len m 'content' for m in self.history return { 'turns': n user, 'total messages': len self.history , 'estimated tokens': total chars // 4, 'history depth': len self.history } Usage bot = ProductionChatbot system prompt="You are a helpful ML tutor specializing in practical examples.", model="gpt-3.5-turbo", max history=20 response = bot.chat "Explain overfitting to me." print response bot.save conversation 'session 001.json' print "ProductionChatbot ready requires OPENAI API KEY " python from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory from langchain.chains import ConversationChain from langchain community.llms import HuggingFacePipeline from transformers import pipeline as hf pipeline Create LLM gen pipe = hf pipeline 'text-generation', model='gpt2', max new tokens=100 llm = HuggingFacePipeline pipeline=gen pipe Buffer memory: keeps all messages buffer memory = ConversationBufferMemory Summary memory: automatically summarizes when too long summary memory = ConversationSummaryMemory llm=llm Build conversation chain conversation = ConversationChain llm=llm, memory=buffer memory, verbose=False Chat result = conversation.predict input="Hello, my name is Alex." print f"Bot: {result :100 }..." result = conversation.predict input="What is my name?" print f"Bot: {result :100 }..." Inspect memory print f"\nMemory buffer:\n{buffer memory.buffer}" python import json import os class PersistentChatbot: def init self, model pipeline, session id: str, storage dir: str = './chat sessions', max history: int = 50 : self.model = model pipeline self.session id = session id self.storage dir = storage dir self.max history = max history self.history = self.metadata = {} os.makedirs storage dir, exist ok=True self. load session def session path self - str: return os.path.join self.storage dir, f"{self.session id}.json" def load session self : path = self. session path if os.path.exists path : with open path, 'r' as f: data = json.load f self.history = data.get 'history', self.metadata = data.get 'metadata', {} print f"Loaded session '{self.session id}' with {len self.history } messages" else: print f"New session '{self.session id}' started" def save session self : data = { 'session id': self.session id, 'last updated': datetime.now .isoformat , 'history': self.history, 'metadata': self.metadata } with open self. session path , 'w' as f: json.dump data, f, indent=2 def chat self, user message: str - str: self.history.append {'role': 'user', 'content': user message} if len self.history self.max history: self.history = self.history -self.max history: response = self.model self. format prompt self.history.append {'role': 'assistant', 'content': response} self. save session return response def format prompt self - str: parts = for msg in self.history -10: : role = "Human" if msg 'role' == 'user' else "Assistant" parts.append f"{role}: {msg 'content' }" parts.append "Assistant:" return "\n".join parts def list sessions self - list: sessions = for f in os.listdir self.storage dir : if f.endswith '.json' : sessions.append f.replace '.json', '' return sessions Usage persistent bot = PersistentChatbot mock model, session id='user alex 001' persistent bot.chat "What's the capital of France?" persistent bot.chat "What's the population there?" print f"\nSaved sessions: {persistent bot.list sessions }" print f"History length: {len persistent bot.history } messages" checklist = { "Memory management": "Does the bot remember context from 5+ turns ago?", "Does it handle coreferences correctly? 'there', 'it', 'they' ", "Does it avoid repeating information the user already gave?" , "Context window": "Does it handle very long conversations without breaking?", "Is there a graceful fallback when history is too long?", "Are summarized messages accurate and not lossy?" , "Conversation quality": "Does it stay on topic through the conversation?", "Does it refer to earlier decisions correctly?", "Does it handle topic switches gracefully?" , "Persistence": "Does it save conversations for later use?", "Can it resume from a previous session?", "Is the storage format readable and debuggable?" , "Edge cases": "What happens if the user asks about something not in memory?", "What happens if the user contradicts themselves?", "Does it handle very short or very long user messages?" } for category, items in checklist.items : print f"\n{category}:" for item in items: print f" {item}" | Memory type | When to use | How it works | |---|---|---| | Buffer all history | Short conversations | Keep all messages, pass everything | | Sliding window | Medium conversations | Keep last N messages only | | Summary memory | Long conversations | Summarize old messages, keep recent in full | | Entity memory | User-specific facts | Extract and store named entities | | Persistent memory | Multi-session chatbots | Save/load from disk or database | | Pattern | Code | |---|---| | Add to history | history.append {'role': 'user', 'content': msg} | | Trim history | history = history -max size: | | Build messages | {'role': 'system', 'content': system} + history | | Save session | json.dump {'history': history}, f | | Load session | history = json.load f 'history' | | LangChain buffer | ConversationBufferMemory | | LangChain summary | ConversationSummaryMemory llm=llm | Level 1: Build a SlidingWindowChatbot that talks to GPT-2 locally. Have a 10-turn conversation about a topic of your choice. Print the full history at the end. Verify the bot correctly references things from earlier turns. Level 2: Implement SummaryMemoryChatbot with a real summarization call. After every 8 turns, summarize the first half using a small T5 model. Test with a 20-turn conversation. Print the summary after it triggers. Is the summary accurate? Level 3: Build PersistentChatbot that stores conversations to disk. Start a conversation, close it, restart the program, load the session, and continue the conversation. Verify the bot remembers what was said in the previous session. Add a /history command that prints a summary of previous sessions. Final post, Post 100:OpenAI API: Build With GPT-4. API setup, chat completions, function calling, streaming, and cost management. The last post in the series wraps everything together.