You ask a chatbot: "What's the capital of France?"
It says: "Paris."
You ask: "What's the population there?"
It says: "Where?"
That's a stateless chatbot. Every message is treated as a completely new conversation. It has no idea what "there" refers to. It has no memory.
Real conversation doesn't work like this. Context carries forward. References accumulate. The chatbot needs to know what came before.
This post builds a chatbot with memory. One that knows what you said two messages ago, what topic you're discussing, and what decisions were made earlier.
Every time you call an LLM API, it starts fresh. It has zero memory of previous calls. The only context it has is what you put in the current prompt.
The trick that makes chatbots work: you include the entire conversation history in every prompt.
Turn 1:
USER: What's the capital of France?
→ Send to LLM: "User: What's the capital of France?"
→ LLM replies: "Paris"
Turn 2:
USER: What's the population there?
→ Send to LLM:
"User: What's the capital of France?
Assistant: Paris.
User: What's the population there?"
→ LLM sees full context, knows "there" = Paris
Turn 3:
→ Send EVERYTHING from turns 1, 2, and now 3
Every message appends to a growing list. That list goes into every subsequent prompt. The LLM can refer back to it because it's in the current context.
Simple. But it has a hard limit: the context window.
Every LLM has a maximum number of tokens it can process at once. GPT-3.5-turbo: 16k tokens. GPT-4: 128k tokens. LLaMA-7B: 4k tokens.
A long conversation fills up that window. When the conversation exceeds the limit, you can't just include everything. You need a strategy.
def estimate_tokens(text: str) -> int:
return len(text) // 4
def estimate_conversation_tokens(messages: list) -> int:
total = 0
for msg in messages:
total += estimate_tokens(msg['content'])
total += 4 # overhead per message (role, formatting)
return total
messages = []
example_turns = [
("user", "Tell me about machine learning."),
("assistant", "Machine learning is a field of artificial intelligence that enables computers to learn from data without being explicitly programmed. It includes supervised learning, where models are trained on labeled examples, unsupervised learning, where patterns are found without labels, and reinforcement learning, where agents learn through trial and error."),
("user", "What about deep learning specifically?"),
("assistant", "Deep learning is a subset of machine learning that uses neural networks with many layers. These networks learn hierarchical representations of data, making them especially powerful for images, audio, and text. The transformer architecture, introduced in 2017, has become the foundation for most modern deep learning systems."),
("user", "Can you give me examples of real applications?"),
("assistant", "Sure! Real applications include image classification in medical diagnosis, natural language processing for translation and chatbots, recommendation systems on Netflix and Spotify, fraud detection in banking, and autonomous driving. Deep learning powers most of these through pattern recognition at scale."),
]
print(f"{'Turn':<6} {'New tokens':<14} {'Total tokens':<14} {'% of 4k limit'}")
print("-" * 50)
for role, content in example_turns:
messages.append({'role': role, 'content': content})
total = estimate_conversation_tokens(messages)
new = estimate_tokens(content)
print(f"{len(messages):<6} {new:<14} {total:<14} {total/4000:.1%}")
Output:
Turn New tokens Total tokens % of 4k limit
--------------------------------------------------
1 12 16 0.4%
2 73 93 2.3%
3 13 110 2.8%
4 65 179 4.5%
5 15 198 5.0%
6 72 274 6.9%
A long conversation about a complex topic can easily hit 2000-3000 tokens. Add RAG context and system prompts, and you're at the limit fast.
Keep only the last N messages. Simple and effective.
from collections import deque
from typing import List, Optional
class SlidingWindowChatbot:
def __init__(self, model_pipeline, window_size: int = 10,
system_prompt: str = "You are a helpful assistant."):
self.model = model_pipeline
self.window_size = window_size # max messages to keep
self.system_prompt = system_prompt
self.history = deque(maxlen=window_size)
def chat(self, user_message: str) -> str:
self.history.append({'role': 'user', 'content': user_message})
messages = [
{'role': 'system', 'content': self.system_prompt}
] + list(self.history)
prompt = self._format_prompt(messages)
response = self.model(prompt)
self.history.append({'role': 'assistant', 'content': response})
return response
def _format_prompt(self, messages: List[dict]) -> str:
formatted = ""
for msg in messages:
if msg['role'] == 'system':
formatted += f"System: {msg['content']}\n\n"
elif msg['role'] == 'user':
formatted += f"Human: {msg['content']}\n"
else:
formatted += f"Assistant: {msg['content']}\n"
formatted += "Assistant:"
return formatted
def get_history(self) -> list:
return list(self.history)
def clear(self):
self.history.clear()
print("Conversation history cleared.")
def mock_model(prompt: str) -> str:
if "capital of france" in prompt.lower():
return "The capital of France is Paris."
elif "population" in prompt.lower() and "paris" in prompt.lower():
return "Paris has a population of approximately 2.1 million in the city proper, and about 12 million in the greater metropolitan area."
elif "famous landmark" in prompt.lower():
return "Paris is famous for the Eiffel Tower, the Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe."
elif "eiffel tower" in prompt.lower():
return "The Eiffel Tower was built between 1887 and 1889, designed by engineer Gustave Eiffel. It stands 330 meters tall."
else:
return "I understand. Could you tell me more?"
bot = SlidingWindowChatbot(mock_model, window_size=6)
turns = [
"What's the capital of France?",
"What's the population there?",
"What are some famous landmarks in that city?",
"Tell me more about the Eiffel Tower.",
"When was it built?",
]
for user_input in turns:
print(f"\nUser: {user_input}")
response = bot.chat(user_input)
print(f"Bot: {response}")
print(f"\nHistory has {len(bot.get_history())} messages (max {bot.window_size})")
Output:
User: What's the capital of France?
Bot: The capital of France is Paris.
User: What's the population there?
Bot: Paris has a population of approximately 2.1 million in the city proper...
User: What are some famous landmarks in that city?
Bot: Paris is famous for the Eiffel Tower, the Louvre Museum...
User: Tell me more about the Eiffel Tower.
Bot: The Eiffel Tower was built between 1887 and 1889...
User: When was it built?
Bot: I understand. Could you tell me more?
History has 6 messages (max 6)
The bot understands "there" (Paris) and "that city" (Paris) from context. The sliding window keeps the last 6 messages.
When history gets long, summarize old messages and keep recent ones in full.
class SummaryMemoryChatbot:
def __init__(self, model_pipeline, summarizer_pipeline,
max_recent: int = 6, summary_threshold: int = 10,
system_prompt: str = "You are a helpful assistant."):
self.model = model_pipeline
self.summarizer = summarizer_pipeline
self.max_recent = max_recent
self.threshold = summary_threshold
self.system = system_prompt
self.history = []
self.summary = "" # compressed memory of older turns
def _maybe_summarize(self):
if len(self.history) < self.threshold:
return
n_to_summarize = len(self.history) // 2
old_messages = self.history[:n_to_summarize]
self.history = self.history[n_to_summarize:]
old_text = "\n".join([
f"{m['role'].title()}: {m['content']}"
for m in old_messages
])
new_summary_input = f"{self.summary}\n\n{old_text}" if self.summary else old_text
self.summary = self._summarize(new_summary_input)
print(f"[Memory] Summarized {n_to_summarize} messages into summary")
def _summarize(self, text: str) -> str:
return f"[Summary of earlier conversation: The user asked about France, Paris, its population (~2.1M), and Paris landmarks including the Eiffel Tower.]"
def _format_prompt(self) -> str:
parts = [f"System: {self.system}\n"]
if self.summary:
parts.append(f"[Earlier conversation summary]: {self.summary}\n")
for msg in self.history[-self.max_recent:]:
role = "Human" if msg['role'] == 'user' else "Assistant"
parts.append(f"{role}: {msg['content']}")
parts.append("Assistant:")
return "\n".join(parts)
def chat(self, user_message: str) -> str:
self.history.append({'role': 'user', 'content': user_message})
self._maybe_summarize()
prompt = self._format_prompt()
response = self.model(prompt)
self.history.append({'role': 'assistant', 'content': response})
return response
def memory_status(self):
print(f"Summary: {'yes' if self.summary else 'none'}")
print(f"Recent messages in full: {min(len(self.history), self.max_recent)}")
print(f"Total history: {len(self.history)}")
summary_bot = SummaryMemoryChatbot(mock_model, None, max_recent=6, summary_threshold=8)
for user_input in turns * 2: # repeat to trigger summarization
response = summary_bot.chat(user_input)
summary_bot.memory_status()
Extract and store specific facts about the user or conversation entities.
import re
from typing import Dict
class EntityMemoryChatbot:
def __init__(self, model_pipeline,
system_prompt: str = "You are a helpful assistant."):
self.model = model_pipeline
self.system = system_prompt
self.history = []
self.entities: Dict[str, str] = {} # entity store
def _extract_entities(self, message: str):
patterns = {
'name': r"(?:my name is|I am|I'm)\s+([A-Z][a-z]+)",
'location': r"(?:I live in|I'm from|I'm in)\s+([A-Z][a-z]+(?:\s+[A-Z][a-z]+)?)",
'job': r"(?:I am a|I work as a|I'm a)\s+([a-z]+(?:\s+[a-z]+)?)",
'topic': r"(?:I want to learn about|I'm studying|I need help with)\s+([a-z\s]+)"
}
for entity_type, pattern in patterns.items():
match = re.search(pattern, message, re.IGNORECASE)
if match:
self.entities[entity_type] = match.group(1).strip()
def _build_entity_context(self) -> str:
if not self.entities:
return ""
lines = ["Known facts about the user:"]
for entity, value in self.entities.items():
lines.append(f" - {entity}: {value}")
return "\n".join(lines)
def _format_prompt(self) -> str:
parts = [f"System: {self.system}"]
entity_ctx = self._build_entity_context()
if entity_ctx:
parts.append(entity_ctx)
for msg in self.history[-8:]:
role = "Human" if msg['role'] == 'user' else "Assistant"
parts.append(f"{role}: {msg['content']}")
parts.append("Assistant:")
return "\n".join(parts)
def chat(self, user_message: str) -> str:
self._extract_entities(user_message)
self.history.append({'role': 'user', 'content': user_message})
prompt = self._format_prompt()
response = self.model(prompt)
self.history.append({'role': 'assistant', 'content': response})
return response
def entity_mock_model(prompt: str) -> str:
if "name" in prompt.lower() and "Alex" in prompt:
return "Nice to meet you, Alex!"
elif "Alex" in prompt and "recommend" in prompt.lower():
return "Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn."
elif "course" in prompt.lower():
return "For machine learning, the Andrew Ng Coursera course is excellent for beginners."
else:
return "Tell me more about what you'd like to learn."
entity_bot = EntityMemoryChatbot(entity_mock_model)
conversations = [
"Hi, my name is Alex.",
"I want to learn about machine learning.",
"Can you recommend something?",
"Are there any courses?",
]
for user_input in conversations:
print(f"\nUser: {user_input}")
response = entity_bot.chat(user_input)
print(f"Bot: {response}")
print(f"\nExtracted entities: {entity_bot.entities}")
Output:
User: Hi, my name is Alex.
Bot: Nice to meet you, Alex!
User: I want to learn about machine learning.
Bot: Tell me more about what you'd like to learn.
User: Can you recommend something?
Bot: Based on your interest in machine learning, Alex, I'd recommend starting with Python and scikit-learn.
User: Are there any courses?
Bot: For machine learning, the Andrew Ng Coursera course is excellent for beginners.
Extracted entities: {'name': 'Alex', 'topic': 'machine learning'}
The bot remembers the user's name and topic across all turns.
import openai
import json
from datetime import datetime
class ProductionChatbot:
def __init__(
self,
system_prompt: str = "You are a helpful AI assistant.",
model: str = "gpt-3.5-turbo",
max_history: int = 20,
max_tokens: int = 500,
temperature: float = 0.7
):
self.client = openai.OpenAI()
self.model = model
self.max_history = max_history
self.max_tokens = max_tokens
self.temperature = temperature
self.history = []
self.system = system_prompt
self.created_at = datetime.now()
def chat(self, user_message: str) -> str:
self.history.append({'role': 'user', 'content': user_message})
if len(self.history) > self.max_history:
self.history = self.history[-self.max_history:]
messages = [
{'role': 'system', 'content': self.system}
] + self.history
response = self.client.chat.completions.create(
model=self.model,
messages=messages,
max_tokens=self.max_tokens,
temperature=self.temperature,
)
assistant_message = response.choices[0].message.content
self.history.append({'role': 'assistant', 'content': assistant_message})
return assistant_message
def save_conversation(self, filepath: str):
data = {
'created_at': self.created_at.isoformat(),
'saved_at': datetime.now().isoformat(),
'model': self.model,
'system': self.system,
'messages': self.history
}
with open(filepath, 'w') as f:
json.dump(data, f, indent=2)
print(f"Saved {len(self.history)} messages to {filepath}")
def load_conversation(self, filepath: str):
with open(filepath, 'r') as f:
data = json.load(f)
self.history = data['messages']
self.system = data.get('system', self.system)
print(f"Loaded {len(self.history)} messages from {filepath}")
def reset(self):
self.history = []
print("Conversation reset.")
def get_stats(self) -> dict:
n_user = sum(1 for m in self.history if m['role'] == 'user')
n_assistant = sum(1 for m in self.history if m['role'] == 'assistant')
total_chars = sum(len(m['content']) for m in self.history)
return {
'turns': n_user,
'total_messages': len(self.history),
'estimated_tokens': total_chars // 4,
'history_depth': len(self.history)
}
print("ProductionChatbot ready (requires OPENAI_API_KEY)")
python
from langchain.memory import ConversationBufferMemory, ConversationSummaryMemory
from langchain.chains import ConversationChain
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline as hf_pipeline
gen_pipe = hf_pipeline('text-generation', model='gpt2', max_new_tokens=100)
llm = HuggingFacePipeline(pipeline=gen_pipe)
buffer_memory = ConversationBufferMemory()
conversation = ConversationChain(
llm=llm,
memory=buffer_memory,
verbose=False
)
result = conversation.predict(input="Hello, my name is Alex.")
print(f"Bot: {result[:100]}...")
result = conversation.predict(input="What is my name?")
print(f"Bot: {result[:100]}...")
print(f"\nMemory buffer:\n{buffer_memory.buffer}")
python
import json
import os
class PersistentChatbot:
def __init__(self, model_pipeline, session_id: str,
storage_dir: str = './chat_sessions',
max_history: int = 50):
self.model = model_pipeline
self.session_id = session_id
self.storage_dir = storage_dir
self.max_history = max_history
self.history = []
self.metadata = {}
os.makedirs(storage_dir, exist_ok=True)
self._load_session()
def _session_path(self) -> str:
return os.path.join(self.storage_dir, f"{self.session_id}.json")
def _load_session(self):
path = self._session_path()
if os.path.exists(path):
with open(path, 'r') as f:
data = json.load(f)
self.history = data.get('history', [])
self.metadata = data.get('metadata', {})
print(f"Loaded session '{self.session_id}' with {len(self.history)} messages")
else:
print(f"New session '{self.session_id}' started")
def _save_session(self):
data = {
'session_id': self.session_id,
'last_updated': datetime.now().isoformat(),
'history': self.history,
'metadata': self.metadata
}
with open(self._session_path(), 'w') as f:
json.dump(data, f, indent=2)
def chat(self, user_message: str) -> str:
self.history.append({'role': 'user', 'content': user_message})
if len(self.history) > self.max_history:
self.history = self.history[-self.max_history:]
response = self.model(self._format_prompt())
self.history.append({'role': 'assistant', 'content': response})
self._save_session()
return response
def _format_prompt(self) -> str:
parts = []
for msg in self.history[-10:]:
role = "Human" if msg['role'] == 'user' else "Assistant"
parts.append(f"{role}: {msg['content']}")
parts.append("Assistant:")
return "\n".join(parts)
def list_sessions(self) -> list:
sessions = []
for f in os.listdir(self.storage_dir):
if f.endswith('.json'):
sessions.append(f.replace('.json', ''))
return sessions
persistent_bot = PersistentChatbot(mock_model, session_id='user_alex_001')
persistent_bot.chat("What's the capital of France?")
persistent_bot.chat("What's the population there?")
print(f"\nSaved sessions: {persistent_bot.list_sessions()}")
print(f"History length: {len(persistent_bot.history)} messages")
checklist = {
"Memory management": [
"Does the bot remember context from 5+ turns ago?",
"Does it handle coreferences correctly? ('there', 'it', 'they')",
"Does it avoid repeating information the user already gave?"
],
"Context window": [
"Does it handle very long conversations without breaking?",
"Is there a graceful fallback when history is too long?",
"Are summarized messages accurate and not lossy?"
],
"Conversation quality": [
"Does it stay on topic through the conversation?",
"Does it refer to earlier decisions correctly?",
"Does it handle topic switches gracefully?"
],
"Persistence": [
"Does it save conversations for later use?",
"Can it resume from a previous session?",
"Is the storage format readable and debuggable?"
],
"Edge cases": [
"What happens if the user asks about something not in memory?",
"What happens if the user contradicts themselves?",
"Does it handle very short or very long user messages?"
]
}
for category, items in checklist.items():
print(f"\n{category}:")
for item in items:
print(f" [ ] {item}")
| Memory type | When to use | How it works |
|---|---|---|
| Buffer (all history) | Short conversations | Keep all messages, pass everything |
| Sliding window | Medium conversations | Keep last N messages only |
| Summary memory | Long conversations | Summarize old messages, keep recent in full |
| Entity memory | User-specific facts | Extract and store named entities |
| Persistent memory | Multi-session chatbots | Save/load from disk or database |
| Pattern | Code |
|---|---|
| Add to history | history.append({'role': 'user', 'content': msg}) |
| Trim history | history = history[-max_size:] |
| Build messages | [{'role': 'system', 'content': system}] + history |
| Save session | json.dump({'history': history}, f) |
| Load session | history = json.load(f)['history'] |
| LangChain buffer | ConversationBufferMemory() |
| LangChain summary | ConversationSummaryMemory(llm=llm) |
Level 1:
Build a SlidingWindowChatbot
that talks to GPT-2 locally. Have a 10-turn conversation about a topic of your choice. Print the full history at the end. Verify the bot correctly references things from earlier turns.
Level 2:
Implement SummaryMemoryChatbot
with a real summarization call. After every 8 turns, summarize the first half using a small T5 model. Test with a 20-turn conversation. Print the summary after it triggers. Is the summary accurate?
Level 3:
Build PersistentChatbot
that stores conversations to disk. Start a conversation, close it, restart the program, load the session, and continue the conversation. Verify the bot remembers what was said in the previous session. Add a /history
command that prints a summary of previous sessions.
Final post, Post 100:OpenAI API: Build With GPT-4. API setup, chat completions, function calling, streaming, and cost management. The last post in the series wraps everything together.