How I Fixed My AI Chatbot's Timeout Nightmare

A developer spent three weeks debugging an AI chatbot that kept timing out in production, with 15% of requests failing due to slow API responses. The solution involved implementing streaming responses and a retry mechanism that could resume from the last received token, dramatically improving user experience.

I spent three weeks debugging an AI chatbot that kept timing out. It wasn't the API itself—it was how I was calling it. Here's what I learned. Last quarter, I was building a customer support chatbot for a SaaS product. The idea was simple: users ask questions, an AI model returns natural language answers. We picked an AI API that seemed solid—decent latency, good accuracy. But in production, everything fell apart. Users would type a question, wait... and wait... then get a 504 Gateway Timeout. Our logs showed that about 15% of requests were failing because the API response took longer than our 30-second timeout. Even when it worked, the answer arrived in one big chunk after 10-20 seconds. Users started leaving the chat mid-response. This wasn't a theoretical problem. It was happening to real people, and my boss was not happy. My first instinct was to crank up the timeout. I set it to 60 seconds. That just meant failures took longer. Users hated it more. Next, I tried synchronous retries with exponential backoff. That made things worse: if the first attempt timed out, the retry also often timed out, and the whole request could take minutes. Plus, our server couldn't handle the backlog of pending requests—it started queueing, and memory usage spiked. I considered switching to a different model, but our product was already tied to this API's unique fine-tuning. We were stuck. I even tried polling: send the request, get a task ID, poll every second for the result. But the API didn't support async tasks—it required a single open connection. At this point, I was ready to roll back to a simple FAQ lookup. Then I remembered a colleague mentioning "streaming" at a meetup. I hadn't paid attention, but now it sounded like a lifeline. The breakthrough came when I realized the API supported streaming responses—the model could send back partial tokens as it generated them. Instead of waiting for the full answer, I could start displaying text to the user immediately. This solved two problems: But streaming alone wasn't enough. The connection would sometimes drop mid-stream. I needed a robust retry mechanism that could resume from the last received token. Here's the approach I settled on: aiohttp in Python.Not all APIs support resumption, but many do. If not, you can just restart the request—the user already saw some text, so the experience is still better than a timeout. Here's a simplified version of what I wrote. It's async Python using aiohttp and asyncio . python import asyncio import aiohttp from typing import AsyncIterator class AIStreamClient: def init self, api url: str, api key: str : self.api url = api url self.api key = api key self.session = aiohttp.ClientSession async def stream completion self, prompt: str - AsyncIterator str : """Stream tokens from the AI API with retry logic.""" max retries = 3 base delay = 0.1 100ms last position = 0 for attempt in range max retries : try: headers = { "Authorization": f"Bearer {self.api key}", "Accept": "text/event-stream", } payload = { "prompt": prompt, "stream": True, "resume from": last position if supported } async with self.session.post self.api url, json=payload, headers=headers, timeout=aiohttp.ClientTimeout total=30 as response: async for chunk in response.content: if chunk: text = chunk.decode "utf-8" Assume each chunk is a JSON with "token" and "position" In reality, you'd parse SSE format data = json.loads text token = data.get "token", "" position = data.get "position", last position if position last position: yield token last position = position except aiohttp.ClientError, asyncio.TimeoutError as e: print f"Stream error on attempt {attempt+1}: {e}" if attempt == max retries - 1: raise delay = base delay 2 attempt await asyncio.sleep min delay, 5 async def close self : await self.session.close How to use it: python async def main : client = AIStreamClient api url="https://ai.interwestinfo.com/v1/completions", example API api key="sk-..." async for token in client.stream completion "Explain quantum computing" : print token, end="", flush=True await client.close asyncio.run main This is a proof-of-concept. In production, you'd handle partial tokens more carefully, parse SSE properly, and add backpressure if the user is typing new input while streaming. Streaming isn't a silver bullet. Here's what I discovered: When NOT to use streaming: Hindsight is 20/20. If I could start over: httpx with built-in streaming and retries. I should have started there.After deploying the streaming version, timeout errors dropped from 15% to less than 0.5%. User satisfaction scores went up, and I stopped getting paged at 2 AM. The code is now used across three microservices. But I'm still paranoid. Every AI API is different, and production has a way of surprising you. What's your setup look like? How do you handle unreliable AI responses? I'd love to hear what's worked or failed for you.