Building Instant Translation Assistance for Book Translations with Python and LLMs LectuLibre, an AI-powered book translation platform, built an instant translation help feature that lets readers highlight any phrase and receive a context-aware, human-quality translation within seconds. The feature uses Server-Sent Events (SSE) with FastAPI and Claude 3 Haiku to stream translations token-by-token, preserving literary context by fetching surrounding paragraphs. The team overcame challenges in prompt engineering to handle idioms and cultural references while maintaining sub-second latency. How we integrated real-time phrase translation feedback into our AI-powered book translation workflow, and what we learned about latency, context, and prompt engineering. When we launched LectuLibre, our AI-powered book translation platform, users loved the quality of full-chapter translations. But they kept asking for something else: while reading a partially translated book, they'd stumble on an untranslated phrase or an awkward auto-translation and want to quickly get a better version without leaving the page. So we built 即时翻译求助 Instant Translation Help —a feature that lets readers highlight any phrase and get a context-aware, human-quality translation within seconds, along with a brief explanation of tricky parts. Here's how we built it, the technical challenges we faced, and the lessons we learned about stitching LLMs into a real-time reading experience. Most web apps offer generic translation via API calls—send a sentence to Google Translate, get a result. But that doesn't work for literary texts. A phrase like "She let the cat out of the bag" needs to be translated idiomatically, and the appropriate rendering depends heavily on the surrounding paragraphs is the tone formal? sarcastic? part of a metaphor chain? . Our existing translation pipeline processes entire chapters in bulk with carefully crafted prompts, but for instant help, we needed sub-second latency while preserving that same depth of context. We chose Server-Sent Events SSE over WebSockets because the communication is one-directional server pushes translation tokens and SSE is simpler to implement with FastAPI. The client a React app sends a POST request with: Our backend retrieves the surrounding text from PostgreSQL we store the original book in chunks , feeds a carefully assembled prompt to the LLM Claude 3 Haiku for speed , and streams the response back token-by-token. We index each paragraph with its position. Given a highlighted phrase, we grab the paragraph containing it, plus one paragraph before and after. This usually provides enough narrative context without blowing up the prompt size. python async def get context book id: str, para index: int, db: AsyncSession : Fetch surrounding paragraphs stmt = select BookParagraph .where BookParagraph.book id == book id, BookParagraph.index.between para index - 1, para index + 1 .order by BookParagraph.index result = await db.execute stmt paragraphs = result.scalars .all return "\n".join p.text for p in paragraphs We needed a prompt that instructs the LLM to: Here's the core prompt template: INSTANT HELP PROMPT = """ You are a literary translator. Below is the source text surrounding a highlighted phrase, the phrase itself, and the target language. Translate the highlighted phrase into {target lang} in a way that fits the style of the surrounding text. If the phrase contains an idiom, metaphor, or cultural reference, provide a natural equivalent and a one-sentence explanation in parentheses. Output format: Translation: your translation Note: explanation if needed Surrounding text: {context} Highlighted phrase: "{phrase}" Translation: """ We found that Claude 3 Haiku respects this format almost always, and the "Note" part is omitted when not needed. We built an async endpoint that yields SSE chunks. The client can start rendering the translation as tokens arrive, which feels instant. python from fastapi import APIRouter, Request from fastapi.responses import StreamingResponse import json import asyncio router = APIRouter @router.post "/api/instant-help" async def instant help request: Request : data = await request.json phrase = data "phrase" book id = data "bookId" para index = data "paraIndex" target lang = data "targetLang" async def event generator : async with async session as db: context = await get context book id, para index, db prompt = INSTANT HELP PROMPT.format target lang=target lang, context=context, phrase=phrase Stream from Claude using the official Anthropic Python SDK async with anthropic.AsyncAnthropic as client: stream = await client.messages.create model="claude-3-haiku-20240307", max tokens=300, temperature=0.3, messages= {"role": "user", "content": prompt} , stream=True async for event in stream: if event.type == "content block delta": data = json.dumps {"text": event.delta.text} yield f"data: {data}\n\n" elif event.type == "message stop": yield "data: DONE \n\n" return StreamingResponse event generator , media type="text/event-stream" On the frontend, we use EventSource to consume these events. The whole round-trip from click to first token appears in about 400–600ms for typical phrases. Haiku is fast but not always perfect. We tried DeepSeek-V2 slower but better with idioms but its latency crossed 2 seconds, killing the "instant" feel. We settled on Haiku for now, with a secondary more detailed translation available on demand which uses Claude 3 Opus in the background . Each instant help call costs about $0.002 input + output tokens . With thousands of users, that adds up. We implemented a local cache keyed on book id, para index, phrase, target lang using Redis. Repeated requests for the same phrase e.g., multiple users reading the same book are served from cache instantly, reducing LLM calls by ~30% in our beta. Experimentally, more context 2 paragraphs significantly improved quality without adding too many tokens. But including an entire chapter led to slower responses and occasional off-topic interpretations. We keep the context at ~500 tokens on average. Output format: Translation: ... Note: ... reduced malformed responses by 90%. Small tweaks matter.We're exploring a context window expansion that uses the entire chapter, but with aggressive summarization of preceding paragraphs via a cheap model call. Also, fine-tuning a small open-source model on our translation style could bring costs close to zero. If you've built similar inline AI features, how did you handle the cost/latency/quality triangle? We'd love to hear your approach in the comments. Building LectuLibre has taught us that AI-powered tools shine when they fit seamlessly into the user's workflow. Instant translation help is that seam—a small feature that feels like magic because it respects the reader's flow.