How we integrated real-time phrase translation feedback into our AI-powered book translation workflow, and what we learned about latency, context, and prompt engineering.
When we launched LectuLibre, our AI-powered book translation platform, users loved the quality of full-chapter translations. But they kept asking for something else: while reading a partially translated book, they'd stumble on an untranslated phrase or an awkward auto-translation and want to quickly get a better version without leaving the page. So we built 即时翻译求助 (Instant Translation Help)—a feature that lets readers highlight any phrase and get a context-aware, human-quality translation within seconds, along with a brief explanation of tricky parts.
Here's how we built it, the technical challenges we faced, and the lessons we learned about stitching LLMs into a real-time reading experience.
Most web apps offer generic translation via API calls—send a sentence to Google Translate, get a result. But that doesn't work for literary texts. A phrase like "She let the cat out of the bag" needs to be translated idiomatically, and the appropriate rendering depends heavily on the surrounding paragraphs (is the tone formal? sarcastic? part of a metaphor chain?). Our existing translation pipeline processes entire chapters in bulk with carefully crafted prompts, but for instant help, we needed sub-second latency while preserving that same depth of context.
We chose Server-Sent Events (SSE) over WebSockets because the communication is one-directional (server pushes translation tokens) and SSE is simpler to implement with FastAPI. The client (a React app) sends a POST request with:
Our backend retrieves the surrounding text from PostgreSQL (we store the original book in chunks), feeds a carefully assembled prompt to the LLM (Claude 3 Haiku for speed), and streams the response back token-by-token.
We index each paragraph with its position. Given a highlighted phrase, we grab the paragraph containing it, plus one paragraph before and after. This usually provides enough narrative context without blowing up the prompt size.
async def get_context(book_id: str, para_index: int, db: AsyncSession):
stmt = (
select(BookParagraph)
.where(
BookParagraph.book_id == book_id,
BookParagraph.index.between(para_index - 1, para_index + 1)
)
.order_by(BookParagraph.index)
)
result = await db.execute(stmt)
paragraphs = result.scalars().all()
return "\n".join(p.text for p in paragraphs)
We needed a prompt that instructs the LLM to:
Here's the core prompt template:
INSTANT_HELP_PROMPT = """
You are a literary translator. Below is the source text surrounding a highlighted phrase, the phrase itself, and the target language.
Translate the highlighted phrase into {target_lang} in a way that fits the style of the surrounding text.
If the phrase contains an idiom, metaphor, or cultural reference, provide a natural equivalent and a one-sentence explanation in parentheses.
Output format:
**Translation:** [your translation]
**Note:** [explanation if needed]
Surrounding text:
{context}
Highlighted phrase:
"{phrase}"
Translation:
"""
We found that Claude 3 Haiku respects this format almost always, and the "Note" part is omitted when not needed.
We built an async endpoint that yields SSE chunks. The client can start rendering the translation as tokens arrive, which feels instant.
from fastapi import APIRouter, Request
from fastapi.responses import StreamingResponse
import json
import asyncio
router = APIRouter()
@router.post("/api/instant-help")
async def instant_help(request: Request):
data = await request.json()
phrase = data["phrase"]
book_id = data["bookId"]
para_index = data["paraIndex"]
target_lang = data["targetLang"]
async def event_generator():
async with async_session() as db:
context = await get_context(book_id, para_index, db)
prompt = INSTANT_HELP_PROMPT.format(
target_lang=target_lang,
context=context,
phrase=phrase
)
async with anthropic.AsyncAnthropic() as client:
stream = await client.messages.create(
model="claude-3-haiku-20240307",
max_tokens=300,
temperature=0.3,
messages=[{"role": "user", "content": prompt}],
stream=True
)
async for event in stream:
if event.type == "content_block_delta":
data = json.dumps({"text": event.delta.text})
yield f"data: {data}\n\n"
elif event.type == "message_stop":
yield "data: [DONE]\n\n"
return StreamingResponse(event_generator(), media_type="text/event-stream")
On the frontend, we use EventSource
to consume these events. The whole round-trip from click to first token appears in about 400–600ms for typical phrases.
Haiku is fast but not always perfect. We tried DeepSeek-V2 (slower but better with idioms) but its latency crossed 2 seconds, killing the "instant" feel. We settled on Haiku for now, with a secondary more detailed translation available on demand (which uses Claude 3 Opus in the background).
Each instant help call costs about $0.002 (input + output tokens). With thousands of users, that adds up. We implemented a local cache keyed on (book_id, para_index, phrase, target_lang) using Redis. Repeated requests for the same phrase (e.g., multiple users reading the same book) are served from cache instantly, reducing LLM calls by ~30% in our beta.
Experimentally, more context (2 paragraphs) significantly improved quality without adding too many tokens. But including an entire chapter led to slower responses and occasional off-topic interpretations. We keep the context at ~500 tokens on average.
Output format: **Translation:** ... **Note:** ...
reduced malformed responses by 90%. Small tweaks matter.We're exploring a context window expansion that uses the entire chapter, but with aggressive summarization of preceding paragraphs via a cheap model call. Also, fine-tuning a small open-source model on our translation style could bring costs close to zero. If you've built similar inline AI features, how did you handle the cost/latency/quality triangle? We'd love to hear your approach in the comments.
Building LectuLibre has taught us that AI-powered tools shine when they fit seamlessly into the user's workflow. Instant translation help is that seam—a small feature that feels like magic because it respects the reader's flow.