How We Translate 300-Page Books Using Claude Without Hitting Token Limits

LectuLibre built an AI-powered platform that translates entire books using large language models, overcoming token limits by implementing a sliding window chunking algorithm based on paragraphs with overlap. The system uses Claude's API with a FastAPI backend, splitting long documents into overlapping chunks of up to 180,000 tokens to preserve context and ensure high-quality translation.

Breaking long documents into overlapping chunks, preserving context, and reassembling with FastAPI At LectuLibre, we’ve built an AI‑powered platform that translates entire books—EPUBs and PDFs—using large language models. When we first hooked up Claude’s API, we naively fed it a 300‑page PDF in one request. It failed immediately. Claude 3 Opus has a 200K token window, but a 300‑page book can easily run to 300K tokens or more. Even if we squeezed it in, the output would be truncated and the quality would degrade at the extremes of the context window. So we faced a classic long‑document problem: how do you translate a book that’s larger than the model’s context window? Here’s the real approach we ended up with, the code we wrote, and the lessons we learned. Claude 3 Opus and Haiku models and most LLMs have a maximum context length—200,000 tokens for Opus. A token is roughly ¾ of a word. A 300‑page novel with ~75,000 words translates to about 100K tokens, so it should fit, right? But translations from English to Spanish can expand by 15–20%, and the prompt instructions, system message, and the user message itself all eat into that budget. Plus, we needed to send the entire source text in every call to give the model full context. That’s not feasible. We could have tried a simple split: cut the book at arbitrary page boundaries and translate piecemeal. That fails spectacularly. Narrative breaks mid‑sentence, and phrases like “the previous chapter” lose their referents. We needed a more intelligent chunking strategy. We settled on a sliding window chunking algorithm based on paragraphs, with a generous overlap. Here’s the idea: \n\n . max chunk tokens we used 180,000 to keep a safety margin , adding paragraphs one by one and counting tokens with tiktoken .This isn’t perfect—some chapters may still be split—but it preserves far more context than any fixed‑size split. We built our translation pipeline inside a FastAPI background task. Here’s the core chunking function: python import tiktoken from typing import List from langchain text splitters import RecursiveCharacterTextSplitter def chunk by paragraphs text: str, max tokens: int = 180000, overlap paragraphs: int = 5 - List str : """ Split text into chunks of at most max tokens tokens, using paragraphs as atomic units and overlapping the last overlap paragraphs from the previous chunk. """ enc = tiktoken.get encoding "cl100k base" Claude's tokenizer paragraphs = text.split '\n\n' chunks = current chunk = current token count = 0 for para in paragraphs: para tokens = len enc.encode para If a single paragraph exceeds the limit rare , split it further if para tokens max tokens: Fallback to sentence splitting para texts = RecursiveCharacterTextSplitter chunk size=max tokens, chunk overlap=100, length function=lambda x: len enc.encode x .split text para for p in para texts: p tokens = len enc.encode p if current token count + p tokens max tokens and current chunk: chunks.append '\n\n'.join current chunk overlap = current chunk -overlap paragraphs: if len current chunk = overlap paragraphs else current chunk current chunk = overlap.copy current token count = sum len enc.encode p for p in overlap current chunk.append p current token count += p tokens else: if current token count + para tokens max tokens and current chunk: chunks.append '\n\n'.join current chunk Keep overlapping paragraphs overlap = current chunk -overlap paragraphs: if len current chunk = overlap paragraphs else current chunk current chunk = overlap.copy current token count = sum len enc.encode p for p in overlap current chunk.append para current token count += para tokens if current chunk: chunks.append '\n\n'.join current chunk return chunks Then we translate each chunk using Anthropic’s Python SDK, with back‑pressure and retry logic to handle rate limits: python from anthropic import Anthropic, RateLimitError import asyncio from tenacity import retry, stop after attempt, wait exponential async def translate chunk client: Anthropic, chunk: str, target lang: str - str: system prompt = f"You are a professional translator. Translate the following text from English to {target lang}. Preserve all formatting, line breaks, and special characters. Do not add commentary." @retry stop=stop after attempt 3 , wait=wait exponential multiplier=1, min=4, max=60 async def call : try: response = await asyncio.to thread client.messages.create, model="claude-3-opus-20240229", max tokens=4096, system=system prompt, messages= {"role": "user", "content": chunk} return response.content 0 .text except RateLimitError: Let tenacity handle the retry raise return await call We use asyncio.to thread because the Anthropic SDK is synchronous; in a FastAPI app we don’t want to block the event loop. The tenacity library gives us exponential backoff for rate limits. After translating all chunks in parallel with asyncio.gather , we merge them: php def merge chunks translated chunks: List str , overlap paragraphs: int = 5 - str: """ Concatenate translated chunks, removing the overlapping paragraphs except from the first chunk. """ if not translated chunks: return "" result = translated chunks 0 for i in range 1, len translated chunks : Each subsequent chunk starts with 5 overlap paragraphs; skip them chunk paragraphs = translated chunks i .split '\n\n' We assume the translation preserved paragraph boundaries main text = chunk paragraphs overlap paragraphs: if len chunk paragraphs overlap paragraphs else chunk paragraphs result += '\n\n' + '\n\n'.join main text return result We run all chunk translations concurrently. For a 300‑page book, we typically get 5–8 chunks of ~180K tokens each. With Claude 3 Opus, each chunk takes about 15–30 seconds to translate. We impose a concurrency limit of 4 simultaneous calls to avoid hitting Anthropic’s rate caps. Overall, a full‑book translation completes in 2–5 minutes. Cost : Claude 3 Opus is expensive. At $15 per million input tokens, a 300‑page book ~100K input tokens per chunk, ~8 chunks costs around $12–15. We mitigated this by offering Claude 3 Haiku cheaper, faster, but lower quality and DeepSeek as alternatives. Users can choose. Quality trade‑offs : The overlap strategy works well for most texts, but sometimes a chapter ends exactly at a chunk boundary and the narrative flow feels a bit disjointed. We experimented with dynamic overlap based on chapter markers e.g., force a split only at chapter headings , but that added complexity and didn’t always align with token limits. We’re sticking with paragraph‑level overlap for now. cl100k base is close to Claude’s tokenizer but not identical. We saw a 5% discrepancy in token counts, so we kept a safety margin of 20K tokens below the limit. tenacity and a concurrency semaphore saved us. \n\n works for prose, but tables, lists, and code blocks get mangled. We’re now exploring a markdown‑aware splitter.LectuLibre’s translation pipeline currently handles EPUBs and PDFs up to ~1000 pages. We’ve translated novels, technical manuals, and even a PhD thesis. The chunking approach has held up surprisingly well, but there’s room for improvement: dynamic overlap detection, better table handling, and perhaps a two‑stage translation where we first summarize each chunk’s context. If you’re building a similar system, don’t underestimate the merge logic. The chunking is easy; making the final output read like a single, coherent book is the real challenge. What’s your experience with long‑form AI translation? Have you found a better chunking heuristic? We’d love to hear your thoughts in the comments.