{"slug": "how-we-translate-300-page-books-using-claude-without-hitting-token-limits", "title": "How We Translate 300-Page Books Using Claude Without Hitting Token Limits", "summary": "LectuLibre built an AI-powered platform that translates entire books using large language models, overcoming token limits by implementing a sliding window chunking algorithm based on paragraphs with overlap. The system uses Claude's API with a FastAPI backend, splitting long documents into overlapping chunks of up to 180,000 tokens to preserve context and ensure high-quality translation.", "body_md": "*Breaking long documents into overlapping chunks, preserving context, and reassembling with FastAPI*\n\nAt LectuLibre, we’ve built an AI‑powered platform that translates entire books—EPUBs and PDFs—using large language models. When we first hooked up Claude’s API, we naively fed it a 300‑page PDF in one request. It failed immediately. Claude 3 Opus has a 200K token window, but a 300‑page book can easily run to 300K tokens or more. Even if we squeezed it in, the output would be truncated and the quality would degrade at the extremes of the context window.\n\nSo we faced a classic long‑document problem: **how do you translate a book that’s larger than the model’s context window?** Here’s the real approach we ended up with, the code we wrote, and the lessons we learned.\n\nClaude 3 Opus and Haiku models (and most LLMs) have a maximum context length—200,000 tokens for Opus. A token is roughly ¾ of a word. A 300‑page novel with ~75,000 words translates to about 100K tokens, so it *should* fit, right? But translations from English to Spanish can expand by 15–20%, and the prompt instructions, system message, and the user message itself all eat into that budget. Plus, we needed to send the *entire* source text in every call to give the model full context. That’s not feasible.\n\nWe could have tried a simple split: cut the book at arbitrary page boundaries and translate piecemeal. That fails spectacularly. Narrative breaks mid‑sentence, and phrases like “the previous chapter” lose their referents. We needed a more intelligent chunking strategy.\n\nWe settled on a **sliding window chunking algorithm** based on paragraphs, with a generous overlap. Here’s the idea:\n\n`\\n\\n`\n\n).`max_chunk_tokens`\n\n(we used 180,000 to keep a safety margin), adding paragraphs one by one and counting tokens with `tiktoken`\n\n.This isn’t perfect—some chapters may still be split—but it preserves far more context than any fixed‑size split.\n\nWe built our translation pipeline inside a FastAPI background task. Here’s the core chunking function:\n\n``` python\nimport tiktoken\nfrom typing import List\nfrom langchain_text_splitters import RecursiveCharacterTextSplitter\n\ndef chunk_by_paragraphs(text: str, max_tokens: int = 180000, overlap_paragraphs: int = 5) -> List[str]:\n    \"\"\"\n    Split text into chunks of at most `max_tokens` tokens,\n    using paragraphs as atomic units and overlapping the last\n    `overlap_paragraphs` from the previous chunk.\n    \"\"\"\n    enc = tiktoken.get_encoding(\"cl100k_base\")  # Claude's tokenizer\n    paragraphs = text.split('\\n\\n')\n    chunks = []\n    current_chunk = []\n    current_token_count = 0\n\n    for para in paragraphs:\n        para_tokens = len(enc.encode(para))\n        # If a single paragraph exceeds the limit (rare), split it further\n        if para_tokens > max_tokens:\n            # Fallback to sentence splitting\n            para_texts = RecursiveCharacterTextSplitter(\n                chunk_size=max_tokens, chunk_overlap=100,\n                length_function=lambda x: len(enc.encode(x))\n            ).split_text(para)\n            for p in para_texts:\n                p_tokens = len(enc.encode(p))\n                if current_token_count + p_tokens > max_tokens and current_chunk:\n                    chunks.append('\\n\\n'.join(current_chunk))\n                    overlap = current_chunk[-overlap_paragraphs:] if len(current_chunk) >= overlap_paragraphs else current_chunk\n                    current_chunk = overlap.copy()\n                    current_token_count = sum(len(enc.encode(p)) for p in overlap)\n                current_chunk.append(p)\n                current_token_count += p_tokens\n        else:\n            if current_token_count + para_tokens > max_tokens and current_chunk:\n                chunks.append('\\n\\n'.join(current_chunk))\n                # Keep overlapping paragraphs\n                overlap = current_chunk[-overlap_paragraphs:] if len(current_chunk) >= overlap_paragraphs else current_chunk\n                current_chunk = overlap.copy()\n                current_token_count = sum(len(enc.encode(p)) for p in overlap)\n            current_chunk.append(para)\n            current_token_count += para_tokens\n\n    if current_chunk:\n        chunks.append('\\n\\n'.join(current_chunk))\n    return chunks\n```\n\nThen we translate each chunk using Anthropic’s Python SDK, with back‑pressure and retry logic to handle rate limits:\n\n``` python\nfrom anthropic import Anthropic, RateLimitError\nimport asyncio\nfrom tenacity import retry, stop_after_attempt, wait_exponential\n\nasync def translate_chunk(client: Anthropic, chunk: str, target_lang: str) -> str:\n    system_prompt = f\"You are a professional translator. Translate the following text from English to {target_lang}. Preserve all formatting, line breaks, and special characters. Do not add commentary.\"\n\n    @retry(stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60))\n    async def _call():\n        try:\n            response = await asyncio.to_thread(\n                client.messages.create,\n                model=\"claude-3-opus-20240229\",\n                max_tokens=4096,\n                system=system_prompt,\n                messages=[{\"role\": \"user\", \"content\": chunk}]\n            )\n            return response.content[0].text\n        except RateLimitError:\n            # Let tenacity handle the retry\n            raise\n    return await _call()\n```\n\nWe use `asyncio.to_thread`\n\nbecause the Anthropic SDK is synchronous; in a FastAPI app we don’t want to block the event loop. The `tenacity`\n\nlibrary gives us exponential backoff for rate limits. After translating all chunks in parallel with `asyncio.gather`\n\n, we merge them:\n\n``` php\ndef merge_chunks(translated_chunks: List[str], overlap_paragraphs: int = 5) -> str:\n    \"\"\"\n    Concatenate translated chunks, removing the overlapping paragraphs\n    except from the first chunk.\n    \"\"\"\n    if not translated_chunks:\n        return \"\"\n    result = translated_chunks[0]\n    for i in range(1, len(translated_chunks)):\n        # Each subsequent chunk starts with 5 overlap paragraphs; skip them\n        chunk_paragraphs = translated_chunks[i].split('\\n\\n')\n        # We assume the translation preserved paragraph boundaries\n        main_text = chunk_paragraphs[overlap_paragraphs:] if len(chunk_paragraphs) > overlap_paragraphs else chunk_paragraphs\n        result += '\\n\\n' + '\\n\\n'.join(main_text)\n    return result\n```\n\nWe run all chunk translations concurrently. For a 300‑page book, we typically get 5–8 chunks of ~180K tokens each. With Claude 3 Opus, each chunk takes about 15–30 seconds to translate. We impose a concurrency limit of 4 simultaneous calls to avoid hitting Anthropic’s rate caps. Overall, a full‑book translation completes in 2–5 minutes.\n\n**Cost**: Claude 3 Opus is expensive. At $15 per million input tokens, a 300‑page book (~100K input tokens per chunk, ~8 chunks) costs around $12–15. We mitigated this by offering Claude 3 Haiku (cheaper, faster, but lower quality) and DeepSeek as alternatives. Users can choose.\n\n**Quality trade‑offs**: The overlap strategy works well for most texts, but sometimes a chapter ends exactly at a chunk boundary and the narrative flow feels a bit disjointed. We experimented with dynamic overlap based on chapter markers (e.g., force a split only at chapter headings), but that added complexity and didn’t always align with token limits. We’re sticking with paragraph‑level overlap for now.\n\n`cl100k_base`\n\nis close to Claude’s tokenizer but not identical. We saw a 5% discrepancy in token counts, so we kept a safety margin of 20K tokens below the limit.`tenacity`\n\nand a concurrency semaphore saved us.`\\n\\n`\n\nworks for prose, but tables, lists, and code blocks get mangled. We’re now exploring a markdown‑aware splitter.LectuLibre’s translation pipeline currently handles EPUBs and PDFs up to ~1000 pages. We’ve translated novels, technical manuals, and even a PhD thesis. The chunking approach has held up surprisingly well, but there’s room for improvement: dynamic overlap detection, better table handling, and perhaps a two‑stage translation where we first summarize each chunk’s context.\n\nIf you’re building a similar system, don’t underestimate the merge logic. The chunking is easy; making the final output read like a single, coherent book is the real challenge.\n\n**What’s your experience with long‑form AI translation? Have you found a better chunking heuristic?** We’d love to hear your thoughts in the comments.", "url": "https://wpnews.pro/news/how-we-translate-300-page-books-using-claude-without-hitting-token-limits", "canonical_source": "https://dev.to/jacob_gong/how-we-translate-300-page-books-using-claude-without-hitting-token-limits-4b93", "published_at": "2026-07-01 03:01:18+00:00", "updated_at": "2026-07-01 03:48:38.128187+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "developer-tools"], "entities": ["LectuLibre", "Claude", "FastAPI", "Claude 3 Opus", "Claude 3 Haiku", "tiktoken", "langchain"], "alternates": {"html": "https://wpnews.pro/news/how-we-translate-300-page-books-using-claude-without-hitting-token-limits", "markdown": "https://wpnews.pro/news/how-we-translate-300-page-books-using-claude-without-hitting-token-limits.md", "text": "https://wpnews.pro/news/how-we-translate-300-page-books-using-claude-without-hitting-token-limits.txt", "jsonld": "https://wpnews.pro/news/how-we-translate-300-page-books-using-claude-without-hitting-token-limits.jsonld"}}