{"slug": "parsing-and-rebuilding-epub-files-in-python-lessons-learned", "title": "Parsing and Rebuilding EPUB Files in Python: Lessons Learned", "summary": "LectuLibre built an AI-powered book translation service that parses and rebuilds EPUB files without breaking navigation or metadata. The team chose the ebooklib Python library but had to add extensive validation to handle malformed files, including missing container.xml and corrupted metadata. They iterate over spine items, extract paragraphs with BeautifulSoup, translate via Claude or DeepSeek, and replace text while preserving HTML structure.", "body_md": "*How we handle complex EPUB structures for AI translation without breaking navigation and metadata*\n\nAt [LectuLibre](https://lectulibre.com), we built an AI‑powered book translation service. Users upload an EPUB, and our pipeline translates the text using LLMs like Claude and DeepSeek. That sounds straightforward until you have to parse and rebuild a valid EPUB without mangling the table of contents, internal links, or styles.\n\nI’m sharing the real‑world challenge we faced, how we chose our tooling, and the ugly corners we discovered when dealing with real‑world EPUB files.\n\nAn EPUB is essentially a ZIP archive containing XHTML, CSS, images, and an OPF manifest. It’s a well‑defined standard (EPUB 3.2), but in practice publishers produce files that bend the rules: missing `container.xml`\n\n, inline styles that break after translation, and structural quirks that make parsing fragile.\n\nOur translation process needed to:\n\nStep 4 is the tricky part: the translated text can be longer or shorter, it may contain characters that need escaping, and the surrounding markup must remain intact.\n\n`ebooklib`\n\nwith a Dose of Defensive Coding\nWe evaluated several Python libraries:\n\n`epub`\n\n(pypub)`lxml`\n\n+ manual zip`ebooklib`\n\nWe went with [ ebooklib](https://github.com/aerkalov/ebooklib). It provides an object‑oriented model of the EPUB structure, allows us to iterate over documents, and can write a new EPUB from the modified objects. The downside: its documentation is sparse and it can choke on malformed files. We had to layer on a lot of validation.\n\n``` php\nimport ebooklib\nfrom ebooklib import epub\n\ndef load_epub(epub_path: str) -> epub.EpubBook:\n    book = epub.read_epub(epub_path, {\"ignore_ncx\": True})\n    # Force title to be a string (some books have list titles)\n    if isinstance(book.title, list):\n        book.title = \" \".join(book.title)\n    return book\n```\n\nBut we quickly learned that `read_epub`\n\ncan fail silently if the book’s metadata is corrupted. We added a custom validation step that checks for a valid OPF and at least one spine item.\n\n``` python\ndef validate_epub(book: epub.EpubBook):\n    if not book.opf:\n        raise ValueError(\"Missing OPF metadata\")\n    if len(list(book.spine)) == 0:\n        raise ValueError(\"No spine items found – EPUB is unreadable\")\n```\n\nAn EPUB’s content is stored in `epub.EpubHtml`\n\nobjects. We iterate over all items in reading order (spine) and parse the body content with BeautifulSoup (`lxml`\n\nparser) because ebooklib’s own `get_body_content()`\n\nreturns raw bytes, and we need to extract text paragraph‑by‑paragraph while keeping the HTML structure.\n\n``` python\nfrom bs4 import BeautifulSoup\nimport html\n\ndef extract_paragraphs(item: epub.EpubHtml) -> list[dict]:\n    soup = BeautifulSoup(item.get_body_content(), \"html.parser\")\n    paragraphs = []\n    for tag in soup.find_all([\"p\", \"h1\", \"h2\", \"h3\", \"h4\", \"h5\", \"h6\", \"li\"]):\n        clean_text = tag.get_text(strip=True)\n        if clean_text:\n            paragraphs.append({\n                \"tag\": tag,\n                \"original\": clean_text,\n                \"translated\": None\n            })\n    return paragraphs\n```\n\nWe keep a reference to the original BeautifulSoup `tag`\n\nobject so we can later replace its text. This is memory‑heavy for large books but works for books under 10 MB (our VPS limit).\n\nFor each paragraph we call our translation API (Claude or DeepSeek). The tricky part is that some paragraphs are very short (headers) or contain entity references. We escape HTML entities before sending, and decode them afterward.\n\n``` python\nimport requests\n\ndef translate_text(text: str, source_lang: str, target_lang: str) -> str:\n    escaped = html.escape(text, quote=False)\n    response = requests.post(\n        \"https://api.lectulibre.com/v1/translate\",  # simplified\n        json={\"text\": escaped, \"source\": source_lang, \"target\": target_lang},\n        headers={\"Authorization\": f\"Bearer {API_KEY}\"}\n    )\n    translated = response.json()[\"translated\"]\n    return html.unescape(translated)\n```\n\nWe found that LLMs can sometimes add extra spaces or punctuation. We apply a light post‑processing: trim, normalize spaces, and ensure the translated text doesn’t break the containing tag’s structure.\n\nBack in the `extract_paragraphs`\n\noutput, we replace the `tag.string`\n\nwith the translated text. Since `tag.string`\n\nmight be a `NavigableString`\n\ncontaining child elements, we must be careful. If the tag contains only a string, we replace it. If it contains mixed content, we replace the first text node only, which is a simplification that works for most books.\n\n``` python\ndef replace_text(tag, new_text: str):\n    if tag.string is not None and not tag.find_all(text=False):\n        # Simple case: tag has only a single text node\n        tag.string.replace_with(new_text)\n    else:\n        # Find the first text node and replace it\n        for child in tag.children:\n            if isinstance(child, str) and child.strip():\n                child.replace_with(new_text)\n                break\n```\n\nAfter all replacements, we set the item’s body content back to the modified HTML.\n\n``` python\ndef update_item(item: epub.EpubHtml, paragraphs: list[dict]):\n    for p in paragraphs:\n        if p[\"translated\"]:\n            replace_text(p[\"tag\"], p[\"translated\"])\n    # Rebuild the HTML\n    html_str = p[\"tag\"].prettify()  # or extract the full soup\n    item.set_body_content(html_str.encode(\"utf-8\"))\n```\n\nA problem here: `set_body_content`\n\nexpects bytes, and we must ensure the encoding is UTF‑8. Also, if the original file had a XML declaration or namespaces, we might lose them. We handle that by preserving the `item.media_type`\n\nand other metadata.\n\nOnce all items are updated, we write the book to a new file. We also add a modified‑date and update the language metadata.\n\n``` python\ndef save_book(book: epub.EpubBook, output_path: str):\n    book.set_identifier(\"urn:uuid:\" + str(uuid.uuid4()))\n    book.add_metadata(\"DC\", \"language\", \"fr\")  # target language\n    epub.write_epub(output_path, book, {})\n```\n\nWe learned the hard way that `epub.write_epub`\n\nmay fail if items reference resources (images, fonts) that aren’t properly registered in the manifest. We iterate all items from the original book and add them to the manifest early to avoid missing dependency errors.\n\n**Broken Table of Contents**: After translation, the NCX/NAV files pointed to old file names or anchors that no longer existed because we had renamed items. We now never rename items; we only modify their content in-place. If we must add new items (e.g., for footnotes), we update the TOC manually using `ebooklib.epub.Link`\n\nobjects.\n\n**Inline CSS Overwrites**: Some books use inline styles like `font-size: 12pt`\n\n. When a translated paragraph becomes longer, it can overflow fixed‑height containers. We don’t modify CSS, but we added a warning for books with rigid styling and offer a “clean” version without fixed heights.\n\n**Performance**: For a 500‑page novel, the entire pipeline (parse, translate, rebuild) takes about 90 seconds on our VPS (4 vCPU, 8 GB RAM). The LLM calls dominate; we batch paragraphs of up to 5 together to reduce API overhead, trading off a slight translation quality dip.\n\n**Memory**: Loading the entire EPUB and keeping BeautifulSoup trees in memory can spike to 300 MB for large books. We process one book at a time and use a queue to avoid concurrency issues.\n\n`ebooklib`\n\nis great but fragileWe’re exploring `pandoc`\n\nfor pre‑conversion to a simpler intermediate format that’s easier to manipulate. However, the rebuild step becomes more complex. For now, `ebooklib`\n\n+ `BeautifulSoup`\n\nserves our needs.\n\nIf you’re building an EPUB processing tool in Python, I hope these real‑world insights save you some of the debugging hours we spent. Got a better approach? I’d love to hear it in the comments!\n\n*Happy coding!*", "url": "https://wpnews.pro/news/parsing-and-rebuilding-epub-files-in-python-lessons-learned", "canonical_source": "https://dev.to/jacob_gong/parsing-and-rebuilding-epub-files-in-python-lessons-learned-5e6h", "published_at": "2026-06-27 03:00:46+00:00", "updated_at": "2026-06-27 04:03:51.687312+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "developer-tools"], "entities": ["LectuLibre", "ebooklib", "BeautifulSoup", "Claude", "DeepSeek", "Python", "lxml"], "alternates": {"html": "https://wpnews.pro/news/parsing-and-rebuilding-epub-files-in-python-lessons-learned", "markdown": "https://wpnews.pro/news/parsing-and-rebuilding-epub-files-in-python-lessons-learned.md", "text": "https://wpnews.pro/news/parsing-and-rebuilding-epub-files-in-python-lessons-learned.txt", "jsonld": "https://wpnews.pro/news/parsing-and-rebuilding-epub-files-in-python-lessons-learned.jsonld"}}