Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an AI Translation Service

wpnews.pro

How we extract, translate, and reconstruct entire ebooks with Python while preserving every detail

At LectuLibre, we built a service that translates entire books using large language models. Our users upload EPUB files, and our backend pipeline parses them, extracts the text, sends it to an LLM for translation, and then rebuilds the EPUB with the translated content—all while preserving the original formatting, images, and metadata. This sounded straightforward until we looked inside a real EPUB.

EPUB is essentially a ZIP file containing a structured set of XHTML, CSS, and XML files. The content.opf

file defines the reading order (spine), metadata, and manifest. The toc.ncx

holds the table of contents. The actual text lives in XHTML documents, often split per chapter. To translate a book, we needed to: 1) reliably parse the EPUB, 2) locate all translatable text, 3) send it chunk by chunk to the LLM, and 4) rebuild the EPUB with the translated text while keeping every byte of the formatting intact.

We initially reached for ebooklib

, the most popular Python library for EPUB manipulation. It worked great for simple EPUBs—until we threw a few hundred real-world files at it. We quickly hit issues:

ebooklib

didn’t fully preserve custom metadata or namespace-prefixed properties in the OPF.xmlns

attributes, breaking rendering on some devices.ebooklib

loaded everything at once.We could have used a heavyweight tool like Calibre’s command-line interface, but that introduced external dependencies and wasn’t as programmatically flexible. Instead, we decided to stick with ebooklib

for high-level book structure and augment it with lxml

for precise XML control.

Here’s the core approach we landed on:

ebooklib

to get a list of items (documents, images, CSS).ITEM_DOCUMENT

(XHTML) and sometimes ITEM_NAVIGATION

(NCX for titles).lxml

, extract text, while keeping a map of each text node to its parent element.ebooklib

, manually ensuring the OPF and spine are correct.Let’s dive into the code.

import ebooklib
from ebooklib import epub

book = epub.read_epub('original.epub')

translatable_items = []
for item in book.get_items():
    if item.get_type() == ebooklib.ITEM_DOCUMENT:
        translatable_items.append(item)
    elif item.get_type() == ebooklib.ITEM_NAVIGATION:
        translatable_items.append(item)

We ignore images, fonts, and CSS—they don’t contain translatable text.

We need to extract text while remembering exactly where it came from. We use lxml.etree

to parse the XHTML and walk the tree, collecting text nodes and their XPath locations:

from lxml import etree

def extract_text_with_xpath(content):
    parser = etree.HTMLParser()
    root = etree.fromstring(content, parser)
    tree = etree.ElementTree(root)

    text_mapping = []  # list of (xpath, original_text, parent_element)
    for elem in root.iter():
        if elem.text and elem.text.strip():
            xpath = tree.getpath(elem)
            text_mapping.append((xpath, elem.text, elem))
        if elem.tail and elem.tail.strip():
            parent = elem.getparent()
            xpath = tree.getpath(parent) if parent is not None else None
            if xpath:
                text_mapping.append((xpath, elem.tail, elem))
    return text_mapping

Pay attention to tail

text—it’s the text that follows a closing tag, common in interleaved markup. Missing it leads to lost sentences.

We batch the collected text nodes into chunks that respect LLM token limits. For instance, we group consecutive text from the same XHTML document, aiming for ~3000 tokens per batch. We then send each chunk to our translation model (e.g., Claude 3.5 Sonnet) and receive a block of translated text. We split the translated block back into individual strings by comparing lengths (advanced: we use a diff algorithm to align original and translated sentences). This is simplified here for brevity.

Now we map translations back:

for (xpath, original, elem), translated_text in zip(text_mapping, translations):
    if elem.text and elem.text == original:
        elem.text = translated_text
    elif elem.tail and elem.tail == original:
        elem.tail = translated_text

new_content = etree.tostring(root, encoding='unicode', method='html')

We return the modified XHTML as a string, ready to replace the item’s content in the EPUB.

Here’s where ebooklib

shines. We create a new EpubBook

, set the same metadata (title, author, language), and add items:

new_book = epub.EpubBook()
new_book.set_identifier(original_book.get_metadata('DC', 'identifier')[0][0])
new_book.set_title(original_book.get_metadata('DC', 'title')[0][0])
new_book.set_language(original_book.get_metadata('DC', 'language')[0][0])

for item in original_book.get_items():
    if item.get_name() in modified_content_map:
        new_content = modified_content_map[item.get_name()]
        new_item = epub.EpubItem(
            uid=item.get_id(),
            file_name=item.get_name(),
            media_type=item.get_type(),
            content=new_content.encode('utf-8')
        )
    else:
        new_item = item
    new_book.add_item(new_item)

new_book.spine = original_book.spine
new_book.toc = original_book.toc

epub.write_epub('translated.epub', new_book, {})

But wait—this naive approach can corrupt the OPF. We found that ebooklib

sometimes rewrites the spine order incorrectly if the original had complex nesting. To fix this, we manually post-process the written EPUB’s content.opf

using lxml

:

import zipfile
from lxml import etree

with zipfile.ZipFile('translated.epub', 'a') as zf:
    with zf.open('content.opf', 'r') as f:
        opf = etree.parse(f)
    spine = opf.find('.//{http://www.idpf.org/2007/opf}spine')
    zf.writestr('content.opf', etree.tostring(opf, xml_declaration=True, encoding='UTF-8'))

Yes, it’s ugly, but it saved us from countless validation errors.

We benchmarked on a typical novel: 50 chapters, 350KB uncompressed. Parsing and extracting text: ~0.2 seconds. Rebuilding after translation: ~0.3 seconds. The LLM translation step dominates (around 45 seconds for the whole book), so we worked on parallelism for that part instead.

However, with larger educational texts containing hundreds of images and complex tables, memory usage spiked to over 500MB. We mitigated this by processing documents one by one and releasing them immediately.

xmlns="http://www.w3.org/1999/xhtml"

and any custom namespaces on the <html>

tag. Lxml’s etree.tostring()

with method='html'

can drop them unless you explicitly add them back.epubcheck

(via Python subprocess) to catch issues. False positives from custom metadata? We whitelist them after manual review.ebooklib

is great for reading, but for writing, we ended up doing a lot of OPF and NCX manipulation ourselves to ensure compliance.<encryption>

element in META-INF/encryption.xml

and gracefully reject them.We’d love to know how others are managing complex EPUB manipulation in production. Have you found a more robust library than ebooklib

? How do you deal with interactive EPUB3 elements (Javascript, form fields) when translating? We’re still iterating on our pipeline and would appreciate any battle stories.

If you’re tackling similar problems or want to try translating your own eBooks, you can see the result of this work at LectuLibre. But most importantly, we hope this deep dive saves you a few late nights the next time you need to mess with EPUB internals.

source & further reading

dev.to — original article AI Model Failover Drills: Keep Agents Useful When Providers Break The CFO's AI Playbook: 5 Finance Automations Every Indian Business Should Run in 2026 Solstice Assassin

Parsing and Rebuilding EPUB Files in Python: Lessons Learned from Building an AI Translation Service

Run your AI side-project on zahid.host