How We Built a Robust EPUB Parsing and Rebuilding Pipeline in Python LectuLibre built a robust EPUB parsing and rebuilding pipeline in Python to translate entire books while preserving visual structure. The team used ebooklib for metadata handling and lxml for fast, namespace-aware XHTML manipulation, overcoming issues with broken markup, performance, and namespace chaos. The pipeline extracts translatable text segments, groups them into sentences for LLM translation, and reassembles the book with images, CSS, fonts, and layout intact. Dealing with broken markup, embedded fonts, and namespace chaos while building LectuLibre's translation engine At LectuLibre https://lectulibre.com , we needed to translate entire EPUB books while preserving their exact visual structure. The core challenge: parse the EPUB, extract all translatable text, send it to an LLM, then reassemble the book with the translated content—images, CSS, fonts, and layout untouched. This turned out to be much harder than it looked. Here’s how we solved it, what broke, and what we learned. An EPUB is a ZIP archive containing XHTML, CSS, images, and a few XML control files like container.xml and the OPF manifest . In theory, it’s a clean format. In practice, real‑world EPUBs are a mess: