Show HN: Review-oriented DOCX extraction toolkit for Rust A Rust developer released `docx-review`, an open-source toolkit that extracts structured JSON from DOCX files, including tracked changes, comments, headers, and footnotes. The headless tool is designed for automation, review workflows, and AI pipelines that require more than plain text extraction from Word documents. docx-review is a review-oriented DOCX extraction toolkit for Rust. It reads .docx files directly from OOXML and produces structured JSON for: - document blocks - tracked changes - comments and comment threads - text anchors - headers, footers, footnotes, and endnotes The project is designed as headless infrastructure for automation, review workflows, AI pipelines, and downstream tools that need more than plain text. docx-review currently supports: - paragraphs - table cells as flat blocks - tracked changes: - insert - delete - replacement - move - format change - comments - comment anchors and anchored text - comment threading and resolved state from commentsExtended.xml - footnotes and endnotes - headers and footers - list item detection by nesting level - text spans with tracked-change markers crates/core docx-review-core - extraction library and normalized data model crates/cli docx-review - command-line interface Install the CLI from crates.io: cargo install docx-review-cli This installs the docx-review command. Basic extraction: docx-review extract document.docx Pretty JSON: docx-review extract document.docx --pretty Tracked changes only: docx-review extract document.docx --only track-changes --pretty Comments only: docx-review extract document.docx --only comments --pretty Read from stdin: cat document.docx | docx-review extract - JSONL output: docx-review extract document.docx --format jsonl Notes: --format jsonl with no --only emits one block JSON object per line. --only comments --format jsonl emits one comment per line. --only track-changes --format jsonl emits one tracked change per line. Track changes modes: docx-review extract document.docx --mode paired docx-review extract document.docx --mode raw docx-review extract document.docx --mode both Useful flags: --no-comments --no-text-spans --include-raw-ids -v , -vv Add the crate: dependencies docx-review-core = "0.1" php use docx review core::extract from path; fn main - Result< , docx review core::Error { let document = extract from path "document.docx" ?; println "blocks: {}", document.blocks.len ; println "tracked changes: {}", document.tracked changes.len ; println "comments: {}", document.comments.len ; Ok } With options: use docx review core::{extract from path with opts, ExtractOptions, TrackChangesMode}; fn main - Result< , docx review core::Error { let mut opts = ExtractOptions::default ; opts.track changes mode = TrackChangesMode::Both; opts.include raw ids = true; let document = extract from path with opts "document.docx", opts ?; println "raw changes: {}", document.raw changes.len ; Ok } At a high level, extraction returns: Document.blocks - the normalized textual structure of the document Document.tracked changes - review-oriented change records Document.comments - comments, anchors, replies, and resolved state Document.raw changes - optional raw tracked changes when TrackChangesMode::Both is used - optional raw tracked changes when blocks are the main content surface. Comments and tracked changes are linked back to blocks by id. The current implementation is focused on review semantics and structural extraction. Designed for: - review metadata extraction from real Word documents - tracked changes and comment workflows - structural stories outside document.xml Not the current focus: - editing - image extraction - full numbering style reconstruction Run the CLI locally: cargo run -p docx-review-cli -- extract path/to/document.docx --pretty