docx-review
is a review-oriented DOCX extraction toolkit for Rust.
It reads .docx
files directly from OOXML and produces structured JSON for:
- document blocks
- tracked changes
- comments and comment threads
- text anchors
- headers, footers, footnotes, and endnotes
The project is designed as headless infrastructure for automation, review workflows, AI pipelines, and downstream tools that need more than plain text.
docx-review
currently supports:
-
paragraphs
-
table cells as flat blocks
-
tracked changes:
-
insert
-
delete
-
replacement
-
move
-
format change
-
comments
-
comment anchors and anchored text
-
comment threading and resolved state from
commentsExtended.xml -
footnotes and endnotes
-
headers and footers
-
list item detection by nesting level
-
text spans with tracked-change markers
crates/core
docx-review-core
- extraction library and normalized data model
crates/cli
docx-review
- command-line interface
Install the CLI from crates.io:
cargo install docx-review-cli
This installs the docx-review
command.
Basic extraction:
docx-review extract document.docx
Pretty JSON:
docx-review extract document.docx --pretty
Tracked changes only:
docx-review extract document.docx --only track-changes --pretty
Comments only:
docx-review extract document.docx --only comments --pretty
Read from stdin:
cat document.docx | docx-review extract -
JSONL output:
docx-review extract document.docx --format jsonl
Notes:
--format jsonl
with no--only
emits one block JSON object per line.--only comments --format jsonl
emits one comment per line.--only track-changes --format jsonl
emits one tracked change per line.
Track changes modes:
docx-review extract document.docx --mode paired
docx-review extract document.docx --mode raw
docx-review extract document.docx --mode both
Useful flags:
--no-comments
--no-text-spans
--include-raw-ids
-v
,-vv
Add the crate:
[dependencies]
docx-review-core = "0.1"
php
use docx_review_core::extract_from_path;
fn main() -> Result<(), docx_review_core::Error> {
let document = extract_from_path("document.docx")?;
println!("blocks: {}", document.blocks.len());
println!("tracked changes: {}", document.tracked_changes.len());
println!("comments: {}", document.comments.len());
Ok(())
}
With options:
use docx_review_core::{extract_from_path_with_opts, ExtractOptions, TrackChangesMode};
fn main() -> Result<(), docx_review_core::Error> {
let mut opts = ExtractOptions::default();
opts.track_changes_mode = TrackChangesMode::Both;
opts.include_raw_ids = true;
let document = extract_from_path_with_opts("document.docx", opts)?;
println!("raw changes: {}", document.raw_changes.len());
Ok(())
}
At a high level, extraction returns:
Document.blocks
- the normalized textual structure of the document
Document.tracked_changes
- review-oriented change records
Document.comments
- comments, anchors, replies, and resolved state
Document.raw_changes
- optional raw tracked changes when
TrackChangesMode::Both
is used
- optional raw tracked changes when
blocks
are the main content surface. Comments and tracked changes are linked back to blocks by id.
The current implementation is focused on review semantics and structural extraction.
Designed for:
- review metadata extraction from real Word documents
- tracked changes and comment workflows
- structural stories outside
document.xml
Not the current focus:
- editing
- image extraction
- full numbering style reconstruction
Run the CLI locally:
cargo run -p docx-review-cli -- extract path/to/document.docx --pretty