exploit(dynamic behavioural sandbox, YARA, ClamAV, CVE patterns, polyglot & embedded-executable detection, JS AST deobfuscation),
document-integrity tampering(signature forensics, shadow documents, DocMDP/FieldMDP, trailer-chain & XRef integrity),
content-integrity / semantic-determinism(value/appearance V/AP divergence, font glyph remapping, OCR text-layer poisoning, /Alt & /ActualText AI-prompt injection, reading-order ambiguity β where a file shows one thing to a human and another to a parser/LLM), and neutral structure (including PDF 2.0 / ISO 32000-2 constructs β Associated Files /AF, unencrypted-wrapper / encrypted-payload detection, document-part hierarchy /DPartRoot, tagged-PDF namespaces) β backed by LightGBM + SHAP ML anomaly detection, six-parser differential comparison, fully offline threat intelligence (URLhaus Β· MalwareBazaar Β· ThreatFox Β· FeodoTracker Β· OpenPhish β 6.4M+ indicators, zero external API calls), and TLSH campaign attribution.
47 independent forensic passes with MITRE ATT&CK mapping, 24-tab browser including π€ AI Forensic Report (Qwen 2.5 1.5B β self-hosted, zero third-party AI), and raw forensics view.
π How it works β methodology, engines & comparisons β π¬ The research β Semantic Nondeterminism, proven across 24,824 PDFs β
π Case study β the property in the wild: 16,971 DOJ Epstein PDFs β
%%EOF
markers (exploit PDFs often carry multiple), audits cross-reference table depth, linearisation flags, and excessive filter chains used for obfuscation. Also detects linearized first-page object overridesβ incremental updates that re-define an existing Page 1 object (same OID) to inject JavaScript or actions. Renderers that fast-path Page 1 via the linearization hint table never re-evaluate the override on initial render, making the injected content invisible until the page is refreshed.
File-level polyglot detectionβ checks whether a recognised format magic signature (JPEG
FF D8 FF
, ZIP PK\x03\x04
, PNG, GIF, Gzip, OLE, RIFF) appears in the bytes before the %PDF-
header; ISO 32000 Β§7.5.2 NOTE 1 permits arbitrary pre-header bytes for legitimate reasons (e.g., PostScript DSC prefix), but a recognised format signature at byte 0 is characteristic of deliberate polyglot construction to bypass format-based email security gateways that classify files by their first bytes. PDF 2.0 (ISO 32000-2) structuresβ records the
/DPartRoot
document-part hierarchy (Β§14.12, PDF/VT) and tagged-PDF /Namespaces
(Β§14.7.4); both are neutral structural features, the latter part of the accessibility/semantic layer that reality-drift attacks target./JavaScript
/Launch
/OpenAction
/EmbeddedFile
/JBIG2Decode
/XFA
/RichMedia
, NOP sleds (%u9090
%u4141
), heapspray fills, and dangerous JS APIs: `eval()`
`unescape()`
`collab.getIcon()`
`util.printf()`
./S /Launch
, /S /JavaScript
, /RichMedia
, /XFA
). Reports exact xref numbers of suspicious objects.Producer
and Creator
fields for known exploit-tool strings (Metasploit, Canvas, Core Impact), flags missing metadata β a hallmark of crafted exploits β and scans XMP streams for embedded script references./JBIG2Decode
usage β the codec exploited in CVE-2009-0658 and CVE-2010-0188 β and for abnormally large /Widths
arrays used in historic heap-overflow attacks against Acrobat's font engine.0x0C
/ 0x0D
fill patterns).exiftool
for deep metadata extraction complementing PyMuPDF's view. Detects exploit-kit fingerprints in Creator/Producer/Author fields (Metasploit, msfvenom, Canvas, Core Impact), independently confirms XFA forms, and surfaces embedded attachment flags visible only via EXIF/XMP metadata layers. Results feed into the Correlation Engine.qpdf --check
to validate cross-reference tables, trailer dictionaries, and overall document structure. Intentionally malformed or "damaged" PDFs β where xref tables are deliberately broken β are a hallmark of exploit kits designed to hide objects from basic parsers while still rendering in vulnerable viewers. PDF 2.0 unencrypted-wrapper detection(ISO 32000-2 Β§7.6.7) β flags encrypted-payload documents: a clear cover page carrying an
/AF
file whose /AFRelationship
is /EncryptedPayload
(optionally with a /Collection
wrapper view). The real content is sealed inside an encrypted attachment that no static engine can read β a deliberate content-hiding construct graded on the document-integrity (tampering) axis, not a benign attachment. Results feed into the Correlation Engine.24 custom YARA rules targeting PDF-specific attack signatures: classic heap-spray patterns (
%u9090
, 0x0c0c
fills), CVE-specific byte patterns (CVE-2009-0658, CVE-2008-2992, CVE-2010-1240, CVE-2018-4990, CVE-2021 XFA, **CVE-2024-41869** use-after-free,
**CVE-2024-45112** type confusion), JavaScript shellcode s (
eval
+unescape
), hex-obfuscated keywords, auto-open executable combos, XFA+script exploits, Cobalt Strike beacon signatures, PowerShell encoded commands, Unicode obfuscation, and multi-layer encoder chains. Provides byte-level corroboration independent of PyMuPDF parsing. Results feed into the Correlation Engine./Launch
, unescape
, getIcon
, printf
, eval
, /EmbeddedFile
), and reports JavaScript object locations. Where our other engines parse bytes and structure, PeePDF provides a full second-opinion parse. Results feed into the Correlation Engine.executesthe PDF. Renders through
six independent enginesβ Ghostscript (PostScript + JS interpreter), MuPDF, Poppler, LibreOffice Draw (OLE/macro paths), Chromium PDFium (Chrome browser engine β the dominant modern PDF viewer), and pdf.js/Node (Firefox engine) β each inside a Linux process namespace with its own isolated network stack, PID space, and mount point. All syscalls are captured by
strace
. Detects: outbound network connections (beaconing in an isolated namespace is definitively malicious), anonymous executable memory mappings (the runtime signature of shellcode), unauthorised process spawning (shell execution from a CVE), filesystem escape attempts, and excessive fork/clone calls (process bombs). Static analysis sees the PDF's structure; this engine sees what it does.
Pdf.Exploit.*
family (CVE-2009-0927, CVE-2009-4324, Exploit.PDF-JS, and many more). Where the other engines use heuristics and structural analysis to catch zero-days, ClamAV provides authoritative signature intelligence for known samples. A match here means the file is a confirmed known threat.Bayesian contextual scoring adjusts risk based on document origin β a
JasperReports
or Microsoft Word
creator is dampened; a Metasploit/msfvenom creator is amplified. IsolationForest provides unsupervised anomaly detection from the very first scan β flags documents whose 38-feature vector is statistically unusual compared to the scanned population.
RandomForest + LightGBM classifiers activate once β₯10 labeled samples accumulate; bootstrap pseudo-labeling (unlabeled scans with raw_score β€ 5 treated as pseudo-benign) enables supervised activation from the first real malicious scan.
Explainable ML reports the top feature contributions via SHAP for each scan. Every scan is persisted to PostgreSQL. User feedback (false positive / confirmed threat) feeds directly into retraining.
six independent PDF parsers β
MuPDF(
mutool
), Poppler(
pdfinfo
/pdfdetach
), Ghostscript,
qpdf,
pdfminer, and
pdf.jsβ against the same file and cross-compares
8 structural dimensions: page count, object count, PDF version, JavaScript presence, encryption status, AcroForm presence, embedded file count, and OpenAction. Malicious PDFs abuse broken xref tables, hidden incremental updates, and duplicate object numbers so that one parser recovers hidden exploit objects that another ignores entirely. Seven discrepancy checks (Critical/High/Medium) flag page mismatches, JS visibility gaps, encryption oracle indicators, version header confusion attacks, hidden form action trees, and attachment count discrepancies β all invisible to single-parser scanners. A hard 30-second timeout guards the entire engine so no malformed PDF can stall the scan.
File-level polyglotβ checks whether a recognised format magic signature (
FF D8 FF
JPEG, PK\x03\x04
ZIP, PNG, GIF, Gzip, OLE, RIFF) appears in the bytes beforethe
%PDF-
header. ISO 32000 Β§7.5.2 NOTE 1 allows arbitrary bytes before %PDF-
for legitimate reasons, but a recognised format signature here is characteristic of deliberate polyglot construction. Attackers use JPEG+PDF and ZIP+PDF polyglots to bypass format-based routing in email security gateways that classify files by their first bytes β the gateway sees a JPEG image, the PDF payload executes in the viewer. Stream-level polyglotβ scans every PDF stream (raw and decompressed) for embedded executable magic:
PK\x03\x04
(ZIP/JAR), MZ
+PE header validation (Windows PE), \x7fELF
+class byte (Linux ELF), \xcf\xfa\xed\xfe
(Mach-O x64), `\xca\xfe\xba\xbe`
(Java class), `\xd0\xcf\x11\xe0`
(OLE/CFBF β Office binary), RAR, 7-Zip, `\x00asm`
(WebAssembly), HTML/XHTML, and embedded PostScript. JAR detection requires META-INF/MANIFEST.MF. PE detection requires the DOS e_lfanew pointer to resolve to a valid PE signature β eliminating false positives from font tables and binary streams. ELF detection validates the class byte (32-bit or 64-bit). Mid-stream scanning catches payloads prefixed by junk bytes./JS
strings and JavaScript-bearing compressed streams β then parses each through the Acorn parser to build a full Abstract Syntax Tree. Instead of scanning text for keywords, the AST walker detects
meaning:
`eval()`
and `execScript()`
dynamic execution entry points; `String.fromCharCode()`
arrays that assemble shellcode from integer sequences at runtime; unescape()
decode chains that two-stage-deliver encoded payloads; numeric arrays of 150+ elements (the structural signature of heap spray); and new Function(string)
dynamic code construction. These patterns are completely invisible to regex-based scanners but trivially visible at the AST level.fully offline local databasesβ
URLhaus(abuse.ch malware payload hashes + URLs),
**MalwareBazaar**(SHA-256 hash reputation + family labels),
**ThreatFox**(IOC indicators), and
FeodoTracker + OpenPhish(C2 IPs + phishing URLs) β totalling 6.4M+ indicators stored in local PostgreSQL tables.
Zero external API calls per scan, zero rate-limit dependency, zero data exfiltration. Databases are updated from public feeds on a periodic schedule. A hash match raises a critical indicator and auto-labels the scan. URLs extracted from the PDF are checked against URLhaus URL and domain feeds and the OpenPhish phishing URL list. No hash, URL, or file byte is transmitted externally.
ByteRange coverage integrityβ per ISO 32000 Β§12.8.1, offsets are measured from the
%PDF-
header (files may carry bytes before it): o1
must be 0 (non-zero leaves the file header unsigned), both segments must stay within file bounds, the inner gap between segments must contain only the /Contents
blob, and o2+l2
must reach at least the %%EOF
marker. Shadow document detection vs full-save rewrite detectionβ when
o2+l2 < %%EOF
, the engine inspects the unsigned trailing region: if it contains execution vectors (/JavaScript
, /Launch
) it is a shadow attack (CVE-2019-14980 class); if it contains xref/trailer/startxref
structure without active content, it is a full-save rewriteβ a PDF viewer rebuilt the entire file with new byte offsets, invalidating all existing signatures while leaving the visual signature appearance intact (used by DocuSign and similar tools). Both are flagged at high severity with the distinction clearly reported.
β all-zero blobs, sub-32-byte blobs, and missing DER SEQUENCE headers all indicate a structurally signed but cryptographically empty document.
/Contents
structural validationSubFilter deprecationβ
adbe.pkcs7.sha1
(SHA-1 collision attack surface), adbe.x509.rsa_sha1
(no embedded cert chain), and unknown SubFilters. Weak digest algorithm detectionβ MD5 and SHA-1 enable collision-assisted shadow document forgery. Post-signature object injection detection flags execution vectors added in incremental updates after signing.
urgency/deception phrase detection(30+ patterns: "login required", "verify your account", "prize notification", "limited time");
brand impersonation keywords (Microsoft, Apple, Amazon, PayPal, DocuSign, Adobe, DHL, IRS, and others);
credential harvesting via AcroForm β detects
SubmitForm
actions combined with password-type form fields, the structural signature of a phishing PDF form; QR code decodingβ renders every PDF page as an image and extracts all embedded QR codes via
zbarimg
, then checks decoded URLs for suspicious domains (IP addresses, non-HTTPS schemes, URL shorteners). High urgency phrase density combined with brand impersonation is scored as a high-confidence phishing indicator.pdfdetach
(Poppler) to extract every embedded file attachment, then forensically analyses each: magic byte detection for Windows PE (
MZ
), Linux ELF (`\x7fELF`
), OLE/CFBF (`\xd0\xcf`
β Office binary), OOXML archives, script files (.bat
, .ps1
, .vbs
, .sh
), and 7-Zip/RAR containers; VBA macro detection in OOXML β inspects
xl/vbaProject.bin
, word/vbaProject.bin
, and ppt/vbaProject.bin
entries inside Office attachments; strings extraction on executable payloads to surface suspicious API calls, IP addresses, or command-line arguments. A PDF carrying a PE executable is a confirmed dropper β scored critical regardless of other indicators.
TLSH(Trend Locality Sensitive Hash) of the full PDF β a similarity-preserving hash where similar files produce similar hashes, enabling fuzzy matching unlike SHA-256. TLSH score <30 indicates near-identical files (same exploit kit generation); <100 indicates the same malware campaign family. The hash is compared against every confirmed-malicious PDF previously scanned and stored in the PostgreSQL database. A cluster match surfaces the associated campaign context. The structural fingerprint (object counts, stream sizes, action types) is also computed for samples too small for TLSH. Falls back to
ssdeep
if TLSH is unavailable.JavaScript on fields(
/A
and /AA
actions β JS fires on focus, blur, keystroke, validate, or calculate events, invisible during static review); hidden fields(NoExport flag β fields not shown to the user but present in submitted data);
password-type fields(credential harvesting indicators); SubmitForm exfiltration targetsβ the URL(s) to which all field data is POSTed;
/AA additional-action JS triggers on field objects (secondary execution vector independent of /OpenAction);
calculation order (/CO)β adversaries reorder field calculations to chain JS evaluations across fields, enabling multi-step payload staging hidden inside form arithmetic.
Value / Appearance Stream (V/AP) divergence detectionβ five rendering-free checks:
/NeedAppearances true
(ISO 32000 Β§12.7.2 β stale AP flag, critical when a digital signature is present since signed bytes cover the stale AP while the viewer renders regenerated content); checkbox and radio /V
vs /AS
key mismatch (definitive, zero-false-positive β the displayed state and the stored data value structurally disagree); text, listbox, and combobox field AP stream text extraction with font encoding remap(decompresses
/AP /N
, extracts Tj
/TJ
operators, resolves any /Encoding /Differences
table in the AP font so byte codes are translated to their rendered glyphs before comparison β catches font-remap evasion where a custom encoding makes byte 0x31 render as β9β instead of β1β; compares rendered text to /V
; listbox multi-select /V
arrays joined β catches βI agree to $1,000β displayed as βI agree to $10β, or dropdown showing Option A while /V
holds Option B); image-based AP stream detection(AP stream invokes an image XObject via
Do
with no text operators β the field displays a rasterised image while /V
holds a different binding value inside the signed byte range; not comparable without image recognition, flagged as high severity requiring manual review); and blank AP stream detection (AP stream draws no content, hiding the field value completely). All enumerated field /V
values are collected and passed to Engine γΆ (JS Behavioral Emulation) so doc.getField()
returns real file values during JavaScript execution. Results feed into the Correlation Engine.%%EOF
boundary and extracts per-revision metadata: author, producer, modification date, and new/modified/deleted object counts for each revision. Detects author identity changes between revisions,
execution vectors injected after the original document was created, and
large object injections in the final revision β the structural signature of automated exploit staging. Results feed into the Correlation Engine.
dangerous URI schemes(
javascript:
, data:
, file://
, vbscript:
); JavaScript action triggers on annotation interaction;
/Launch actions that spawn arbitrary programs;
GoToR remote links that open external files; and
SubmitForm actions that exfiltrate form data. Annotation-borne payloads are completely invisible to scanners that only analyse raw bytes or page content streams. Results feed into the Correlation Engine.
PDF 2.0 Associated Files (/AF)β enumerates the ISO 32000-2 Β§7.11.4 Associated Files mechanism at document and page level, recording each
/AFRelationship
type (Source
, Data
, EncryptedPayload
, Alternative
, Supplement
). Associated files are a modern attachment surface a legacy /EF
-only walk misses; the attached streams themselves are passed to the Embedded File engine. Catalogues the full PDF action infrastructure: Named JavaScript Registry(
/Names /JavaScript
subtree β persistent JS objects callable by name from any action); /AA Additional Actions(event-driven triggers on page open/close, print, save, field events);
/OpenAction classification(JavaScript, Launch, GoToR, URI, GoTo);
/Perms cryptographic permission restrictions; and
UR3 usage-rights signatures used to exploit extended viewer features.
Deep DocMDP forensicsβ parses the
/P
permission level from /TransformParams
(1 = no changes, 2 = form fill-ins, 3 = annotations and form fill-ins β the most exploitable); flags missing or out-of-spec /P
values; checks /SigFlags
AppendOnly bit; validates presence of /ByteRange
and /Reference
array; detects incremental updates that violate the MDP constraint (any JavaScript is prohibited under all /P
levels); and flags multiple /DocMDP
entries (validator confusion spoofing). FieldMDP per-signature field lock(ISO 32000 Β§12.8.2.4, "File MDP") β unlike DocMDP which is document-level, FieldMDP locks specific named form fields per approval signature and can be selectively permissive: flags
Action=Include
with an empty /Fields
array (locks no fields despite appearing to certify), Action=Exclude
with named fields (those fields are explicitly notlocked, leaving them modifiable post-signing), and incremental updates that contain form-field modifications (
/Widget
, /AcroForm
) after a FieldMDP signature is in place β a constraint violation detectable across Acrobat and pdf.js where validators differ on whether they check field names against the locked set. Results feed into the Correlation Engine.dangerous PostScript execution operators:
exec
(dynamic code execution), `run`
(file execution), `token`
(string-to-code eval), `setpagedevice`
(PostScript-to-system passthrough β bridges to the PostScript interpreter from PDF context). Also detects ICC color profile abuseβ malformed
/ICCBased
profiles of anomalous size exploit heap buffer overflows (CVE-2021-21017 class). Flags content bombs: streams exceeding 5 MB that may exhaust parser memory or conceal data in oversized objects. Results feed into the Correlation Engine.
/ObjStm
stream. Scanners that only search raw bytes will miss any object inside a compressed container. This engine decompresses every /ObjStm
in the document and re-scans the decompressed content: JavaScript,
/Launch actions,
/EmbeddedFile references, and
high-entropy payloads(entropy >7.5 bits) that suggest encrypted content hidden inside compressed object bundles. Complements the Stream Inspector (Engine 3) with object-container-specific forensics. Results feed into the Correlation Engine.
/J#61vaScript
is syntactically identical to /JavaScript
to every PDF parser, but bypasses every raw-byte keyword scanner. This engine decodes all hex escapes in name tokens and checks whether the decoded name matches a dangerous keyword β JavaScript, Launch, OpenAction, EmbeddedFile, XFA, SubmitForm, JBIG2Decode, and 10 more. Also detects
whitespace-split keywords(e.g.
/Ja\nvaScript
β valid PDF whitespace normalised by parsers but invisible to substring search), formfeed byte injection(0x0C β a valid PDF whitespace delimiter used instead of space to confuse tools that only accept 0x20/0x09/0x0A as delimiters), and
null byte injection in header regions(0x00 bytes outside binary streams β evasion technique against C-string comparison scanners). Finds obfuscation that Engine 2 (byte patterns) misses because it searches for literal keyword bytes. Results feed into the Correlation Engine.
FormCalc scripts(
`xfa.host.exec()`
, `Url.resolve()`
, `xfa.host.openURL()`
), **submit actions**(
`<submit action="http://...">`
β direct form field exfiltration), initialize event auto-execution(fires on open without /OpenAction, bypassing standard detection), and
remote template inheritance(
mergeMode="matchTemplate"
pulling malicious XDP from a remote server). Fills a genuine gap in the entire PDF forensics industry.directed graph of the entire action space and reasons about its topology. Graph analysis:
cycles(AβBβA crash loops in vulnerable readers),
depth >5(deep action chains indicate automated exploit construction),
fan-in(single JavaScript object triggered by >8 distinct events β trigger maximization for reliability), and
dead action nodes(JavaScript/Launch objects with no inbound edges from /Root traversal β sleeper payloads reachable only through vulnerabilities). Computes graph density and edge count. Architecturally unique β no existing tool does this.
default-OFF layers containing annotations with URI/JS actions (hidden until user interaction β social engineering trap),
**Never-state layers**(
`/View [/Never]`
β content invisible to all viewer UI but accessible to the PDF rendering engine), screen/print divergence(visible on screen but not in print, or vice versa β hides content from PDF review while delivering it during printing), and
viewer-conditional layers(activate only on specific
app.viewerType
or app.viewerVersion
β targeted attack indicator). Including VMRay β it sandboxes the default view, not OCG state variants.page.get_text("rawdict")
to access raw rendering data: Rendering mode 3(invisible β clips to outline only, not rendered or printed β hides machine-readable content from visual review),
mode 7(fill+clip β invisible fill, part of accessibility tree β screen reader sees it, human doesn't),
white text on white background(visually invisible but present in copy-paste and accessibility tree). Unicode scan:
U+202E RIGHT-TO-LEFT OVERRIDE(makes
invoice_FDP.exe
display as invoice_exe.PDF
), U+200B ZERO WIDTH SPACE in field names breaking security tool matching,
U+00AD SOFT HYPHEN in URLs evading pattern matchers, and
homograph domain attacks using Cyrillic/Greek lookalike characters in URLs.
/Prev
pointer chain at the binary level. For each revision boundary: /ID array mutation(document fingerprint changes mid-document = document substitution attack),
/Encrypt changes(encryption added or removed after signing β not covered by Engine 21's byte-range check alone),
/Root OID change(entire document catalog swapped between revisions while retaining a valid signature over the original β Shadow Attack / PDF signature bypass),
/Size shrinkage(incremental update that reduces declared object count β objects being hidden), and
post-%%EOF data(raw bytes after final %%EOF marker β invisible to all PDF-aware parsers, used for payload staging or steganography). No commercial tool does this at the raw trailer-chain level.
**CCITTFaxDecode**: K<-1 (OOB in some decoders), Columns/Rows >65535 (integer overflow).
**JBIG2Decode**: JBIG2Globals presence (shared segment attack β CVE-2009-0658).
DCTDecode: ColorTransform/ColorSpace mismatch (color channel confusion in decoders).
LZWDecode EarlyChange=0: historically exploited code path in Acrobat's LZW implementation.
RunLengthDecode decompression bomb: extreme compression ratio.
Multi-filter chains(β₯3 chained filters β legitimate PDFs essentially never need this, used purely to evade single-layer scanners).
ASCIIHex+Flate double-encoding obfuscation.
Crypt /Identity bypassβ nominally encrypted but identity-transform, bypasses scanners that skip encrypted content.
post-%%EOF data(any bytes after the final %%EOF not part of a valid incremental update β raw executables appended are invisible to all PDF-aware parsers),
high entropy in non-stream zones(>7.2 bits/byte in comment blocks or trailer area β indicates binary data hidden as PDF comments),
entropy cliffs between revisions(sudden entropy jump at an incremental update boundary not corresponding to a new compressed stream = injected encrypted payload), and
near-zero entropy in compressed streams(nearly all-zero FlateDecode input that expands to massive output β decompression bomb precursor). Zone averages computed for header, comment, stream, non-stream, and post-EOF regions.
Chi-square LSB test: in natural images, LSBs are near-uniform; in LSB stego they are exactly uniform (ratio <0.005 from 50/50) β flags stego candidates with statistical confidence.
Tracking beacons: 1Γ2px images that are fetched when the PDF is rendered, revealing opener IP, timestamp, and user agent to the attacker.
JPEG EXIF/COM anomalies: APP1/COM segments >512 bytes in simple images β common steganography carrier used by APT groups to embed C2 configuration in logo images.
Colorspace mismatches: declared /DeviceGray but actual RGB (extra channels used for data hiding).
Duplicate visual content with different raw bytes (per-recipient tracking watermarks). Requires β₯2 simultaneous indicators to flag, controlling false positive rate.
pdfaid:conformance
, pdfaid:part
, GTS_PDFXVersion
and cross-checks against PDF/A prohibitions: JavaScript(forbidden in all PDF/A variants),
**/Launch actions**(forbidden),
**embedded files**(forbidden in PDF/A-1),
**encryption**(forbidden in PDF/A),
**URI actions**(forbidden in PDF/A-1). Also detects
fake compliance markersβ raw strings claiming compliance without proper XMP metadata (string injection to confuse parsers). A finding of "Claims PDF/A-1b but contains JavaScript" is unambiguous evidence of deliberate DLP bypass.
vm.runInNewContext
with a complete PDF Acrobat API stub: `app.launchURL()`
, `doc.submitForm()`
, `doc.getURL()`
, `doc.mailDoc()`
, `app.openDoc()`
. from the PDF (collected by Engine γ) β not a hardcoded empty string β so conditional exploitation chains such as
doc.getField(name)
returns the actual /V
valueif (doc.getField('status').value == 'approved') { app.launchURL(c2) }
are correctly followed; SUBMIT_FORM
events carry real field content. doc.numFields
reflects the true field count. Captures: LAUNCH_URL(C2 beacon URL decoded at runtime),
SUBMIT_FORM(credential exfiltration target URL),
GET_URL(network fetch), MAIL_DOC(email exfiltration target),
DYNAMIC_FUNCTION(runtime code generation payload). Six-pass multi-eval resolution unwraps nested deobfuscation chains. Catches deobfuscated URLs and payloads invisible to static AST analysis, in under 1 second, with no OS fingerprinting evasion vectors.
subroutine call depth >10(nested
callsubr
/callgsubr
β relevant CVE classes: depth exhaustion), missing(program terminates without proper end operator β parser overread),
endchar
seac OOB(standard encoding accented character operator with character codes >127 β out-of-bounds indexing),
othersubr argument overflow(>16 arguments), and oversized CharString programs(>2000 bytes β legitimate glyphs are almost always <500). Identifies the specific glyph name and OID for each finding.
Phantom objects: referenced by OID in another object but absent from xref β parser-confusion attack where some readers synthesize the missing object.
Orphan objects: present in xref, not reachable from /Root β staged payload waiting for a vulnerability to make it reachable (especially dangerous when type is JavaScript, Action, or EmbeddedFile).
Xref F-entry on live objects: object marked as free in xref but still referenced β exploits free-entry dereferencing bugs in vulnerable readers.
OID type confusion: same OID used for different /Type objects across revisions (type confusion exploits object recycling).
Stream /Length falsification: declared /Length doesn't match actual stream size β causes buffer over/underread in decoders (broad CVE class).
/StructTreeRoot object graph for
/Alt alt-text on images and figures, and
/ActualText override properties on text spans. Detects 20+ prompt injection patterns:
ignore previous instructions
, ignore prior instructions
, act as
, you are now
, jailbreak
, system prompt
, disregard
, override
, and more. Critical findings: injection strings in alt-text cause LLMs processing accessibility metadata to follow attacker instructions while showing clean visual content.
/ActualText semantic override: when ActualText differs substantially from the visual glyph sequence, the PDF presents different content to screen readers and AI text extractors than to sighted readers. Checks for
anomalous heading density(>30% of structure elements are headers β used to manipulate document outline-based context extraction). Maps to MITRE ATT&CK T1027 (Obfuscated Files/Information).
hidden text layerβ content embedded via invisible text mode (rendering mode 3) or synthetic text-behind-image that is invisible to readers but extracted verbatim by AI ingestion pipelines and search engines. This engine targets genuinely
scanned pages β a single dominant image covering >60% of the page (a page scan), not a design collage of many small graphics over vector text β that also carry substantial extracted text (>30 characters), renders each such page using PyMuPDF at
1Γ (~100 DPI), runs Tesseract OCR, and computes
Jaccard word-set similarity between the OCR transcript and the PDF text extraction. A similarity below
0.30 on any page is flagged as a
critical mismatch, indicating that the document's visual content and its machine-readable text layer contain fundamentally different information. To avoid false positives on figure and photo pages, the check runs only where the rendered image itself OCRs to substantial real text; a page whose image legitimately yields little or no text is excluded rather than scored as a mismatch (the blank-overlay and invisible-text variants are caught by the dedicated invisible-text and prompt-injection engines). Common in CVE-class phantom text injection, AI training data poisoning, and DLP bypass. Maps to MITRE ATT&CK T1036 (Masquerading).
Overlapping text blocks(>30% bounding-box area overlap between two text blocks on the same page) β parsers choose different merge/ordering strategies;
Multi-column layouts(β₯2 X-coordinate clusters each containing β₯3 text blocks with >30% vertical span overlap) β linear extraction merges columns row-by-row instead of column-by-column, producing nonsensical interleaved content;
RTL/LTR direction mixing(10β90% of characters have RTL Unicode bidirectional class on the same page) β bidirectional text reordering rules produce different output depending on parser BiDi algorithm version. All three patterns are used in contracts, compliance documents, and AI training datasets to make the same PDF file present different information to a legal reviewer (visual rendering) versus an AI system (text extraction). Maps to MITRE ATT&CK T1059 (Command and Scripting Interpreter β document interpretation ambiguity subclass).
combinationsthat are orders of magnitude more serious than their parts.
DocMDP compound patterns: DocMDP bypass + JavaScript (certified document carrying an execution vector β critical); DocMDP bypass + weak algorithm (collision-assisted forgery surface β critical); DocMDP P=3 + JavaScript (annotation-triggered execution under certification); multiple DocMDP transforms + signature (validator confusion spoofing); ByteRange not-from-zero + MDP (file header left unsigned under certification); all-zero or sub-32-byte /Contents (structurally signed, cryptographically empty).
FieldMDP compound patterns: FieldMDP Include+empty fields + any active content (locking-nothing certification with live payload); FieldMDP Exclude bypass + incremental form modification (selectively unlocked field modified post-signing); FieldMDP + JavaScript in incremental update (execution vector under per-field certification).
Signature patterns: ByteRange gap + JavaScript (shadow attack confirmed); ByteRange full-save rewrite + active content (cryptographically unsigned document with live payload); post-signature revision + execution vector.
Polyglot patterns: file-level polyglot (JPEG/ZIP header before %PDF-) + active content. Other engines: XFA auto-execute + submit exfiltration, OCG hidden layer + active content, RLO + phishing indicators, trailer mutation + signature forensics, entropy cliff + high-risk content, compliance fraud + JavaScript (confirmed DLP bypass), stego beacon + network indicators, phantom objects + JavaScript reachability. Classic: /OpenAction+JS, JBIG2+JS CVE, eval()+unescape() shellcode, heapspray+JS. Multi-engine vote weighting: indicators confirmed by multiple engines receive logarithmically scaled bonus scoring. 60+ compound patterns.
self-hosted Qwen 2.5 1.5B Instruct Q4_K_M LLM synthesises the structured scan output into a concise forensic verdict. The model outputs seven structured fields:
threat_verdict(MALICIOUS / SUSPICIOUS / LIKELY_CLEAN / CLEAN),
confidence,
executive_summary(one sentence), key_findings(signal + severity + MITRE ID),
observed_techniques(MITRE ATT&CK IDs and names),
recommended_actions, and
false_positive_note. Verdict is PHP-enforced against the engine-computed risk_score so it cannot contradict the quantitative analysis β the AI adds narrative and correlation, not classification override. Runs on an isolated
remotellm
node over an encrypted WireGuard VPN tunnel β zero third-party AI, no data leaves your infrastructure. Typical latency: ~15β25 s at ~13 tokens/s (CPU-only, no GPU). Served by
llama-server
(llama.cpp) as a systemd service.10 MB limitβ covers 99% of real-world malicious PDFs (most are under 5 MB). Need to scan larger files?
Enterprise deploymentremoves all size limits.