Show HN: A PDF analysis tool for parser and representation differences A new PDF analysis tool detects parser and representation differences in documents, using 47 forensic passes with MITRE ATT&CK mapping and LightGBM machine learning to identify exploits, tampering, and semantic nondeterminism. The tool performs differential comparisons across six parsers, analyzes polyglot file structures and PDF 2.0 constructs, and runs fully offline with 6.4 million threat indicators from sources including URLhaus and MalwareBazaar. A self-hosted AI forensic report feature using Qwen 2.5 1.5B provides automated analysis without third-party API calls. exploit dynamic behavioural sandbox, YARA, ClamAV, CVE patterns, polyglot & embedded-executable detection, JS AST deobfuscation , document-integrity tampering signature forensics, shadow documents, DocMDP/FieldMDP, trailer-chain & XRef integrity , content-integrity / semantic-determinism value/appearance V/AP divergence, font glyph remapping, OCR text-layer poisoning, /Alt & /ActualText AI-prompt injection, reading-order ambiguity — where a file shows one thing to a human and another to a parser/LLM , and neutral structure including PDF 2.0 / ISO 32000-2 constructs — Associated Files /AF, unencrypted-wrapper / encrypted-payload detection, document-part hierarchy /DPartRoot, tagged-PDF namespaces — backed by LightGBM + SHAP ML anomaly detection, six-parser differential comparison, fully offline threat intelligence URLhaus · MalwareBazaar · ThreatFox · FeodoTracker · OpenPhish — 6.4M+ indicators, zero external API calls , and TLSH campaign attribution. 47 independent forensic passes with MITRE ATT&CK mapping, 24-tab browser including 🤖 AI Forensic Report Qwen 2.5 1.5B — self-hosted, zero third-party AI , and raw forensics view. 📖 How it works — methodology, engines & comparisons → /pdf-malware-scanner.php 🔬 The research — Semantic Nondeterminism, proven across 24,824 PDFs → /research.php 📄 Case study — the property in the wild: 16,971 DOJ Epstein PDFs → /pdf-epstein-forensics.php %%EOF markers exploit PDFs often carry multiple , audits cross-reference table depth, linearisation flags, and excessive filter chains used for obfuscation. Also detects linearized first-page object overrides — incremental updates that re-define an existing Page 1 object same OID to inject JavaScript or actions. Renderers that fast-path Page 1 via the linearization hint table never re-evaluate the override on initial render, making the injected content invisible until the page is refreshed. File-level polyglot detection — checks whether a recognised format magic signature JPEG FF D8 FF , ZIP PK\x03\x04 , PNG, GIF, Gzip, OLE, RIFF appears in the bytes before the %PDF- header; ISO 32000 §7.5.2 NOTE 1 permits arbitrary pre-header bytes for legitimate reasons e.g., PostScript DSC prefix , but a recognised format signature at byte 0 is characteristic of deliberate polyglot construction to bypass format-based email security gateways that classify files by their first bytes. PDF 2.0 ISO 32000-2 structures — records the /DPartRoot document-part hierarchy §14.12, PDF/VT and tagged-PDF /Namespaces §14.7.4 ; both are neutral structural features, the latter part of the accessibility/semantic layer that reality-drift attacks target. /JavaScript /Launch /OpenAction /EmbeddedFile /JBIG2Decode /XFA /RichMedia , NOP sleds %u9090 %u4141 , heapspray fills, and dangerous JS APIs: eval unescape collab.getIcon util.printf . /S /Launch , /S /JavaScript , /RichMedia , /XFA . Reports exact xref numbers of suspicious objects. Producer and Creator fields for known exploit-tool strings Metasploit, Canvas, Core Impact , flags missing metadata — a hallmark of crafted exploits — and scans XMP streams for embedded script references. /JBIG2Decode usage — the codec exploited in CVE-2009-0658 and CVE-2010-0188 — and for abnormally large /Widths arrays used in historic heap-overflow attacks against Acrobat's font engine. 0x0C / 0x0D fill patterns . exiftool for deep metadata extraction complementing PyMuPDF's view. Detects exploit-kit fingerprints in Creator/Producer/Author fields Metasploit, msfvenom, Canvas, Core Impact , independently confirms XFA forms, and surfaces embedded attachment flags visible only via EXIF/XMP metadata layers. Results feed into the Correlation Engine. qpdf --check to validate cross-reference tables, trailer dictionaries, and overall document structure. Intentionally malformed or "damaged" PDFs — where xref tables are deliberately broken — are a hallmark of exploit kits designed to hide objects from basic parsers while still rendering in vulnerable viewers. PDF 2.0 unencrypted-wrapper detection ISO 32000-2 §7.6.7 — flags encrypted-payload documents: a clear cover page carrying an /AF file whose /AFRelationship is /EncryptedPayload optionally with a /Collection wrapper view . The real content is sealed inside an encrypted attachment that no static engine can read — a deliberate content-hiding construct graded on the document-integrity tampering axis, not a benign attachment. Results feed into the Correlation Engine. 24 custom YARA rules targeting PDF-specific attack signatures: classic heap-spray patterns %u9090 , 0x0c0c fills , CVE-specific byte patterns CVE-2009-0658, CVE-2008-2992, CVE-2010-1240, CVE-2018-4990, CVE-2021 XFA, CVE-2024-41869 use-after-free, CVE-2024-45112 type confusion , JavaScript shellcode loaders eval + unescape , hex-obfuscated keywords, auto-open executable combos, XFA+script exploits, Cobalt Strike beacon signatures, PowerShell encoded commands, Unicode obfuscation, and multi-layer encoder chains. Provides byte-level corroboration independent of PyMuPDF parsing. Results feed into the Correlation Engine. /Launch , unescape , getIcon , printf , eval , /EmbeddedFile , and reports JavaScript object locations. Where our other engines parse bytes and structure, PeePDF provides a full second-opinion parse. Results feed into the Correlation Engine. executes the PDF. Renders through six independent engines — Ghostscript PostScript + JS interpreter , MuPDF, Poppler, LibreOffice Draw OLE/macro paths , Chromium PDFium Chrome browser engine — the dominant modern PDF viewer , and pdf.js/Node Firefox engine — each inside a Linux process namespace with its own isolated network stack, PID space, and mount point. All syscalls are captured by strace . Detects: outbound network connections beaconing in an isolated namespace is definitively malicious , anonymous executable memory mappings the runtime signature of shellcode , unauthorised process spawning shell execution from a CVE , filesystem escape attempts, and excessive fork/clone calls process bombs . Static analysis sees the PDF's structure; this engine sees what it does . Pdf.Exploit. family CVE-2009-0927, CVE-2009-4324, Exploit.PDF-JS, and many more . Where the other engines use heuristics and structural analysis to catch zero-days, ClamAV provides authoritative signature intelligence for known samples. A match here means the file is a confirmed known threat. Bayesian contextual scoring adjusts risk based on document origin — a JasperReports or Microsoft Word creator is dampened; a Metasploit/msfvenom creator is amplified. IsolationForest provides unsupervised anomaly detection from the very first scan — flags documents whose 38-feature vector is statistically unusual compared to the scanned population. RandomForest + LightGBM classifiers activate once ≥10 labeled samples accumulate; bootstrap pseudo-labeling unlabeled scans with raw score ≤ 5 treated as pseudo-benign enables supervised activation from the first real malicious scan. Explainable ML reports the top feature contributions via SHAP for each scan. Every scan is persisted to PostgreSQL. User feedback false positive / confirmed threat feeds directly into retraining. six independent PDF parsers — MuPDF mutool , Poppler pdfinfo / pdfdetach , Ghostscript , qpdf , pdfminer , and pdf.js — against the same file and cross-compares 8 structural dimensions : page count, object count, PDF version, JavaScript presence, encryption status, AcroForm presence, embedded file count, and OpenAction. Malicious PDFs abuse broken xref tables, hidden incremental updates, and duplicate object numbers so that one parser recovers hidden exploit objects that another ignores entirely. Seven discrepancy checks Critical/High/Medium flag page mismatches, JS visibility gaps, encryption oracle indicators, version header confusion attacks, hidden form action trees, and attachment count discrepancies — all invisible to single-parser scanners. A hard 30-second timeout guards the entire engine so no malformed PDF can stall the scan. File-level polyglot — checks whether a recognised format magic signature FF D8 FF JPEG, PK\x03\x04 ZIP, PNG, GIF, Gzip, OLE, RIFF appears in the bytes before the %PDF- header. ISO 32000 §7.5.2 NOTE 1 allows arbitrary bytes before %PDF- for legitimate reasons, but a recognised format signature here is characteristic of deliberate polyglot construction. Attackers use JPEG+PDF and ZIP+PDF polyglots to bypass format-based routing in email security gateways that classify files by their first bytes — the gateway sees a JPEG image, the PDF payload executes in the viewer. Stream-level polyglot — scans every PDF stream raw and decompressed for embedded executable magic: PK\x03\x04 ZIP/JAR , MZ +PE header validation Windows PE , \x7fELF +class byte Linux ELF , \xcf\xfa\xed\xfe Mach-O x64 , \xca\xfe\xba\xbe Java class , \xd0\xcf\x11\xe0 OLE/CFBF — Office binary , RAR, 7-Zip, \x00asm WebAssembly , HTML/XHTML, and embedded PostScript. JAR detection requires META-INF/MANIFEST.MF. PE detection requires the DOS e lfanew pointer to resolve to a valid PE signature — eliminating false positives from font tables and binary streams. ELF detection validates the class byte 32-bit or 64-bit . Mid-stream scanning catches payloads prefixed by junk bytes. /JS strings and JavaScript-bearing compressed streams — then parses each through the Acorn parser to build a full Abstract Syntax Tree. Instead of scanning text for keywords, the AST walker detects meaning : eval and execScript dynamic execution entry points; String.fromCharCode arrays that assemble shellcode from integer sequences at runtime; unescape decode chains that two-stage-deliver encoded payloads; numeric arrays of 150+ elements the structural signature of heap spray ; and new Function string dynamic code construction. These patterns are completely invisible to regex-based scanners but trivially visible at the AST level. fully offline local databases — URLhaus abuse.ch malware payload hashes + URLs , MalwareBazaar SHA-256 hash reputation + family labels , ThreatFox IOC indicators , and FeodoTracker + OpenPhish C2 IPs + phishing URLs — totalling 6.4M+ indicators stored in local PostgreSQL tables. Zero external API calls per scan, zero rate-limit dependency, zero data exfiltration. Databases are updated from public feeds on a periodic schedule. A hash match raises a critical indicator and auto-labels the scan. URLs extracted from the PDF are checked against URLhaus URL and domain feeds and the OpenPhish phishing URL list. No hash, URL, or file byte is transmitted externally. ByteRange coverage integrity — per ISO 32000 §12.8.1, offsets are measured from the %PDF- header files may carry bytes before it : o1 must be 0 non-zero leaves the file header unsigned , both segments must stay within file bounds, the inner gap between segments must contain only the /Contents blob, and o2+l2 must reach at least the %%EOF marker. Shadow document detection vs full-save rewrite detection — when o2+l2 < %%EOF , the engine inspects the unsigned trailing region: if it contains execution vectors /JavaScript , /Launch it is a shadow attack CVE-2019-14980 class ; if it contains xref/trailer/startxref structure without active content, it is a full-save rewrite — a PDF viewer rebuilt the entire file with new byte offsets, invalidating all existing signatures while leaving the visual signature appearance intact used by DocuSign and similar tools . Both are flagged at high severity with the distinction clearly reported. — all-zero blobs, sub-32-byte blobs, and missing DER SEQUENCE headers all indicate a structurally signed but cryptographically empty document. /Contents structural validation SubFilter deprecation — adbe.pkcs7.sha1 SHA-1 collision attack surface , adbe.x509.rsa sha1 no embedded cert chain , and unknown SubFilters. Weak digest algorithm detection — MD5 and SHA-1 enable collision-assisted shadow document forgery. Post-signature object injection detection flags execution vectors added in incremental updates after signing. urgency/deception phrase detection 30+ patterns: "login required", "verify your account", "prize notification", "limited time" ; brand impersonation keywords Microsoft, Apple, Amazon, PayPal, DocuSign, Adobe, DHL, IRS, and others ; credential harvesting via AcroForm — detects SubmitForm actions combined with password-type form fields, the structural signature of a phishing PDF form; QR code decoding — renders every PDF page as an image and extracts all embedded QR codes via zbarimg , then checks decoded URLs for suspicious domains IP addresses, non-HTTPS schemes, URL shorteners . High urgency phrase density combined with brand impersonation is scored as a high-confidence phishing indicator. pdfdetach Poppler to extract every embedded file attachment, then forensically analyses each: magic byte detection for Windows PE MZ , Linux ELF \x7fELF , OLE/CFBF \xd0\xcf — Office binary , OOXML archives, script files .bat , .ps1 , .vbs , .sh , and 7-Zip/RAR containers; VBA macro detection in OOXML — inspects xl/vbaProject.bin , word/vbaProject.bin , and ppt/vbaProject.bin entries inside Office attachments; strings extraction on executable payloads to surface suspicious API calls, IP addresses, or command-line arguments. A PDF carrying a PE executable is a confirmed dropper — scored critical regardless of other indicators. TLSH Trend Locality Sensitive Hash of the full PDF — a similarity-preserving hash where similar files produce similar hashes, enabling fuzzy matching unlike SHA-256. TLSH score <30 indicates near-identical files same exploit kit generation ; <100 indicates the same malware campaign family. The hash is compared against every confirmed-malicious PDF previously scanned and stored in the PostgreSQL database. A cluster match surfaces the associated campaign context. The structural fingerprint object counts, stream sizes, action types is also computed for samples too small for TLSH. Falls back to ssdeep if TLSH is unavailable. JavaScript on fields /A and /AA actions — JS fires on focus, blur, keystroke, validate, or calculate events, invisible during static review ; hidden fields NoExport flag — fields not shown to the user but present in submitted data ; password-type fields credential harvesting indicators ; SubmitForm exfiltration targets — the URL s to which all field data is POSTed; /AA additional-action JS triggers on field objects secondary execution vector independent of /OpenAction ; calculation order /CO — adversaries reorder field calculations to chain JS evaluations across fields, enabling multi-step payload staging hidden inside form arithmetic. Value / Appearance Stream V/AP divergence detection — five rendering-free checks: /NeedAppearances true ISO 32000 §12.7.2 — stale AP flag, critical when a digital signature is present since signed bytes cover the stale AP while the viewer renders regenerated content ; checkbox and radio /V vs /AS key mismatch definitive, zero-false-positive — the displayed state and the stored data value structurally disagree ; text, listbox, and combobox field AP stream text extraction with font encoding remap decompresses /AP /N , extracts Tj / TJ operators, resolves any /Encoding /Differences table in the AP font so byte codes are translated to their rendered glyphs before comparison — catches font-remap evasion where a custom encoding makes byte 0x31 render as “9” instead of “1”; compares rendered text to /V ; listbox multi-select /V arrays joined — catches “I agree to $1,000” displayed as “I agree to $10”, or dropdown showing Option A while /V holds Option B ; image-based AP stream detection AP stream invokes an image XObject via Do with no text operators — the field displays a rasterised image while /V holds a different binding value inside the signed byte range; not comparable without image recognition, flagged as high severity requiring manual review ; and blank AP stream detection AP stream draws no content, hiding the field value completely . All enumerated field /V values are collected and passed to Engine ㊶ JS Behavioral Emulation so doc.getField returns real file values during JavaScript execution. Results feed into the Correlation Engine. %%EOF boundary and extracts per-revision metadata: author, producer, modification date, and new/modified/deleted object counts for each revision. Detects author identity changes between revisions, execution vectors injected after the original document was created, and large object injections in the final revision — the structural signature of automated exploit staging. Results feed into the Correlation Engine. dangerous URI schemes javascript: , data: , file:// , vbscript: ; JavaScript action triggers on annotation interaction; /Launch actions that spawn arbitrary programs; GoToR remote links that open external files; and SubmitForm actions that exfiltrate form data. Annotation-borne payloads are completely invisible to scanners that only analyse raw bytes or page content streams. Results feed into the Correlation Engine. PDF 2.0 Associated Files /AF — enumerates the ISO 32000-2 §7.11.4 Associated Files mechanism at document and page level, recording each /AFRelationship type Source , Data , EncryptedPayload , Alternative , Supplement . Associated files are a modern attachment surface a legacy /EF -only walk misses; the attached streams themselves are passed to the Embedded File engine. Catalogues the full PDF action infrastructure: Named JavaScript Registry /Names /JavaScript subtree — persistent JS objects callable by name from any action ; /AA Additional Actions event-driven triggers on page open/close, print, save, field events ; /OpenAction classification JavaScript, Launch, GoToR, URI, GoTo ; /Perms cryptographic permission restrictions; and UR3 usage-rights signatures used to exploit extended viewer features. Deep DocMDP forensics — parses the /P permission level from /TransformParams 1 = no changes, 2 = form fill-ins, 3 = annotations and form fill-ins — the most exploitable ; flags missing or out-of-spec /P values; checks /SigFlags AppendOnly bit; validates presence of /ByteRange and /Reference array; detects incremental updates that violate the MDP constraint any JavaScript is prohibited under all /P levels ; and flags multiple /DocMDP entries validator confusion spoofing . FieldMDP per-signature field lock ISO 32000 §12.8.2.4, "File MDP" — unlike DocMDP which is document-level, FieldMDP locks specific named form fields per approval signature and can be selectively permissive: flags Action=Include with an empty /Fields array locks no fields despite appearing to certify , Action=Exclude with named fields those fields are explicitly not locked, leaving them modifiable post-signing , and incremental updates that contain form-field modifications /Widget , /AcroForm after a FieldMDP signature is in place — a constraint violation detectable across Acrobat and pdf.js where validators differ on whether they check field names against the locked set. Results feed into the Correlation Engine. dangerous PostScript execution operators : exec dynamic code execution , run file execution , token string-to-code eval , setpagedevice PostScript-to-system passthrough — bridges to the PostScript interpreter from PDF context . Also detects ICC color profile abuse — malformed /ICCBased profiles of anomalous size exploit heap buffer overflows CVE-2021-21017 class . Flags content bombs : streams exceeding 5 MB that may exhaust parser memory or conceal data in oversized objects. Results feed into the Correlation Engine. /ObjStm stream. Scanners that only search raw bytes will miss any object inside a compressed container. This engine decompresses every /ObjStm in the document and re-scans the decompressed content: JavaScript , /Launch actions , /EmbeddedFile references , and high-entropy payloads entropy 7.5 bits that suggest encrypted content hidden inside compressed object bundles. Complements the Stream Inspector Engine 3 with object-container-specific forensics. Results feed into the Correlation Engine. /J 61vaScript is syntactically identical to /JavaScript to every PDF parser, but bypasses every raw-byte keyword scanner. This engine decodes all hex escapes in name tokens and checks whether the decoded name matches a dangerous keyword — JavaScript, Launch, OpenAction, EmbeddedFile, XFA, SubmitForm, JBIG2Decode , and 10 more. Also detects whitespace-split keywords e.g. /Ja\nvaScript — valid PDF whitespace normalised by parsers but invisible to substring search , formfeed byte injection 0x0C — a valid PDF whitespace delimiter used instead of space to confuse tools that only accept 0x20/0x09/0x0A as delimiters , and null byte injection in header regions 0x00 bytes outside binary streams — evasion technique against C-string comparison scanners . Finds obfuscation that Engine 2 byte patterns misses because it searches for literal keyword bytes. Results feed into the Correlation Engine. FormCalc scripts xfa.host.exec , Url.resolve , xfa.host.openURL , submit actions