# PDFs Don't Have One Meaning: Measuring Semantic Drift Across 24,824 Files

> Source: <https://pqpdf.com/research.php>
> Published: 2026-06-14 17:24:26+00:00

# Immutable Bytes, Mutable Meaning

A PDF isn't one document — it's a rendering program over an object graph, and the parser, renderer, signature validator, or AI pipeline each decides what it says. This is the single entry point to one argument, built in layers and measured across 24,824 real PDFs in three separate corpora.

Document AI assumes semantic determinism. *The PDF format never provided it.*

Ground truth, retrieval correctness, reproducible evaluation, hallucination control — every one quietly assumes a
document means the same thing to every reader. For PDFs that assumption is false, and now that machines do the reading it is a
**measurable failure**, not a curiosity. You don't have to care about PDFs to care about that.

Immutable bytes do not guarantee *immutable meaning*.

Imagine an LLM trained on a document **no human has ever seen**. Nothing was hidden and nothing was hacked — the
PDF simply held several valid machine-readable realities, the ingestion pipeline picked one, and enshrined it as ground truth.
That is not a thought experiment; it is what the studies below measure.

PDF was engineered to guarantee one thing: **visual fidelity** — that a page looks the
same on every screen and printer. It never promised **semantic determinism** — that
every system reading the file extracts the same meaning from it. For thirty years that gap was
invisible, because humans read the pixels and the assumption “one parser, one truth” was never tested.

Machines changed the stakes. RAG knowledge bases, LLM training corpora, compliance pipelines and
legal-discovery systems now read the **object graph**, not the page — and the gap becomes a
security and correctness problem: forms signed while their value and their appearance disagree,
knowledge bases that silently enshrine the wrong reality, evidence that reads one way to a person
and another to a tool. This program names that property, measures how often it occurs, and shows
what it enables.

## Why this surfaced now

**readers** are. For thirty years a human read the rendered page and the gap between pixel and payload never mattered. Now machines read the

*structure*at scale, and a latent property of the format becomes an active security and correctness problem.

For thirty years we assumed a document had one meaning, because a human read the page. Now machines read the structure — and many PDFs never had a single meaning at all.

## One file, as many documents as there are readers

**different document**. None is malfunctioning; the PDF format guarantees they'll agree on how the page

*looks*, never on what it

*says*.

## See the gap in one file

The claim sounds abstract until you watch a single file say two things at once.
None of the files below is corrupt. None trips a malware alert. Each is simply
**a different document depending on who — or what — is reading it.**

**ToUnicode** map points the glyph drawn as

**9** at the character

**1**. The page is honest to the eye and lying to the machine — and a digital signature over the bytes certifies both readings. Proven and diagrammed in

[The Illusion of Immutability](/pdf-illusion-of-immutability.php).

**two independent representations**. The certificate covers the file's bytes — including both — so the signature stays

**valid** while the value disagrees with its own appearance. See

[PDF Forms as Executable Security Boundaries](/pdf-form-security.php).

| Parser | Pages | PDF ver. | JavaScript | Encrypted | AcroForm |
|---|---|---|---|---|---|
| MuPDF | 1 | 1.4 | None | No | No |
| Poppler | 1 | 1.4 | Present | No | No |
| Ghostscript | 1 | — | None | — | — |
| qpdf | 1 | 1.4 | — | No | — |
| pdfminer | 1 | — | Present | No | No |
| pdf.js | 1 | — | Present | No | — |

**/Names/JavaScript** tree. Three parsers find it, two report

**none**, one has no path to check — and the

**—** cells show the tools don't even agree on which properties they can

*see*. A malware scanner built on MuPDF or Ghostscript waves the file through as clean. Real scanner output, all eleven crafted files, in

[Parser Disagreement: Six Parsers, Eleven Divergences](/pdf-parser-disagreement.php).

Same bytes. Same hash. A valid signature. *Three different documents.*

[
See it on your own file
Open the PDF Forensics Scanner →
](/tools/scan.php)

Every exhibit above is what the engine sees on a real upload. Drop in a PDF and watch the three verdict axes — threat, deception, structural — resolve in seconds, with the parser-disagreement and reality-drift signals called out. Zero-retention: nothing is stored.

## Findings, visualized

[🧷 Start here · the synthesis
](/pdf-not-the-document.php)

## The PDF Is Not the Document — 24,824 PDFs, Three Corpora

**Finding —** One finding across an adversarial detection set, a real-world benign control, and the entire 16,971-PDF Epstein release: a PDF is a stack of representations that can disagree, and malware is only one axis.

The synthesis over the entire program below and the three corpora behind it — 24,824 real PDFs measured separately, never blended. Detection without reputation, a 0.34% false-positive rate, 18.6% human-vs-machine drift at scale, metadata that survives the strip in orphaned objects, and a numbering sequence only 2.4% complete. The strongest evidence that document forensics and malware scanning are different disciplines.

Read the full study →**Malware is only one axis.** The scanner grades threat, deception and structural integrity

*independently*— a file can carry zero malware and still rank high on deception. Collapsing the three into a single “risk score” is exactly how a clean-looking, lying document slips through.

*Illustrative of the verdict model.*

## One property. PDF is the proof.

Parser disagreement, reality drift, V/AP divergence and the rest aren't separate topics — and they aren't really about PDF.
They're independent lines of evidence for one general property — **Semantic Nondeterminism**: identical bytes
that yield multiple valid semantic interpretations across different consumers, despite nothing in the file having changed.
PDF is simply where it can be **measured at scale**, because PDF exposes the object graph the divergence hides in.
The format is the proof; the property is the point.

[Parser disagreement](/pdf-parser-disagreement.php)

[Reality drift](/pdf-reality-drift.php)

[V/AP divergence](/pdf-form-security.php)

[OCR-layer divergence](/pdf-reality-drift.php)

[Accessibility-tree divergence](/pdf-reality-drift.php)

[ToUnicode remapping](/pdf-illusion-of-immutability.php)

[AI-ingestion failure](/pdf-ai-ingestion.php)

Seven independent lines of evidence — every one measured in PDF, the document format that exposes its own object graph. The same assumption lives unmeasured under search, e-discovery, AI ingestion and compliance.

## The argument, in five layers

Read top to bottom and the case builds from first principles to field measurement: what a PDF is, the one property it lacks, the four mechanisms that exploit the gap, the prevalence at scale, and a real-world application. Each study stands alone — together they define the discipline.

## Foundational framing — what a PDF actually is

Before any attack, a reset of first principles. A PDF is not the document; it is a rendering **program over an object graph**. Separate the file from the document and the central illusion collapses: identical bytes, an unbroken hash, even a valid signature say something about a container — not about what any given system will read as truth.

## The unifying theory — one root cause

Every failure mode below is a symptom of a single structural fact. PDF was engineered to guarantee **visual fidelity** — that a page looks the same everywhere. It never promised **semantic determinism**: that every extractor reads the same meaning. This is the keystone that ties the threads together.

## Mechanism threads — how one file forks into many

Four distinct structural mechanisms, each turning a single file into different documents for different readers. These are not variations on one bug; they are independent routes to the same outcome — **what this PDF is depends on who is reading it**.

[⚖️
Parser discrepancyKeyword injectionStructural ambiguity
Read the study →
](/pdf-parser-disagreement.php)

### Parser Disagreement: Six Parsers, Eleven Divergences

**Finding —** 11 crafted PDFs run through six production parsers — every file produced a different reading. Same bytes, different document.

MuPDF, Poppler, Ghostscript, qpdf, pdfminer and pdf.js, each in isolated namespaces, disagree on page count, text, JavaScript presence and structure for the same file — the basis of parser-discrepancy attacks.

[📋
V/AP divergenceDocMDPAcroForm
Read the study →
](/pdf-form-security.php)

### PDF Forms as Executable Security Boundaries

**Finding —** A digital signature can certify a form while /V (the value) and /AP (the appearance) disagree — what gets signed is not what gets read.

Form fields carry two independent representations. V/AP divergence, NeedAppearances, DocMDP and FieldMDP certification mean a "signed" document can render one value and store another.

[🌀
Reality driftOCR / text-layerAccessibility
Read the study →
](/pdf-reality-drift.php)

### PDF Reality Drift

**Finding —** One file, many realities: 43 of 44 IRS tax forms drift between the rendered page and the extracted text layer.

Thirteen structural drift vectors — OCR layers, accessibility trees, incremental revisions, ToUnicode remaps, optional-content groups — make legitimate, professionally produced PDFs among the most semantically unstable of all.

[🤖
AI poisoningV/AP divergenceParser discrepancyRAG / LLM
Read the study →
](/pdf-ai-ingestion.php)

### PDF Structural Problems in AI Ingestion Pipelines

**Finding —** AI ingestion can be poisoned by the document itself — the model ingests text the human reader never sees.

When V/AP divergence and parser disagreement reach a RAG knowledge base or an LLM training corpus, single-parser extraction silently picks one reality and enshrines it as ground truth — quietly poisoning what the model learns or retrieves.

## Empirical scale — does it hold in the wild?

A theory is only as good as its measurement. The mechanisms above are stress-tested against a large, multi-domain corpus to answer the question that decides whether this is a curiosity or a property of the ecosystem: **how often does it actually happen?** Its corpus-level numbers **consolidate and supersede** the per-study prevalence figures in the earlier deep-dives — the mechanisms they document still stand; the authoritative counts live here and in the synthesis above.

## Field application — a high-stakes case

The machinery, applied to a socially loaded, contested real-world disclosure where “what the document says” genuinely matters. This is the exhibit that the work is not academic: multi-engine, multi-layer forensics yields a different answer than opening the file and squinting.

## Why this is its own subject

The thirty-year PDF research canon — rendering fidelity, compression, digital signatures, malware,
OCR, accessibility, PDF/A conformance — shares one unexamined assumption: that a PDF
*has* a single, knowable content, and that one parser reading it yields the truth. No standards
body, AI lab, or academic group routinely measures whether different engines extract the
**same meaning** from the same bytes. We measured it. They don't. **And if semantic determinism cannot be
assumed for the world's most common document format, then AI, search, compliance, and evidentiary systems built on top of
documents must treat semantic determinism as a property to be verified, not assumed.**

Across 24,824 real PDFs the same file routinely produces different documents — different page
counts, different text, different JavaScript visibility, a value that disagrees with its own signed
appearance. That isn't a crafted edge case; it's a property of the ecosystem, latent for as long as
humans read the page and now active because machines read the structure. The threat model shifts
from *“is this file malware?”* to *“what realities can this file present, to whom, and who will
believe which one?”* — and that question is the subject this program exists to map.

The finding is about PDF; the assumption it breaks is not. The assumption that a document means one thing regardless of who
reads it is the unstated foundation of **digital forensics, e-discovery, search and information retrieval, compliance, and now AI
training and retrieval**. Semantic determinism is the assumption they all share; PDF is the proof it was never guaranteed.
Name it, measure it, and you can start hardening *ground truth* everywhere a document is ingested.

## Frequently asked

### What is semantic nondeterminism?

Semantic nondeterminism is the property whereby identical bytes — an unchanged file, same hash — yield multiple valid machine-readable interpretations across different consumers. It is technical and falsifiable: same input, different extracted meaning, nothing in the file changed. This research measures it in PDF, the document format whose object graph exposes the divergence.

### Is this only a PDF problem?

The finding is measured in PDF; the assumption it breaks is not. PDF is simply where the property is testable at scale, because the format exposes its own object graph. The assumption that a document means one thing regardless of who reads it is the unstated foundation of search, e-discovery, AI ingestion and compliance. We have proven it fails in PDF, so those systems must treat semantic determinism as a property to be verified, not assumed.

### Why does it matter for AI and RAG pipelines?

When a single PDF carries more than one valid reading, single-parser extraction silently picks one and enshrines it as ground truth. The text a model trains or retrieves on can differ from the page a human reviewed — so retrieval correctness, reproducible evaluation and hallucination control all rest on an assumption the document never guaranteed.

### How common is it, really?

Common enough to be a property of the ecosystem, not an edge case. Six production parsers disagree on roughly one file in three (502 of 1,572). 43 of 44 IRS tax forms drift between the rendered page and the extracted text layer. 18.6% of the 16,971-PDF DOJ Epstein release reads differently to a machine than to a person. Detection is by analysis, not reputation — a 0.34% false-positive rate on a 6,281-PDF real-world control.

### Doesn't a hash or digital signature already prevent this?

No. A hash proves the bytes are unchanged; a signature certifies them. Neither constrains meaning. A form field can be signed while its stored value disagrees with its rendered appearance; a font can draw the character 9 while text extraction returns 1. The signature stays valid and the document still says two different things to two different readers.

[PDF Forensics Scanner](/tools/scan.php)·

[how the engine works](/pdf-malware-scanner.php)
