Detect AI-Generated PDFs: What Works and What Does Not

wpnews.pro

Originally published at

[htpbe.tech]. The version on htpbe.tech stays in sync with the latest detection algorithm — refer to it for the canonical text.

Accounts payable teams are receiving receipts generated by ChatGPT plugins. HR platforms are seeing payslips rendered by Python scripts. Insurance claims contain repair estimates that no shop ever issued. The documents look correct. The logos match. The numbers are plausible.

The question is: what can actually be detected, and what cannot?

The honest answer requires separating two things that are often confused under the phrase “AI-generated document detection.”

When people ask how to detect an AI-generated document, they usually mean one of two distinct things:

Content classification asks: was the text in this document written by an AI language model? This is what tools like GPTZero and Turnitin’s AI detector do. They analyze writing style, token probability distributions, and linguistic patterns to estimate whether a human or a model produced the text.

Structural forensics asks: was this PDF file generated by a real institutional system, or did it come from a headless browser, a PDF library, or a consumer tool? This is what HTPBE does. It reads the binary structure of the file — producer metadata, xref patterns, font embedding, object numbering — and checks whether those patterns match how legitimate institutional software generates documents.

These are not the same problem. A document can contain AI-written text and still come from a real corporate system. A document can contain entirely human-written text and still have been rendered by Puppeteer an hour ago. The structural check and the content check answer different questions.

HTPBE does structural forensics. It does not classify text. This article explains what that distinction means in practice, what the structural approach reliably catches, and where its limits are.

When an AI tool generates a PDF, it must render that PDF using some software. The rendering layer almost always leaves a producer fingerprint.

The most common rendering paths for AI-generated documents in fraud scenarios:

Headless browsers (Chrome Headless, Puppeteer, Playwright) are used when a fraudster builds an HTML template — often copied from a legitimate document they scanned or photographed — and renders it to PDF using a browser. Chrome Headless has a characteristic producer string: Chromium

, Chrome

, or a Puppeteer-generated variant that typically includes the Chrome version. These strings are recognizable and are cross-referenced against known institutional producers.

Python and Node.js PDF libraries (ReportLab, PDFKit, jsPDF, fpdf2, WeasyPrint) are used when someone generates a document programmatically — either directly or as part of an AI tool’s export pipeline. ReportLab’s producer string is ReportLab PDF Library

. PDFKit’s is PDFKit

. jsPDF writes jsPDF

. None of these strings appear in documents genuinely issued by banks, payroll processors, or insurance carriers.

wkhtmltopdf is an older HTML-to-PDF tool that remains common in automated document generation pipelines. Its producer string is wkhtmltopdf

.

Online “AI document generators” that export to PDF typically use one of the tools above internally. The producer field reflects the underlying renderer, not the AI layer on top.

When HTPBE analyzes a submitted PDF, it compares the Producer

field against a database of known institutional generators — the software that real banks, payroll platforms, accounting systems, and government agencies use to produce documents. A mismatch between the claimed document type and the actual rendering software is a modification marker.

A payslip generated by ReportLab does not look like a payslip generated by Sage Payroll or ADP Workforce Now at the structural level. Both may look identical visually. The binary layer tells a different story.

Below is a real API response for a payslip submitted to a lending platform. The file was generated by a Puppeteer-based AI document tool and submitted as proof of income.

{
  "status": "inconclusive",
  "modification_confidence": "none",
  "modification_markers": [],
  "creator": null,
  "producer": "Chromium (Chrome 124.0)",
  "origin": { "type": "consumer_software", "software": null },
  "creation_date": null,
  "modification_date": null,
  "xref_count": 1
}

The verdict is inconclusive

, not modified

. There is no evidence this file was edited after creation — because it was never edited. It was created in its current form, in a single render pass, by a headless browser.

The producer

field is Chromium (Chrome 124.0)

. A payslip from a real employer does not come from a headless Chrome instance. The origin_type

is consumer

. creation_date

is null because Puppeteer does not set it by default.

This is the correct interpretation of inconclusive

in an AI fraud context: the document shows no modification markers because it was never a real document that was later modified. It was fabricated from nothing. The absence of institutional metadata is itself the signal.

inconclusive

from HTPBE means: this document was created by consumer or non-institutional software, and we cannot determine whether it was modified after creation because there is no institutional baseline to compare against.

For user-generated documents — cover letters, personal statements, forms the applicant completed themselves — inconclusive

is expected and is not a fraud signal. A person who writes their cover letter in Google Docs and exports it to PDF will produce an inconclusive

result. That is correct behavior.

For documents that claim institutional origin, inconclusive

is a strong fraud signal. The reasoning:

If your workflow receives documents that claim to be bank statements, payslips, or official certificates, and those documents return inconclusive

with a consumer or headless-browser producer, do not accept them. The document’s own metadata contradicts its claimed origin.

import os
import httpx

API_KEY = os.environ["HTPBE_API_KEY"]
BASE_URL = "https://api.htpbe.tech/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

INSTITUTIONAL_DOC_TYPES = {"bank_statement", "payslip", "tax_certificate", "insurance_policy"}

CONSUMER_PRODUCERS = {
    "chromium", "chrome", "puppeteer", "playwright",
    "reportlab", "pdfkit", "jspdf", "fpdf", "wkhtmltopdf",
    "weasyprint",
}

def verify_document(pdf_url: str, doc_type: str) -> dict:
    r = httpx.post(f"{BASE_URL}/analyze", headers=HEADERS, json={"url": pdf_url}, timeout=30)
    r.raise_for_status()
    check_id = r.json()["id"]

    r2 = httpx.get(f"{BASE_URL}/result/{check_id}", headers=HEADERS, timeout=30)
    r2.raise_for_status()
    result = r2.json()

    if result["status"] == "modified":
        return {"action": "reject", "reason": "post_creation_modification", "check_id": check_id}

    if result["status"] == "inconclusive" and doc_type in INSTITUTIONAL_DOC_TYPES:
        producer = (result.get("producer") or "").lower()
        is_consumer_origin = any(tool in producer for tool in CONSUMER_PRODUCERS)
        reason = "ai_or_consumer_origin" if is_consumer_origin else "missing_institutional_metadata"
        return {"action": "reject", "reason": reason, "check_id": check_id}

    return {"action": "accept", "check_id": check_id}

Being clear about the limits of this approach matters. Overstating what structural forensics catches creates false confidence.

Printed and re-scanned AI documents. If someone generates a PDF with an AI tool, prints it, and scans it back to PDF, the structural fingerprints are gone. The scanner produces a new PDF — with its own producer and its own structure — containing image pages. The analysis will return inconclusive

(scanned origin), which is technically correct but loses the AI-rendering signal. This is a known limitation and requires a different layer: image quality analysis, font rendering artifact detection, or manual review.

Sophisticated producer spoofing. The Producer

field is a plain text string. A determined attacker who knows the detection approach can hardcode a string like Adobe PDF Library 15.0

or Oracle PDF Renderer

into their fake document generator. This would defeat producer-based detection. Countering it requires checking multiple structural signals together — object numbering patterns, font embedding methods, XMP metadata consistency — rather than relying on the producer string alone. HTPBE runs multiple analysis layers, but a sophisticated attacker who specifically targets the detection system can evade individual signals.

AI text pasted into Word then exported to PDF. If someone uses an AI to write text, pastes it into Microsoft Word, and exports to PDF, the resulting file looks like any Word-to-PDF export. The origin is consumer

(Word), which is inconclusive

but not alarming on its own for documents expected to come from Word. This case requires content-layer analysis.

Documents generated by the same software as legitimate issuers. If a fraudster gains access to Sage Payroll, generates a payslip for a fake employee, and exports it, the structural signals will look legitimate. The file came from the right software. Detecting this requires checking the content with the issuer — structural forensics alone cannot distinguish a real Sage payslip from a fraudulent one generated on a compromised Sage account.

No single layer catches everything. The approach that covers the most ground combines:

Structural forensics (HTPBE) handles the file layer: modified documents, consumer-origin documents submitted as institutional, headless-browser renders, and PDF-library-generated fakes. This runs first — it is fast, cost-effective, and catches the majority of operational fraud. See the AI-generated document detection page for a complete breakdown of what the file layer covers.

Content classification (GPTZero, Originality.ai, or a fine-tuned classifier for your document type) handles the text layer: detecting AI-written prose in documents where the writing itself is the fraud signal — reference letters, employment checks, academic submissions.

Issuer fraud detection handles the ground-truth layer: contacting the bank, payroll provider, or issuing authority to confirm the document was actually issued. This is costly at scale but appropriate for high-value decisions.

The practical sequence for a lending or HR platform processing document submissions:

modified

results and inconclusive

results with a consumer producer for institutional document types. This eliminates the majority of fraudulent submissions without manual effort.Accounts payable teams processing invoice and receipt submissions from vendors or employees: the primary AI fraud vector is fabricated receipts and invoices generated by AI tools. Structural fraud detection catches headless-browser and PDF-library renders before they enter the approval queue.

HR platforms and background check providers: AI-generated reference letters, diploma supplements, and employment fraud detection documents are increasingly common. producer

field analysis alone is not sufficient here (the text also matters), but it catches the lowest-effort fabrications — documents rendered by the wrong software for their claimed origin.

Insurance claims operations: repair estimates, medical bills, and supporting documentation submitted by claimants are a high-fraud category. AI tools reduce the effort required to fabricate a plausible-looking estimate. Structural forensics identifies documents that did not come from the claimed issuer’s systems.

Lending and fintech compliance teams: bank statements and payslips are the most-targeted document types. The structural check is a necessary first layer before any income or asset fraud-detection workflow. See the PDF authenticity API documentation and pricing.

source & further reading

dev.to — original article AI Age Estimation: Ethics and Implications at the Border - SmarterArticles S1E10 We built a free status monitor for 77 AI APIs. Here's what 6 weeks of data taught us. KIMI + Agnes: A Real-World Test of Cross-Provider Agent Chain Correctover

Detect AI-Generated PDFs: What Works and What Does Not

Run your AI side-project on zahid.host