{"slug": "detect-ai-generated-pdfs-what-works-and-what-does-not", "title": "Detect AI-Generated PDFs: What Works and What Does Not", "summary": "HTPBE, a structural forensics tool, distinguishes between content classification and structural analysis for detecting AI-generated PDFs. It identifies documents rendered by headless browsers, Python libraries, or other non-institutional software by examining binary file structures like producer metadata and font embedding. The tool catches mismatches between claimed document types and actual rendering software, but does not classify text content.", "body_md": "Originally published at\n\n[htpbe.tech]. The version on htpbe.tech stays in sync with the latest detection algorithm — refer to it for the canonical text.\n\nAccounts payable teams are receiving receipts generated by ChatGPT plugins. HR platforms are seeing payslips rendered by Python scripts. Insurance claims contain repair estimates that no shop ever issued. The documents look correct. The logos match. The numbers are plausible.\n\nThe question is: what can actually be detected, and what cannot?\n\nThe honest answer requires separating two things that are often confused under the phrase “AI-generated document detection.”\n\nWhen people ask how to detect an AI-generated document, they usually mean one of two distinct things:\n\n**Content classification** asks: was the text in this document written by an AI language model? This is what tools like GPTZero and Turnitin’s AI detector do. They analyze writing style, token probability distributions, and linguistic patterns to estimate whether a human or a model produced the text.\n\n**Structural forensics** asks: was this PDF file generated by a real institutional system, or did it come from a headless browser, a PDF library, or a consumer tool? This is what HTPBE does. It reads the binary structure of the file — producer metadata, xref patterns, font embedding, object numbering — and checks whether those patterns match how legitimate institutional software generates documents.\n\nThese are not the same problem. A document can contain AI-written text and still come from a real corporate system. A document can contain entirely human-written text and still have been rendered by Puppeteer an hour ago. The structural check and the content check answer different questions.\n\nHTPBE does structural forensics. It does not classify text. This article explains what that distinction means in practice, what the structural approach reliably catches, and where its limits are.\n\nWhen an AI tool generates a PDF, it must render that PDF using some software. The rendering layer almost always leaves a producer fingerprint.\n\nThe most common rendering paths for AI-generated documents in fraud scenarios:\n\n**Headless browsers** (Chrome Headless, Puppeteer, Playwright) are used when a fraudster builds an HTML template — often copied from a legitimate document they scanned or photographed — and renders it to PDF using a browser. Chrome Headless has a characteristic producer string: `Chromium`\n\n, `Chrome`\n\n, or a Puppeteer-generated variant that typically includes the Chrome version. These strings are recognizable and are cross-referenced against known institutional producers.\n\n**Python and Node.js PDF libraries** (ReportLab, PDFKit, jsPDF, fpdf2, WeasyPrint) are used when someone generates a document programmatically — either directly or as part of an AI tool’s export pipeline. ReportLab’s producer string is `ReportLab PDF Library`\n\n. PDFKit’s is `PDFKit`\n\n. jsPDF writes `jsPDF`\n\n. None of these strings appear in documents genuinely issued by banks, payroll processors, or insurance carriers.\n\n**wkhtmltopdf** is an older HTML-to-PDF tool that remains common in automated document generation pipelines. Its producer string is `wkhtmltopdf`\n\n.\n\n**Online “AI document generators”** that export to PDF typically use one of the tools above internally. The producer field reflects the underlying renderer, not the AI layer on top.\n\nWhen HTPBE analyzes a submitted PDF, it compares the `Producer`\n\nfield against a database of known institutional generators — the software that real banks, payroll platforms, accounting systems, and government agencies use to produce documents. A mismatch between the claimed document type and the actual rendering software is a modification marker.\n\nA payslip generated by ReportLab does not look like a payslip generated by Sage Payroll or ADP Workforce Now at the structural level. Both may look identical visually. The binary layer tells a different story.\n\nBelow is a real API response for a payslip submitted to a lending platform. The file was generated by a Puppeteer-based AI document tool and submitted as proof of income.\n\n```\n{\n  \"status\": \"inconclusive\",\n  \"modification_confidence\": \"none\",\n  \"modification_markers\": [],\n  \"creator\": null,\n  \"producer\": \"Chromium (Chrome 124.0)\",\n  \"origin\": { \"type\": \"consumer_software\", \"software\": null },\n  \"creation_date\": null,\n  \"modification_date\": null,\n  \"xref_count\": 1\n}\n```\n\nThe verdict is `inconclusive`\n\n, not `modified`\n\n. There is no evidence this file was edited after creation — because it was never edited. It was created in its current form, in a single render pass, by a headless browser.\n\nThe `producer`\n\nfield is `Chromium (Chrome 124.0)`\n\n. A payslip from a real employer does not come from a headless Chrome instance. The `origin_type`\n\nis `consumer`\n\n. `creation_date`\n\nis null because Puppeteer does not set it by default.\n\nThis is the correct interpretation of `inconclusive`\n\nin an AI fraud context: the document shows no modification markers because it was never a real document that was later modified. It was fabricated from nothing. The absence of institutional metadata is itself the signal.\n\n`inconclusive`\n\nfrom HTPBE means: this document was created by consumer or non-institutional software, and we cannot determine whether it was modified after creation because there is no institutional baseline to compare against.\n\nFor user-generated documents — cover letters, personal statements, forms the applicant completed themselves — `inconclusive`\n\nis expected and is not a fraud signal. A person who writes their cover letter in Google Docs and exports it to PDF will produce an `inconclusive`\n\nresult. That is correct behavior.\n\nFor documents that claim institutional origin, `inconclusive`\n\nis a strong fraud signal. The reasoning:\n\nIf your workflow receives documents that claim to be bank statements, payslips, or official certificates, and those documents return `inconclusive`\n\nwith a consumer or headless-browser producer, do not accept them. The document’s own metadata contradicts its claimed origin.\n\n``` python\nimport os\nimport httpx\n\nAPI_KEY = os.environ[\"HTPBE_API_KEY\"]\nBASE_URL = \"https://api.htpbe.tech/v1\"\nHEADERS = {\"Authorization\": f\"Bearer {API_KEY}\"}\n\nINSTITUTIONAL_DOC_TYPES = {\"bank_statement\", \"payslip\", \"tax_certificate\", \"insurance_policy\"}\n\nCONSUMER_PRODUCERS = {\n    \"chromium\", \"chrome\", \"puppeteer\", \"playwright\",\n    \"reportlab\", \"pdfkit\", \"jspdf\", \"fpdf\", \"wkhtmltopdf\",\n    \"weasyprint\",\n}\n\ndef verify_document(pdf_url: str, doc_type: str) -> dict:\n    # Step 1: submit\n    r = httpx.post(f\"{BASE_URL}/analyze\", headers=HEADERS, json={\"url\": pdf_url}, timeout=30)\n    r.raise_for_status()\n    check_id = r.json()[\"id\"]\n\n    # Step 2: retrieve\n    r2 = httpx.get(f\"{BASE_URL}/result/{check_id}\", headers=HEADERS, timeout=30)\n    r2.raise_for_status()\n    result = r2.json()\n\n    # Route based on verdict + document type\n    if result[\"status\"] == \"modified\":\n        return {\"action\": \"reject\", \"reason\": \"post_creation_modification\", \"check_id\": check_id}\n\n    if result[\"status\"] == \"inconclusive\" and doc_type in INSTITUTIONAL_DOC_TYPES:\n        producer = (result.get(\"producer\") or \"\").lower()\n        is_consumer_origin = any(tool in producer for tool in CONSUMER_PRODUCERS)\n        reason = \"ai_or_consumer_origin\" if is_consumer_origin else \"missing_institutional_metadata\"\n        return {\"action\": \"reject\", \"reason\": reason, \"check_id\": check_id}\n\n    return {\"action\": \"accept\", \"check_id\": check_id}\n```\n\nBeing clear about the limits of this approach matters. Overstating what structural forensics catches creates false confidence.\n\n**Printed and re-scanned AI documents.** If someone generates a PDF with an AI tool, prints it, and scans it back to PDF, the structural fingerprints are gone. The scanner produces a new PDF — with its own producer and its own structure — containing image pages. The analysis will return `inconclusive`\n\n(scanned origin), which is technically correct but loses the AI-rendering signal. This is a known limitation and requires a different layer: image quality analysis, font rendering artifact detection, or manual review.\n\n**Sophisticated producer spoofing.** The `Producer`\n\nfield is a plain text string. A determined attacker who knows the detection approach can hardcode a string like `Adobe PDF Library 15.0`\n\nor `Oracle PDF Renderer`\n\ninto their fake document generator. This would defeat producer-based detection. Countering it requires checking multiple structural signals together — object numbering patterns, font embedding methods, XMP metadata consistency — rather than relying on the producer string alone. HTPBE runs multiple analysis layers, but a sophisticated attacker who specifically targets the detection system can evade individual signals.\n\n**AI text pasted into Word then exported to PDF.** If someone uses an AI to write text, pastes it into Microsoft Word, and exports to PDF, the resulting file looks like any Word-to-PDF export. The origin is `consumer`\n\n(Word), which is `inconclusive`\n\nbut not alarming on its own for documents expected to come from Word. This case requires content-layer analysis.\n\n**Documents generated by the same software as legitimate issuers.** If a fraudster gains access to Sage Payroll, generates a payslip for a fake employee, and exports it, the structural signals will look legitimate. The file came from the right software. Detecting this requires checking the content with the issuer — structural forensics alone cannot distinguish a real Sage payslip from a fraudulent one generated on a compromised Sage account.\n\nNo single layer catches everything. The approach that covers the most ground combines:\n\n**Structural forensics (HTPBE)** handles the file layer: modified documents, consumer-origin documents submitted as institutional, headless-browser renders, and PDF-library-generated fakes. This runs first — it is fast, cost-effective, and catches the majority of operational fraud. See the [AI-generated document detection](https://htpbe.tech/use-cases/ai-generated-document-detection) page for a complete breakdown of what the file layer covers.\n\n**Content classification** (GPTZero, Originality.ai, or a fine-tuned classifier for your document type) handles the text layer: detecting AI-written prose in documents where the writing itself is the fraud signal — reference letters, employment checks, academic submissions.\n\n**Issuer fraud detection** handles the ground-truth layer: contacting the bank, payroll provider, or issuing authority to confirm the document was actually issued. This is costly at scale but appropriate for high-value decisions.\n\nThe practical sequence for a lending or HR platform processing document submissions:\n\n`modified`\n\nresults and `inconclusive`\n\nresults with a consumer producer for institutional document types. This eliminates the majority of fraudulent submissions without manual effort.**Accounts payable teams** processing invoice and receipt submissions from vendors or employees: the primary AI fraud vector is fabricated receipts and invoices generated by AI tools. Structural fraud detection catches headless-browser and PDF-library renders before they enter the approval queue.\n\n**HR platforms and background check providers**: AI-generated reference letters, diploma supplements, and employment fraud detection documents are increasingly common. `producer`\n\nfield analysis alone is not sufficient here (the text also matters), but it catches the lowest-effort fabrications — documents rendered by the wrong software for their claimed origin.\n\n**Insurance claims operations**: repair estimates, medical bills, and supporting documentation submitted by claimants are a high-fraud category. AI tools reduce the effort required to fabricate a plausible-looking estimate. Structural forensics identifies documents that did not come from the claimed issuer’s systems.\n\n**Lending and fintech compliance teams**: bank statements and payslips are the most-targeted document types. The structural check is a necessary first layer before any income or asset fraud-detection workflow. See the [PDF authenticity API documentation and pricing](https://htpbe.tech/api).", "url": "https://wpnews.pro/news/detect-ai-generated-pdfs-what-works-and-what-does-not", "canonical_source": "https://dev.to/iurii_rogulia/detect-ai-generated-pdfs-what-works-and-what-does-not-1efg", "published_at": "2026-06-22 10:00:40+00:00", "updated_at": "2026-06-22 10:09:50.438216+00:00", "lang": "en", "topics": ["ai-safety", "ai-tools", "developer-tools"], "entities": ["HTPBE", "GPTZero", "Turnitin", "Chrome Headless", "Puppeteer", "Playwright", "ReportLab", "PDFKit"], "alternates": {"html": "https://wpnews.pro/news/detect-ai-generated-pdfs-what-works-and-what-does-not", "markdown": "https://wpnews.pro/news/detect-ai-generated-pdfs-what-works-and-what-does-not.md", "text": "https://wpnews.pro/news/detect-ai-generated-pdfs-what-works-and-what-does-not.txt", "jsonld": "https://wpnews.pro/news/detect-ai-generated-pdfs-what-works-and-what-does-not.jsonld"}}