{"slug": "how-to-automate-file-renaming-with-ai-and-ocr", "title": "How to Automate File Renaming with AI and OCR", "summary": "A developer has built a pipeline that automates file renaming by reading the actual content inside documents and images rather than relying on metadata or manual naming conventions. The system uses OCR for text-heavy files like PDFs and invoices, and a vision model for photos and general images, then feeds extracted data into a structured filename template. The approach solves the common problem of meaningless filenames like `scan_001.pdf` or `IMG_4382.jpg` that break search and downstream automation.", "body_md": "*Give your files names that actually describe what is inside them.*\n\n**TL;DR**\n\nOpen a folder of scanned documents and you will almost certainly find the same pattern: `scan_001.pdf`\n\n, `document_final_v3.docx`\n\n, `IMG_4382.jpg`\n\n. The names tell you nothing. Search is broken. Downstream scripts that pattern-match on filenames fail silently. Someone on the team ends up doing the renaming manually, which works fine at ten files a week and falls apart at a hundred.\n\nA naming convention helps, but only if everyone follows it. Getting a team to consistently name files `{type}_{vendor}_{date}_{id}`\n\nrequires discipline that degrades under deadline pressure.\n\nThe better approach is to read what is inside each file and generate the name from that content. This tutorial walks you through building that pipeline.\n\nContent-aware renaming is not about reading file metadata. Metadata fields like creation date or the author field in an EXIF block are often wrong, empty, or filled with camera defaults that tell you nothing useful.\n\nReading the document's actual content requires two different approaches depending on what you are working with:\n\n**Text documents** (PDFs, scanned invoices, contracts): OCR converts the rendered pixels or embedded text into a raw string. An LLM then pulls structured fields out of that string: document type, vendor name, date, reference number.\n\n**Photos and general images** (receipts photographed on a phone, ID cards, site photos): OCR returns sparse or useless text on these. A vision model describes what is actually in the frame. The same LLM extraction step then processes that description.\n\nBoth paths produce the same output: a small dictionary of fields you feed into a filename template.\n\n**Pipeline Architecture**\n\nHere is the full flow in plain text:\n\n```\nInput file\n    │\n    ├─ PDF or text-heavy image\n    │       └─ extract_text()    OCR or embedded text extraction\n    │               └─ extract_fields()    LLM structured prompt\n    │\n    └─ Photo / general image\n            └─ describe_image()    vision model\n                    └─ extract_fields()    same LLM step\n    │\n    └─ build_filename()    template + sanitize\n            └─ rename file on disk\n```\n\nThe two content paths share the field extraction and naming steps. Only the ingestion step differs based on what kind of file you are processing.\n\nFor PDFs with embedded text, **pdfplumber** is faster and more accurate than rendering pages to images and running OCR on them. For scanned PDFs and standalone image files, **pytesseract** wraps Tesseract and handles most common formats.\n\n``` python\nimport pdfplumber\nimport pytesseract\nfrom PIL import Image\nfrom pathlib import Path\n\ndef extract_text(file_path: str) -> str:\n    \"\"\"\n    Extract text from a PDF (embedded or scanned) or image file.\n    Returns raw text string. Returns empty string on failure.\n    \"\"\"\n    path = Path(file_path)\n    suffix = path.suffix.lower()\n\n    if suffix == \".pdf\":\n        # Try embedded text first: faster and more accurate\n        try:\n            with pdfplumber.open(file_path) as pdf:\n                pages = [page.extract_text() or \"\" for page in pdf.pages]\n                combined = \"\\n\".join(pages).strip()\n                if len(combined) > 50:\n                    return combined\n        except Exception:\n            pass\n\n        # Fall back to OCR on the first page\n        try:\n            import pdf2image\n            images = pdf2image.convert_from_path(\n                file_path, first_page=1, last_page=1\n            )\n            if images:\n                return pytesseract.image_to_string(images[0])\n        except Exception:\n            return \"\"\n\n    elif suffix in {\".jpg\", \".jpeg\", \".png\", \".tiff\", \".bmp\"}:\n        try:\n            img = Image.open(file_path)\n            return pytesseract.image_to_string(img)\n        except Exception:\n            return \"\"\n\n    return \"\"\n```\n\n**Install:** `pip install pdfplumber pytesseract pdf2image Pillow`\n\nTesseract itself requires a system install. On macOS: `brew install tesseract`\n\n. On Ubuntu or Debian: `apt install tesseract-ocr`\n\n.\n\nRaw OCR output is noisy. Whitespace errors, page headers, footer fragments, and encoding artifacts all land in the string. An LLM handles this noise well if your prompt is tight and returns structured JSON.\n\n``` python\nimport json\nimport openai\n\nclient = openai.OpenAI()  # reads OPENAI_API_KEY from environment\n\nEXTRACTION_PROMPT = \"\"\"\nYou are a document classifier. Given raw OCR text, extract exactly these fields as JSON:\n\n- \"doc_type\": one of: invoice, contract, receipt, report, id_card, other\n- \"vendor_or_party\": the main company or person name. Use \"unknown\" if not found.\n- \"date\": the primary date in YYYY-MM-DD format. Use \"unknown\" if not found.\n- \"identifier\": invoice number, contract ID, or reference number. Use \"unknown\" if not found.\n\nReturn ONLY valid JSON with these four keys. No explanation. No markdown fences.\n\nOCR text:\n{text}\n\"\"\"\n\ndef extract_fields(raw_text: str) -> dict:\n    \"\"\"\n    Send OCR text to GPT and return structured filename fields.\n    Falls back to a safe default dict on any parse failure.\n    \"\"\"\n    fallback = {\n        \"doc_type\": \"file\",\n        \"vendor_or_party\": \"unknown\",\n        \"date\": \"unknown\",\n        \"identifier\": \"unknown\",\n    }\n\n    if not raw_text.strip():\n        return fallback\n\n    prompt = EXTRACTION_PROMPT.format(text=raw_text[:3000])\n\n    try:\n        response = client.chat.completions.create(\n            model=\"gpt-4o-mini\",\n            messages=[{\"role\": \"user\", \"content\": prompt}],\n            temperature=0,\n        )\n        content = response.choices[0].message.content.strip()\n        result = json.loads(content)\n        for key in fallback:\n            if key not in result:\n                result[key] = \"unknown\"\n        return result\n    except Exception:\n        return fallback\n```\n\n`temperature=0`\n\nkeeps output deterministic across repeated runs on the same document. The 3,000-character cap limits token spend. For most invoices and contracts, the fields you need appear in the first page, well within that cap.\n\nIf you prefer Anthropic's API, the swap is straightforward. Replace `openai.OpenAI()`\n\nwith `anthropic.Anthropic()`\n\nand use `client.messages.create(model=\"claude-3-haiku-20240307\", ...)`\n\n. The prompt itself works without changes.\n\nWith structured fields in hand, building the filename is a sanitize-and-join operation. The two things that will trip you up are illegal filesystem characters and name collisions.\n\n``` php\nimport re\nimport hashlib\n\ndef sanitize(value: str) -> str:\n    \"\"\"Strip illegal filesystem characters and collapse whitespace.\"\"\"\n    value = value.strip()\n    value = re.sub(r'[<>:\"/\\\\|?*\\x00-\\x1f]', \"\", value)\n    value = re.sub(r\"\\s+\", \"_\", value)\n    return value[:50]\n\ndef build_filename(fields: dict, original_path: str) -> str:\n    \"\"\"\n    Assemble a descriptive filename from extracted fields.\n    Format: {doc_type}_{vendor}_{date}_{identifier}{ext}\n    Example: invoice_Acme_Corp_2024-03-15_INV-0042.pdf\n    Unknown fields are dropped. Falls back to original stem + hash on total failure.\n    \"\"\"\n    ext = Path(original_path).suffix.lower()\n\n    parts = [\n        sanitize(fields.get(\"doc_type\", \"file\")),\n        sanitize(fields.get(\"vendor_or_party\", \"\")),\n        sanitize(fields.get(\"date\", \"\")),\n    ]\n\n    identifier = sanitize(fields.get(\"identifier\", \"\"))\n    if identifier and identifier.lower() != \"unknown\":\n        parts.append(identifier)\n\n    clean = [p for p in parts if p and p.lower() != \"unknown\"]\n\n    if not clean:\n        stem = Path(original_path).stem\n        h = hashlib.md5(stem.encode()).hexdigest()[:6]\n        clean = [sanitize(stem), h]\n\n    return \"_\".join(clean) + ext\n\ndef rename_file(original_path: str, dry_run: bool = True) -> str:\n    \"\"\"\n    Full text-document pipeline. Set dry_run=False to rename on disk.\n    \"\"\"\n    raw_text = extract_text(original_path)\n    fields = extract_fields(raw_text)\n    new_name = build_filename(fields, original_path)\n\n    original = Path(original_path)\n    new_path = original.parent / new_name\n\n    if dry_run:\n        print(f\"  [dry run] {original.name} -> {new_name}\")\n    else:\n        original.rename(new_path)\n        print(f\"  renamed:  {original.name} -> {new_name}\")\n\n    return str(new_path)\n```\n\nRun with `dry_run=True`\n\nfirst. Print the full preview list, spot-check a few filenames against the actual documents, then re-run with `dry_run=False`\n\nonce you are satisfied.\n\nFor photos, OCR returns little or nothing useful. A vision model can describe scene content and read visible text in context, which is exactly what you need for a photograph of a receipt or a scanned ID card.\n\n``` php\nimport base64\n\ndef describe_image(file_path: str) -> str:\n    \"\"\"\n    Use GPT-4o vision to describe an image for filename generation.\n    Returns a plain-text description that extract_fields() can process.\n    \"\"\"\n    path = Path(file_path)\n    with open(file_path, \"rb\") as f:\n        image_data = base64.b64encode(f.read()).decode(\"utf-8\")\n\n    ext = path.suffix.lower().lstrip(\".\")\n    mime = \"image/jpeg\" if ext in (\"jpg\", \"jpeg\") else f\"image/{ext}\"\n\n    try:\n        response = client.chat.completions.create(\n            model=\"gpt-4o\",\n            messages=[\n                {\n                    \"role\": \"user\",\n                    \"content\": [\n                        {\n                            \"type\": \"text\",\n                            \"text\": (\n                                \"Describe this image in 2-3 sentences for the purpose of \"\n                                \"generating a descriptive filename. Include: the document \"\n                                \"or scene type, any visible names, dates, or reference \"\n                                \"numbers, and the primary subject. Be specific.\"\n                            ),\n                        },\n                        {\n                            \"type\": \"image_url\",\n                            \"image_url\": {\n                                \"url\": f\"data:{mime};base64,{image_data}\",\n                                \"detail\": \"low\",\n                            },\n                        },\n                    ],\n                }\n            ],\n            max_tokens=200,\n        )\n        return response.choices[0].message.content.strip()\n    except Exception:\n        return \"\"\n\ndef rename_photo(file_path: str, dry_run: bool = True) -> str:\n    \"\"\"Photo pipeline: vision description -> field extraction -> rename.\"\"\"\n    description = describe_image(file_path)\n    fields = extract_fields(description)\n    new_name = build_filename(fields, file_path)\n\n    original = Path(file_path)\n    new_path = original.parent / new_name\n\n    if dry_run:\n        print(f\"  [dry run] {original.name} -> {new_name}\")\n    else:\n        original.rename(new_path)\n        print(f\"  renamed:  {original.name} -> {new_name}\")\n\n    return str(new_path)\n```\n\nThe `\"detail\": \"low\"`\n\nsetting on the image URL cuts vision API costs by roughly 75% compared to the default. You do not need high-resolution analysis to figure out that a photo contains an Office Depot receipt dated February 28.\n\nLow-quality scans. Tesseract accuracy drops quickly below 150 DPI. Before passing a scanned image to pytesseract, convert it to grayscale with `img.convert(\"L\")`\n\n, apply `ImageFilter.SHARPEN`\n\n, and scale up to 300 DPI if the image is smaller. Even basic preprocessing recovers meaningful accuracy on borderline scans.\n\n**Multi-language documents.** Tesseract defaults to English. If your pipeline processes documents in French, German, or Japanese, pass the language code explicitly: `pytesseract.image_to_string(img, lang=\"fra\")`\n\n. Documents that mix languages are harder to handle reliably and often need a detection step first. The `langdetect`\n\nlibrary covers this well.\n\n**Handwriting.** Tesseract handles printed text reliably. Cursive or informal handwriting produces garbage output. For those files, skip OCR entirely and route directly to the vision path, which handles handwriting considerably better (though with some inconsistency on complex script).\n\n**All-unknown extractions.** When OCR returns junk and the LLM cannot extract any real fields, your pipeline produces a useless name. Track these cases: write them to a review list instead of renaming, so a human can handle the outliers without the rest of the batch stalling.\n\n**Name collisions.** Two invoices from the same vendor on the same date will produce identical filenames. The `build_filename`\n\nfunction above uses a hash fallback on empty fields. You can extend it to also check whether the target path already exists and append a counter when it does.\n\nThe pipeline above is around 130 lines. API costs are low: `gpt-4o-mini`\n\nextraction runs under $0.01 per document at typical invoice length. Vision calls for photos run a bit higher (around $0.01 to $0.03 each with `detail: low`\n\n), but that is manageable at moderate volume.\n\nThe maintenance cost is higher than it looks at first. Preprocessing edge cases, handling new document types, managing API key rotation, and wiring in a watch folder or webhook all take real time. If you are automating your own personal workflow, the code above is a solid foundation to build on.\n\nIf you are deploying this for a team or integrating it into a document management system, dedicated tools are worth a look. [renamer.ai](https://renamer.ai/) handles the OCR and vision path with a REST API and takes the maintenance overhead off your plate. Filebot is strong for media libraries with rule-based naming. AWS Textract makes sense if you are already on AWS and processing at high volume. The right call depends on whether owning the infrastructure is an asset or a cost for your situation.\n\nA content-aware renaming pipeline that:\n\nStart with dry-run mode on a folder of test files. Review the preview output against what you actually expected. Once the extraction is working the way you want, flip to live mode and process in batches.\n\nThe four functions here are designed to stay independent, so you can swap in a different OCR library, point at a different model, or change the filename format without touching the rest.", "url": "https://wpnews.pro/news/how-to-automate-file-renaming-with-ai-and-ocr", "canonical_source": "https://dev.to/tighnarizerda/how-to-automate-file-renaming-with-ai-and-ocr-1hd4", "published_at": "2026-05-27 11:25:33+00:00", "updated_at": "2026-05-27 11:40:42.429681+00:00", "lang": "en", "topics": ["artificial-intelligence", "computer-vision", "natural-language-processing", "ai-tools", "large-language-models"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/how-to-automate-file-renaming-with-ai-and-ocr", "markdown": "https://wpnews.pro/news/how-to-automate-file-renaming-with-ai-and-ocr.md", "text": "https://wpnews.pro/news/how-to-automate-file-renaming-with-ai-and-ocr.txt", "jsonld": "https://wpnews.pro/news/how-to-automate-file-renaming-with-ai-and-ocr.jsonld"}}