How to Automate File Renaming with AI and OCR

A developer has built a pipeline that automates file renaming by reading the actual content inside documents and images rather than relying on metadata or manual naming conventions. The system uses OCR for text-heavy files like PDFs and invoices, and a vision model for photos and general images, then feeds extracted data into a structured filename template. The approach solves the common problem of meaningless filenames like `scan_001.pdf` or `IMG_4382.jpg` that break search and downstream automation.

Give your files names that actually describe what is inside them. TL;DR Open a folder of scanned documents and you will almost certainly find the same pattern: scan 001.pdf , document final v3.docx , IMG 4382.jpg . The names tell you nothing. Search is broken. Downstream scripts that pattern-match on filenames fail silently. Someone on the team ends up doing the renaming manually, which works fine at ten files a week and falls apart at a hundred. A naming convention helps, but only if everyone follows it. Getting a team to consistently name files {type} {vendor} {date} {id} requires discipline that degrades under deadline pressure. The better approach is to read what is inside each file and generate the name from that content. This tutorial walks you through building that pipeline. Content-aware renaming is not about reading file metadata. Metadata fields like creation date or the author field in an EXIF block are often wrong, empty, or filled with camera defaults that tell you nothing useful. Reading the document's actual content requires two different approaches depending on what you are working with: Text documents PDFs, scanned invoices, contracts : OCR converts the rendered pixels or embedded text into a raw string. An LLM then pulls structured fields out of that string: document type, vendor name, date, reference number. Photos and general images receipts photographed on a phone, ID cards, site photos : OCR returns sparse or useless text on these. A vision model describes what is actually in the frame. The same LLM extraction step then processes that description. Both paths produce the same output: a small dictionary of fields you feed into a filename template. Pipeline Architecture Here is the full flow in plain text: Input file │ ├─ PDF or text-heavy image │ └─ extract text OCR or embedded text extraction │ └─ extract fields LLM structured prompt │ └─ Photo / general image └─ describe image vision model └─ extract fields same LLM step │ └─ build filename template + sanitize └─ rename file on disk The two content paths share the field extraction and naming steps. Only the ingestion step differs based on what kind of file you are processing. For PDFs with embedded text, pdfplumber is faster and more accurate than rendering pages to images and running OCR on them. For scanned PDFs and standalone image files, pytesseract wraps Tesseract and handles most common formats. python import pdfplumber import pytesseract from PIL import Image from pathlib import Path def extract text file path: str - str: """ Extract text from a PDF embedded or scanned or image file. Returns raw text string. Returns empty string on failure. """ path = Path file path suffix = path.suffix.lower if suffix == ".pdf": Try embedded text first: faster and more accurate try: with pdfplumber.open file path as pdf: pages = page.extract text or "" for page in pdf.pages combined = "\n".join pages .strip if len combined 50: return combined except Exception: pass Fall back to OCR on the first page try: import pdf2image images = pdf2image.convert from path file path, first page=1, last page=1 if images: return pytesseract.image to string images 0 except Exception: return "" elif suffix in {".jpg", ".jpeg", ".png", ".tiff", ".bmp"}: try: img = Image.open file path return pytesseract.image to string img except Exception: return "" return "" Install: pip install pdfplumber pytesseract pdf2image Pillow Tesseract itself requires a system install. On macOS: brew install tesseract . On Ubuntu or Debian: apt install tesseract-ocr . Raw OCR output is noisy. Whitespace errors, page headers, footer fragments, and encoding artifacts all land in the string. An LLM handles this noise well if your prompt is tight and returns structured JSON. python import json import openai client = openai.OpenAI reads OPENAI API KEY from environment EXTRACTION PROMPT = """ You are a document classifier. Given raw OCR text, extract exactly these fields as JSON: - "doc type": one of: invoice, contract, receipt, report, id card, other - "vendor or party": the main company or person name. Use "unknown" if not found. - "date": the primary date in YYYY-MM-DD format. Use "unknown" if not found. - "identifier": invoice number, contract ID, or reference number. Use "unknown" if not found. Return ONLY valid JSON with these four keys. No explanation. No markdown fences. OCR text: {text} """ def extract fields raw text: str - dict: """ Send OCR text to GPT and return structured filename fields. Falls back to a safe default dict on any parse failure. """ fallback = { "doc type": "file", "vendor or party": "unknown", "date": "unknown", "identifier": "unknown", } if not raw text.strip : return fallback prompt = EXTRACTION PROMPT.format text=raw text :3000 try: response = client.chat.completions.create model="gpt-4o-mini", messages= {"role": "user", "content": prompt} , temperature=0, content = response.choices 0 .message.content.strip result = json.loads content for key in fallback: if key not in result: result key = "unknown" return result except Exception: return fallback temperature=0 keeps output deterministic across repeated runs on the same document. The 3,000-character cap limits token spend. For most invoices and contracts, the fields you need appear in the first page, well within that cap. If you prefer Anthropic's API, the swap is straightforward. Replace openai.OpenAI with anthropic.Anthropic and use client.messages.create model="claude-3-haiku-20240307", ... . The prompt itself works without changes. With structured fields in hand, building the filename is a sanitize-and-join operation. The two things that will trip you up are illegal filesystem characters and name collisions. php import re import hashlib def sanitize value: str - str: """Strip illegal filesystem characters and collapse whitespace.""" value = value.strip value = re.sub r' < :"/\\|? \x00-\x1f ', "", value value = re.sub r"\s+", " ", value return value :50 def build filename fields: dict, original path: str - str: """ Assemble a descriptive filename from extracted fields. Format: {doc type} {vendor} {date} {identifier}{ext} Example: invoice Acme Corp 2024-03-15 INV-0042.pdf Unknown fields are dropped. Falls back to original stem + hash on total failure. """ ext = Path original path .suffix.lower parts = sanitize fields.get "doc type", "file" , sanitize fields.get "vendor or party", "" , sanitize fields.get "date", "" , identifier = sanitize fields.get "identifier", "" if identifier and identifier.lower = "unknown": parts.append identifier clean = p for p in parts if p and p.lower = "unknown" if not clean: stem = Path original path .stem h = hashlib.md5 stem.encode .hexdigest :6 clean = sanitize stem , h return " ".join clean + ext def rename file original path: str, dry run: bool = True - str: """ Full text-document pipeline. Set dry run=False to rename on disk. """ raw text = extract text original path fields = extract fields raw text new name = build filename fields, original path original = Path original path new path = original.parent / new name if dry run: print f" dry run {original.name} - {new name}" else: original.rename new path print f" renamed: {original.name} - {new name}" return str new path Run with dry run=True first. Print the full preview list, spot-check a few filenames against the actual documents, then re-run with dry run=False once you are satisfied. For photos, OCR returns little or nothing useful. A vision model can describe scene content and read visible text in context, which is exactly what you need for a photograph of a receipt or a scanned ID card. php import base64 def describe image file path: str - str: """ Use GPT-4o vision to describe an image for filename generation. Returns a plain-text description that extract fields can process. """ path = Path file path with open file path, "rb" as f: image data = base64.b64encode f.read .decode "utf-8" ext = path.suffix.lower .lstrip "." mime = "image/jpeg" if ext in "jpg", "jpeg" else f"image/{ext}" try: response = client.chat.completions.create model="gpt-4o", messages= { "role": "user", "content": { "type": "text", "text": "Describe this image in 2-3 sentences for the purpose of " "generating a descriptive filename. Include: the document " "or scene type, any visible names, dates, or reference " "numbers, and the primary subject. Be specific." , }, { "type": "image url", "image url": { "url": f"data:{mime};base64,{image data}", "detail": "low", }, }, , } , max tokens=200, return response.choices 0 .message.content.strip except Exception: return "" def rename photo file path: str, dry run: bool = True - str: """Photo pipeline: vision description - field extraction - rename.""" description = describe image file path fields = extract fields description new name = build filename fields, file path original = Path file path new path = original.parent / new name if dry run: print f" dry run {original.name} - {new name}" else: original.rename new path print f" renamed: {original.name} - {new name}" return str new path The "detail": "low" setting on the image URL cuts vision API costs by roughly 75% compared to the default. You do not need high-resolution analysis to figure out that a photo contains an Office Depot receipt dated February 28. Low-quality scans. Tesseract accuracy drops quickly below 150 DPI. Before passing a scanned image to pytesseract, convert it to grayscale with img.convert "L" , apply ImageFilter.SHARPEN , and scale up to 300 DPI if the image is smaller. Even basic preprocessing recovers meaningful accuracy on borderline scans. Multi-language documents. Tesseract defaults to English. If your pipeline processes documents in French, German, or Japanese, pass the language code explicitly: pytesseract.image to string img, lang="fra" . Documents that mix languages are harder to handle reliably and often need a detection step first. The langdetect library covers this well. Handwriting. Tesseract handles printed text reliably. Cursive or informal handwriting produces garbage output. For those files, skip OCR entirely and route directly to the vision path, which handles handwriting considerably better though with some inconsistency on complex script . All-unknown extractions. When OCR returns junk and the LLM cannot extract any real fields, your pipeline produces a useless name. Track these cases: write them to a review list instead of renaming, so a human can handle the outliers without the rest of the batch stalling. Name collisions. Two invoices from the same vendor on the same date will produce identical filenames. The build filename function above uses a hash fallback on empty fields. You can extend it to also check whether the target path already exists and append a counter when it does. The pipeline above is around 130 lines. API costs are low: gpt-4o-mini extraction runs under $0.01 per document at typical invoice length. Vision calls for photos run a bit higher around $0.01 to $0.03 each with detail: low , but that is manageable at moderate volume. The maintenance cost is higher than it looks at first. Preprocessing edge cases, handling new document types, managing API key rotation, and wiring in a watch folder or webhook all take real time. If you are automating your own personal workflow, the code above is a solid foundation to build on. If you are deploying this for a team or integrating it into a document management system, dedicated tools are worth a look. renamer.ai https://renamer.ai/ handles the OCR and vision path with a REST API and takes the maintenance overhead off your plate. Filebot is strong for media libraries with rule-based naming. AWS Textract makes sense if you are already on AWS and processing at high volume. The right call depends on whether owning the infrastructure is an asset or a cost for your situation. A content-aware renaming pipeline that: Start with dry-run mode on a folder of test files. Review the preview output against what you actually expected. Once the extraction is working the way you want, flip to live mode and process in batches. The four functions here are designed to stay independent, so you can swap in a different OCR library, point at a different model, or change the filename format without touching the rest.