# How to Automate File Renaming with AI and OCR

> Source: <https://dev.to/tighnarizerda/how-to-automate-file-renaming-with-ai-and-ocr-1hd4>
> Published: 2026-05-27 11:25:33+00:00

*Give your files names that actually describe what is inside them.*

**TL;DR**

Open a folder of scanned documents and you will almost certainly find the same pattern: `scan_001.pdf`

, `document_final_v3.docx`

, `IMG_4382.jpg`

. The names tell you nothing. Search is broken. Downstream scripts that pattern-match on filenames fail silently. Someone on the team ends up doing the renaming manually, which works fine at ten files a week and falls apart at a hundred.

A naming convention helps, but only if everyone follows it. Getting a team to consistently name files `{type}_{vendor}_{date}_{id}`

requires discipline that degrades under deadline pressure.

The better approach is to read what is inside each file and generate the name from that content. This tutorial walks you through building that pipeline.

Content-aware renaming is not about reading file metadata. Metadata fields like creation date or the author field in an EXIF block are often wrong, empty, or filled with camera defaults that tell you nothing useful.

Reading the document's actual content requires two different approaches depending on what you are working with:

**Text documents** (PDFs, scanned invoices, contracts): OCR converts the rendered pixels or embedded text into a raw string. An LLM then pulls structured fields out of that string: document type, vendor name, date, reference number.

**Photos and general images** (receipts photographed on a phone, ID cards, site photos): OCR returns sparse or useless text on these. A vision model describes what is actually in the frame. The same LLM extraction step then processes that description.

Both paths produce the same output: a small dictionary of fields you feed into a filename template.

**Pipeline Architecture**

Here is the full flow in plain text:

```
Input file
    │
    ├─ PDF or text-heavy image
    │       └─ extract_text()    OCR or embedded text extraction
    │               └─ extract_fields()    LLM structured prompt
    │
    └─ Photo / general image
            └─ describe_image()    vision model
                    └─ extract_fields()    same LLM step
    │
    └─ build_filename()    template + sanitize
            └─ rename file on disk
```

The two content paths share the field extraction and naming steps. Only the ingestion step differs based on what kind of file you are processing.

For PDFs with embedded text, **pdfplumber** is faster and more accurate than rendering pages to images and running OCR on them. For scanned PDFs and standalone image files, **pytesseract** wraps Tesseract and handles most common formats.

``` python
import pdfplumber
import pytesseract
from PIL import Image
from pathlib import Path

def extract_text(file_path: str) -> str:
    """
    Extract text from a PDF (embedded or scanned) or image file.
    Returns raw text string. Returns empty string on failure.
    """
    path = Path(file_path)
    suffix = path.suffix.lower()

    if suffix == ".pdf":
        # Try embedded text first: faster and more accurate
        try:
            with pdfplumber.open(file_path) as pdf:
                pages = [page.extract_text() or "" for page in pdf.pages]
                combined = "\n".join(pages).strip()
                if len(combined) > 50:
                    return combined
        except Exception:
            pass

        # Fall back to OCR on the first page
        try:
            import pdf2image
            images = pdf2image.convert_from_path(
                file_path, first_page=1, last_page=1
            )
            if images:
                return pytesseract.image_to_string(images[0])
        except Exception:
            return ""

    elif suffix in {".jpg", ".jpeg", ".png", ".tiff", ".bmp"}:
        try:
            img = Image.open(file_path)
            return pytesseract.image_to_string(img)
        except Exception:
            return ""

    return ""
```

**Install:** `pip install pdfplumber pytesseract pdf2image Pillow`

Tesseract itself requires a system install. On macOS: `brew install tesseract`

. On Ubuntu or Debian: `apt install tesseract-ocr`

.

Raw OCR output is noisy. Whitespace errors, page headers, footer fragments, and encoding artifacts all land in the string. An LLM handles this noise well if your prompt is tight and returns structured JSON.

``` python
import json
import openai

client = openai.OpenAI()  # reads OPENAI_API_KEY from environment

EXTRACTION_PROMPT = """
You are a document classifier. Given raw OCR text, extract exactly these fields as JSON:

- "doc_type": one of: invoice, contract, receipt, report, id_card, other
- "vendor_or_party": the main company or person name. Use "unknown" if not found.
- "date": the primary date in YYYY-MM-DD format. Use "unknown" if not found.
- "identifier": invoice number, contract ID, or reference number. Use "unknown" if not found.

Return ONLY valid JSON with these four keys. No explanation. No markdown fences.

OCR text:
{text}
"""

def extract_fields(raw_text: str) -> dict:
    """
    Send OCR text to GPT and return structured filename fields.
    Falls back to a safe default dict on any parse failure.
    """
    fallback = {
        "doc_type": "file",
        "vendor_or_party": "unknown",
        "date": "unknown",
        "identifier": "unknown",
    }

    if not raw_text.strip():
        return fallback

    prompt = EXTRACTION_PROMPT.format(text=raw_text[:3000])

    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
        )
        content = response.choices[0].message.content.strip()
        result = json.loads(content)
        for key in fallback:
            if key not in result:
                result[key] = "unknown"
        return result
    except Exception:
        return fallback
```

`temperature=0`

keeps output deterministic across repeated runs on the same document. The 3,000-character cap limits token spend. For most invoices and contracts, the fields you need appear in the first page, well within that cap.

If you prefer Anthropic's API, the swap is straightforward. Replace `openai.OpenAI()`

with `anthropic.Anthropic()`

and use `client.messages.create(model="claude-3-haiku-20240307", ...)`

. The prompt itself works without changes.

With structured fields in hand, building the filename is a sanitize-and-join operation. The two things that will trip you up are illegal filesystem characters and name collisions.

``` php
import re
import hashlib

def sanitize(value: str) -> str:
    """Strip illegal filesystem characters and collapse whitespace."""
    value = value.strip()
    value = re.sub(r'[<>:"/\\|?*\x00-\x1f]', "", value)
    value = re.sub(r"\s+", "_", value)
    return value[:50]

def build_filename(fields: dict, original_path: str) -> str:
    """
    Assemble a descriptive filename from extracted fields.
    Format: {doc_type}_{vendor}_{date}_{identifier}{ext}
    Example: invoice_Acme_Corp_2024-03-15_INV-0042.pdf
    Unknown fields are dropped. Falls back to original stem + hash on total failure.
    """
    ext = Path(original_path).suffix.lower()

    parts = [
        sanitize(fields.get("doc_type", "file")),
        sanitize(fields.get("vendor_or_party", "")),
        sanitize(fields.get("date", "")),
    ]

    identifier = sanitize(fields.get("identifier", ""))
    if identifier and identifier.lower() != "unknown":
        parts.append(identifier)

    clean = [p for p in parts if p and p.lower() != "unknown"]

    if not clean:
        stem = Path(original_path).stem
        h = hashlib.md5(stem.encode()).hexdigest()[:6]
        clean = [sanitize(stem), h]

    return "_".join(clean) + ext

def rename_file(original_path: str, dry_run: bool = True) -> str:
    """
    Full text-document pipeline. Set dry_run=False to rename on disk.
    """
    raw_text = extract_text(original_path)
    fields = extract_fields(raw_text)
    new_name = build_filename(fields, original_path)

    original = Path(original_path)
    new_path = original.parent / new_name

    if dry_run:
        print(f"  [dry run] {original.name} -> {new_name}")
    else:
        original.rename(new_path)
        print(f"  renamed:  {original.name} -> {new_name}")

    return str(new_path)
```

Run with `dry_run=True`

first. Print the full preview list, spot-check a few filenames against the actual documents, then re-run with `dry_run=False`

once you are satisfied.

For photos, OCR returns little or nothing useful. A vision model can describe scene content and read visible text in context, which is exactly what you need for a photograph of a receipt or a scanned ID card.

``` php
import base64

def describe_image(file_path: str) -> str:
    """
    Use GPT-4o vision to describe an image for filename generation.
    Returns a plain-text description that extract_fields() can process.
    """
    path = Path(file_path)
    with open(file_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    ext = path.suffix.lower().lstrip(".")
    mime = "image/jpeg" if ext in ("jpg", "jpeg") else f"image/{ext}"

    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": (
                                "Describe this image in 2-3 sentences for the purpose of "
                                "generating a descriptive filename. Include: the document "
                                "or scene type, any visible names, dates, or reference "
                                "numbers, and the primary subject. Be specific."
                            ),
                        },
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:{mime};base64,{image_data}",
                                "detail": "low",
                            },
                        },
                    ],
                }
            ],
            max_tokens=200,
        )
        return response.choices[0].message.content.strip()
    except Exception:
        return ""

def rename_photo(file_path: str, dry_run: bool = True) -> str:
    """Photo pipeline: vision description -> field extraction -> rename."""
    description = describe_image(file_path)
    fields = extract_fields(description)
    new_name = build_filename(fields, file_path)

    original = Path(file_path)
    new_path = original.parent / new_name

    if dry_run:
        print(f"  [dry run] {original.name} -> {new_name}")
    else:
        original.rename(new_path)
        print(f"  renamed:  {original.name} -> {new_name}")

    return str(new_path)
```

The `"detail": "low"`

setting on the image URL cuts vision API costs by roughly 75% compared to the default. You do not need high-resolution analysis to figure out that a photo contains an Office Depot receipt dated February 28.

Low-quality scans. Tesseract accuracy drops quickly below 150 DPI. Before passing a scanned image to pytesseract, convert it to grayscale with `img.convert("L")`

, apply `ImageFilter.SHARPEN`

, and scale up to 300 DPI if the image is smaller. Even basic preprocessing recovers meaningful accuracy on borderline scans.

**Multi-language documents.** Tesseract defaults to English. If your pipeline processes documents in French, German, or Japanese, pass the language code explicitly: `pytesseract.image_to_string(img, lang="fra")`

. Documents that mix languages are harder to handle reliably and often need a detection step first. The `langdetect`

library covers this well.

**Handwriting.** Tesseract handles printed text reliably. Cursive or informal handwriting produces garbage output. For those files, skip OCR entirely and route directly to the vision path, which handles handwriting considerably better (though with some inconsistency on complex script).

**All-unknown extractions.** When OCR returns junk and the LLM cannot extract any real fields, your pipeline produces a useless name. Track these cases: write them to a review list instead of renaming, so a human can handle the outliers without the rest of the batch stalling.

**Name collisions.** Two invoices from the same vendor on the same date will produce identical filenames. The `build_filename`

function above uses a hash fallback on empty fields. You can extend it to also check whether the target path already exists and append a counter when it does.

The pipeline above is around 130 lines. API costs are low: `gpt-4o-mini`

extraction runs under $0.01 per document at typical invoice length. Vision calls for photos run a bit higher (around $0.01 to $0.03 each with `detail: low`

), but that is manageable at moderate volume.

The maintenance cost is higher than it looks at first. Preprocessing edge cases, handling new document types, managing API key rotation, and wiring in a watch folder or webhook all take real time. If you are automating your own personal workflow, the code above is a solid foundation to build on.

If you are deploying this for a team or integrating it into a document management system, dedicated tools are worth a look. [renamer.ai](https://renamer.ai/) handles the OCR and vision path with a REST API and takes the maintenance overhead off your plate. Filebot is strong for media libraries with rule-based naming. AWS Textract makes sense if you are already on AWS and processing at high volume. The right call depends on whether owning the infrastructure is an asset or a cost for your situation.

A content-aware renaming pipeline that:

Start with dry-run mode on a folder of test files. Review the preview output against what you actually expected. Once the extraction is working the way you want, flip to live mode and process in batches.

The four functions here are designed to stay independent, so you can swap in a different OCR library, point at a different model, or change the filename format without touching the rest.
