# When the Cleanup Code Becomes the Project

> Source: <https://dev.to/awjudd/when-the-cleanup-code-becomes-the-project-4h1k>
> Published: 2026-05-22 12:47:20+00:00

Tesseract can't do handwriting. Time to spend money.

AWS Textract. Cloud service, built-in handwriting detection, pay per page. If I'm paying for it, the output should at least be usable.

## Textract

``` python
import boto3
from pathlib import Path

def extract_document(image_path):
    client = boto3.client('textract')

    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    response = client.detect_document_text(
        Document={'Bytes': image_bytes}
    )

    lines = []
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            lines.append({
                'text': block['Text'],
                'confidence': block['Confidence']
            })

    return lines

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        for line in extract_document(image_file):
            print(f"  [{line['confidence']:5.1f}%] {line['text']}")
```

Confidence scores are a nice touch. Accuracy is better - maybe 40-60% on a good document.

But "better" isn't "good enough." "2 1/4 cups flour" comes back as "2 1/4 c fleur." "1 tsp baking soda" becomes "1 tso bokrig sado."

## The Real Problem

Even when it gets the words right, Textract doesn't know what any of it *means*. Flat text. Lines in reading order. My documents have titles, ingredient lists, instruction paragraphs, notes in margins. Textract sees none of that. Just characters on a page.

So now I'm writing parsers.

``` python
import boto3
import json
import re
from pathlib import Path

def extract_document(image_path):
    client = boto3.client('textract')

    with open(image_path, 'rb') as f:
        image_bytes = f.read()

    response = client.detect_document_text(
        Document={'Bytes': image_bytes}
    )

    lines = []
    for block in response['Blocks']:
        if block['BlockType'] == 'LINE':
            lines.append({
                'text': block['Text'],
                'confidence': block['Confidence']
            })

    return lines

def parse_structured_data(raw_lines):
    title = None
    items = []
    instructions = []

    quantity_pattern = r'^(\d+[\s/]*\d*)\s*(cups?|tbsp?|tsp|oz|lbs?|g|ml|c)\s+(.+)'

    for line in raw_lines:
        text = line['text'].strip()
        match = re.match(quantity_pattern, text, re.IGNORECASE)

        if match:
            items.append({
                'quantity': match.group(1),
                'unit': match.group(2),
                'item': match.group(3)
            })
        elif not title:
            title = text
        else:
            instructions.append(text)

    return {'title': title, 'items': items, 'instructions': instructions}

if __name__ == "__main__":
    for image_file in sorted(Path("images").glob("*")):
        print(f"\n--- {image_file.name} ---")
        raw_lines = extract_document(image_file)
        result = parse_structured_data(raw_lines)
        print(json.dumps(result, indent=2))
```

Works on 30% of the documents. The other 70% break at least one assumption. Title not on line one. Quantities written backwards. Abbreviations I've never seen. Crossed-out text mixed into the content. Multi-line entries split apart.

Every new document, a new edge case. Every new edge case, another `if`

, another regex.

## Saturday Night

Here's where I'm at:

- Pre-processing with 6 configurable parameters
- 200+ lines of regex and heuristics
- 70% of documents still need a human
- Accuracy I'm being generous calling 30%

The parser is now more work than just typing things by hand.

And every time I fix one document's output, three others break. The heuristics are fragile. Interconnected. Basically untestable because no two documents look alike.

One document has a crossed-out line. Original text scratched out, correction written above. Any person glances at it and reads the correction. Half a second.

Textract returns both lines. Jumbled. My parser doesn't know what a strikethrough is. Teaching it would mean analyzing the spatial layout of ink strokes. That's not a text problem anymore. That's a computer vision problem.

I'm a full day in. The system I'm building reads worse than I do, and the code to make it slightly less bad is growing faster than the documents it's supposed to process.
