When the Cleanup Code Becomes the Project

The article describes the author's attempt to use AWS Textract to extract text from handwritten recipe documents, finding that while the service provides confidence scores, its accuracy is poor (40-60%) and it fails to understand document structure. The author then writes increasingly complex parsing code to handle edge cases like strikethroughs and varied formatting, but concludes that the parser has become more work than manually typing the data, as fixing one document's output often breaks others.

Tesseract can't do handwriting. Time to spend money. AWS Textract. Cloud service, built-in handwriting detection, pay per page. If I'm paying for it, the output should at least be usable. Textract python import boto3 from pathlib import Path def extract document image path : client = boto3.client 'textract' with open image path, 'rb' as f: image bytes = f.read response = client.detect document text Document={'Bytes': image bytes} lines = for block in response 'Blocks' : if block 'BlockType' == 'LINE': lines.append { 'text': block 'Text' , 'confidence': block 'Confidence' } return lines if name == " main ": for image file in sorted Path "images" .glob " " : print f"\n--- {image file.name} ---" for line in extract document image file : print f" {line 'confidence' :5.1f}% {line 'text' }" Confidence scores are a nice touch. Accuracy is better - maybe 40-60% on a good document. But "better" isn't "good enough." "2 1/4 cups flour" comes back as "2 1/4 c fleur." "1 tsp baking soda" becomes "1 tso bokrig sado." The Real Problem Even when it gets the words right, Textract doesn't know what any of it means . Flat text. Lines in reading order. My documents have titles, ingredient lists, instruction paragraphs, notes in margins. Textract sees none of that. Just characters on a page. So now I'm writing parsers. python import boto3 import json import re from pathlib import Path def extract document image path : client = boto3.client 'textract' with open image path, 'rb' as f: image bytes = f.read response = client.detect document text Document={'Bytes': image bytes} lines = for block in response 'Blocks' : if block 'BlockType' == 'LINE': lines.append { 'text': block 'Text' , 'confidence': block 'Confidence' } return lines def parse structured data raw lines : title = None items = instructions = quantity pattern = r'^ \d+ \s/ \d \s cups?|tbsp?|tsp|oz|lbs?|g|ml|c \s+ .+ ' for line in raw lines: text = line 'text' .strip match = re.match quantity pattern, text, re.IGNORECASE if match: items.append { 'quantity': match.group 1 , 'unit': match.group 2 , 'item': match.group 3 } elif not title: title = text else: instructions.append text return {'title': title, 'items': items, 'instructions': instructions} if name == " main ": for image file in sorted Path "images" .glob " " : print f"\n--- {image file.name} ---" raw lines = extract document image file result = parse structured data raw lines print json.dumps result, indent=2 Works on 30% of the documents. The other 70% break at least one assumption. Title not on line one. Quantities written backwards. Abbreviations I've never seen. Crossed-out text mixed into the content. Multi-line entries split apart. Every new document, a new edge case. Every new edge case, another if , another regex. Saturday Night Here's where I'm at: - Pre-processing with 6 configurable parameters - 200+ lines of regex and heuristics - 70% of documents still need a human - Accuracy I'm being generous calling 30% The parser is now more work than just typing things by hand. And every time I fix one document's output, three others break. The heuristics are fragile. Interconnected. Basically untestable because no two documents look alike. One document has a crossed-out line. Original text scratched out, correction written above. Any person glances at it and reads the correction. Half a second. Textract returns both lines. Jumbled. My parser doesn't know what a strikethrough is. Teaching it would mean analyzing the spatial layout of ink strokes. That's not a text problem anymore. That's a computer vision problem. I'm a full day in. The system I'm building reads worse than I do, and the code to make it slightly less bad is growing faster than the documents it's supposed to process.