{"slug": "when-the-cleanup-code-becomes-the-project", "title": "When the Cleanup Code Becomes the Project", "summary": "The article describes the author's attempt to use AWS Textract to extract text from handwritten recipe documents, finding that while the service provides confidence scores, its accuracy is poor (40-60%) and it fails to understand document structure. The author then writes increasingly complex parsing code to handle edge cases like strikethroughs and varied formatting, but concludes that the parser has become more work than manually typing the data, as fixing one document's output often breaks others.", "body_md": "Tesseract can't do handwriting. Time to spend money.\n\nAWS Textract. Cloud service, built-in handwriting detection, pay per page. If I'm paying for it, the output should at least be usable.\n\n## Textract\n\n``` python\nimport boto3\nfrom pathlib import Path\n\ndef extract_document(image_path):\n    client = boto3.client('textract')\n\n    with open(image_path, 'rb') as f:\n        image_bytes = f.read()\n\n    response = client.detect_document_text(\n        Document={'Bytes': image_bytes}\n    )\n\n    lines = []\n    for block in response['Blocks']:\n        if block['BlockType'] == 'LINE':\n            lines.append({\n                'text': block['Text'],\n                'confidence': block['Confidence']\n            })\n\n    return lines\n\nif __name__ == \"__main__\":\n    for image_file in sorted(Path(\"images\").glob(\"*\")):\n        print(f\"\\n--- {image_file.name} ---\")\n        for line in extract_document(image_file):\n            print(f\"  [{line['confidence']:5.1f}%] {line['text']}\")\n```\n\nConfidence scores are a nice touch. Accuracy is better - maybe 40-60% on a good document.\n\nBut \"better\" isn't \"good enough.\" \"2 1/4 cups flour\" comes back as \"2 1/4 c fleur.\" \"1 tsp baking soda\" becomes \"1 tso bokrig sado.\"\n\n## The Real Problem\n\nEven when it gets the words right, Textract doesn't know what any of it *means*. Flat text. Lines in reading order. My documents have titles, ingredient lists, instruction paragraphs, notes in margins. Textract sees none of that. Just characters on a page.\n\nSo now I'm writing parsers.\n\n``` python\nimport boto3\nimport json\nimport re\nfrom pathlib import Path\n\ndef extract_document(image_path):\n    client = boto3.client('textract')\n\n    with open(image_path, 'rb') as f:\n        image_bytes = f.read()\n\n    response = client.detect_document_text(\n        Document={'Bytes': image_bytes}\n    )\n\n    lines = []\n    for block in response['Blocks']:\n        if block['BlockType'] == 'LINE':\n            lines.append({\n                'text': block['Text'],\n                'confidence': block['Confidence']\n            })\n\n    return lines\n\ndef parse_structured_data(raw_lines):\n    title = None\n    items = []\n    instructions = []\n\n    quantity_pattern = r'^(\\d+[\\s/]*\\d*)\\s*(cups?|tbsp?|tsp|oz|lbs?|g|ml|c)\\s+(.+)'\n\n    for line in raw_lines:\n        text = line['text'].strip()\n        match = re.match(quantity_pattern, text, re.IGNORECASE)\n\n        if match:\n            items.append({\n                'quantity': match.group(1),\n                'unit': match.group(2),\n                'item': match.group(3)\n            })\n        elif not title:\n            title = text\n        else:\n            instructions.append(text)\n\n    return {'title': title, 'items': items, 'instructions': instructions}\n\nif __name__ == \"__main__\":\n    for image_file in sorted(Path(\"images\").glob(\"*\")):\n        print(f\"\\n--- {image_file.name} ---\")\n        raw_lines = extract_document(image_file)\n        result = parse_structured_data(raw_lines)\n        print(json.dumps(result, indent=2))\n```\n\nWorks on 30% of the documents. The other 70% break at least one assumption. Title not on line one. Quantities written backwards. Abbreviations I've never seen. Crossed-out text mixed into the content. Multi-line entries split apart.\n\nEvery new document, a new edge case. Every new edge case, another `if`\n\n, another regex.\n\n## Saturday Night\n\nHere's where I'm at:\n\n- Pre-processing with 6 configurable parameters\n- 200+ lines of regex and heuristics\n- 70% of documents still need a human\n- Accuracy I'm being generous calling 30%\n\nThe parser is now more work than just typing things by hand.\n\nAnd every time I fix one document's output, three others break. The heuristics are fragile. Interconnected. Basically untestable because no two documents look alike.\n\nOne document has a crossed-out line. Original text scratched out, correction written above. Any person glances at it and reads the correction. Half a second.\n\nTextract returns both lines. Jumbled. My parser doesn't know what a strikethrough is. Teaching it would mean analyzing the spatial layout of ink strokes. That's not a text problem anymore. That's a computer vision problem.\n\nI'm a full day in. The system I'm building reads worse than I do, and the code to make it slightly less bad is growing faster than the documents it's supposed to process.", "url": "https://wpnews.pro/news/when-the-cleanup-code-becomes-the-project", "canonical_source": "https://dev.to/awjudd/when-the-cleanup-code-becomes-the-project-4h1k", "published_at": "2026-05-22 12:47:20+00:00", "updated_at": "2026-05-22 13:05:34.843847+00:00", "lang": "en", "topics": ["artificial-intelligence", "cloud-computing", "developer-tools", "machine-learning"], "entities": ["Tesseract", "AWS Textract", "boto3"], "alternates": {"html": "https://wpnews.pro/news/when-the-cleanup-code-becomes-the-project", "markdown": "https://wpnews.pro/news/when-the-cleanup-code-becomes-the-project.md", "text": "https://wpnews.pro/news/when-the-cleanup-code-becomes-the-project.txt", "jsonld": "https://wpnews.pro/news/when-the-cleanup-code-becomes-the-project.jsonld"}}