I Shipped One Messy Python Script. Here's the 10-Point Checklist That Got It There. A developer outlines a 10-point checklist for shipping AI-generated Python scripts reliably, based on a typical messy script with hardcoded paths and bare exception handling. The checklist covers error handling, configuration management, input validation, logging, and testing, aiming to close the gap between 'runs on my machine' and production-ready code. AI can generate a functional Python script in approximately ninety seconds. The script executes successfully, and the user proceeds to the next task. However, the script often persists beyond its initial use. It may include a hard-coded Downloads path or a comment such as edit before running . Subsequently, someone might duplicate it as script v2 FINAL.py. Weeks later, a teammate may attempt to use the script with a slightly different CSV file, resulting in a raw traceback or, more problematically, the script silently producing an incomplete file and exiting with a success code. That gap between "runs on my machine" and "I'd let someone else run this" — that distance is the actual engineering work. And AI almost never closes it for you, because you never asked. That's an observation from watching a lot of generated coThis post provides a detailed analysis of a deliberately typical, disorganized script. I will identify each defect and demonstrate the necessary modifications to prepare it for distribution. By the conclusion, readers will have a ten-point checklist applicable to their own code, requiring only Python and pytest.on your own code today, with zero tools beyond Python and pytest. The "before" — a script that works exactly once The following is the initial script. It aggregates expenses from a CSV file by category and flags entries exceeding a specified limit. Although it functions as intended, this apparent success can be misleading. The script is abbreviated for brevity; the overall structure remains representative. import csv INPUT = "/Users/me/Downloads/expenses.csv" <- edit this before running LIMIT = 500 print "starting..." rows = try: f = open INPUT r = csv.reader f next r skip header for row in r: rows.append row except: print "error reading file" for row in rows: cat = row 1 amt = float row 2 ...total by category, flag over LIMIT, print a report... out = open "report.txt", "w" Count the landmines: A hardcoded home path /Users/me/Downloads/expenses.csv you must edit in source to use the tool. A bare except: that swallows every error — including bugs in your own code — and then keeps running on an empty rows list. Positional column access row 1 , row 2 that explodes on a missing column and gives no clue which one. float row 2 that crashes the whole run on one malformed cell. print for everything — diagnostics, results, and the "FLAGGED" lines all jammed into stdout together. A non-atomic write out = open "report.txt", "These issues are common and representative of typical AI-generated Python scripts. The following sections will address how to prepare such scripts for reliable distribution. --help. None of this is exotic. This is what working AI-generated Python looks like. Let's ship it. The checklist run this on any script This is the evaluation rubric I apply before sharing code with others. Each criterion is scored from 0 to 2, for a maximum of twenty points. Apply strict standards; a 'partially' addressed item receives a score of 1. Each entry presents a Failure and corresponding Fix for efficient review. Error handling — Failure: a bare except: automatic 0 , or raw tracebacks on bad input. Fix: wrap I/O, parsing, and network calls in specific exceptions; failures produce actionable messages and a nonzero exit code. Secrets & config — Failure: hardcoded keys, tokens, or home paths. Fix: config comes from arguments or env. Grep your own code for api key =, password =, token =, sk-, Bearer, and absolute home paths before you ship. Inputs & validation — Failure: the script assumes well-formed input. Fix: check every external input — what happens on an empty file? A missing column? a path with spaces? Logging & observability — Failure: print for diagnostics, or total silence on failure. Fix: logging with levels; user output separated from debug noise. Tests — Failure: none most scripts . Fix: a pytest suite covering the happy path and at least three failure modes, with all tests running green. Dependency hygiene — Failure: undeclared or unpinned deps, dead imports. Fix: declare dependencies with version bounds; imports match declared deps. Interface & UX — Failure: values you edit in source to run it. Fix: a real CLI --help, exit codes or a documented API. Packaging & install — Failure: "clone it and run python script.py and hope." Fix: pip install . works, an entry point is defined, it runs from any directory. Documentation — Failure: no runnable example. Fix: a README with one-line purpose, install, a copy-pasteable example, and expected output. Portability — Failure: open without encoding= blows up on a non-UTF-8 file; OS-specific paths wiA critical rule for effective use of this checklist is that every finding must reference a specific file and line number. General statements such as 'improve error handling' are insufficient. Instead, precise observations like 'open at line 14 crashes on a missing file' are required. Ambiguous findings are unlikely to be addressed.dling" is banned. "open at line 14 crashes on a missing file" is the standard. Vague findings never get fixed. Now the three fixes that matter most. Fix 1: bare except → specific exceptions + real exit codes The guiding principle is concise: always catch specific exceptions rather than broad ones, and handle exceptions at the program's boundaries rather than within core logic. Internal code should raise exceptions without handling them directly. Only the main function should convert errors into user-facing messages and exit codes. Exit codes must be meaningful: 0 for success, 1 for runtime failure, and 2 for usage errors. Otherwise, shell pipelines and scheduled jobs that depend on the script may fail unpredictably. Here's the validation from the shipped core.py. Notice it names the missing column instead of dying on row 1 : class AppError Exception : """ Expected, user-facing failure. Message says what to do, not just what broke.""" REQUIRED COLUMNS = "date", "category", "amount" def read expenses path: Path - list dict str, str : if not path.is file : raise AppError f"input file not found: {path}" with open path, newline="", encoding="utf-8" as f: reader = csv.DictReader f if reader. fieldnames is None: raise AppError f"{path} is empty — expected a header row" missing = c for c in REQUIRED COLUMNS if c not in reader.fieldnames if missing: raise AppError f"{path} is missing required column s : {', '.join missing }" return list reader A bad row no longer kills the run — it's skipped with a warning and counted. And the top of main does the translating: try: rows = read expenseImportantly, this approach catches only AppError exceptions. Unexpected exceptions are allowed to propagate, displaying a full traceback. Suppressing such errors makes diagnosis difficult, whereas explicit tracebacks provide actionable information. Avoid wrapping main in a generic exception handler that outputs only a vague error message.aceback — because a swallowed crash is undiagnosable, while a loud one tells you exactly what to fix. Don't wrap main in a catch-all that prints "something went wrong." That output write also became atomic — write to a .tmp file, then rename it into place — so a crash never leaves a cA common misconception is to conflate program output with logging. If the script is intended to print a report to standard output, use print for that purpose. Logging, which provides diagnostic information, should be directed to standard error. Mixing these streams can disrupt output redirection for users.t goes to stderr. Mix them, then run script.py out.csv for every user. So the "starting..." and "read N rows" lines became leveled logs to stderr, wired to a verbosity flag — quiet by default, -v for milestones, -vv for debug: def setup logging verbosity: int = 0 - None: level = logging.WARNING, logging.INFO, logging.DEBUG min verbosity, 2 logging.basicConfig level=level, handlers= logging.StreamHandler sys.stderr Maintain discipline in logging levels: detailed information for each row should be logged at the DEBUG level; summary actions such as 'wrote report.csv' should use INFO; and warnings like 'skipped 3 rows with missing amount' should use WARNING. INFO-level messages should occur only a constant number of times per execution, never within per-row loops. Additionally, use the lazy formatting style log.debug "row %d amount=%s", i, amt to avoid unnecessary string formatting when the log level is not active. Fix 3: loose script → installable, tested CLI The hardcoded INPUT path was replaced by an argparse CLI with --help, --limit, --output, --force, -v, and --version. The loose file became a src/ layout package with a pyproject. toml declaring an entry point: project.scripts expense-report = "expense report.cli:main" That's what turns "clone it and run python script.py and hope" into a real command. And the pytest suite covers the happy path plus the failure modes the original couldn't survive: missing file, empty file, missing column, non-numeric amount skipped, not fatal , and refusing to overwrite an existing report without --force. I verified the whole chain in a fresh virtual environment before writing this: python3 -m venv .venv && .venv/bin/pip install ". dev " .venv/bin/python -m pytest 16 passed .venv/bin/expense-report --help All sixteen tests pass, the installation process is clean, and the script executes correctly from any directory. Demonstrating evidence of reliability. The checklist provided is intended for your use, and the preceding example demonstrates the complete methodology. Evaluate one of your completed scripts against the ten categories, address the most critical issues first, and verify each fix by reproducing the corresponding failure. The determination of whether a script 'needs work' is subjective, but scripts with low scores in error handling, testing, and packaging should not be distributed. A low score in error handling, tests, and packaging is one you shouldn't hand to anyone yet. If you'd rather your AI assistant enforce this loop instead of doing it by hand, I packaged the discipline as eight Claude Code skills — the scored audit ship-check , plus harden-errors, add-logging, make-cli, add-tests, package-it, write-readme, and a release-prep gate — along with the full before/after sample project you just read through. Full disclosure: I built it, and it's a paid kit $19 at jackiecole.gumroad.com/l/lcscdf. The checklist in this post stands on its own either way.