Making AI-Generated Code Fail Gracefully
If your app generates code with an LLM and executes it, you already know the dirty secret: it fails a lot. Not catastrophically — just wrong method names, bad assumptions about state, off-by-one stuff. The kind of errors a human would fix in 10 seconds. The question is what your user sees when that happens.
The Problem
Version 1 of my app showed users raw Python tracebacks when a generated script failed. Something like:
Script execution failed:
Traceback (most recent call last): File "", line 3, in
items = timeline.GetItemsInTrack("video", 1)
AttributeError: 'Timeline' object has no attribute 'GetItemsInTrack'
The LLM got the method name wrong — it's GetItemListInTrack, not GetItemsInTrack. An easy fix. But my users are video editors, not Python developers. That traceback means nothing to them except "it broke."
The Fix: Silent Self-Correction Instead of showing the error, I send it back to the LLM with context:
"The previous script failed with: AttributeError: 'Timeline' object has no attribute 'GetItemsInTrack'. Generate a corrected script."
The LLM sees its own mistake, fixes the method name, and the corrected script runs. The user sees:
"Hmm, let me try that a different way..."
Then 2 seconds later:
"✓ Set opacity to 50% on 12 clips"
They never see the error. It just works on the second attempt.
The Implementation (High Level) The retry loop is simple:
LLM generates a script
Script fails (validation or execution) Send the error message back to the LLM as a new prompt
LLM generates a corrected script
Try again (up to 2 retries)
If all retries fail, show a friendly message suggesting simpler commands
The key insight: LLMs are surprisingly good at fixing their own mistakes when you show them the exact error. The success rate on retry is much higher than the first attempt because the error message narrows the solution space.
Friendly Validation Messages
Not all failures are execution errors. Some scripts get rejected before they run because they violate sandbox rules (my app runs generated code in a restricted environment). Instead of showing "Script contains blocked import: 'os'", the user sees:
"That operation would need external libraries that aren't available. Try rephrasing — most operations work with the built-in tools."
Different failure modes get different messages. The user gets guidance on what to try next, not a technical explanation of why it broke.
What I Learned
Users don't care about errors — they care about results. If you can fix it silently, fix it silently.
LLMs are good debuggers of their own output. Feeding the error back works 70-80% of the time on the first retry.
Three retries is the sweet spot. One isn't enough (sometimes the fix introduces a new error). Two catches most errors. Three is for that last 10-20% that need complex logic reevaluations.
Friendly messages need to be actionable. "Something went wrong" is useless. "Try a simpler version of your request" gives the user a next step.
The QThread signal collision was the real bug. I spent hours debugging why retries weren't working before realizing Qt's built-in finished signal was shadowing my custom one in packaged builds. Renamed it and everything clicked. If you're subclassing QThread — don't name your signals finished or started.
The UX Difference
Before: 30% of commands showed a traceback. Users assumed the app was broken.
After: 90%+ of commands succeed (including retries). The 10% that fail get a conversational message. Users assume the app is smart but has limits — which is exactly right.
Building Cutting Room AI — natural language video editing for DaVinci Resolve Studio.
Available now FOR FREE: NickValenciaTech.com