Making AI-Generated Code Fail Gracefully

wpnews.pro

cd /news/large-language-models/making-ai-generated-code-fail-gracef… · home › topics › large-language-models › article

[ARTICLE · art-18927] src=dev.to ↗ pub=2026-05-31T02:55Z topic=large-language-models verified=true sentiment=· neutral

Making AI-Generated Code Fail Gracefully

A developer building an AI-powered video editing tool implemented silent self-correction for LLM-generated code failures, achieving a 90%+ success rate. Instead of showing users raw Python tracebacks, the system feeds error messages back to the LLM for automatic retries, with a 70-80% success rate on the first retry. The approach transforms user experience from seeing cryptic errors to receiving friendly messages like "Hmm, let me try that a different way..." followed by successful execution.

read3 min views28 publishedMay 31, 2026

Making AI-Generated Code Fail Gracefully

If your app generates code with an LLM and executes it, you already know the dirty secret: it fails a lot. Not catastrophically — just wrong method names, bad assumptions about state, off-by-one stuff. The kind of errors a human would fix in 10 seconds. The question is what your user sees when that happens.

The Problem

Version 1 of my app showed users raw Python tracebacks when a generated script failed. Something like:

Script execution failed:

Traceback (most recent call last): File "", line 3, in

items = timeline.GetItemsInTrack("video", 1)

AttributeError: 'Timeline' object has no attribute 'GetItemsInTrack'

The LLM got the method name wrong — it's GetItemListInTrack, not GetItemsInTrack. An easy fix. But my users are video editors, not Python developers. That traceback means nothing to them except "it broke."

The Fix: Silent Self-Correction Instead of showing the error, I send it back to the LLM with context:

"The previous script failed with: AttributeError: 'Timeline' object has no attribute 'GetItemsInTrack'. Generate a corrected script."

The LLM sees its own mistake, fixes the method name, and the corrected script runs. The user sees:

"Hmm, let me try that a different way..."

Then 2 seconds later:

"✓ Set opacity to 50% on 12 clips"

They never see the error. It just works on the second attempt.

The Implementation (High Level) The retry loop is simple:

LLM generates a script

Script fails (validation or execution) Send the error message back to the LLM as a new prompt

LLM generates a corrected script

Try again (up to 2 retries)

If all retries fail, show a friendly message suggesting simpler commands

The key insight: LLMs are surprisingly good at fixing their own mistakes when you show them the exact error. The success rate on retry is much higher than the first attempt because the error message narrows the solution space.

Friendly Validation Messages

Not all failures are execution errors. Some scripts get rejected before they run because they violate sandbox rules (my app runs generated code in a restricted environment). Instead of showing "Script contains blocked import: 'os'", the user sees:

"That operation would need external libraries that aren't available. Try rephrasing — most operations work with the built-in tools."

Different failure modes get different messages. The user gets guidance on what to try next, not a technical explanation of why it broke.

What I Learned

Users don't care about errors — they care about results. If you can fix it silently, fix it silently.

LLMs are good debuggers of their own output. Feeding the error back works 70-80% of the time on the first retry.

Three retries is the sweet spot. One isn't enough (sometimes the fix introduces a new error). Two catches most errors. Three is for that last 10-20% that need complex logic reevaluations.

Friendly messages need to be actionable. "Something went wrong" is useless. "Try a simpler version of your request" gives the user a next step.

The QThread signal collision was the real bug. I spent hours debugging why retries weren't working before realizing Qt's built-in finished signal was shadowing my custom one in packaged builds. Renamed it and everything clicked. If you're subclassing QThread — don't name your signals finished or started.

The UX Difference

Before: 30% of commands showed a traceback. Users assumed the app was broken.

After: 90%+ of commands succeed (including retries). The 10% that fail get a conversational message. Users assume the app is smart but has limits — which is exactly right.

Building Cutting Room AI — natural language video editing for DaVinci Resolve Studio.

Available now FOR FREE: NickValenciaTech.com

source & further reading

dev.to — original article The SSE Fragmentation Catastrophe That Took Down CareerPilot AI (Smash Stories) AI Is Not Replacing Marketers. It Is Replacing Marketers With No Taste. DeepSeek vs Qwen vs Kimi vs GLM: Which One Wins My Freelance Budget?

── more in #large-language-models 4 stories · sorted by recency

dev.to · 15 Jul · #large-language-models

DeepSeek vs Qwen vs Kimi vs GLM: Which One Wins My Freelance Budget?

businessinsider.com · 15 Jul · #large-language-models

Meet the vibe coders who want to be the 'MrBeast' of apps

mlq.ai · 15 Jul · #large-language-models

Sam Altman Warns of 'Hiccups' as GPT-5.6 Sol Demand Strains OpenAI Infrastructure

businessinsider.com · 15 Jul · #large-language-models

This startup is betting job seekers will pay to land a job

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required