{"slug": "the-llm-kept-saying-fixed-for-three-months-it-wasnt", "title": "The LLM Kept Saying “Fixed.” For Three Months, It Wasn’t.", "summary": "A recurring debugging failure where the author spent three months repeatedly asking an LLM (Claude Code) to fix a cron health monitor alert, only to discover the LLM was providing plausible but incorrect fixes—either muting alerts temporarily or editing wrong files—without addressing the underlying structural bug. The root cause was a flawed homegrown monitoring system where cron scripts could be registered without implementing the required heartbeat ping, and the author's workflow failure was treating each isolated LLM session as part of an ongoing conversation when the model had no memory of previous sessions. After finally conducting a proper architecture review with multiple AI models, the author discovered the fix had never been applied correctly and also identified a dangerous bug in the deployment script that could have silently deleted every scheduled job.", "body_md": "That afternoon a Slack bot told me a script had NEVER RUN. That was a lie. The script had pulled 81 weather observations two minutes earlier. Unwinding the lie took three hours.\nThe bigger lie had been running for three months underneath it.\nBefore the session in this post, the cron health alert had been firing two or three times a week for three months. Each time, I'd paste the alert into a Claude Code session and ask the LLM to figure out why a script was reporting NEVER RUN. Each time the LLM would root around, land on something plausible, propose a fix, and confirm it with some variation of \"yep, that's it, we got it.\" I'd apply the fix and move on.\nThe fix was never the fix. Some of the time the LLM had just pushed alerted_until\na few months forward, quieting the alert without touching the structural bug. Some of the time it edited the wrong file. Either way the alert came back within a week or two on a different script, and the loop rolled forward.\nEach session was a cold start. The model had no memory of the previous session, of the pattern, of anything. The workflow failure was mine. I was treating fifteen independent debugging sessions as if they were one ongoing conversation, and the model was only seeing the one in front of it.\nI had an inbox and a model saying \"got it,\" and that was enough.\nI run about sixty-six scheduled scripts on a personal VPS. This is one story from that pile. I'd benchmarked the frontier models a few weeks earlier. The tier mapping that came out of it was fine for planning work. It was not sufficient for catching hallucinated fixes.\nThe cron health monitor is a dead man's switch. Every scheduled script is supposed to send a heartbeat ping on each run. If the monitor doesn't see a ping within the expected window, it fires a Slack alert.\nHealthchecks.io, Cronitor, and Dead Man's Snitch all solve this as a service: you get an HTTP endpoint per check, you hit it from your cron script, and they alert you if a ping goes missing. My system was a homegrown version of the same pattern, which is why the bugs described here were possible. A SaaS monitor would have refused to let me register a slug that had never sent a ping.\nThe architecture had three components that had to agree:\ncrontab.txt\n(slug tag)\n|\nv\nchecks.json\n(registry) ------> source code\n| (health_run())\n| |\n+------> pings.json <------+\n(runtime record)\nNothing enforced the edges. You could register a script in checks.json\nwithout adding a health_run()\ncall. You could tag a cron line with a slug that didn't exist in the registry. You could mute an alert indefinitely without touching source.\nEvery new cron script I'd shipped had reproduced the same bug. Registered in the registry, no ping call in source, alert fires, alert gets muted. The monitor was doing its job. I was systematically ignoring it.\n\"Didn't we just fix this?\" - Me, that afternoon, wrong.\nI opened a session and put Opus 4.6 on architecture review while Codex CLI (GPT-5) rewrote the validator and tagged 66 cron lines with their slugs. About thirty-five minutes, start to finish.\n\"[K-2SO] The chaos has been slightly inconvenienced.\" - Opus 4.6, after round 2, before we discovered the\ncrontab-apply.sh\nbug that would have silently deleted every scheduled job on the VPS.\n(K-2SO is the sardonic persona I prompt Claude Code with.)\nFour audit passes in, Sonnet 4.5 with shellcheck\nwired in flagged a bug in crontab-apply.sh\nthat neither Opus nor Codex had caught during implementation or the cold audit round. The script was supposed to install a new crontab safely. The actual sequence was:\n# BEFORE: install first, verify after\ncrontab new.txt # already live\ncrontab -l > verify.txt\ndiff crontab.txt verify.txt || exit 1 # too late\nIf the diff failed, the script exited 1. The new crontab was already live. A malformed crontab.txt\nwould have wiped every scheduled job on the VPS with no restore path.\nThe fix is obvious once you see it:\n# AFTER: verify first, install only on pass\ncrontab -n new.txt || { restore_backup; exit 1; }\ncrontab new.txt\ncrontab -l > verify.txt\ndiff crontab.txt verify.txt || { restore_backup; exit 1; }\nThis bug had been in the script since the script was written. Opus and Codex both looked at the file and missed it. I never looked at it at all. I was trusting the two frontier models to flag anything off.\nBefore this session, shellcheck wasn't in my review pipeline at all. When Sonnet 4.5 caught the bug on Round 4, it wasn't because the model out-reasoned Opus and Codex. The qa-bash\nskill wires shellcheck into the review. Once shellcheck scanned the file, it flagged the order-of-operations pattern on its own. Sonnet read the output and passed it upstream. A validator that always passes isn't a validator. It's a confidence injection machine. I had been using the whole session pipeline as one for three months.\nSession length: 3 hours\nDistinct bugs found: 18 (9 during implementation wave, 9 across 4 audit passes)\nAudit passes: 4 (Codex cold audit + qa-bash + qa-python + Opus final)\nPass-by-pass bug count: 5, 1, 3, 0\nCron lines tagged: 66\nCoverage: 0% -> 94% (86 tests)\nScripts never pinging: handful -> 1 (legitimate edge case)\nAlerts muted to 2099: 15 -> 1 (legitimate intentional mute)\nShellcheck, pytest-cov\n, mypy\n, a type checker, a linter, a schema validator, any tool that either finds a bug or doesn't find a bug with no probabilistic layer in between is the first thing you should reach for. LLMs are useful for everything that can't be checked deterministically, but stacking more LLM passes is not a substitute for a single deterministic tool with domain-specific rules.\nThe LLM on its own had confidently endorsed broken fixes for three months. A shell linter caught a crontab-wipe bug on its first scan.\nThe tier mapping I use with Claude Code:\nqa-bash\n(which runs shellcheck) and qa-python\n(which runs pytest, pytest-cov, and mypy). The model drives the skill, the tool finds the bugs.Your stack will be different. The piece worth copying is pairing each LLM review pass with a deterministic tool, not stacking prompts.\nOn a personal cron supervisor, roughly two hundred lines of logic with deterministic inputs, the bug count trends toward zero across passes. Pass one found five bugs. Pass two found one. Pass three found three (the implementation wave had created new surface to audit). Pass four found zero. That's where I stopped.\nPast that size, or once database side effects and real concurrency enter, the pass count stops converging and \"loop until zero\" becomes a paralysis spiral. This is a rule for small, fully-owned systems, not for production services with moving dependencies.\nA monitoring pattern where the absence of a signal triggers the alert, not the presence of one. Every scheduled script sends a heartbeat ping on each run. If the monitor doesn't see a ping within the expected window, the script is assumed dead. Healthchecks.io, Cronitor, and Dead Man's Snitch are commercial implementations.\nSame pattern, rolled by hand. A SaaS monitor wouldn't have let me register a slug and never ping it, because the slug doesn't exist until the first ping lands. Most of the referential integrity gaps that caused the bugs in this post are enforced by those services at signup.\nInstalling a change to a live system before validating it, with no rollback path if validation fails. In the crontab case, crontab new.txt\nmade the new config live immediately, and the diff\ncheck that followed couldn't undo it. The fix is to validate syntax in a staging slot first with crontab -n\n, then install on pass.\nResponding to a noisy monitor by silencing it (pushing alerted_until\nforward, adding a filter, raising a threshold) without addressing why the alert fired. Debt accumulates invisibly. Three months of mutes looks fine on the dashboard. One bad state escaping to production recovers the debt with interest.\nThe property that the three components describing a scheduled job (the crontab line, the registry entry in checks.json\n, and the health-ping calls in source code) must all agree. Without enforcement, you can register a job that has no ping call, tag a cron line with a slug that doesn't exist in the registry, or mute an alert indefinitely without touching source. SaaS monitors enforce this at signup. A homegrown system has to add the gate deliberately.\nI had been fixing this bug, one alert at a time, for three months. Every fix was a mute. Every mute was a debt I told myself I'd deal with later. I didn't.\nThree hours with a shell linter. I had spent more than that, cumulatively, letting a confident LLM talk me out of reading my own code.\nFix once is a lie. Loop until zero.", "url": "https://wpnews.pro/news/the-llm-kept-saying-fixed-for-three-months-it-wasnt", "canonical_source": "https://dev.to/ianlpaterson/the-llm-kept-saying-fixed-for-three-months-it-wasnt-20nn", "published_at": "2026-05-18 19:56:26+00:00", "updated_at": "2026-05-18 20:06:20.738385+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "artificial-intelligence"], "entities": ["Claude Code", "Healthchecks.io", "Cronitor", "Dead Man's Snitch", "Opus", "Codex CLI", "GPT-5"], "alternates": {"html": "https://wpnews.pro/news/the-llm-kept-saying-fixed-for-three-months-it-wasnt", "markdown": "https://wpnews.pro/news/the-llm-kept-saying-fixed-for-three-months-it-wasnt.md", "text": "https://wpnews.pro/news/the-llm-kept-saying-fixed-for-three-months-it-wasnt.txt", "jsonld": "https://wpnews.pro/news/the-llm-kept-saying-fixed-for-three-months-it-wasnt.jsonld"}}