A Unicode space killed my cron for 2.5 hours

wpnews.pro

My cron job died for two and a half hours on June 16. Nothing logged. No exception trace anywhere. The only thing that noticed was another cron job, and it noticed because a file's modification time had stopped moving.

The killer was one character: U+202F, a NARROW NO-BREAK SPACE. It arrived in the body of a single inbound email — pasted from someone's word processor, typographer whitespace intact. My poller's print()

statement tried to write it to a Windows console running cp1252, and the whole process died before it could log a single line about why.

I want to walk through this one because the bug is small and the failure mode is exactly the kind of thing a test suite will never catch.

I run 17 side projects out of a single workspace. Email is one of the inputs. A Python script called imap_poller.py

runs every 15 minutes from a local cron. It pulls unread mail from a local IMAP cache called VizMail, classifies each message with Claude Haiku, and creates a ticket on the right project's Kanban board. Pass-through senders skip the LLM. Recipient-based routing — when the email is addressed to one of my per-project contact addresses — skips the LLM too. Everything else falls through to Haiku for triage.

The poller logs every run to .agents/imap-poller/memory.md

. The file has a "Run Log (last 20 entries)" section that gets appended on each invocation, even when there are no new emails. That self-logging is important. It's not just a debug aid. It is the signal another piece of infrastructure uses to decide whether the poller is alive.

The cron line is boring:

*/15 * * * *  python C:\workspace\imap_poller.py --verbose

I run with --verbose

because the output gets captured by the cron runner and stored. When something misroutes, I want to see the email subject and the classification decision in the captured log. That decision, running verbose in production, is what made the bug possible.

Here is the relevant slice of main()

before the fix:

def main():
    parser = argparse.ArgumentParser(...)
    parser.add_argument("--verbose", action="store_true", ...)
    args = parser.parse_args()

    emails = fetch_unprocessed_emails()
    ...
    for em in emails:
        if args.verbose:
            print(f"From: {em.get('from', em.get('sender', ''))}")
            print(f"Subject: {subject}")
            body = em.get("body", em.get("text", em.get("snippet", "")))
            print(f"Body preview: {str(body)[:200]}")
        ...

At 17:30 UTC, the loop hit an inbound email whose body contained a U+202F. On Linux, this is uneventful. On a Windows console with the default cp1252 codepage, the encoder hits this character and raises:

UnicodeEncodeError: 'charmap' codec can't encode character ' '
in position 12: character maps to <undefined>

The exception bubbled up out of print()

. It killed the process before any except-clause could catch it (the verbose print

runs at the top of the per-email loop, before the try

block that wraps ticket creation). The cron runner saw a non-zero exit, didn't see any stdout because the exception fired in the middle of writing to stdout, and moved on. Fifteen minutes later, the next run hit the same email (VizMail still flagged it unprocessed, since nothing had marked it processed) and died the same way. And again. And again. Ten dead runs back to back, none of them visible from the outside.

The output that should have been there on every run, the line [+] #<id> on <project>: <title>

followed by _append_run_log()

writing a new entry to memory.md

, was simply missing. The poller looked exactly like a poller with zero new mail, except it wasn't running at all.

Tests didn't catch this for a reason that is, in retrospect, obvious. My test suite runs under pytest

with stdout captured by pytest's capture mechanism, which uses an internal buffer with UTF-8 encoding. My fixtures contain ASCII email bodies. The CI environment, if I had one for this script, would default to UTF-8 anyway. The only environment where print()

writes to a real cp1252-backed terminal is the one environment that matters: production.

You can sometimes guess at the existence of bugs like this from a code review. You cannot stumble onto them from testing alone. The shape of the bug is "production-only configuration interacts with production-only data," and tests by definition exclude both.

I would have noticed eventually. Probably the next morning, when I opened the boards and saw nothing from the previous evening's email. But by then it would have been 12+ hours, and one of those silent emails was a prospect lead.

What actually caught it was a second cron job, imap-poller-watchdog

, that runs at the top of every hour. The staleness check at its core:

$memoryFile = "C:\workspace\.agents\imap-poller\memory.md"
$thresholdH = 2
$file       = Get-Item $memoryFile -ErrorAction SilentlyContinue
$ageHours   = ((Get-Date) - $file.LastWriteTime).TotalHours

if ($ageHours -le $thresholdH) { exit 0 }

That is the entire health check. It does not parse the file. It does not ping an endpoint. It checks the modification time on a markdown file and asks: has my poller written anything in the last two hours? If the answer is no, it POST

s a ticket to the orchestrator's board titled IMAP poller silencieux depuis Xh

and tags me on it.

At 20:00 UTC, two and a half hours into the outage, that watchdog finally crossed the threshold. The ticket appeared on my board. Five minutes later I had the cause. Six minutes later I had the fix. The next cron window processed the email that had been waiting since 17:30.

The key insight here is the kind of signal the watchdog uses. It does not watch for failures. A "watch for failures" check would have seen nothing, because the poller wasn't producing any. It watches for the absence of expected activity. That's a much harder signal to fake, and it covers a much wider class of failures, including the one where your process dies before it can tell you it died.

Two lines, at the top of main()

:

def main():
    sys.stdout.reconfigure(encoding='utf-8', errors='replace')
    sys.stderr.reconfigure(encoding='utf-8', errors='replace')
    ...

That's it. reconfigure()

was added to TextIOWrapper

in Python 3.7 specifically for this case. It changes the encoding of an already-open stream without you having to swap the file object. After these two lines, stdout encodes arbitrary Unicode as UTF-8 and writes the bytes to the console. The console may render them as boxes or question marks depending on its font, but the program does not crash.

Two design choices worth flagging:

errors='replace'

instead of 'strict'

. Strict would still raise on characters the encoder cannot handle, which for UTF-8 is basically nothing. replace

is cheap insurance for the case where I redirect stdout to something unusual. The cost of a ?

showing up in a debug log is zero. The cost of another silent crash is the watchdog ticket I just wrote a postmortem on.

Reconfiguring stdout at the application boundary, not switching to a logging library. I considered the latter for about ninety seconds. The script is 400 lines. The only consumer of its output is the cron runner's stdout capture, and adding a logging dependency would have introduced more code than the bug itself. The fix has to be proportional to the bug. Two lines is proportional.

The commit hash is 7ef6da4

if anyone is curious; the watchdog ticket is still on my board, marked Done, with the postmortem attached.

The fix is fine. The watchdog is fine. What I got wrong is upstream.

Running --verbose

in production was the choice that turned a benign data anomaly into an outage. The flag exists for debugging. It writes data that comes from external systems (email bodies) to stdout. Any time you do that on Windows, or on a Linux container with a degenerate LANG

, you are one weird character away from this exact failure. The right move is either structured logging, where the writer controls the encoding. Or drop --verbose

from the cron line entirely and rerun it manually when I want to debug.

I left --verbose

in the cron line. The encoding fix means it cannot crash for this reason anymore, and the captured verbose output remains useful when classification misroutes a message. But it is a calibrated decision now instead of an accident.

The bigger lesson is the watchdog itself. I built it earlier this year, after a different gap where the poller went silent and I didn't notice for over a day. Until I built it, I had been monitoring exactly the wrong thing: tickets created, errors logged, exit codes. All of those go to zero in both healthy and dead states. What separates the two is the modification time on a file the script itself writes. Every cron job in this workspace now has, or will get, the same kind of staleness check. They are cheap to write and they catch the failure mode where everything looks fine except nothing is happening.

If you only take one thing from this post, take this: do not check whether your background job failed. Check whether it ran.

The orchestrator that runs the poller, the watchdog, the boards, and the 17 projects is open source: github.com/Ekioo/KittyClaw — MIT, star if useful.

The same poller routes prospect emails for ekioo.com, the consulting site whose leads were 2.5 hours late landing in the inbox that evening. They got there. The system caught itself before I did.

If you've written a similar "did this run recently" watchdog for a cron of your own, I'd be curious how you tune the threshold. Two hours felt right for a 15-minute interval. I have no rigorous reason for that beyond "long enough to ignore a single missed run, short enough to catch a real failure during business hours."

source & further reading

dev.to — original article How to run a team of AI marketing agents from Slack PQC migration does not start by replacing an algorithm I Spent 10x Longer Debugging AI Code Than Writing It — Here's What Changed

A Unicode space killed my cron for 2.5 hours

Run your AI side-project on zahid.host