ShareBox displays shared folders as a Netflix-style grid with TMDB posters. The problem: folder names come from torrents. Naruto.INTEGRALE.MULTI.VFF.1080p.BluRay.x264-AMB3R
needs to match "Naruto" on TMDB — not "Naruto Shippuden", not "Naruto the Movie". And Vol 1
must definitely not match "Kill Bill: Volume 1".
Basic regex + TMDB search works for 80% of cases. For the remaining 20%, I built a 3-pass AI pipeline (Claude Haiku via CLI) with a cron every 30 minutes. Here's each pass in detail, the exact prompts, and iterations measured on 290 real entries.
The architecture is layered, cheapest to most expensive:
extract_title_year()
cleans the name, searches TMDB, takes the first result with a poster. Free, instant, correct ~80% of the time.The first prompt was simple: "extract the proper movie title for a TMDB search." Tested on 290 real names, it produced 72 false skips — the AI considered "Naruto.INTEGRALE", "Pokemon La Series", "Despicable Me COLLECTION" as non-titles and marked them skip=true
.
The fix: explicit rules about what to keep vs. skip, a "when in doubt, skip=false" rule, and instructions to translate known English titles to French. Result: 72 → 41 skips. 31 improvements, zero regressions.
The verification prompt sent {name, TMDB title} pairs and asked correct: true/false
. On 247 entries, it flagged 55 as incorrect. But 46 were false negatives.
The AI didn't know that S01 → "Season 1"
is a correct match — it's a TMDB season poster, not a generic match. Same for all 34 Simpsons seasons, 11 Walking Dead seasons, 4 Batman seasons.
The fix: a "Special cases — do NOT mark as incorrect" section explaining that season folders matched to season titles are correct, and translations/saga names are fine. Result: 55 → 9 incorrects. All 9 are real problems. Zero false negatives.
When pass 2 detects a false positive and suggests "Naruto" as a better title, we search TMDB. Problem: TMDB returns results by popularity. "Naruto" → Naruto Shippuden (more popular). Taking the first result reproduces the error.
The solution: get 15 TMDB candidates (via multi + tv + movie endpoints), send the full list to AI with the filename for context. The AI picks {"idx": 1}
— Naruto (2002), the original series. The word "INTEGRALE" in the filename helps it understand this is the complete series, not a spin-off.
A gotcha: Claude sometimes adds explanations after the JSON, breaking parsing. Fix: extract {"idx": N}
via regex instead of full JSON parsing.
Prompt
Before
After
Improvement
Pass 1 (extraction) 72 false skips
41
-43%
Pass 2 (verification) 55 false negatives
9 (all real) -84%
Pass 3 (candidate pick) 4 parse failures
0
-100%
Measure before iterating. Without 290 real entries as a benchmark, I would have iterated blindly. The numbers showed pass 2 v1 had 84% false negatives — impossible to see without real data.
Edge cases dominate. 46 out of 55 false negatives came from one pattern: season folders. One line in the prompt ("seasons matched to Season N are CORRECT") eliminated 84% of errors. The 80/20 rule applies to prompts too.
Parsing matters as much as the prompt. A perfect prompt is useless if parsing breaks. The AI adds text, code fences, explanations. Regex extraction is more reliable than json_decode()
.
Layered architecture reduces costs. Free regex handles 80%. AI only runs on the remaining 20%. Pass 3 (the most expensive) only fires when pass 2 detects a problem — 9 times out of 290 entries.
The best prompt isn't the one with the most instructions — it's the one that precisely describes edge cases. "When in doubt, skip=false" and "seasons are CORRECT" are worth more than 20 lines of generic rules.