Why This System Looks the Way It Does — Recoverflow's 6-Day Design Journey

Recoverflow spent four days on architecture design before writing any code for its AI collections system, prioritizing what the system should not do over what it can do. The team shifted the design center from gross invoice amount to actual outstanding balance, a decision that propagates through five agents and defines the sweet spot for small business owners. The success metric is customer relationship preservation and owner willingness to recommend, not just money recovered.

The day after submitting to the hackathon, I finally had a moment to look back at the whole system. Turned out something surprising. In these 7 days, we spent more time thinking about architecture than writing code — by a factor of two . Jun 9-11 PRE-EVENT ← 4 days, pure design Jun 12 KICKOFF Jun 13-15 BUILD ← 3 days, writing code Jun 16-18 POLISH ← 3 days, demo + submission Days 9-12, those 4 solid days, I was doing this: By Day 13 when we started writing code, the system's spine was already standing . Writing code turned out to be the least painful part — because in the previous 4 days I'd already nailed down "what NOT to do." This post is that full design journey. Why I picked these designs, what problems we hit, how we fixed them, what we added. Long-form, grab water. Before I built anything, I asked myself one question. "Can I build an AI collections system that small business owners would actively recommend to their friends?" Not "can I build a multi-agent demo," not "can I wire up all 9 sponsors," not "can I ship something submittable in 7 days." The question was — can I build something a small business owner would actively recommend to their friends? The question is provocative. Because AI collections, as a category, is reputation poison in the eyes of most small business owners. Nobody recommends "I used an AI robot to bother my customers." But I thought it could work — if the AI genuinely respected boundaries in the right places, didn't step on landmines, and actually solved a pain the owner couldn't solve alone. This question had two design implications: First : the entire system's design center of gravity shifted from "what can it do" to " what should it NOT do ." What cases not to take D-031 Lite mode for < $3K means we let go and don't chase , what words not to say D-019 LLM never writes bank numbers , when not to continue D-035 stops before any 988 welfare signal , what hours not to call TCPA-compliant windows + US federal holidays . Every line is a clear "no." Second : the success metric was no longer "amount of money recovered." It was "customer relationship stays whole + owner feels safe recommending us to friends." These two metrics sometimes fight. Money recovered but the customer felt bullied by a robot — failure. No money recovered but the customer said "they were professional and we had a real conversation" — success. Once you know which side matters more, the whole prompt is written differently. My aunt is in finance. She told me one thing that made me redesign the whole system. She said — most cross-border collection disputes, the amount isn't actually that big . A lot of customers say "I already paid the 50% deposit," "the rest I'll pay 3 months after goods arrive," "we amended the SOW midway, can you confirm?" From the gross invoice view — it's a $47K invoice. From the actual outstanding view — it might be $23K. A US lawyer opens the file, sees $47K, and says "too small, cross-border litigation costs over $30K, the math doesn't work for you." But what we're chasing is $23K — squarely in our sweet spot $3K-$40K . That moment I realized — the entire industry is using the wrong number to define the sweet spot . I made this D-038 — we chase outstanding, not gross. This one decision propagates through 5 agents: Pre-flight → Routes by "outstanding" Lite / In-Spot / Attorney three paths Reconciler → Settles by "outstanding" partial / full / overpaid / unmatched Diplomat → Email body cites "outstanding," not "gross" Escalator → Demand letter draft cites "outstanding" AAA Specialist → Formal attorney letter cites "outstanding" Every routing boundary has automated checks pinning it, so nobody accidentally lets "gross" leak through. It's a one-word difference between "gross" and "outstanding" — and that's the difference between a system that survives in the mid-market and one that doesn't . This is a tool built specifically to make life easier for finance people chasing tails. Most multi-agent systems do this — write one giant prompt, hand it to a "super-agent," and pray it doesn't make a mistake. We went the opposite way. This is something I've believed about AI from the start — specialist agents doing one job have a lower error rate than generalists. Doers can be ordinary. Guardians must be rock-solid. Why? In collections, one wrong move — a FDCPA-violating line, an AI-hallucinated bank account, an unconsented escalation — permanently destroys customer trust. Trust is this industry's asset. Guardians are there to protect it. And beyond the AI guardians, every outbound action has to go through a human. This is the principle I hold to: AI is a tool that helps us. The decision authority should stay with humans. The system has three layers: One agent: Pre-flight . It reads the contract and decides which of the three paths the case takes — Lite, In-Spot, or Attorney-Recommended. The decision input is D-038 outstanding balance + contract clauses governing law, arbitration, late fee, deposit % . Get this wrong once — the whole pipeline runs down the wrong path. So Pre-flight is the gatekeeper — has to get this right first. 5 agents: Investigator, Diplomat, Voice, Payment, Escalator. Their work is "ordinary" — write emails, look up data, dial phones, run flow. Each agent does one thing, then waits for the Guardians to review. Doers themselves don't send any email, letter, or phone call out. They produce drafts, hand to Concierge, wait for approval. 3 agents: Concierge, Tone Coach, AAA Specialist. Concierge is the "single outbound choke point" — every outbound action comes out through here. The operator presses APPROVE / REJECT / REVISE in Slack. The AI itself can't send anything outside on its own. Tone Coach is the "cross-agent tone gatekeeper" — it scans Diplomat / Voice drafts in real time and blocks any FDCPA-violating sentence Claude does the tone judgment . AAA Specialist is the "Day-65 dynamic join" — normally not in the Room. When a case ages past 60 days, Escalator uses Band Platform's tools.lookup peers + tools.add participant to pull AAA in to draft the legal letter. The reason this role exists this way: thinking about future expansion, when a contract is parsed and we discover which state's law governs it, we can pull in the specialist agent for that state — extending to non-US jurisdictions too. The three-layer design philosophy: when Doers make mistakes, Guardians catch them; when Guardians make mistakes, the whole system stops . This sounds paranoid, but collections is exactly that fragile. One wrong word and you've burned it — no single agent can be allowed to talk to the customer alone. After writing the three-layer architecture, I read it back myself and felt it was still abstract. Let me run one real case through the whole thing — that way you see how the 9 agents hand off. Step 1 — Drop in the contract The operator drops the contract PDF into the system. Pre-flight catches it, parses with Gemini 2.5 Flash-Lite — governing law Texas / California / New York , late fee clause, arbitration clause, deposit %, any SOW amendment. Writes everything into case state. Step 2 — Drop in the Invoice The operator attaches the invoice PDF. The system auto-matches it to the contract, computes outstanding balance invoice amount − deposit paid , and locks in the due date. Case state updates. Step 3 — Build the calendar reminders Based on the due date, the system schedules Day −7 pre-notice, Day 7 friendly reminder, Day 30 firmer, Day 55 voice call, Day 60+ Escalator, Day 65 AAA dynamic join. Each node is cron-triggered — the operator doesn't have to schedule anything manually. Step 4 — Investigator runs customer background Investigator running Featherless's Qwen pulls the customer's past 6 months of records — did they pay on time? Were there disputes? What's their reply pattern? Tag a behavioral label: reliable payer / slow payer / dispute history / silent after 30d / etc. Step 5 — Pre-flight decides which path the whole case takes What Pre-flight is doing here is "overall case routing" — not the tone strength of letters tone is what Tone Coach manages per stage, covered in Step 7 . Pre-flight combines outstanding + behavior label + governing law and routes the entire case to one of three paths: Put another way, Pre-flight is asking "do I want to walk this case through the full set?" Tone Coach is asking, later, "does this letter read right?" Two agents managing completely different layers. Step 6 — Day −7 pre-notice Before the due date, send one friendly heads-up: "Just wanted to flag this invoice is due in 7 days. Let me know if there's anything blocking it." Diplomat drafts, Tone Coach reviews, Concierge pushes to Slack, operator presses APPROVE, the email ships. Step 7 — Day 7 friendly / Day 30 firmer THIS is where tone strength gets routed Due date passes with no payment — Day 7 sends a friendly reminder, Day 30 sends a firmer one. THIS is the "tone strength" routing — it escalates with the timeline stage, not with Pre-flight . Tone Coach uses Claude Haiku on every letter — "kindly remind" is too soft and gets bounced back for rewrites; "we will sue you" is too hard and blocks for FDCPA violation. Every letter carries a paylink. Step 8 — Day 55 Sarah dials At Day 55 with no response, Voice Agent goes through ElevenLabs ConvAI + Twilio to dial. First it checks the customer's local time — is it in the TCPA-allowed window Tue/Wed/Thu 10-11 or 14-15 ? Is it a US federal holiday? If outside the window, defer until next valid slot. You also don't have to stay up to call US time anymore Sarah's possible call branches: Step 9 — Day 60+ Escalator + Day 65 AAA dynamic join Still no response — Escalator drafts a demand letter running Featherless's Llama 70B . Day 65, through Band Platform's tools.lookup peers + tools.add participant , AAA Specialist is pulled into the Room — this is the first time the lawyer agent appears in the conversation. AAA uses Claude Sonnet to polish the letter's wording. Step 10 — Paylink + USDC settlement Every outbound letter carries a paylink. D-019's key invariant is right here — LLM never writes bank account numbers . The paylink is a hard-coded URL, the customer clicks into a Next.js checkout page that reads real account / SWIFT / Stripe / USDC wallet info directly from payment methods.json . LLM never touches banking info. The reason I included USDC payment here is — I see AI handling finance as an inevitable trend. Since this system is AI, having USDC payment configured is necessary. In the future this could evolve into every AI with X402 or some identity-verification mechanism handling agent-to-agent payments themselves. Customer picks USDC → Transfers from their wallet to Recoverflow's Circle ARC wallet → Receipt poller scans every 10 seconds, sees the new transaction, triggers the Payment agent for reconciliation. Step 11 — 4 settlement states auto-judged Compare against outstanding balance: Step 12 — Notify + audit Whatever the state, the system writes to the audit trail audit trail.jsonl , append-only, 21,000+ rows , Resend emails the human a receipt notice bilingual ZH/EN , and if it's a "FULL" the customer also gets a Thank you confirmation. End to end, the human pressed APPROVE / REJECT in Slack maybe a few times, but the 9 agents processed everything from contract parsing all the way to USDC arrival. Every step left an audit trail, every outbound went through human review, every edge case got caught by an invariant. That's what "weak doers, strong guardians" actually looks like. After the three-layer architecture was set, what tormented me most was the Voice Agent. It evolved three times in 7 days — V1 one-way notification → V2 two-way dialogue → V3 with the Phase 3b branch — each version broken in a different way. V2 was the one where I called myself, said "I'll email you in 2 days" to Sarah, and she treated it as evasion and escalated. That moment was when I realized — the "ideal conversation" you write in a prompt is very different from real human conversation. The full bug story and the birth of Phase 3b I wrote up in another post — 《AI Called Me Back — Recoverflow Dev Diary Day 2: Two Hours with the Voice Agent》 https://judyailab.com/en/posts/2026-06-16-recoverflow-day2-voice-agent/ — go read that one if you want the full play-by-play. What I want to add here is — what Voice Agent really made me realize is that ideal scenarios always lose to real scenarios . That insight led me directly to the next section. After Phase 3b was added, I asked J: could there be other edge cases also being misread as "evasion" and escalated? We reviewed the whole voice agent's possible conversation space and listed 26 edge cases. Of these, 4 hidden ones I think are the keys to whether this system "destroys the customer relationship or not": Buyer pays half, promises "the rest next month." A lot of collection systems don't know how to handle this — is it "done" or "not done"? Our handling: sub-cycle tracking. Diplomat restarts outreach for the remaining balance, cadence resets. Maxes out at 2 sub-cycles before escalation. The design inspiration here is from my aunt's experience. She said "a customer who pays half is usually willing but cash-flow stuck — give them space and you'll actually receive the rest." Buyer realizes they're talking to a voice agent and gets emotional / starts swearing. Our handling: Tone Coach blocks any FDCPA-violating response immediately. ConvAI's 16-reason escalation enum tags this case as "customer hostile persistent." Concierge pages a human within minutes. Sarah AI will never respond emotionally to the customer . Buyer mentions self-harm, depression, "I really can't anymore" on the call. Our handling: set anomaly halt case id, "welfare" fires BEFORE Concierge is paged. The case freezes. The 988 Suicide & Crisis Lifeline gets mentioned in the conversation. 3 dedicated unit tests pin this ordering. Life before debt. Encoded in code, not in prompt . Buyer mentions Chapter 11 US bankruptcy protection . Our handling: Sarah gently confirms, escalates with reason "customer mentions bankruptcy." This triggers legal claim-filing deadline awareness. Concierge pulls human counsel in. These 4 cases weren't on my mind when I was writing the prompt. They were — what I imagined I would say if I were the real person picking up that phone . This is also one of my strengths — putting myself in someone else's shoes and rehearsing all the things that could happen. Mom just passed, business folded, customer ghosted, cash flow snapped — what does a person say in those moments? What's an OK response, and what's a response that breaks them? If this AI steps on a landmine in any of those 4 moments, my mom will not recommend it to a friend. Writing this far, I've compiled what I learned in those 7 days. One : " What NOT to do " is much harder to think through than "what to do." Days 9-12 we spent 4 days listing "forbidden / exception / red line." Days 13-15, only 3 days of code — because once the previous layer was clear, wiring up was fast. If you find yourself spending too much time on "what to do," it's usually because the layer above isn't sharp enough. Two : Weak doers, strong guardians is the spine of this design. The most common failure mode in multi-agent systems is "super-agent" — shove every judgment into one prompt and pray. We went the opposite — doers ordinary, guardians ironclad. Once the layering holds, you stop chasing every doer prompt because the guardians will catch their mistakes. Three : Iron rules must be "unbreakable hard rules," not "best practice suggestions" . 988 welfare freeze, D-019 LLM not allowed to write bank numbers, Tone Coach FDCPA blocking — these aren't suggestions, they're hard-wired. "Best practices" get lost to "let me skip this just for today," iron rules don't. Four : Real scenarios always beat ideal scenarios . Voice V2 broken was found by me actually calling myself. The 4 hidden cases came from me imagining being the human picking up that phone. No simulate-conversation API, no mock test, no dry-run can substitute for actually walking through it once, in someone's shoes. Five : Design time isn't wasted time . I'd assumed a 7-day hackathon "should" be 6 days of code + 1 day of submission. The actual ratio was 4 days design + 3 days code + 3 days polish. Looking back, that ratio was right — by day 4 of design, I already knew exactly what every line of the next 3 days of code would look like. The moment we submitted to the hackathon, the whole system had 9 agents, 21,000+ rows of audit trail, 482 tests all green, 5 real ARC-TESTNET USDC settlements, all 7 sponsors actually running. But looking back — what I'm actually proud of isn't those numbers. It's how this system reacts in those 4 hidden cases . Customer emotionally collapses, mentions 988, mentions bankruptcy, pays half and promises the rest next month — I hope my mom would recommend this system to those same-trade peers of hers who are also stuck. If she actually does that one day, then these 7 days were worth it. There's still a lot we haven't written into the system — the pricing model is empty, parts of the architecture are still incomplete, there are corrections still pending... But the iron rules are standing. The rest can take its time. Design time isn't wasted time . Originally published at Judy AI Lab. Visit for more articles on AI engineering and development.