{"slug": "why-this-system-looks-the-way-it-does-recoverflow-s-6-day-design-journey", "title": "Why This System Looks the Way It Does — Recoverflow's 6-Day Design Journey", "summary": "Recoverflow spent four days on architecture design before writing any code for its AI collections system, prioritizing what the system should not do over what it can do. The team shifted the design center from gross invoice amount to actual outstanding balance, a decision that propagates through five agents and defines the sweet spot for small business owners. The success metric is customer relationship preservation and owner willingness to recommend, not just money recovered.", "body_md": "The day after submitting to the hackathon, I finally had a moment to look back at the whole system.\n\nTurned out something surprising.\n\nIn these 7 days, **we spent more time thinking about architecture than writing code — by a factor of two**.\n\n```\nJun 9-11   PRE-EVENT   ← 4 days, pure design\nJun 12     KICKOFF\nJun 13-15  BUILD       ← 3 days, writing code\nJun 16-18  POLISH      ← 3 days, demo + submission\n```\n\nDays 9-12, those 4 solid days, I was doing this:\n\n**By Day 13 when we started writing code, the system's spine was already standing**.\n\nWriting code turned out to be the least painful part — because in the previous 4 days I'd already nailed down \"what NOT to do.\"\n\nThis post is that full design journey. Why I picked these designs, what problems we hit, how we fixed them, what we added.\n\nLong-form, grab water.\n\nBefore I built anything, I asked myself one question.\n\n\"Can I build an AI collections system that small business owners would actively recommend to their friends?\"\n\nNot \"can I build a multi-agent demo,\" not \"can I wire up all 9 sponsors,\" not \"can I ship something submittable in 7 days.\"\n\nThe question was — **can I build something a small business owner would actively recommend to their friends?**\n\nThe question is provocative. Because AI collections, as a category, is reputation poison in the eyes of most small business owners. Nobody recommends \"I used an AI robot to bother my customers.\"\n\nBut I thought it could work — if the AI genuinely respected boundaries in the right places, didn't step on landmines, and actually solved a pain the owner couldn't solve alone.\n\nThis question had two design implications:\n\n**First**: the entire system's design center of gravity shifted from \"what can it do\" to \"** what should it NOT do**.\"\n\nWhat cases not to take (D-031 Lite mode for < $3K means we let go and don't chase), what words not to say (D-019 LLM never writes bank numbers), when not to continue (D-035 stops before any 988 welfare signal), what hours not to call (TCPA-compliant windows + US federal holidays).\n\nEvery line is a clear \"no.\"\n\n**Second**: the success metric was no longer \"amount of money recovered.\" It was \"customer relationship stays whole + owner feels safe recommending us to friends.\"\n\nThese two metrics sometimes fight.\n\nMoney recovered but the customer felt bullied by a robot — failure.\n\nNo money recovered but the customer said \"they were professional and we had a real conversation\" — success.\n\nOnce you know which side matters more, the whole prompt is written differently.\n\nMy aunt is in finance. She told me one thing that made me redesign the whole system.\n\nShe said — most cross-border collection disputes, **the amount isn't actually that big**.\n\nA lot of customers say \"I already paid the 50% deposit,\" \"the rest I'll pay 3 months after goods arrive,\" \"we amended the SOW midway, can you confirm?\"\n\nFrom the gross invoice view — it's a $47K invoice.\n\nFrom the actual outstanding view — it might be $23K.\n\n**A US lawyer opens the file, sees $47K, and says \"too small, cross-border litigation costs over $30K, the math doesn't work for you.\"**\n\nBut what we're chasing is $23K — squarely in our sweet spot ($3K-$40K).\n\nThat moment I realized — **the entire industry is using the wrong number to define the sweet spot**.\n\nI made this D-038 — we chase outstanding, not gross.\n\nThis one decision propagates through 5 agents:\n\n```\nPre-flight     → Routes by \"outstanding\" (Lite / In-Spot / Attorney three paths)\nReconciler     → Settles by \"outstanding\" (partial / full / overpaid / unmatched)\nDiplomat       → Email body cites \"outstanding,\" not \"gross\"\nEscalator      → Demand letter draft cites \"outstanding\"\nAAA Specialist → Formal attorney letter cites \"outstanding\"\n```\n\nEvery routing boundary has automated checks pinning it, so nobody accidentally lets \"gross\" leak through.\n\n**It's a one-word difference between \"gross\" and \"outstanding\" — and that's the difference between a system that survives in the mid-market and one that doesn't**.\n\nThis is a tool built specifically to make life easier for finance people chasing tails.\n\nMost multi-agent systems do this — write one giant prompt, hand it to a \"super-agent,\" and pray it doesn't make a mistake.\n\nWe went the opposite way. This is something I've believed about AI from the start — specialist agents doing one job have a lower error rate than generalists.\n\n**Doers can be ordinary. Guardians must be rock-solid.**\n\nWhy?\n\nIn collections, one wrong move — a FDCPA-violating line, an AI-hallucinated bank account, an unconsented escalation — permanently destroys customer trust.\n\nTrust is this industry's asset. Guardians are there to protect it. And beyond the AI guardians, every outbound action has to go through a human.\n\nThis is the principle I hold to: AI is a tool that helps us. The decision authority should stay with humans.\n\nThe system has three layers:\n\nOne agent: **Pre-flight**.\n\nIt reads the contract and decides which of the three paths the case takes — Lite, In-Spot, or Attorney-Recommended.\n\nThe decision input is D-038 outstanding balance + contract clauses (governing law, arbitration, late fee, deposit %).\n\nGet this wrong once — the whole pipeline runs down the wrong path. So Pre-flight is the gatekeeper — has to get this right first.\n\n5 agents: Investigator, Diplomat, Voice, Payment, Escalator.\n\nTheir work is \"ordinary\" — write emails, look up data, dial phones, run flow.\n\nEach agent does one thing, then waits for the Guardians to review. Doers themselves don't send any email, letter, or phone call out. They produce drafts, hand to Concierge, wait for approval.\n\n3 agents: Concierge, Tone Coach, AAA Specialist.\n\n**Concierge** is the \"single outbound choke point\" — every outbound action comes out through here. The operator presses APPROVE / REJECT / REVISE in Slack. The AI itself can't send anything outside on its own.\n\n**Tone Coach** is the \"cross-agent tone gatekeeper\" — it scans Diplomat / Voice drafts in real time and blocks any FDCPA-violating sentence (Claude does the tone judgment).\n\n**AAA Specialist** is the \"Day-65 dynamic join\" — normally not in the Room. When a case ages past 60 days, Escalator uses Band Platform's `tools.lookup_peers + tools.add_participant`\n\nto pull AAA in to draft the legal letter. The reason this role exists this way: thinking about future expansion, when a contract is parsed and we discover which state's law governs it, we can pull in the specialist agent for that state — extending to non-US jurisdictions too.\n\n**The three-layer design philosophy: when Doers make mistakes, Guardians catch them; when Guardians make mistakes, the whole system stops**.\n\nThis sounds paranoid, but collections is exactly that fragile. One wrong word and you've burned it — no single agent can be allowed to talk to the customer alone.\n\nAfter writing the three-layer architecture, I read it back myself and felt it was still abstract. Let me run one real case through the whole thing — that way you see how the 9 agents hand off.\n\n**Step 1 — Drop in the contract**\n\nThe operator drops the contract PDF into the system. Pre-flight catches it, parses with Gemini 2.5 Flash-Lite — governing law (Texas / California / New York), late fee clause, arbitration clause, deposit %, any SOW amendment. Writes everything into case state.\n\n**Step 2 — Drop in the Invoice**\n\nThe operator attaches the invoice PDF. The system auto-matches it to the contract, computes outstanding balance (invoice amount − deposit paid), and locks in the due date. Case state updates.\n\n**Step 3 — Build the calendar reminders**\n\nBased on the due date, the system schedules Day −7 pre-notice, Day 7 friendly reminder, Day 30 firmer, Day 55 voice call, Day 60+ Escalator, Day 65 AAA dynamic join. Each node is cron-triggered — the operator doesn't have to schedule anything manually.\n\n**Step 4 — Investigator runs customer background**\n\nInvestigator (running Featherless's Qwen) pulls the customer's past 6 months of records — did they pay on time? Were there disputes? What's their reply pattern? Tag a behavioral label: reliable_payer / slow_payer / dispute_history / silent_after_30d / etc.\n\n**Step 5 — Pre-flight decides which path the whole case takes**\n\nWhat Pre-flight is doing here is \"overall case routing\" — **not** the tone strength of letters (tone is what Tone Coach manages per stage, covered in Step 7).\n\nPre-flight combines outstanding + behavior label + governing law and routes the entire case to one of three paths:\n\nPut another way, Pre-flight is asking \"do I want to walk this case through the full set?\" Tone Coach is asking, later, \"does this letter read right?\" Two agents managing completely different layers.\n\n**Step 6 — Day −7 pre-notice**\n\nBefore the due date, send one friendly heads-up: \"Just wanted to flag this invoice is due in 7 days. Let me know if there's anything blocking it.\" Diplomat drafts, Tone Coach reviews, Concierge pushes to Slack, operator presses APPROVE, the email ships.\n\n**Step 7 — Day 7 friendly / Day 30 firmer (THIS is where tone strength gets routed)**\n\nDue date passes with no payment — Day 7 sends a friendly reminder, Day 30 sends a firmer one. **THIS is the \"tone strength\" routing — it escalates with the timeline stage, not with Pre-flight**. Tone Coach uses Claude Haiku on every letter — \"kindly remind\" is too soft and gets bounced back for rewrites; \"we will sue you\" is too hard and blocks for FDCPA violation. Every letter carries a paylink.\n\n**Step 8 — Day 55 Sarah dials**\n\nAt Day 55 with no response, Voice Agent goes through ElevenLabs ConvAI + Twilio to dial. First it checks the customer's local time — is it in the TCPA-allowed window (Tue/Wed/Thu 10-11 or 14-15)? Is it a US federal holiday? If outside the window, defer until next valid slot. (You also don't have to stay up to call US time anymore!)\n\nSarah's possible call branches:\n\n**Step 9 — Day 60+ Escalator + Day 65 AAA dynamic join**\n\nStill no response — Escalator drafts a demand letter (running Featherless's Llama 70B). Day 65, through Band Platform's `tools.lookup_peers + tools.add_participant`\n\n, AAA Specialist is pulled into the Room — this is the first time the lawyer agent appears in the conversation. AAA uses Claude Sonnet to polish the letter's wording.\n\n**Step 10 — Paylink + USDC settlement**\n\nEvery outbound letter carries a paylink. D-019's key invariant is right here — **LLM never writes bank account numbers**. The paylink is a hard-coded URL, the customer clicks into a Next.js checkout page that reads real account / SWIFT / Stripe / USDC wallet info directly from `payment_methods.json`\n\n. LLM never touches banking info.\n\nThe reason I included USDC payment here is — I see AI handling finance as an inevitable trend. Since this system is AI, having USDC payment configured is necessary. In the future this could evolve into every AI with X402 or some identity-verification mechanism handling agent-to-agent payments themselves.\n\nCustomer picks USDC → Transfers from their wallet to Recoverflow's Circle ARC wallet → Receipt poller scans every 10 seconds, sees the new transaction, triggers the Payment agent for reconciliation.\n\n**Step 11 — 4 settlement states auto-judged**\n\nCompare against outstanding balance:\n\n**Step 12 — Notify + audit**\n\nWhatever the state, the system writes to the audit trail (`audit_trail.jsonl`\n\n, append-only, 21,000+ rows), Resend emails the human a receipt notice (bilingual ZH/EN), and if it's a \"FULL\" the customer also gets a Thank you confirmation.\n\nEnd to end, the human pressed APPROVE / REJECT in Slack maybe a few times, but the 9 agents processed everything from contract parsing all the way to USDC arrival.\n\nEvery step left an audit trail, every outbound went through human review, every edge case got caught by an invariant.\n\nThat's what \"weak doers, strong guardians\" actually looks like.\n\nAfter the three-layer architecture was set, what tormented me most was the Voice Agent. It evolved three times in 7 days — V1 one-way notification → V2 two-way dialogue → V3 with the Phase 3b branch — each version broken in a different way.\n\nV2 was the one where I called myself, said \"I'll email you in 2 days\" to Sarah, and she treated it as evasion and escalated. That moment was when I realized — the \"ideal conversation\" you write in a prompt is very different from real human conversation.\n\nThe full bug story and the birth of Phase 3b I wrote up in another post — [《AI Called Me Back — Recoverflow Dev Diary Day 2: Two Hours with the Voice Agent》](https://judyailab.com/en/posts/2026-06-16-recoverflow-day2-voice-agent/) — go read that one if you want the full play-by-play.\n\nWhat I want to add here is — what Voice Agent really made me realize is that **ideal scenarios always lose to real scenarios**. That insight led me directly to the next section.\n\nAfter Phase 3b was added, I asked J: could there be other edge cases also being misread as \"evasion\" and escalated?\n\nWe reviewed the whole voice agent's possible conversation space and listed 26 edge cases.\n\nOf these, 4 hidden ones I think are the keys to whether this system \"destroys the customer relationship or not\":\n\nBuyer pays half, promises \"the rest next month.\"\n\nA lot of collection systems don't know how to handle this — is it \"done\" or \"not done\"?\n\nOur handling: sub-cycle tracking. Diplomat restarts outreach for the remaining balance, cadence resets. Maxes out at 2 sub-cycles before escalation.\n\nThe design inspiration here is from my aunt's experience. She said \"a customer who pays half is usually willing but cash-flow stuck — give them space and you'll actually receive the rest.\"\n\nBuyer realizes they're talking to a voice agent and gets emotional / starts swearing.\n\nOur handling: Tone Coach blocks any FDCPA-violating response immediately. ConvAI's 16-reason escalation enum tags this case as \"customer_hostile_persistent.\" Concierge pages a human within minutes.\n\n**Sarah (AI) will never respond emotionally to the customer**.\n\nBuyer mentions self-harm, depression, \"I really can't anymore\" on the call.\n\nOur handling: `set_anomaly_halt(case_id, \"welfare\")`\n\nfires BEFORE Concierge is paged. The case freezes. The 988 Suicide & Crisis Lifeline gets mentioned in the conversation. 3 dedicated unit tests pin this ordering.\n\n**Life before debt. Encoded in code, not in prompt**.\n\nBuyer mentions Chapter 11 (US bankruptcy protection).\n\nOur handling: Sarah gently confirms, escalates with reason \"customer_mentions_bankruptcy.\" This triggers legal claim-filing deadline awareness. Concierge pulls human counsel in.\n\nThese 4 cases weren't on my mind when I was writing the prompt.\n\nThey were — **what I imagined I would say if I were the real person picking up that phone**. (This is also one of my strengths — putting myself in someone else's shoes and rehearsing all the things that could happen.)\n\nMom just passed, business folded, customer ghosted, cash flow snapped — what does a person say in those moments? What's an OK response, and what's a response that breaks them?\n\nIf this AI steps on a landmine in any of those 4 moments, my mom will not recommend it to a friend.\n\nWriting this far, I've compiled what I learned in those 7 days.\n\n**One**: \"** What NOT to do**\" is much harder to think through than \"what to do.\" Days 9-12 we spent 4 days listing \"forbidden / exception / red line.\" Days 13-15, only 3 days of code — because once the previous layer was clear, wiring up was fast. If you find yourself spending too much time on \"what to do,\" it's usually because the layer above isn't sharp enough.\n\n**Two**: **Weak doers, strong guardians** is the spine of this design. The most common failure mode in multi-agent systems is \"super-agent\" — shove every judgment into one prompt and pray. We went the opposite — doers ordinary, guardians ironclad. Once the layering holds, you stop chasing every doer prompt because the guardians will catch their mistakes.\n\n**Three**: **Iron rules must be \"unbreakable hard rules,\" not \"best practice suggestions\"**. 988 welfare freeze, D-019 LLM not allowed to write bank numbers, Tone Coach FDCPA blocking — these aren't suggestions, they're hard-wired. \"Best practices\" get lost to \"let me skip this just for today,\" iron rules don't.\n\n**Four**: **Real scenarios always beat ideal scenarios**. Voice V2 broken was found by me actually calling myself. The 4 hidden cases came from me imagining being the human picking up that phone. No simulate-conversation API, no mock test, no dry-run can substitute for actually walking through it once, in someone's shoes.\n\n**Five**: **Design time isn't wasted time**. I'd assumed a 7-day hackathon \"should\" be 6 days of code + 1 day of submission. The actual ratio was 4 days design + 3 days code + 3 days polish. Looking back, that ratio was right — by day 4 of design, I already knew exactly what every line of the next 3 days of code would look like.\n\nThe moment we submitted to the hackathon, the whole system had 9 agents, 21,000+ rows of audit trail, 482 tests all green, 5 real ARC-TESTNET USDC settlements, all 7 sponsors actually running.\n\nBut looking back — what I'm actually proud of isn't those numbers. It's **how this system reacts in those 4 hidden cases**.\n\nCustomer emotionally collapses, mentions 988, mentions bankruptcy, pays half and promises the rest next month — I hope my mom would recommend this system to those same-trade peers of hers who are also stuck.\n\nIf she actually does that one day, then these 7 days were worth it.\n\nThere's still a lot we haven't written into the system — the pricing model is empty, parts of the architecture are still incomplete, there are corrections still pending...\n\nBut the iron rules are standing. The rest can take its time.\n\n**Design time isn't wasted time**.\n\n*Originally published at Judy AI Lab. Visit for more articles on AI engineering and development.*", "url": "https://wpnews.pro/news/why-this-system-looks-the-way-it-does-recoverflow-s-6-day-design-journey", "canonical_source": "https://dev.to/judy_miranttie/why-this-system-looks-the-way-it-does-recoverflows-6-day-design-journey-5a7o", "published_at": "2026-06-20 01:00:28+00:00", "updated_at": "2026-06-20 01:06:46.716396+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "ai-products", "ai-startups", "natural-language-processing"], "entities": ["Recoverflow", "TCPA", "LLM", "AAA"], "alternates": {"html": "https://wpnews.pro/news/why-this-system-looks-the-way-it-does-recoverflow-s-6-day-design-journey", "markdown": "https://wpnews.pro/news/why-this-system-looks-the-way-it-does-recoverflow-s-6-day-design-journey.md", "text": "https://wpnews.pro/news/why-this-system-looks-the-way-it-does-recoverflow-s-6-day-design-journey.txt", "jsonld": "https://wpnews.pro/news/why-this-system-looks-the-way-it-does-recoverflow-s-6-day-design-journey.jsonld"}}