{"slug": "i-let-claude-code-run-a-month-of-my-business-books-it-reconciled-200-and-11", "title": "I Let Claude Code Run a Month of My Business Books. It Reconciled 200 Transactions and Miscategorized 11.", "summary": "A developer handed over a month's worth of business bookkeeping to Claude Code, which reconciled 200 transactions and miscategorized 11, achieving a 94.5% accuracy rate. The agent built a reusable pipeline that handled bulk categorization well but made systematic errors on ambiguous intent, missing tax rules, and edge cases requiring human judgment.", "body_md": "I run a few small businesses, which means once a month I sit down with a bank export and an accounting platform and turn a pile of transactions into something a tax office will accept. It is the part of self-employment nobody warns you about. The coding is fun. The books are not.\n\nSo this month I handed the books to Claude Code and watched.\n\nThe result, up front: it reconciled 200 transactions and miscategorized 11. That is a 94.5% hit rate, which sounds great until you remember that the 11 wrong ones were the difference between a clean filing and a letter from the tax office. This is the story of where the agent shined, where it quietly lied to me, and the exact line I now draw between what it runs and what I sign.\n\nMy first instinct was the obvious one. Paste a CSV into a chat and say \"categorize these.\" I did that for about ten rows before I stopped, because it was the wrong shape of work.\n\nA throwaway categorization is a thing you ask for once and then have to babysit forever. What I actually wanted was a process I could re-run next month with a different CSV and trust a little more each time. So I told Claude Code to build the bookkeeping pipeline, not to do the bookkeeping.\n\nThat distinction matters more than it sounds. When you ask an agent to *build the thing that does the work*, you get a script you can read, a set of rules you can correct, and an audit trail you can point at later. When you ask it to *do the work*, you get an answer and a shrug. One is an asset. The other is a chore you now share with a robot.\n\nThe pipeline it wrote was unglamorous and exactly right:\n\nStep 4 is the one that saved me. More on that in a second.\n\nThe reconciliation itself was genuinely good. Matching 200 bank rows against receipts is the kind of tedious pattern-matching that humans are bad at precisely because it is boring. You zone out around row 40 and start rubber-stamping. The agent does not zone out.\n\nIt correctly handled the cases I expected to trip it: a subscription that renewed on a slightly different day, a refund that showed up as a negative line, a vendor whose name on the bank statement bore no resemblance to the name on the invoice. For the bulk of the month, \"AWS-style charge goes to infrastructure, coffee receipt goes to meetings\" was handled without me touching anything.\n\nThis is not a niche experience anymore. A January 2026 Deloitte study found 63% of finance organizations have fully deployed AI somewhere in their operations, and the pattern that keeps winning is the boring one: let the model categorize at volume, then have a human review the output. Machines do the reading, people do the signing. I arrived at the same split independently, which I choose to read as validation rather than as me being unoriginal.\n\nHere is the uncomfortable part. The 11 mistakes were not random noise. They clustered in three places, and all three are places where the agent had no way to know what it did not know.\n\n**Intent it couldn't see (5 rows).** A laptop I bought is a business expense if I use it for work and a personal purchase if I do not. The receipt looks identical either way. The agent categorized every device as a business asset because that is the statistically likely call, and for two of them it was wrong. No amount of context in the CSV would have told it otherwise. That information lives in my head.\n\n**Rules it didn't have (4 rows).** Tax categories are not universal logic; they are local law. A meal with a client and a meal alone are deductible to different degrees depending on rules the agent was never given. It made a reasonable, confident, wrong guess. Confidence is the dangerous part. A wrong answer delivered with a hedge is easy to catch. A wrong answer delivered cleanly slides right through.\n\n**Edge cases that needed a human (2 rows).** A single payment that covered two unrelated things, split across categories. The agent picked one. A person would have asked.\n\nNotice what is not on this list: arithmetic. It never added wrong, never lost a row, never double-counted. The failures were all judgment, not math. Which is the whole point. The agent is a tireless clerk, not an accountant, and the moment I treated it like an accountant was the moment it would have cost me money.\n\nAfter this month I have a rule, and it is not \"trust the agent\" or \"don't trust the agent.\" It is narrower than that.\n\nThe agent runs everything that is high-volume and low-judgment: pulling data, matching receipts, drafting categories, flagging doubt. I personally sign off on everything that is low-volume and high-stakes: every flagged row, a spot-check of the confident ones, and every category that touches a tax outcome.\n\nThe flagging mechanism is what makes this tractable. Because I told the agent to surface its own uncertainty rather than bury it, my review was not \"re-check 200 rows.\" It was \"check the 18 it wasn't sure about, then sample the rest.\" Seven of the 11 errors were already sitting in its own flagged pile. The other four I caught on the sample. That is the difference between an agent that helps and an agent that just moves the work somewhere you can't see it.\n\nOne thing I will say loudly, because the internet is full of people who won't: the screenshot-friendly version of this post would be \"AI did my taxes in an afternoon.\" That version is a lie of omission. The honest version is \"AI did 94.5% of my taxes and I did the 5.5% that could get me audited.\" The second one is less viral and considerably more useful.\n\nIf you want to try this on your own books, three things carried the result:\n\nI spent years learning that the bottleneck in any process is rarely the part everyone optimizes. With bookkeeping, everyone wants to automate the data entry. The data entry was never the hard part. The hard part is the eleven rows that need a human who knows what the business actually did, and an agent that is honest enough to point at them.\n\nI write more about this kind of human-and-agent division of labor in [Claude Code Mastery](https://kenimoto.dev/books/claude-code-mastery?utm_source=devto&utm_medium=article&utm_campaign=claude-code-bookkeeping), where a full chapter goes into using coding agents for financial and business work without handing over the parts that bite.\n\nNext month the agent runs the books again. I will still read every flagged row. Let's keep it interesting.", "url": "https://wpnews.pro/news/i-let-claude-code-run-a-month-of-my-business-books-it-reconciled-200-and-11", "canonical_source": "https://dev.to/kenimo49/i-let-claude-code-run-a-month-of-my-business-books-it-reconciled-200-transactions-and-513d", "published_at": "2026-06-25 13:00:00+00:00", "updated_at": "2026-06-25 13:14:23.141057+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-agents", "large-language-models"], "entities": ["Claude Code", "Deloitte"], "alternates": {"html": "https://wpnews.pro/news/i-let-claude-code-run-a-month-of-my-business-books-it-reconciled-200-and-11", "markdown": "https://wpnews.pro/news/i-let-claude-code-run-a-month-of-my-business-books-it-reconciled-200-and-11.md", "text": "https://wpnews.pro/news/i-let-claude-code-run-a-month-of-my-business-books-it-reconciled-200-and-11.txt", "jsonld": "https://wpnews.pro/news/i-let-claude-code-run-a-month-of-my-business-books-it-reconciled-200-and-11.jsonld"}}