Self-improving agents are AI's next act

OpenAI and Thrive Holdings built a tax-preparation AI agent that cut document processing time by a third while achieving up to 97% accuracy. The system then self-improved over six weeks, raising the rate of fully accurate tax returns from 25% to 90% by using practitioner corrections to iteratively refine its Codex-based harness. The three-part feedback loop is designed to be generalizable to other industries, including accounting and audit workflows.

Thrive Holdings used AI to cut tax document prep time by a third while maintaining up to 97% accuracy. But the real story is what happened next: The system kept improving itself. OpenAI forward-deployed engineers FDE worked with Thrive Holdings’ engineers to build Tax AI, an agent that helps automate the process of preparing 1040 and 1041 tax returns. Initially, Tax AI was built for simpler work, such as ingesting W-2s and 1099s, but as the tax season went on, it was able to self-improve and handle more complex tasks, Arthur Fernandes Araujo, an OpenAI FDE lead on the project, told The Deep View. "If you think about AI as an ever more capable coworker that is augmenting you, you want a coworker that is capable of —given some input on what it did incorrectly — having the memory of it and not repeating the error," Araujo said. OpenAI quantified those improvements by measuring the percentage of tax returns with accurate field completion, meaning all boxes on the document were filled out correctly. At launch, only a quarter of tax returns had 75% or higher field completion. Within six weeks, 86% of returns hit that mark. Eventually, the system grew even faster, with 90% of returns hitting 100% correct field completion, OpenAI said in a release. So how does it work? To get better, the system uses a three-part loop: - First, a practitioner reveals errors and steers what the product learns - Then the system tracks the full process beyond inputs and outputs to convert the corrections into evals - Finally, a Codex-driven improvement loop allows the system to build on the evals For instance, in the blog post, OpenAI outlined a rental property income example in which their Tax AI system must extract Schedule E fields from messy source materials, such as handwritten notes, emails, and spreadsheets, then map them to a tax engine for practitioner review. When errors are found, practitioner corrections are captured as structured data, grouped into recurring failure patterns, and fed to Codex as scoped engineering tasks. In this project, what’s "self‑improving" is the harness around the model, not the underlying model itself. The tax agent is built on OpenAI’s Codex harness, and it’s the harness that’s being iteratively improved based on practitioner feedback. Because the Codex harness is open source, other developers can also build similar self‑improving agents on top of it. Also notable is that the same three-part design can be applicable for workflows in other domains and industries and is "quite generalizable," John de Wasseige, one of the FDE leads on this project, told The Deep View. For example, OpenAI says it is working with Thrive Holdings to apply it to accounting workflows, such as booking, audit and operational workflows. Our Deeper View Ultimately, one of the biggest challenges with any AI model is accuracy, a concern that becomes even more critical in enterprise use cases. For these models to earn genuine trust, they need to perform as closely as possible to a human expert. One hallmark of real-world training is the ability to correct someone's behavior so they don't repeat the same mistake. The Tax AI system appears to mimic that dynamic, and its potential to extend that feedback loop to other forms of knowledge work makes it a noteworthy advancement in the space. OpenAI said the broader purpose of publishing the article was to provide developers with a blueprint for building on these ideas and a constructive mindset to push the technology forward.