Sarang Kulkarni on Lessons from Building Deep Research Agents in Production

Sarang Kulkarni of Thoughtworks presented at the Arc of AI Conference 2026 on building deep research AI agents for healthcare and pharmaceutical R&D, detailing the development of an "Agentic RAG++" system to address the $2.6 billion cost of bringing a new drug to market. The system uses clarification, research, and writing loops with tools like weighted hybrid search and text2sql to overcome challenges such as context anxiety, high latency, and incomplete data. Kulkarni emphasized that long-horizon tasks require explicit think-act loops and reflection mechanisms to ensure reliability, transparency, and compliance in critical industries.

Deep Research Agentic Systems https://arxiv.org/abs/2506.18096 , such as OpenAI https://openai.com/index/introducing-deep-research/ and Gemini Deep Research Agent https://blog.google/innovation-and-ai/models-and-research/gemini-models/next-generation-gemini-deep-research/ , are AI Agents designed to conduct multi-step research on the internet for complex tasks using dynamic reasoning, multi-hop information retrieval, and generate comprehensive, structured analytical reports at the level of a research analyst. Sarang Kulkarni from Thoughtworks team spoke https://www.arcofai.com/schedule at the Arc of AI Conference 2026 https://www.arcofai.com on how to design and deploy multi-agent research systems for deep reasoning and synthesis, and the lessons learned from real-world healthcare and pharmaceutical R&D projects developing Deep Research Agents. He also discussed how the team leveraged techniques like agentic loops https://code.claude.com/docs/en/agent-sdk/agent-loop and harness engineering to get the best out of the solution. In critical industries like healthcare and clinical trials https://clinicaltrials.gov/ , the researchers need more than the traditional AI models that perform simple Q&A tasks. They need systems that can discover, connect, and reason across both internal and Internet data, while maintaining reliability, transparency, and compliance. Kulkarni started the presentation by highlighting that it typically costs $2.6B to bring a new drug to market. Also, about half the research studies are conducted without prior evidence because the knowledge exists, but access to this knowledge and information is broken. In the overall drug discovery and development pipeline, getting the right data at the right time is a major challenge. With the goal of inventing a new drug using AI technologies, their team built a Retrieval Augmented Generation RAG https://en.wikipedia.org/wiki/Retrieval-augmented generation based chatbot two years ago to search through the unstructured data. For simple queries in the study, the RAG solution worked fine, but for complex questions, they had to enhance it to be an agentic RAG application. And for deep research use cases, the team developed a solution they call the Agentic RAG++. Kulkarni shared the details of the deep research system, which consists of a clarification loop, research loop to perform the tasks think and plan, execute, reflect, adjust the plan , and the writing loop that focuses on the write and reflect tasks. The researcher agent initial version was based on two tools: RAG tool and text2sql tool. RAG tool’s design is based on weighted hybrid search https://www.elastic.co/what-is/hybrid-search , 20 context chunks, a re-ranker, and seven refined context chunks. The text2sql tool is responsible for feeding SQL query errors back to the LLM to improve the model for better accuracy of query execution. He mentioned factors like higher token cost, poor performance, and high latency can result in poor retrieval from AI agents. Context anxiety https://www.anthropic.com/engineering/harness-design-long-running-apps is another problem that teams need to be cautious about. Also incomplete data can lead to poor self-evaluation, but techniques like the reflection loop can help with data completeness. The speaker discussed the different failure modes they had to address when developing the custom deep research agent solution. Long-horizon tasks require an explicit think-act loop. This can be solved by incorporating multiple steps like think, plan that works before research , inspect works after the research is complete and validates the output , and finally the update step, which actually creates the final report. Anthropic's "think" tool https://www.anthropic.com/engineering/claude-think-tool and other similar solutions can help formazlie the reasoning pause. Also the long-horizon tasks tend to break decisions between steps in the overall process. The reflection step in their solution includes not only the data relfection, but also a process reflection that assesses if the process is complete or not. This phase includes a third reflection step called Draft Writing Loop that helps with synthesis gaps, for example any information that was in the research but write task didn't capture it, so the re-draft step takes care of it. Kulkarni concluded the talk with a discussion on the emerging harness engineering https://www.langchain.com/blog/the-anatomy-of-an-agent-harness techniques, where designing the tools, memory systems, and validation checks, constraints, and feedback loops make autonomous AI agents more reliable and accountable. Harness engineering’s goal https://openai.com/index/harness-engineering/ is to help the AI solutions shift from just prompt engineering to focus on the automated execution of tasks by AI agents. Since AI Agents are basically the combination of model and harness https://martinfowler.com/articles/harness-engineering.html , the better the models are, the thinner harness needs to be.