Human-Guided Agentic Research: A Research Agenda

As recursive self-improvement accelerates, humans risk losing the ability to interpret and guide autonomous research agents, which could lead to safety failures. A new research agenda proposes studying how to keep humans in the loop by improving human oversight of agentic research, addressing issues like reward hacking, sandbagging, and incompetence.

tl;dr: As recursive self-improvement accelerates, we need a top-level agenda to research how to effectively keep humans in the loop. We need to study how humans can best interpret and guide research performed by autonomous agents when those agents lack taste, tacit knowledge or competence, or may try to reward hack, sandbag or sabotage such research. This is one attempt to define the problem and the shape of potential solutions. Imagine yourself a year or two in the future. Recursive self-improvement RSI is accelerating. Agents work in swarms independently for days or weeks at a time doing research. You work in a frontier lab doing AI safety research. You sit in front of your computer and click into the input box, ready to kick off a new project. What do you type? “Solve AI alignment”? Beware giving a magic genie vague wishes. Think about that again: what exactly do you type? How do you know what you type is the best way to prompt this agent swarm into doing your bidding? When the lead agent comes back a week later, what exactly does that output look like? How do you use that output to launch the next phase of the project? How will you validate that output to ensure the agent hasn’t reward hacked, sabotaged or incompetently explored the research space? How will you know what key decisions the agent made? Which research paths they explored? Which research paths they intentionally or unintentionally left unexplored? How will you know how faithful the research the agent did was to the goal you gave it? Today these questions are productivity issues—soon they will be safety issues. If we don’t start researching how humans can effectively interpret and direct agentic research, we lose the ability to make safety claims about that research. Recursive self-improvement RSI has already started. 1 Models working within agent harnesses are being used to develop the next model, accelerating AI development. Anthropic earlier this month published a post https://www.anthropic.com/institute/recursive-self-improvement detailing the potential paths RSI may take, naming “human review” as an upcoming bottleneck: Once human- and AI-authored code quality reach parity, humans will stop writing code entirely, and shift to only reviewing it. But if they can’t review code as quickly as Claude can generate it, human review will become the bottleneck to AI development. While this refers to code review, the same bottleneck will occur in research. But two problems exist with this framing: If we want humans to stay in the loop, we need to research how to keep them in the loop; otherwise the default path will be humans eventually being removed entirely from AI development, which will have drastic safety implications. Agents that don’t fully share human taste, tacit knowledge and competence will, at best, make decisions that deviate from human values in small ways that compound during autonomous loops; at worst, these agents may actively try to subvert the research agenda. As agents do more of the research, the scarce resource will be human attention and decision-making ability—can humans accurately interpret what agents are doing and effectively guide them? Right now, the answer appears to be no. Researchers like myself, and others I’ve talked to, are already running into issues managing just a handful of agents doing primary research and not noticing when agents reward hack during research. 2 https://www.lesswrong.com/feed.xml fnrs3h9jwnvtk Until recently researchers have mostly used agents for convergent tasks with well-defined goal conditions: coding, producing analysis reports, running autoresearch loops, etc. When agents do this type of research, the focus is on the output they create, not the path taken to create that output. Humans play the “reviewer” described above by Anthropic: did the output meet the task requirements? I like to call this “destination-focused” research, since researchers define the problem as one that is bounded with an end state they want to reach. With this type of research, human researchers may struggle to detect when the agent reward-hacks, hallucinates or produces so much output as to overwhelm. And while obvious incompetence—where the output is clearly wrong—can be easily detected by human researchers, subtle incompetence—where the output has ambiguous or obfuscated mistakes—is much harder to detect. The tools and techniques that exist today for managing this type of research are limited. Monitors are being developed to detect failures like reward hacking and hallucinations, but researcher overwhelm from managing swarms of agents is becoming a bigger problem. And when researchers are overwhelmed, they can’t detect subtle incompetence. Recently researchers have started to use agents for exploratory, divergent research as well. This type of research is not focused on a specific destination, but rather a direction. It is when a researcher picks a path to explore, not necessarily knowing where it leads e.g., something “smells bad” and they try to figure out why; or they use their experience and taste to choose a hypothesis to explore . Such “direction-focused” research introduces new problems: researchers must now not only evaluate the output of the agent, but the path or paths it has chosen to produce that output. What was a problem of tracking and guiding an agent along a single route from point A to point B becomes a problem of tracking and guiding an exploding graph of research paths being explored simultaneously by a swarm of agents. This type of research is often open-ended with no clear stopping condition. Not only do researchers now have all of the same problems of destination-focused research—detecting incompetence and subversion without being overwhelmed—they need to understand: Once a human researcher understands all this, they need an effective way to guide not just one agent doing this, but swarms of agents, in a way that allows them to provide their taste, tacit knowledge and judgement at the decision junctions that matter. The tools and techniques that exist today for managing this type of research are almost non-existent. The researchers performing this type of research with agents either build their own ad-hoc tools, or struggle with infrastructure and tools designed for destination-focused research. Effective monitoring and guidance for direction-focused research is compounded by these additional aspects that destination-focused research doesn’t share: Finally, while destination-focused research can often be performed without any direction-focused research, direction-focused research often involves a fan out / fan in process, where branches are explored before one or more is picked to do destination-focused research, upon which the research may switch back to direction-focused. At the highest level: Human-Guided Agentic Research studies how humans can interpret and guide research performed by autonomous agents. The agenda breaks down into three key pillars: Cross-cutting all of these are two aspects: Note that this is a research agenda, not an engineering agenda. While doing this research will inevitably require building new tools, the focus should be on validating those tools to find out which actually increase human-agent collaboration, e.g., by doing user studies, developing evals, etc. What are the ways agentic research could fail if we don’t improve how humans can interpret and guide this research? In the short term, AI safety research will lose a productivity increase at a time when AI capabilities are rapidly increasing. Without the interfaces, infrastructure and frameworks to manage swarms of research agents, research will be slower than it needs to be. While AI capabilities research may be slower as well, I believe it relies more on destination-focused research, which will be easier to automate and manage than the exploratory direction-focused research that will provide the biggest returns for AI safety research. In the long term, losing the ability to interpret and guide swarms of research agents, whether on the capabilities or the safety side, may cause issues due to: In the best case, these lead to wasted effort or subtle research errors that are hard to catch. In the worst case, these lead to the development of unaligned models, whose subtle unalignment may compound over time. As the number of agents doing research scales, additional issues may appear, including: Mitigating all of these threats safely and effectively requires both developing new methods and validating that those methods work. Many will notice similarities between this agenda and other AI safety research agendas. That’s definitely true. There are rarely agendas that have no overlap at all with another agenda. What makes this agenda distinct and how does it compare to other agendas? Scalable oversight presumes that the model is far more intelligent than humans, while human-guided agentic research presumes the model lacks taste, tacit knowledge and alignment that, on the default path, cause issues that humans won’t be able to detect. To solve this, scalable oversight researches how weaker models can be used to train or oversee stronger models to maintain alignment better than humans. Conversely, human-guided agentic research researches how humans can more effectively interpret and guide the models themselves. Put more simply: in scalable oversight, humans are outmatched; in human-guided agentic research, they are blindfolded. In scalable oversight, model-model interaction is researched; in human-guided agentic research, human-model interaction is researched. AI control focuses on the period of time when models are capable enough to cause significant harm if they are scheming to the time when models are so superintelligent that any means of humans controlling them will be futile. The approach AI control takes is similar to that of scalable oversight: use a less intelligent model to control a more intelligent model, or design a protocol where an untrusted model can oversee another untrusted model without the two being able to collude. The focus here remains on the model-model interaction. Human-guided agentic research focuses on the same period of time, assuming that at some point models will be so superintelligent that no human will be able to effectively control them. However, it researches human-model interaction. The two agendas are complementary. While AI control focuses on ensuring a model cannot take a harmful action, human-guided agentic research focuses on ensuring a model takes effective aligned actions, by enabling humans to continuously inject their taste, tacit knowledge and values into their work. Cooperative AI researches how humans and agents working together within systems can produce mutually beneficial outcomes. It tackles issues like mixed-motives, collusion and how social intelligence and alignment arises. While some of the human-AI interaction research in cooperative AI may be relevant for human-guided agentic research, the shape of the agent swarms being directed and the interaction patterns between humans and agents differ. Agents being managed during agentic research have the same motive / research plan, and interact within a far more limited system than the agents researched within cooperative AI. Cooperative AI is a broad agenda focused on how to get multiple agents and humans to work together, while human-guided agentic research is focused specifically on how agents can be guided when doing research. The overarching question to answer is: How does a human researcher safely and effectively guide a swarm of research agents? Below is a sample of some of the specific research directions that can be explored in support of this. This includes both adapting existing research for monitoring and guidance of swarms of agentic researchers and developing entirely new research. For each research direction, I’ve included a partial list of prior work and other resources that may be relevant. Treat these as starting points; I skipped and evaluated each one as potentially helpful, but haven’t yet read them all in depth. As frontier labs drive toward recursive self-improvement, won’t they just develop the tools and infrastructure themselves? My impression is most of the frontier labs think the process can be fully automated without human input, so they’ll focus mostly on passive monitoring tools and infrastructure, not active guidance. Even if labs decide to develop active guidance tools and infrastructure, safety departments are resource constrained and thus unlikely to do the user studies required to determine which approaches also increase safety vs only productivity. Since agent orchestration platforms are proliferating, won’t others in industry develop these tools and techniques? Possibly, but doubtful. The divergent workflows used in much of primary research differ significantly enough from the convergent workflows used by agent employees, embedded agents and “deep research” agents doing secondary research. Industry tools developed for the latter will be less effective for the former. Industry also doesn’t share the same safety concerns about rogue AI. It’s possible that academia will develop such tools and techniques, but it’s unclear on what timeline and how much such tools will include AI safety concerns such as sabotage and sandbagging. Why do research and develop tools for managing agents when an agent can just generate a user interface on the fly? While agents can generate user interfaces on the fly, those interfaces may intentionally or unintentionally hide critical information. Researchers shouldn’t be trusting agents who might be reward-hacking or sandbagging to accurately report on their own actions. More importantly, consistency is a key aspect of being able to reason about the information presented to a researcher. If each time a new interface is used to present information, it introduces a cognitive overhead which reduces reasoning ability. Finally, we need research to understand which interfaces are best at surfacing actionable information. A key assumption of this agenda is that we want to guide our autonomous agents; and to effectively guide, we need visibility into what our agents are doing. Agents are already starting to acquire taste, so why can’t we automate everything? Even if agents have all the taste, tacit knowledge and values of a theoretical human researcher, we still will likely want to keep humans in the loop. Different researchers approach problems in different ways. And even the best researcher, when operating autonomously over long periods of time, may deviate from the goals you set for the research. If you wouldn’t hire a human researcher to go do research and then not check their research or give them guidance for months or years at a time, why would you do that to an agentic researcher? Won’t keeping humans in the loop slow down research, allowing those labs who take humans out of the loop to outpace those that keep humans in the loop? Potentially. Though I would argue that the details along the path matter. The butterfly effect shows that small differences in initial conditions can cause unexpected results later. I believe the taste, tacit knowledge and values that a human researcher can inject into the research process through effective insight and guidance will produce better results than optimizing that researcher away and having small deviations compound into research errors, or worse, hidden or unexpected misalignment in future models developed using RSI without humans involved. Autonomous research poses unique risks and challenges not currently being addressed. As autonomous research gets more broadly adopted and the speed of that research accelerates, it’ll be important for humans to be able to effectively monitor and guide our autonomous agents. Research in the most effective methods of doing so should begin now. – Thank you to Tyler Tracy, Carlos Giudice, Douw Marx, Monika Jotautaitė and Alex McKenzie for reviewing drafts of this post and providing valuable feedback. I use the term RSI here to mean AI models being used to accelerate the development of future AI models, whose creation further accelerates development, creating a feedback loop. Autonomous RSI, where models are doing all of the development themselves without human intervention, may accelerate development further. Case in point: while using Claude Code to research how auto mode could be subverted in AI control settings, Claude Code reported a success. I pushed it to refine the attack, but then a few turns later became suspicious. Searching through the conversation history, I discovered in the wall of text that Claude generated during the research that the way it succeeded was by cheating and turning off auto mode rather than subverting it. One tendency I’ve noticed in my own research is accepting one of the next research directions my agent gives me after completing an experiment instead of typing my own; a behavior that could be nudged the other way by putting the freeform “Do something different” input at the top instead of the bottom of the option list.