How AI Agents Work: An Architectural Deep Dive A new architectural analysis reveals that production AI agent systems are dominated by operational infrastructure rather than AI decision logic, with Claude Code's leaked source code showing only about 1.6% of its codebase constitutes AI reasoning while the remaining 98.4% handles context management, tool design, and reliability engineering. The research finds that agent architecture has converged around a small set of well-understood patterns centered on the ReAct loop—where a large language model reasons, calls external tools, observes results, and repeats—but warns that 95% of enterprise AI pilots deliver zero measurable ROI and Gartner predicts 40% of agentic AI projects will be scrapped by 2027 due to rising costs and integration complexity. The findings underscore that the field's primary bottleneck is now evaluation methodology, not model capability, as roughly half of SWE-bench-passing code changes would not be accepted by real software maintainers. Post How AI Agents Actually Work: An Architectural Deep Dive An analysis of the patterns, infrastructure, and trade-offs behind the systems that have redefined what large language models can do Executive Summary The term “AI agent” has become one of the most overloaded in modern tech, but at its core it refers to a simple pattern: a large language model LLM connected to external tools and operating in a loop where it reasons about what to do, calls a tool, observes the result, and repeats until the task is complete. This pattern, known as ReAct after the 2022 paper “Synergizing Reasoning and Acting in Language Models,” has become the foundation of every production AI agent today. What makes agents work well is not the model itself but the surrounding infrastructure: how context windows are managed across thousands of tool calls, how tools are designed for non-deterministic consumers, and how safety boundaries are enforced. A widely-circulated claim has become the defining statistic in this space: Claude Code’s leaked source code revealed only about 1.6% of its codebase constitutes AI decision logic, with the remaining 98.4% being operational infrastructure 3 . This figure is disputed: critics argue it misinterprets how the Liu et al. paper categorizes different kinds of code, and that the distinction between “AI logic” and “infrastructure” is itself an interpretive choice rather than a fact about the code. Regardless of the exact percentage, the underlying intuition holds: production agent systems are dominated by operational engineering. The architecture has evolved through several identifiable layers: The ReAct loop Thought → Action → Observation interleaves reasoning traces with external actions so the model can induce, track, and update plans while interacting with real data sources. Tool use connects the model to APIs, files, databases, and other systems. The key insight is that tools must be designed specifically for agents, i.e., non-deterministic consumers, not just wrapped as API endpoints. Memory comes in two forms: short-term in-context learning bounded by the context window and long-term external vector stores via Retrieval-Augmented Generation . Planning and composition patterns orchestrator-workers, evaluator-optimizer, parallelization allow agents to handle complex multi-step tasks. Multi-agent systems delegate subtasks to specialized workers, trading exponential token costs for dramatic gains in capability on open-ended problems. Observability distributed tracing via OpenTelemetry GenAI semantic conventions, infinite loop detection, cost attribution, and session replay has emerged as a critical operational layer. Without it, debugging non-deterministic agent behavior is nearly impossible. The most important finding from this research is that agent architecture has converged around a small set of well-understood patterns. The competition between framework vendors LangChain, CrewAI, OpenAI’s SDKs, Anthropic’s Agent SDK is largely about ergonomics. Real engineering effort goes into context management, tool design, and reliability, areas where the best practitioners have accumulated significant domain knowledge. A second important finding is that the gap between agent benchmarks and real-world performance is much wider than commonly assumed: 95% of enterprise AI pilots deliver zero measurable ROI 25 , and roughly half of SWE-bench-passing PRs would not be merged by real maintainers 17 . The field’s primary bottleneck is now evaluation methodology, not model capability 21 . A third finding: the “agent winter” critique has empirical backing. Enterprise adoption has been slower and more cautious than early hype suggested, with Gartner predicting 40% of agentic AI projects will be scrapped by 2027, citing “rising costs, unclear business value, and integration complexity,” and PwC identifying integration complexity 67% , lack of monitoring 58% , and unclear escalation paths 52% as the top causes of pilot failure. 1. Definitions: What Is an “Agent” and How Does It Differ from Other AI Systems? The word “agent” has a long history in computer science. The classic definition from Russell and Norvig’s Artificial Intelligence: A Modern Approach describes an agent as anything that perceives its environment through sensors and acts upon that environment through actuators. This is a broad definition; a thermostat is technically an agent. In the modern AI literature, the term has narrowed. Anthropic defines agents as “systems where LLMs dynamically direct their own processes and tool usage,” distinguishing them from workflows : systems where LLMs and tools are orchestrated through predefined code paths. This distinction matters: a customer support bot that follows a decision tree of prompts is a workflow; one that decides on its own whether to query a knowledge base, check a user’s account history, or ask for clarification is an agent. The key property that makes something “agentic” is autonomy in tool selection and task decomposition . An autonomous system chooses which tools to use and in what order; it breaks complex goals into subgoals without explicit human instruction for each step. A related term, copilot , refers to systems that assist a human operator but do not operate independently. ChatGPT, GitHub Copilot, and Cursor are copilots: they generate suggestions but require the user to approve and execute each action. Claude Code occupies an interesting middle ground: it can autonomously edit files and run commands in a sandbox, but permission modes plan, default, auto control how much autonomy it has. 2. The ReAct Pattern: Core Architecture The single most important pattern in agent design is ReAct short for “Reasoning and Acting” , introduced by Yao et al. at Google Research and Princeton University in October 2022 1 . Before ReAct, reasoning chain-of-thought prompting and acting action plan generation had been studied as separate capabilities. The paper’s central insight was that interleaving them creates a synergy: reasoning traces help the model induce, track, and update action plans, while actions enable interaction with external sources of information. How the Loop Works The ReAct loop is deceptively simple: while not done: thought = model reasoning trace + available tools if thought is a tool call: result = execute tool thought.tool, thought.args observation = format result result append to reasoning trace else: return thought In practice, the “thought” that the model generates can be either a natural-language reasoning step or a structured tool call. The model alternates between these two types of outputs. Each iteration adds both a reasoning trace and an observation the result of the previous action to the context window. Why It Works There are three reasons ReAct outperforms its predecessors: Error correction : Chain-of-thought reasoning alone is vulnerable to error propagation. If the model makes a mistake in step 2, every subsequent step compounds that error. By interleaving actions like Wikipedia lookups , the agent can detect and correct mistakes early. Information grounding : The ReAct paper showed that on question-answering tasks HotpotQA and fact verification FEVER , ReAct “overcomes issues of hallucination and error propagation prevalent in chain-of-thought reasoning by interacting with a simple Wikipedia API” 1 . Interpretability : Because the agent’s thought process is visible, failures are debuggable. You can see exactly where the model went wrong. Was it the initial plan? A tool call with wrong arguments? An incorrect interpretation of the result? A Minimal ReAct Implementation Below is a minimal working implementation of the ReAct loop using OpenAI’s function calling API, illustrating how the pattern translates from theory to code: python import openai Define tools as JSON schemas the model understands tools = { "type": "function", "function": { "name": "search wikipedia", "description": "Search Wikipedia for relevant information", "parameters": { "type": "object", "properties": { "query": {"type": "string", "description": "Search query"} }, "required": "query" } } }, { "type": "function", "function": { "name": "calculate", "description": "Perform arithmetic calculation", "parameters": { "type": "object", "properties": { "expression": {"type": "string", "description": "Math expression to evaluate"} }, "required": "expression" } } } Tool implementations executed by deterministic code, not the model def search wikipedia query: str - str: """Actual Wikipedia API call""" ... real implementation pass def calculate expression: str - str: return str eval expression simplified for illustration tool functions = {"search wikipedia": search wikipedia, "calculate": calculate} The ReAct loop messages = {"role": "user", "content": "What is the capital of France and what's its population squared?"} max iterations = 10 for in range max iterations : response = openai.chat.completions.create model="gpt-4o", messages=messages, tools=tools msg = response.choices 0 .message if msg.tool calls: Model wants to call a tool for tool call in msg.tool calls: Append the tool call to history the "Thought" phase messages.append {"role": "assistant", "content": None, "tool calls": tool call } Execute the tool deterministically func name = tool call.function.name func args = json.loads tool call.function.arguments result = tool functions func name func args Append the observation back to history messages.append { "role": "tool", "content": result, "tool call id": tool call.id } else: No tool call; model has a final answer print msg.content break This code illustrates the core separation: the model decides what to do which tool to call and with what arguments , while deterministic Python code handles the execution. The conversation history grows with each iteration thought, action, observation until the model produces a final answer rather than a tool call. Performance The ReAct paper reported significant improvements: on ALFWorld a synthetic household task environment , ReAct outperformed imitation and reinforcement learning methods by an absolute success rate of 34%. On WebShop an online shopping environment with 1.18 million products , it beat baselines by 10% in success rate. These results were achieved with only one or two in-context examples. Mechanistic Analysis: Why Interleaving Works and When It Does Not The ReAct paper’s claim of “synergy” between reasoning and acting has been both validated and challenged by subsequent research. Understanding why interleaving helps at the model level requires examining what actually happens inside a transformer during an agent loop. The functional explanation . At the behavioral level, interleaving creates a dynamic feedback loop: each tool output becomes new input for the next reasoning step, allowing the model to continuously update its understanding of the task. Choices are informed by both internal logic pre-trained knowledge and external results tool outputs . This reduces hallucination because the model cannot rely solely on parametric memory. The transformer-level explanation . When a model generates a tool call and then receives the tool’s output appended to its context, several things happen at the attention level: Attention re-weighting : The newly appended tool output tokens receive full attention from all subsequent generation steps. The model’s attention heads redistribute their focus across the entire context, including the original prompt, prior reasoning traces, and the fresh observation. This allows the model to “reconsider” earlier decisions in light of new information. KV cache growth : Each iteration adds tokens to the key-value cache. Unlike single-pass chain-of-thought where the entire reasoning trace is generated in one forward pass , ReAct involves multiple separate inference calls. Each call rebuilds attention over the growing context. This means the model genuinely re-processes prior information rather than generating it in a single stream. Activation reset : Each new inference call starts with a fresh activation state the KV cache persists, but the residual stream is recomputed . This gives the model an opportunity to “reset” its reasoning trajectory based on the new observation, rather than being locked into a single forward pass where early mistakes propagate. This mechanism multiple independent forward passes with growing context is fundamentally different from single-pass chain-of-thought, where all reasoning tokens are generated in one continuous forward pass. In CoT, an error in step 2 cannot be corrected because the model never sees external feedback; in ReAct, each tool output provides a grounding signal that can redirect subsequent reasoning. The pattern-matching hypothesis . Critically, some researchers argue that ReAct’s effectiveness may be overstated. A 2025 study from the Artificiality Institute found that ReAct-style interleaving “does not significantly benefit” LLM performance in controlled experiments, and that “placebo guidance” random reasoning traces yielded results comparable to strong reasoning traces 33 . The study found that: - Replacing specific wording in examples with synonyms caused significant performance drops, revealing heavy dependence on exact phrasing rather than genuine reasoning - Performance decayed sharply as similarity between example and query tasks decreased - When guidance was weak or irrelevant, interleaving provided no measurable benefit over direct action generation This suggests that ReAct may exploit the model’s pattern-matching capabilities recognizing the Thought → Action → Observation template from training data rather than enabling genuine deliberative reasoning. The “synergy” observed in the original ReAct paper may partially reflect the model’s ability to follow a structured template it has seen during pre-training, rather than a fundamental improvement in reasoning capability. When ReAct helps and when it does not . The evidence suggests ReAct provides the most benefit when: - External information is genuinely needed fact-lookup tasks where parametric memory is insufficient - Tool outputs provide clear corrective signals e.g., error messages that specify what went wrong - Few-shot examples closely match the target task domain ReAct provides less benefit when: - The task can be solved from parametric memory alone simple knowledge questions - Tool outputs are noisy or ambiguous the model cannot distinguish signal from noise - The reasoning trace adds no new information beyond what the tool output already provides 3. How Models Learn to Be Agents: Training Methodology Before examining how agents use tools at runtime, it is essential to understand how models acquire agent capabilities during training. Function calling and tool use are not emergent properties of scaling; they require deliberate post-training. As the RLHF Book states, tool usage “is a skill that language models need to be trained to have” 28 . This section covers three layers of agent capability development: supervised fine-tuning on tool-use trajectories, preference optimization for tool selection, and reinforcement learning from environment feedback. Supervised Fine-Tuning on Tool-Use Trajectories The foundational technique for teaching models to use tools is supervised fine-tuning SFT on datasets of tool-use trajectories. A trajectory is a sequence of interleaved messages and tool calls that represent a complete agent interaction: User: "What's the weather in Tokyo?" Assistant reasoning :