LangChain Interrupt: Agents Moved Into the Runtime | Focused LangChain shifted its focus from model capabilities to runtime infrastructure at its Interrupt conference, releasing LangSmith Engine to turn agent execution traces into automated debugging and improvement cycles. The company also introduced SmithDB for storing agent traces with fundamentally different data structures than web application traces, and Deep Agents v0.6 with features enabling cost-efficient routing of routine tool work to smaller models while reserving frontier models for complex reasoning. Interrupt felt different this year. Less model worship. More runtime. Instead of another round of model worship, the more useful conversations at the conference took a more practical turn. Agents work best when the workflow is built agentically from the start, but as a reality on the ground, many existing enterprise processes are simply wrapped around a model, and then buttons are pressed, forms filled out, and the output of the model is copied and pasted into another field or application. That falls apart after the demo. The better way to put agents into workflows is to build the workflow agentically in the first place. The parts that should be deterministic should be surrounded by software. The LLM can then be used for judgment, for synthesis, for planning, for dealing with ambiguity, for making priorities. And the harness and static code should be equipped with tools that have contracts. The harness and static code should be equipped with state. The harness and static code should be equipped with a way for the agent to recover from failures. And the harness and static code should give the agent enough visibility so that when the agent does something weird, someone can actually debug it. The release I kept coming back to was LangSmith Engine https://www.langchain.com/blog/introducing-langsmith-engine . This is the loop that people have been trying to describe for improvements of agents over time. The new trace engine watches traces of execution of agents. It clusters failures together and turns them into issues. It analyzes the production code of a harness to diagnose the root cause of a problem. It writes PRs for fixes. It proposes online evaluators. It moves failing traces of production runs into offline eval sets for further improvement. Production behavior becomes evidence. That evidence becomes an issue for the system to fix. That fix becomes an evaluator watching for the failure to come back. That is cool. This changes observability. Longstanding views of observability as viewing a trace of an application's execution as merely a receipt of an application's actions are increasingly becoming obsolete as traces within agent systems become sources of new evidence for the next cycle of improvement whether that be harness tuning, updated prompts, additional context, alternative models, the generation of new evaluators or indeed the repair of workflows. Storage has a similar point to make, as showcased by SmithDB https://www.langchain.com/blog/introducing-smithdb . Agent traces have a fundamentally different shape to the traces which teams are accustomed to following through web applications. In agent systems, traces have a different event density, with individual spans taking longer to execute than would be the case for a web trace, and deep, wide span trees, with again, greatly variable timing characteristics. In some cases, individual runs of an agent can include tool calls, context that was retrieved within the run of the agent, model or evaluator output, files which were opened or written to within the run, user provided feedback, or, rarely, even a full research project. An agent trace is different from an application trace. Instead of helping to debug why an application is slow or failing, an agent trace helps to understand why an agent decided to take a particular action in the first place. That is a different data problem. The same theme shows up in Deep Agents v0.6 https://www.langchain.com/blog/deep-agents-0-6 . The interesting parts are not flashy: code interpreter, programmatic tool calling, typed streaming, DeltaChannel checkpoint storage, and harness profiles for different models. This means lower cost models, e.g. Kimi, will be able to handle routine tool work such as summarization, search, extraction and so on. This type of work should not burn frontier tokens. On the other hand, frontier models will handle the harder reasoning parts of the work, with the ceiling of the model in question being the ultimate determinant. As such, a team might use Kimi here, Qwen there, DeepSeek somewhere else and then reserve Opus or GPT for when the work actually needs it. Just because something is in a managed platform and has a model in a box, doesn't mean that changing the model is free. It can have a whole lot of cascading effects on prompts, and how a team prompts, on tool calls, and how the system programmatically calls tools, on typed streaming, and how DeltaChannel checkpointing behaves. Failure modes are introduced in entirely new parts of the system. And cost and latency have entirely different trade spaces. If a team is tweaking the harness or static code to improve an agent, the team wants to know how different tweaks of the harness or static code correlate with changes to the model. The question that recurs throughout this conference: what to put in a model, what to put in a harness, what to put in static code, and what to put in runtime? And so platforms like Managed Deep Agents https://www.langchain.com/blog/introducing-managed-deep-agents , Context Hub https://www.langchain.com/blog/introducing-context-hub , and Sandboxes https://www.langchain.com/blog/langsmith-sandboxes-generally-available all provide ways to manage such durable threads and other important runtime structure. Files and skills and subagents, and versioned context. Safe code execution and human approval flows. Tracing and checkpoints. And, of course, a memory that lives somewhere other than in the vibes of the person interacting with the system. MCP will connect agents to tools that are within an organization's own stack. This is different from A2A, where agents interact with other agents. Some of those will be user agents, and others will be digital employees that have their own permissions, budgets, policy boundaries, and audit trails. This will require different auth models than current approaches, and in many cases, existing agents are essentially useless, or worse, frightfully powerful in their current form. For LLM Gateway https://www.langchain.com/blog/introducing-llm-gateway : spend limits; PII redaction; tracking of policy events, whether spend limits are set, what they are, who updated them; trace continuity, so after a trace is sent to an external tool, the original trace can still be viewed together with the trace generated by the external tool in the same view; and eventually the same type of controls for external tools as for MCP, meaning tool gateway controls. My takeaway from Interrupt is that the agent conversation is getting more honest. Of course, this is the hard part: putting the LLM in the right place to bear judgment, and surrounding it with code that enables traceability, limits, eval, and improvement. That is what made Interrupt feel important. The ecosystem is moving in the direction of LLM-powered agents as runtime participants within actual software systems, instead of as magic employees. Good. That is where the work is.