As organizations rush to move AI into production, they’re finding that the tools they rely on to monitor traditional software don’t translate cleanly to AI systems. The reason is fundamental: AI doesn’t fail as software does. It doesn’t throw clean error codes or follow predictable execution paths. It drifts, hallucinates, and degrades in ways that are often subtle, intermittent, and hard to reproduce.
The result is a growing gap between what teams think observability should provide and what current tools actually deliver. The uncomfortable truth? The AI observability tools we have today are built for yesterday’s problems.
To understand where the industry is headed, we need to look at where it is today and why that’s not enough.
Today’s AI observability landscape is dominated by one concept: evaluation.
Most tools focus on scoring model outputs after the fact. They rely on test datasets, human graders, or, increasingly, “LLM-as-a-judge” approaches to determine whether a system is behaving correctly. These evaluation pipelines are useful and can provide a baseline for model quality, helping teams benchmark improvements.
But they do share a critical limitation. They’re static, offline, and backward-looking.
Evaluations tell you how a model performed on a predefined set of inputs. But they don’t tell you what’s happening in production, where inputs are unpredictable and context can shift. You need to capture long-running interactions, multi-step workflows, and the behavior of systems composed of multiple models and tools as a part of your evals.
Even when teams use human-in-the-loop feedback, it can be tough to scale. High-quality feedback requires domain expertise, consistency, and time, each of which is in short supply in most engineering organizations. You also need deep knowledge of the models themselves and how they’re working in production to help identify and provide feedback around the source of the error. Was it a lack of context? A bad retrieval-augmented generation (RAG) implementation? The model itself? Or bad feedback poisoning the results?
Some progress is being made. OpenTelemetry (OTel) and LLM tracing are emerging as early attempts to bring runtime visibility into AI systems. But these are still just first steps, and the core issue remains: you can’t understand AI systems by evaluating them after the fact. You need to observe them as they operate.
As AI systems move into production, observability becomes more about managing risk. The attack surface has expanded dramatically, with teams now dealing with:
In response, a new category of “guardrail” tools has emerged. These systems aim to monitor inputs and outputs in real time, flagging or blocking unsafe behavior. In theory, they provide a safety layer that sits between users and models.
In practice, however, the picture is more complicated.
Most guardrails today are reactive. They rely on predefined rules or classifiers that attempt to catch known patterns. But AI systems are inherently open-ended, and adversarial inputs evolve quickly. What works today may fail tomorrow.
There’s also a deeper issue: guardrails operate on the assumption that you already have sufficient visibility into the system. In reality, many teams lack the underlying telemetry needed to understand how and why a failure occurred in the first place.
This creates a gap between what guardrails promise (real-time protection) and what they can reliably deliver. Closing that gap requires something more foundational than filtering inputs and outputs. It requires rethinking observability itself.
The next wave of AI is clearly about autonomous agents. Instead of single inference calls, we’re seeing systems that orchestrate multiple models, interact with external tools and APIs, and execute multi-step workflows over extended periods of time.
These systems don’t just generate outputs; they make decisions. And that changes the observability problem entirely.
Just as containers required orchestration platforms like Kubernetes to become manageable at scale, AI agents will require their own observability and control layer. That layer must go beyond tracking inputs and outputs. It needs to capture:
In many ways, this is similar to what we saw with the evolution of cloud-native observability. We moved from simple metrics to a combination of logs, metrics, and traces to understand distributed systems.
Now we need the equivalent for agentic systems.
As AI becomes embedded across the software development life cycle, from code generation to testing to operations, observability is evolving into a system of truth that feeds both humans and machines. AI agents can only build, debug, and improve systems if they have access to rich, high-fidelity production context. Observability is what provides that context.
There’s a fundamental trust problem at the heart of AI observability. If an AI agent is responsible for reporting its own behavior, how do you know that behavior is being reported accurately?
Traditional observability relies heavily on instrumentation within the application layer. But instrumentation can be incomplete, misconfigured, inadvertently bypassed, or simply incorrect.
This problem becomes more acute as AI systems begin generating their own code. Agents don’t think like human engineers when it comes to instrumentation, nor should they be expected to. But the result is a growing need for independent, out-of-band observability.
This is where kernel-level approaches, such as eBPF, become critical. By operating at the kernel level, eBPF enables teams to:
More importantly, eBPF provides a trusted source of truth. In high-stakes environments where compliance, security, and reliability are non-negotiable, this independence is essential. You need telemetry that’s not influenced by the systems it observes.
If current tools fall short, what comes next? The answer is a shift in how we think about observability. First, we need behavioral anomaly detection for AI systems. Traditional observability focuses on latency, errors, and resource utilization. But AI systems require a different lens to detect when behavior deviates from expectations, even when no explicit “error” occurs.
Second, we need tamper-proof audit trails. As AI systems take on more responsibility, you have to be able to reconstruct decisions. Teams need to understand what happened and, more importantly, why. And they need to trust that the data hasn’t been altered.
Third, observability must become dynamic and adaptive. Static dashboards and predefined metrics won’t cut it. AI systems operate in constantly changing environments, and observability must be able to:
Finally, observability must integrate directly into AI workflows. It’s no longer enough to surface insights to human operators. The same telemetry must be consumable by AI agents feeding back into development, debugging, and optimization loops.
We are still early in the evolution of AI observability. Most of today’s tools are extensions of existing paradigms adapted for AI, but not fundamentally redesigned for it. Predictably, they solve parts of the problem, but not the whole.
The next generation of these systems will look very different. They’ll treat observability as a core layer that enables AI systems to operate safely, efficiently, and autonomously. The teams that succeed will be those that recognize this shift early.
Ultimately, in a world of non-deterministic systems, long-running workflows, and autonomous agents, one thing becomes clear: AI reliability strongly correlates with your observability layer.
—
New Tech Forum** provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to *** doug_dineley@foundryco.com.*