What Is an Agent Registry? (And What We Broke Before We Had One)

wpnews.pro

TL;DR

About four months into our agentic AI buildout, our head of security asked a question I couldn't answer: "Can you give me a list of every AI agent running in production, what systems they have access to, and what version of each is currently deployed?"

I had a rough mental model. I knew about the agents my team had built. I had a vague idea of what the data engineering team had shipped. The product team had recently added two agents I'd heard about secondhand.

I spent the better part of a day pulling together a spreadsheet. By the time I finished, one of the agents I'd listed had already been replaced by a newer version. Two of them had been granted access to an internal API I hadn't known about.

The spreadsheet was outdated before I sent it.

That was our forcing function for building a proper agent registry. This post is what I wish I'd read before that conversation happened.

An agent registry is a centralized catalog of AI agents — a single source of truth that tracks every agent deployed in your organization, its capabilities, its integrations, its ownership, and its current state.

The analogy that landed for me: it's to agents what a container registry (Docker Hub, ECR, GCR) is to container images. When you have three containers running, you don't need a registry — you know what you have. When you have 40 containers across six teams, you need a registry to know what's running, who owns it, what version is deployed, and what depends on what.

Agents are the same. At two or three agents, a shared Notion doc is sufficient. At 14 agents across three teams, you need infrastructure that tracks state, not a doc that someone last edited last month.

A registry stores metadata for each agent:

The last one is what distinguishes a registry from a spreadsheet: it's not just a catalog, it's the enforcement point for agent-to-agent communication.

We ran without a registry for longer than we should have. Here's what actually broke.

Shadow agents. Three separate teams had independently built agents that called our internal data API. None of them knew about the others. When we introduced rate limits on that API, two of the agents started failing intermittently — and we spent a week debugging what we thought was a data API problem before realizing the actual problem was three agents competing for quota we'd only budgeted for one.

Version confusion at 2am. An agent went into production with a bug. We rolled back. The rollback was applied to one environment but not the other. For six hours, our staging environment had the fixed version and production had the broken one, because there was no single source of truth for which version was where. The incident took longer to resolve than it should have because different team members were looking at different version references.

The offboarding gap. When an engineer left the team, we revoked their credentials for the systems we knew about. Three weeks later, a contractor reported that an internal Jira webhook was still firing from an agent they'd built. The agent had been registered nowhere. It was running on a piece of infrastructure they'd stood up themselves, using credentials that hadn't been included in the offboarding checklist because nobody knew the agent existed.

M×N integration hell. Each new agent that needed to call tools had to build its own integration with each tool. Eight agents, six tools: 48 potential integration points, each with its own credential management, error handling, and retry logic. When a tool API changed, we had to find and update every agent that used it manually.

The registry fixes all four of these. Shadow agents can't exist if registration is a prerequisite for deployment. Version state is tracked centrally. Offboarding is "revoke this agent's access in the registry." M×N integrations collapse to each tool being registered once, each agent pointing to the registry.

Worth being explicit, because I conflated some things early on.

It's not a deployment platform. The registry tracks what's running, but it doesn't run the agents. Deployment is a separate concern — Kubernetes, a container orchestrator, whatever your team uses. The registry is the catalog; deployment is the execution layer.

It's not an orchestration framework. LangGraph, CrewAI, AutoGen — those handle how agents coordinate with each other. The registry handles what agents exist and whether they're authorized to talk to each other at all. These are complementary, not competing.

It's not an MCP server list. An MCP server registry catalogs available tools. An agent registry catalogs available agents. Both are useful. Both are needed. TrueFoundry calls the combination of the two a unified MCP and Agents Registry — one place where you can see both the tools agents can use and the agents themselves. That unification matters because the governance question is really "which agents can call which tools" — you need both catalogs to answer it.

It's not just a spreadsheet. The spreadsheet version of an agent catalog is a snapshot. A proper registry is stateful — it connects to your observability layer and shows live performance, not last-week's-update performance. When TrueFoundry's registry shows you an agent's success rate, it's pulling from real-time telemetry, not a manually updated field.

The pattern that made everything cleaner: every agent registers with the gateway using the Model Context Protocol. Once registered, the agent looks like a standard MCP endpoint to every other agent in the system. A LangGraph agent and a CrewAI agent and a custom HTTP service all appear as the same kind of thing to the orchestrator — they're all just callable endpoints with a defined schema.

This is what solves the M×N problem architecturally. Each tool is registered once. Each agent is registered once. The registry maps which agents can call which tools. Agents don't need to know how to integrate with Jira or Slack or your internal data API directly — they call the registry endpoint, and the registry handles routing, credentials, and access control.

The other pattern that mattered: the registry as the access control enforcement point. Before this, access control for agent-to-agent calls lived in application code — each agent decided for itself whether to accept a call. That's as reliable as it sounds. Moving access control to the registry layer means it's enforced centrally, consistently, and not dependent on each individual agent implementation being correct.

After the security audit incident, we evaluated a few options and landed on TrueFoundry's Agent Registry. I can explain specifically what mattered.

Unified agent and MCP catalog. Every agent and every tool visible in one place. When the security team asks "which agents have access to the internal data API," the answer is a query, not a two-day investigation.

Framework-agnostic registration. We have agents on LangGraph, one on CrewAI, and two custom HTTP services. The registry handles all of them through a standard registration interface. Once registered, governance policies apply regardless of what framework built the agent — the same RBAC rules, the same audit trail, the same access policies.

Live performance tracking. The registry shows each agent's success rate, average latency, and last error pulled from the observability layer. We set a routing rule: for production code changes, only route to agents with >90% success rate on the latest eval run. The registry enforces this automatically rather than requiring a human to check before deploying.

A2A communication via MCP. When an agent needs to call another agent, it goes through the registry. The registry checks whether the calling agent is authorized to invoke the target agent, handles the call, and logs the interaction with both agent identities. The over-privileged sub-agent problem — where a spawned agent inherits more permissions than it should — is closed at the registry layer.

The tradeoff: TrueFoundry is Kubernetes-native, so there's real infrastructure investment if you're not already on K8s. For a team of 5 with 3 agents, a YAML file is probably enough. The inflection point for us was around 10 agents across multiple teams with compliance requirements.

The honest answer: you need a registry before you think you do, and you'll know you needed it earlier after you don't have one.

Some concrete signals:

What pushed you toward building or adopting a registry — and what does your current agent catalog look like? Curious whether most teams are still on the spreadsheet version or if the registry infrastructure has actually caught up to the agent deployment pace. Drop it in the comments.

source & further reading

dev.to — original article I built a free AI README Generator (with markdown preview) I Built DevBrand AI with Google AI Studio How Small Can an Agent Model Get? The Nemotron Floor

What Is an Agent Registry? (And What We Broke Before We Had One)

Run your AI side-project on zahid.host