{"slug": "meet-pxi-the-ai-engineering-agent-inside-phoenix", "title": "Meet PXI: the AI engineering agent inside Phoenix", "summary": "Arize AI launched PXI (Phoenix Intelligence), an open-source AI engineering agent integrated into its Phoenix observability platform, at the Arize:Observe conference. PXI helps developers debug traces, build evaluators, optimize prompts, and run experiments using existing telemetry and documentation, reducing time to first useful result compared to general coding agents.", "body_md": "*Co-Authored by Mikyo King, Head of Open Source & Roger Yang, Software Engineer & Nancy Chauhan, Developer Relations Engineer & Anthony Powell, Open Source Engineer.*\n\nWe’re excited to talk about PXI (Phoenix Intelligence, pronounced “pixie”), the AI engineering agent built into Phoenix. We launched it at [Arize:Observe](https://arize.com/observe-2026/), our AI agent evals conference. PXI works alongside you, helping you understand traces, debug failures, create evaluators, optimize prompts, and run experiments. It’s open source, in beta, and you can [use it in Phoenix right now](https://arize.com/docs/phoenix/pxi).\n\n**TL;DR**\n\n- PXI is an AI engineering agent built into Phoenix.\n- It debugs traces, builds evaluators, optimizes prompts, and runs experiments from the context you are already viewing.\n- It works from your own telemetry, datasets, prompts, and docs, so you never paste context in.\n- It keeps you in the loop: it asks before it acts and stages changes for your approval.\n- PXI can capture its own runs as Phoenix traces you can audit.\n\n*PXI working in the Phoenix playground. You tell it what you want, and it writes the prompt, builds an evaluator, and runs the experiment for you.*\n\n**Why put an agent inside Phoenix?**\n\nYou probably already have a coding agent like Claude Code open in another window. Building a second one into your observability platform only makes sense if it clears a real bar.\n\nPXI has to be more useful than opening a general coding agent and pasting in a handful of links. A bare agent session can do almost anything, but it starts from zero every time, so the time, tokens, retries, and false starts add up. It does not know your data, your docs, or the vocabulary it needs to ask the right question.\n\nAn agent built into Phoenix starts with all of that. The documentation and your telemetry are already present, and Phoenix is already the system of record for your prompts, datasets, experiments, and annotations, so the agent has far less to rediscover and can build on what you did yesterday. PXI turns Phoenix from a place where you inspect traces, evaluations, prompts, and experiments into one where an agent uses that same context to help you debug, evaluate, and improve your application. What we cared about was time to a first useful result. An agent that is scoped to Phoenix, and can already see the page you are on, gets there quickly.\n\nWe built PXI as a harness and added capabilities to it one at a time, rather than as a fixed set of features. Which capabilities are in reach depends on where you are in Phoenix, so the agent can do different things on a trace than on a prompt or a dataset. Each section below walks through one of those capabilities. They all attach to the same structure, shown in the diagram below, which we return to fully populated at the end of the post.\n\n*How PXI is put together. A server does the reasoning, the Vercel Data Streaming Protocol connects the front end and back end, and tools run either in the browser or on the backend. Contexts tell the agent which page you are on. Every capability in this post attaches to this same structure, and the final section shows it fully populated.*\n\n**We started with a terminal**\n\nBefore we built a single purpose-built tool, we gave PXI a working shell.\n\nWe took `just-bash`\n\n, a bash emulator written in TypeScript, and ran it entirely in the browser. Inside it we mounted a virtual filesystem with two parts: `/phoenix`\n\n, a read-only directory holding the context of whatever page you are looking at, and `/home/user/workspace`\n\n, a writable scratch space the agent can use freely.\n\nGive a model a shell and a filesystem and it stops behaving like a chatbot and starts behaving like a coding agent: it explores, pipes output through `jq`\n\n, truncates long results, and writes intermediate files to keep track. Because the environment was real enough, it fell back on those habits on its own, without our teaching them.\n\nA shell on its own is only a sandbox. We made it useful by giving the shell a custom CLI binary, `phoenix-gql`\n\n, which queries the Phoenix GraphQL API directly. GraphQL is the main data path for the reads PXI makes through the browser shell: the same API the Phoenix UI calls is the one the agent queries, so those reads go through one contract. It also meant we did not have to build a separate permission system for them. `phoenix-gql`\n\ncalls the same authenticated endpoint the Phoenix UI uses, so the agent reads under your own session, against exactly the data you can already see. By default it runs queries only, so PXI can read your data freely but cannot modify it.\n\n*The shell runs entirely in the browser. phoenix-gql queries the same authenticated GraphQL API the UI uses, under your session, so these reads need no permission system of their own.*\n\n*PXI answering a data question on a live project: 262 traces, 685 spans, 68% of them LLM spans. It also shows the commands it ran, two phoenix-gql queries and a jq aggregation. No tool for this existed; it used the shell.*\n\nBecause PXI always has a shell, a query client, and a scratch disk underneath it, it can reach any data it wants to see, even where we have not built a dedicated tool for a task, and trim that data down to fit its own context window. A shell and a CLI were enough for PXI to analyze and troubleshoot almost anything in Phoenix, and a lot of the later capabilities build on that.\n\n**One protocol between front end and back end**\n\nWe decided early that PXI’s front end and back end would talk through one streaming protocol, the Vercel Data Streaming Protocol. Because that protocol is the contract between them, PXI is not locked to any particular model or framework. Today it runs pydantic-ai on the server and the Vercel AI SDK on the client, but either side could change without breaking the other.\n\nTool calls travel over that same protocol. As the agent decides to call a tool, that intent streams to the browser as a typed chunk that says where the tool should run, on the client or on the server. That is what lets PXI reason on the server while some of its tools run in the browser.\n\n*PXI’s chat streaming rides on this one protocol. Today it is pydantic-ai and the Vercel AI SDK, and because the protocol is the contract between them, either end can be swapped, even the client. Each streamed tool chunk carries a tool_execution_environment flag (client or server), so PXI can reason on the server while some of its tools run in the browser.*\n\n**Giving PXI the docs**\n\nRight after the shell, we hit a domain modeling problem. PXI could query anything, but it did not yet understand the Phoenix domain: what a project is, what an evaluator is, how a span relates to a trace. A coding agent reads the README to learn a codebase; PXI needed the Phoenix docs.\n\nThe docs live in Mintlify, which serves them over a hosted MCP server, and PXI searches them with a single `search_phoenix`\n\ntool.\n\nA weekly cron audits the docs for gaps, so as the docs improve, what PXI retrieves improves with them.\n\n*The docs live in Mintlify and are served over a hosted MCP that handles retrieval. Because we keep the docs current, what PXI retrieves stays current with them.*\n\n**Caching the system prompt**\n\nPhoenix has a lot of surface area, and PXI’s instructions reflect it: a base persona, instructions for every tool, and whatever skills are loaded. That whole prompt goes to the model on every turn, so we rely on prompt caching to keep it efficient. The model reuses its cached work for the stable part of the prompt instead of reprocessing it each time, which keeps cost and latency down.\n\nPrompt caching only helps if the start of the prompt stays identical from turn to turn. We build the prompt on the backend from separate pieces, one for the base persona and one for each tool, and keep the stable pieces at the start. As long as that prefix does not change, the model can reuse the cached version instead of reprocessing it.\n\nWhen you load a skill mid-conversation, instead of splicing it into the system prompt and reshuffling everything after it, we append it at the end. The prefix before it stays identical, so the cache still holds, and you pay only the skill’s tokens rather than a re-read of the whole prompt. It is also why cache-read and cache-write counts appear on every PXI trace later.\n\n*Loading a skill appends it at the end, so the prefix in front of it stays identical and the cache still holds. You pay only the skill’s tokens.*\n\n**How PXI sees and acts on the Phoenix UI**\n\nA coding agent works on the files in front of it. PXI works on the Phoenix UI you are looking at, so it needs a way to see what is on the page and act on it.\n\n**Seeing the page: the context protocol**\n\nAs you move around Phoenix, the active context changes with what you are looking at: a project, a trace, a span, the playground, a dataset. The agent always receives the context of the page you’re on, which lets it interpret references like “this trace” or “that experiment” without you having to explain them. Each context also exposes the actions available on that page as tools the agent can call. When you navigate away, that context and its tools go away with it.\n\n**Acting on the page: dispatching into the UI’s state**\n\nIf contexts are how PXI sees the page, dispatching into the store is how it acts on it. Phoenix’s front-end state lives in a Zustand store, the small library that holds the app’s UI state. Rather than building a separate command channel between PXI and every widget, the agent’s tools dispatch the same actions the UI already dispatches. Say “set the time range to the last five minutes,” and PXI calls a tool that dispatches `setTimeRange`\n\ninto the store. The time-range selector updates because it was already subscribed to that store. We did not wire up any two-way communication; the UI reacts to state changes the way it always has, and the agent became one more source of those changes.\n\nStack enough of those tools together and the agent can drive the product end to end: open the evaluator slide-over, write the evaluator code, run the test, read the results, adjust, and save. It operates the same interface you do.\n\n*PXI’s tools dispatch into the same store the UI uses. There is no two-way wiring: one arrow into the store, and the subscribed UI re-renders on its own.*\n\n**Generative UI: when text isn’t the best answer**\n\nSometimes the clearest reply is not prose. PXI can render a chart inline in the chat, a bar or line chart of what it found, instead of a wall of numbers. Rather than describe a result and point you at a screen, the agent shows it where you are already looking. When the best response is visual, it renders a chart; when the task requires changing the UI, it takes the action directly.\n\n*Ask for a trend and PXI renders the chart inline, right in the conversation, instead of handing back a wall of numbers.*\n\n**An agent you can trust to act**\n\nOnce an agent can drive your UI and write to your data, acting without asking is not acceptable. We built two ways for PXI to pause before it changes anything.\n\n**Elicitation: ask before you act**\n\nWe gave PXI a tool whose only job is to ask a clarifying question. If you say “make me an evaluator,” a good engineer asks “an evaluator for what, and graded how?” before writing a line. Elicitation lets the agent do the same: it pauses, surfaces a small set of choices in the UI, and waits for your answer before committing to a direction.\n\n*Ask for something underspecified and PXI stops to ask. A vague “make me an evaluator” triggers an ask_user elicitation: a three-question flow with multiple-choice options, awaiting your answer before it commits to a direction.*\n\n**Permissions: acknowledge before you commit**\n\nBy default, PXI does not commit persistent changes on its own. When it wants to make a real change, such as editing a prompt, writing an evaluator, or annotating a batch of spans, it stages the change and shows you a diff, and commits only after you approve. That approval is a setting you control: each request defaults to manual approval, and you can switch it to bypass in PXI’s settings once you trust it.\n\nThis is also where the split between server and browser matters. The agent reasons on the server, but the change is committed in the browser, in front of you, with your approval.\n\n**PXI is traced by Phoenix**\n\nWhen you launch an agent, you have no data on how people use it, which makes it hard to know what to improve. Our answer was to point Phoenix at PXI.\n\nPXI can capture conversations as Phoenix traces, controlled by both system settings and per-browser preferences. By default the system settings allow neither local persistence nor remote export, so an administrator turns it on and each user opts in. When recording is enabled, every turn PXI takes becomes a Phoenix trace: every tool call, every model call, and every token (input, output, cache reads, cache writes) is instrumented following OpenInference conventions and shows up as spans you can open and inspect. When you give a PXI response a thumbs-up, that feedback flows straight onto the trace. The agent we built to debug your LLM app is itself a fully observable LLM app.\n\nThat closed a loop we needed. With no usage data at launch, the team reviewed PXI’s own traces by hand to see where it was going wrong. And because we work with a coding agent of our own, the workflow was tight: when PXI got something wrong, we would copy the trace, hand it to the coding agent, and have it fix the bug.\n\n*A PXI run opened as a Phoenix trace in the pxi_phoenix_cloud project. The span tree shows PXIAgent.iter calling render_generative_ui and claude-opus-4-6, with total cost ($0.34), latency (24.7s), and the run’s full input and output.*\n\n**Composable by design: skills, sub-agents, tools**\n\nEven with a cached prompt, we cannot fit all of Phoenix into it, so we made PXI’s knowledge and capabilities composable, loaded on demand. Each piece below could be its own post, so here is the short version.\n\n**Tools**\n\nBy now you have seen most of PXI’s tool surface, spread across the sections above. Pulled together, PXI has bash tools (the in-browser shell and phoenix-gql), client tools that dispatch into the UI’s state, server and MCP tools (the docs search and Phoenix’s own MCP), the elicitation and approval tools that keep a human in the loop, and tools loaded on demand by skills. A single metadata flag on each one decides whether it runs in the browser or on the server, so they all compose under one model even though some act on your screen and some act on the backend.\n\n**Skills**\n\nSkills are bundles of task-specific instructions: how to do open coding on traces, how to build an evaluator, how to optimize a prompt. They are plain markdown with frontmatter, and PXI loads them two ways: automatically, when a task looks relevant, and explicitly, when you type `/skill-name`\n\nthe way you would in a coding agent. Because they are plain files, you can read exactly what each skill loads into the context window.\n\n**Sub-agents**\n\nWhen you are sifting through a lot of data, the orchestrator’s context window fills up and quality degrades. So PXI can spawn sub-agents, each with a fresh context window and a narrow task, that report a distilled answer back to the main agent. In one run, we watched it decompose “figure out the topology of my traces” into three specialized sub-agents on its own. It is new, and it is the start of the long-horizon work we are most interested in.\n\n**Web access**\n\nPXI can also reach the open web, not through a scraping service but through the models’ own native web capabilities, behind a setting. Point it at a prompting guide, tell it to write a prompt that follows those principles, and it will read the guide and do it. The capability comes from web access plus a good guide, with no prompt-optimization feature behind it.\n\n**Why the order was the point**\n\nWe built PXI’s primitives before any of its features, and those primitives are what made the opening run possible. PXI could optimize a prompt from an empty playground, discover the schema, build evaluators, and sleep while it waited for an experiment, even though we never built that workflow. The primitives are a real shell, a query client, a scratch disk, and the docs. Everything above them is a convenience, and when one is missing, the agent does not get stuck; it falls back to the primitives and works it out.\n\nThe same pattern repeats whenever PXI reaches past what we built: new kinds of capability appear from prompting alone, with no feature behind them. When a prompt turns out to be a reliable workflow, we promote it into a first-class skill or tool. The base layer makes the agent capable from day one; turning a proven workflow into a tool makes that task faster.\n\nAll of it, the primitives and the capabilities layered on top, fits together like this:\n\n*The full system in one view. The browser runs the shell and tools and holds the page context; the server assembles the prompt, defines the tools (including MCP), and hosts the skills; the external services are the LLM provider you bring and the Mintlify docs MCP, which exposes search_phoenix. The chat streams between browser and server over the Vercel Data Streaming Protocol, and runs can be traced by Phoenix when recording is on.*\n\n**Open, hackable, and yours to run**\n\nPXI ships with a curated set of frontier models, including claude-opus-4-8, gpt-5.5, and gemini-3.5-flash, as sensible defaults. You can also bring your own key, point it at a different provider, or run it against a local model. PXI leans hard on tool calling, so a local model’s results will vary, but nothing stops you from trying, and if you make it work, send the system-prompt changes upstream. Because it is open source, you can inspect PXI, configure it, or disable it entirely.\n\nPhoenix is an open source platform for AI engineering, observability, and evaluation. PXI builds on the same foundation: traces, evals, prompts, datasets, annotations, feedback, GraphQL, the CLI, and OpenInference conventions, all available in the open.\n\nPXI is in beta today. It will make mistakes, and long-running sessions still need better context management. Even so, the context humans use to debug AI systems is becoming context agents can use to help improve them.\n\nWant to see it in action? Here’s the full walkthrough, where PXI investigates a real agent.\n\n*If you’re building around these problems, we’d love to hear what you try, what breaks, and what you want PXI to do next.*", "url": "https://wpnews.pro/news/meet-pxi-the-ai-engineering-agent-inside-phoenix", "canonical_source": "https://arize.com/blog/meet-pxi/", "published_at": "2026-06-18 22:00:18+00:00", "updated_at": "2026-06-18 22:04:20.507519+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "ai-tools", "ai-research"], "entities": ["Arize AI", "Phoenix", "PXI", "Mikyo King", "Roger Yang", "Nancy Chauhan", "Anthony Powell", "Claude Code"], "alternates": {"html": "https://wpnews.pro/news/meet-pxi-the-ai-engineering-agent-inside-phoenix", "markdown": "https://wpnews.pro/news/meet-pxi-the-ai-engineering-agent-inside-phoenix.md", "text": "https://wpnews.pro/news/meet-pxi-the-ai-engineering-agent-inside-phoenix.txt", "jsonld": "https://wpnews.pro/news/meet-pxi-the-ai-engineering-agent-inside-phoenix.jsonld"}}