{"slug": "terminal-apps-need-a-dom", "title": "Terminal Apps Need a DOM", "summary": "ConductorOne released agent-tui, an open-source tool that gives terminal applications a queryable DOM-like interface for AI agents. The tool runs programs on a PTY, exposes rendered screens as text with stable references, and enables agents to snapshot, type, and wait for named states. It solves the problem of AI tools needing to parse terminal output designed for humans.", "body_md": "When we were building [Squire](https://www.c1.ai/blog/squire-agentic-first-ephemeral-dev-environments-at-conductorone), C1's software factory, we hit a slightly absurd problem: the AI tools were also built for humans.\n\nSquire could give an agent work. But Claude Code, Codex, Pi, and similar AI harnesses present themselves as terminal apps first. Their live interface is a TUI made for a person: a prompt, a streaming response, approval screens, file-change panes, and a cursor waiting for the next instruction. Another agent can type into that interface. It still needs to know whether the response is done, whether an approval screen appeared, or whether the cursor has returned to the prompt.\n\nThat is the problem agent-tui solves. It runs the target program on a PTY, keeps the terminal state alive in a daemon, exposes the rendered screen as text or an outline with stable refs, and lets a client snapshot, press keys, and wait for named screen state. It gives terminal apps the same kind of queryable surface that made browser automation useful.\n\nagent-tui is open source.We are publishing it under the Apache-2.0 license at[github.com/ConductorOne/agent-tui]. The design comes from our experience with[agent-browser]: give the agent something it can query instead of a pile of pixels. agent-tui applies that idea to terminal apps.\n\nOne common Squire pattern is an orchestration agent: a coding harness receives the assignment, then drives another harness to do the work. In this demo, OpenAI's Codex uses agent-tui to drive the [Pi harness](https://pi.dev/) through a real terminal session.\n\n## Agents driving agents[#](#agents-driving-agents)\n\nagent-tui starts the outer Codex TUI, waits for `@codex.input`\n\n, types the task, and presses enter. Codex then runs the command sequence below, which starts a second agent-tui daemon around Pi.\n\nThe Pi side is just another agent-tui session:\n\n```\nagent-tui daemon run\nagent-tui spawn -- pi --offline --no-extensions --no-context-files --no-skills\nagent-tui wait --ref '@pi.input'\nagent-tui type --to '@pi.input' 'Reply with the token formed by joining INNER, AGENT, and OK with `_`.'\nagent-tui press --to '@pi.input' '<cr>'\nagent-tui snapshot --mode text\n```\n\nThe result is two live screens: one where Codex receives the task, and one where Pi answers inside the nested session.\n\nVercel's [AI SDK harnesses](https://ai-sdk.dev/v7/docs/ai-sdk-harnesses/overview) frame agent CLIs as provider-specific surfaces, not one generic wrapper. agent-tui takes the same approach to terminal screens: keep each app's shape, then expose the parts an agent can query.\n\n## Why terminal apps need structure[#](#why-terminal-apps-need-structure)\n\nMost useful terminal programs were built for humans, not machines.\n\n`htop`\n\n, `vim`\n\n, `lazygit`\n\n, `psql`\n\n, language REPLs, and newer agent CLIs such as Claude Code and Codex have different jobs. They share one automation problem: the live interface is a terminal session. It owns a PTY, redraws a grid of cells, and expects a person to infer what changed. Some tools expose batch modes. Many do not. Even when a batch mode exists, it is often a different interface from the human session you need to observe, interrupt, or steer.\n\nAn agent can write bytes to a terminal easily. The hard part is knowing what happened after those bytes landed. A full-screen program may repaint in place, move the cursor, enter an alternate screen, update one field, and never print a clean line that says \"ready.\"\n\nThe usual choices make bad contracts with the terminal. Escape-sequence parsing treats the byte stream as the API. Rendered-text scraping throws away state. Sleeping between keystrokes punts the problem to the scheduler, which means the script works until CI is slow or one prompt lands in a state you did not match.\n\n## Give the terminal a DOM[#](#give-the-terminal-a-dom)\n\nVim is a good stress test: it is a full-screen editor, not a command that prints lines. Here agent-tui drives a real Vim session through refs instead of sleeps.\n\nIn the recording, the left pane issues agent-tui commands and the right pane is the Vim PTY. The driver waits for the buffer, reads the mode, enters insert mode, writes `hello world`\n\n, saves `hello-world.txt`\n\n, and checks the file contents.\n\nRaw terminal text is a bad handle for this job. agent-tui exposes an outline: a tree of screen regions with roles and stable refs. A ref is a name for something on the screen. It lets an agent say \"the Vim mode indicator\" instead of \"row 24, column 1.\"\n\n```\nagent-tui spawn -- vim notes.md\nagent-tui wait --ref '@vim.buffer'        # vim has rendered\nagent-tui --json snapshot --select '@vim.mode' | jq -c '.data.outline.nodes[0]'\n{\"durable\":true,\"ref\":\"@vim.mode\",\"role\":\"mode\",\"value\":\"normal\"}\n```\n\n`@vim.mode`\n\nis durable. It names the same part of the screen whether the value is `normal`\n\n, `insert`\n\n, or something else. Refs can be queried with a small selector language: `[role=buffer][focused]`\n\n, `@vim.mode[value=insert]`\n\n, `@tmux.pane[%2]`\n\n.\n\nThe built-in `vim`\n\nand shell adapters emit named refs such as `@vim.mode`\n\nand `@shell.prompt`\n\n. A TOML manifest can teach agent-tui the regions of another app without adding Rust code. With no adapter at all, the generic adapter still groups the screen into coarse regions and gives them refs:\n\n```\nagent-tui --json snapshot --mode outline | jq -c '[.data.outline.nodes[] | {ref, role}]'\n[{\"ref\":\"@e1\",\"role\":\"meters\"},{\"ref\":\"@e2\",\"role\":\"table\"},{\"ref\":\"@e3\",\"role\":\"footer\"}]\n```\n\nWe patterned this after browser automation, where [agent-browser](https://github.com/vercel-labs/agent-browser) has worked well for our own agents. A browser agent should not click pixel 412, 308; it should click the button. A terminal agent should not depend on a fixed row when agent-tui can name the mode, prompt, focused pane, or table.\n\n## Wait for screen state[#](#wait-for-screen-state)\n\nA snapshot only helps if you know when to take it.\n\n`sleep 0.2`\n\nis a guess about scheduling, terminal redraw, and the program under test. It will be too long on a fast machine and too short on a loaded runner. Worse, it is not connected to the state you care about.\n\nThe wait subcommand is tied to screen state. It can wait for a ref to appear, for a ref to disappear, for a selector value, for a regex, for an event sequence, or for the child process to exit. For screens with no named ref yet, a client can take a snapshot, keep its screen hash, send input, and wait until the rendered grid changes.\n\nHere is a complete `vim`\n\nedit with no sleep:\n\n```\nagent-tui spawn -- vim todo.txt\nagent-tui wait --ref '@vim.buffer'              # 1. wait for the buffer to exist\nagent-tui press i                               # 2. enter insert mode\nagent-tui wait --ref '@vim.mode[value=insert]'  # 3. wait for the mode to flip\nagent-tui type 'review the draft'               # 4. type\nagent-tui press '<esc>'                          # 5. leave insert\nagent-tui wait --ref '@vim.mode[value=normal]'  # 6. wait for the mode to flip back\nagent-tui press ':wq<cr>'                        # 7. save and quit\n```\n\nEach command waits for the next observable transition. After `press i`\n\n, the script does not assume vim is ready for text. It waits until the parsed mode is `insert`\n\n. After `<esc>`\n\n, it waits until the mode is `normal`\n\nagain.\n\nRefs avoid a common false positive. A regex can match the literal word `insert`\n\nin the buffer. A wait on `@vim.mode[value=insert]`\n\nwatches Vim's parsed mode field. It is not looking at arbitrary screen text.\n\nYou can wait for absence too. `wait --ref '@vim.cmdline[focused]' --gone`\n\nblocks until the command prompt closes. For terminal tests, that is usually the difference between \"probably done\" and \"the UI state changed.\"\n\n## Fallback to rendered text[#](#fallback-to-rendered-text)\n\nNot every app has an adapter or a useful state signal. `htop`\n\nis a good example: it has no JSON mode, and the useful output is often just the rendered screen.\n\n```\nagent-tui spawn -- htop\nagent-tui wait --idle 500\nagent-tui --json snapshot --mode text | jq -r .data.text\n0[         0.0%]   4[         0.0%]   8[          0.0%] 12[          0.0%]\n    1[******* 100.0%]   5[         0.0%]   9[          0.0%] 13[          0.0%]\n  Mem[|||||#*@@@@@@@@@@@@@@@16.0G/124G] Tasks: 29, 117 thr, 0 kthr; 7 running\n  Swp[                           0K/0K] Load average: 0.68 0.82 0.69\n\n    PID USER       PRI  NI  VIRT   RES   SHR S  CPU%-MEM%   TIME+  Command\n  50386 user        20   0  4900  3336  2456 R 160.0  0.0  0:00.02 htop\n```\n\nUse `wait --idle 500`\n\nwhen an app has no better signal. It waits for the screen to stop changing, so it is still tied to terminal output instead of a fixed delay after input. Use refs and selectors when the app has structure. Use idle when it does not.\n\n## What about tmux and expect?[#](#what-about-tmux-and-expect)\n\nThe obvious question is whether this is just `tmux send-keys`\n\nplus `capture-pane`\n\n, or a wrapper around `expect`\n\n.\n\nThey are useful, but they stop at a lower level.\n\n`tmux capture-pane`\n\ngives you text from the rendered grid. It does not give you roles, named regions, or durable handles. `tmux send-keys`\n\ncan write input, but it has no opinion about what screen state should follow.\n\n`expect`\n\nis line-oriented. It is excellent for programs that print prompts and lines. It is a poor fit for a full-screen ncurses app that repaints a cell grid in place. There is no line that says \"vim's mode indicator is now insert.\" The information is on the screen, but it is not in the stream in the form `expect`\n\nwants.\n\nagent-tui sits above the byte stream and below the agent. It reads the terminal state, assigns names to parts of the screen, and lets the caller wait on those names.\n\n## Use stdout when stdout is enough[#](#use-stdout-when-stdout-is-enough)\n\nNot every command needs a live terminal. If a program already has a non-interactive mode, the right interface is still stdin, stdout, stderr, and an exit code.\n\nThe `run`\n\nsubcommand exists for commands that should return data. It gives agents a typed, logged wrapper for those calls while PTY automation stays focused on live screens.\n\n```\nagent-tui run -- gh api /repos/ConductorOne/agent-tui \\\n  --jq '{repo: .full_name, lang: .language, default_branch: .default_branch}'\n{\"default_branch\": \"main\", \"lang\": \"Rust\", \"repo\": \"ConductorOne/agent-tui\"}\n```\n\nThe result is plain data. It can be piped into `jq`\n\n, fed to another step, or asserted on in a test. `run`\n\nalso fronts AI CLIs that expose non-interactive modes (`claude -p`\n\n, `codex exec`\n\n, `pi --print`\n\n, `opencode run`\n\n), so one model's answer can become another step's input without screen-scraping a prompt. `ask`\n\nis a short wrapper over that path.\n\nLive AI CLI sessions have both paths. For a one-shot answer, use the data path. For the human terminal session, use the PTY path. Provider manifests expose the prompt and response as screen regions:\n\n```\nagent-tui spawn -- claude\nagent-tui wait --ref '@claude.input[focused]'\nagent-tui type --to '@claude.input' 'write a jq filter for this JSON'\nagent-tui press --to '@claude.input' '<cr>'\nagent-tui wait --ref '@claude.response[name~=/jq/]'\n```\n\nThe adapter names the prompt and response from rendered cells. It does not read a provider transcript API. Knowing that a streaming answer is final across every AI CLI still needs provider-specific events or side channels. The split matters because it keeps the two cases separate: use `run`\n\nwhen the child is already a data-producing process, and use `spawn`\n\n, `snapshot`\n\n, `press`\n\n, and `wait`\n\nwhen the child is an interactive screen.\n\n## Capture artifacts[#](#capture-artifacts)\n\nA live session also produces files. The daemon records each pane to asciicast-v3 under `$XDG_STATE_HOME/agent-tui/<session>/<pane>.cast`\n\n. That file works with the [asciinema](https://www.asciinema.org/) ecosystem: `asciinema play`\n\ncan play it, and renderers such as `agg`\n\ncan turn it into a GIF for docs.\n\nagent-tui uses the same cast as a test input:\n\n```\ncast=\"${XDG_STATE_HOME:-$HOME/.local/state}/agent-tui/default/p1.cast\"\nasciinema play \"$cast\"\nagent-tui replay \"$cast\" --expect-snapshot expected.snap\n```\n\n`replay`\n\ndoes not start the original program. It feeds the recorded output bytes into a fresh terminal engine and compares the rendered snapshot. A demo session can become a regression test input.\n\nFor screenshots, `snapshot`\n\ncan render the current grid to PNG. `--annotate`\n\ndraws boxes and labels for matching refs; `--chrome`\n\nadds a frame for a README or blog image.\n\n```\nagent-tui snapshot --mode outline \\\n  --png vim-mode.png \\\n  --annotate '@vim.*' \\\n  --chrome 'vim todo.txt'\n```\n\nUse the cast when time matters. Use the PNG when the current frame matters, optionally with the screen regions named on top of it.\n\n## Design choices[#](#design-choices)\n\n**The daemon owns the PTY.** `agent-tui spawn`\n\nstarts the program under a daemon that serves screen state over a Unix socket. Later CLI calls connect to that socket, run one action, print a result, and exit. The terminal process keeps running between calls, so an agent can drive the same session one step at a time: snapshot, act, wait, snapshot again. Separate sessions isolate parallel work.\n\n**Stable refs and selectors.** Coordinates are easy to produce and hard to trust. A resize, a theme change, or a status line can move the thing you meant. Refs give the model and the test harness a stable vocabulary: `@vim.mode[value=insert]`\n\n, `@shell.prompt`\n\n, `[role=buffer][focused]`\n\n. New apps can be taught with a TOML manifest that maps screen regions to roles.\n\n## A machine interface for PTY apps[#](#a-machine-interface-for-pty-apps)\n\nA terminal program is already a state machine. The problem is that the state is presented as a screen.\n\nagent-tui gives agents a way to observe that screen, name parts of it, act on it, and wait for it to change. It does not require every program to grow a `--json`\n\nflag. It does not pretend a cell grid is a clean API. It gives the grid enough structure for a script to wait on facts instead of timing.\n\nTerminal apps can be machine-readable and still work for humans. agent-tui is open source at [github.com/ConductorOne/agent-tui](https://github.com/ConductorOne/agent-tui). Bring the terminal apps your agents struggle with: adapters, bugs, and PRs are welcome.", "url": "https://wpnews.pro/news/terminal-apps-need-a-dom", "canonical_source": "https://www.c1.ai/engineering/agent-tui-structured-terminal-access-for-ai-agents", "published_at": "2026-07-01 06:55:19+00:00", "updated_at": "2026-07-01 07:19:42.484201+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "ai-tools"], "entities": ["ConductorOne", "Squire", "Claude Code", "Codex", "Pi", "agent-tui", "Vercel", "OpenAI"], "alternates": {"html": "https://wpnews.pro/news/terminal-apps-need-a-dom", "markdown": "https://wpnews.pro/news/terminal-apps-need-a-dom.md", "text": "https://wpnews.pro/news/terminal-apps-need-a-dom.txt", "jsonld": "https://wpnews.pro/news/terminal-apps-need-a-dom.jsonld"}}