# Terminal Apps Need a DOM

> Source: <https://www.c1.ai/engineering/agent-tui-structured-terminal-access-for-ai-agents>
> Published: 2026-07-01 06:55:19+00:00

When we were building [Squire](https://www.c1.ai/blog/squire-agentic-first-ephemeral-dev-environments-at-conductorone), C1's software factory, we hit a slightly absurd problem: the AI tools were also built for humans.

Squire could give an agent work. But Claude Code, Codex, Pi, and similar AI harnesses present themselves as terminal apps first. Their live interface is a TUI made for a person: a prompt, a streaming response, approval screens, file-change panes, and a cursor waiting for the next instruction. Another agent can type into that interface. It still needs to know whether the response is done, whether an approval screen appeared, or whether the cursor has returned to the prompt.

That is the problem agent-tui solves. It runs the target program on a PTY, keeps the terminal state alive in a daemon, exposes the rendered screen as text or an outline with stable refs, and lets a client snapshot, press keys, and wait for named screen state. It gives terminal apps the same kind of queryable surface that made browser automation useful.

agent-tui is open source.We are publishing it under the Apache-2.0 license at[github.com/ConductorOne/agent-tui]. The design comes from our experience with[agent-browser]: give the agent something it can query instead of a pile of pixels. agent-tui applies that idea to terminal apps.

One common Squire pattern is an orchestration agent: a coding harness receives the assignment, then drives another harness to do the work. In this demo, OpenAI's Codex uses agent-tui to drive the [Pi harness](https://pi.dev/) through a real terminal session.

## Agents driving agents[#](#agents-driving-agents)

agent-tui starts the outer Codex TUI, waits for `@codex.input`

, types the task, and presses enter. Codex then runs the command sequence below, which starts a second agent-tui daemon around Pi.

The Pi side is just another agent-tui session:

```
agent-tui daemon run
agent-tui spawn -- pi --offline --no-extensions --no-context-files --no-skills
agent-tui wait --ref '@pi.input'
agent-tui type --to '@pi.input' 'Reply with the token formed by joining INNER, AGENT, and OK with `_`.'
agent-tui press --to '@pi.input' '<cr>'
agent-tui snapshot --mode text
```

The result is two live screens: one where Codex receives the task, and one where Pi answers inside the nested session.

Vercel's [AI SDK harnesses](https://ai-sdk.dev/v7/docs/ai-sdk-harnesses/overview) frame agent CLIs as provider-specific surfaces, not one generic wrapper. agent-tui takes the same approach to terminal screens: keep each app's shape, then expose the parts an agent can query.

## Why terminal apps need structure[#](#why-terminal-apps-need-structure)

Most useful terminal programs were built for humans, not machines.

`htop`

, `vim`

, `lazygit`

, `psql`

, language REPLs, and newer agent CLIs such as Claude Code and Codex have different jobs. They share one automation problem: the live interface is a terminal session. It owns a PTY, redraws a grid of cells, and expects a person to infer what changed. Some tools expose batch modes. Many do not. Even when a batch mode exists, it is often a different interface from the human session you need to observe, interrupt, or steer.

An agent can write bytes to a terminal easily. The hard part is knowing what happened after those bytes landed. A full-screen program may repaint in place, move the cursor, enter an alternate screen, update one field, and never print a clean line that says "ready."

The usual choices make bad contracts with the terminal. Escape-sequence parsing treats the byte stream as the API. Rendered-text scraping throws away state. Sleeping between keystrokes punts the problem to the scheduler, which means the script works until CI is slow or one prompt lands in a state you did not match.

## Give the terminal a DOM[#](#give-the-terminal-a-dom)

Vim is a good stress test: it is a full-screen editor, not a command that prints lines. Here agent-tui drives a real Vim session through refs instead of sleeps.

In the recording, the left pane issues agent-tui commands and the right pane is the Vim PTY. The driver waits for the buffer, reads the mode, enters insert mode, writes `hello world`

, saves `hello-world.txt`

, and checks the file contents.

Raw terminal text is a bad handle for this job. agent-tui exposes an outline: a tree of screen regions with roles and stable refs. A ref is a name for something on the screen. It lets an agent say "the Vim mode indicator" instead of "row 24, column 1."

```
agent-tui spawn -- vim notes.md
agent-tui wait --ref '@vim.buffer'        # vim has rendered
agent-tui --json snapshot --select '@vim.mode' | jq -c '.data.outline.nodes[0]'
{"durable":true,"ref":"@vim.mode","role":"mode","value":"normal"}
```

`@vim.mode`

is durable. It names the same part of the screen whether the value is `normal`

, `insert`

, or something else. Refs can be queried with a small selector language: `[role=buffer][focused]`

, `@vim.mode[value=insert]`

, `@tmux.pane[%2]`

.

The built-in `vim`

and shell adapters emit named refs such as `@vim.mode`

and `@shell.prompt`

. A TOML manifest can teach agent-tui the regions of another app without adding Rust code. With no adapter at all, the generic adapter still groups the screen into coarse regions and gives them refs:

```
agent-tui --json snapshot --mode outline | jq -c '[.data.outline.nodes[] | {ref, role}]'
[{"ref":"@e1","role":"meters"},{"ref":"@e2","role":"table"},{"ref":"@e3","role":"footer"}]
```

We patterned this after browser automation, where [agent-browser](https://github.com/vercel-labs/agent-browser) has worked well for our own agents. A browser agent should not click pixel 412, 308; it should click the button. A terminal agent should not depend on a fixed row when agent-tui can name the mode, prompt, focused pane, or table.

## Wait for screen state[#](#wait-for-screen-state)

A snapshot only helps if you know when to take it.

`sleep 0.2`

is a guess about scheduling, terminal redraw, and the program under test. It will be too long on a fast machine and too short on a loaded runner. Worse, it is not connected to the state you care about.

The wait subcommand is tied to screen state. It can wait for a ref to appear, for a ref to disappear, for a selector value, for a regex, for an event sequence, or for the child process to exit. For screens with no named ref yet, a client can take a snapshot, keep its screen hash, send input, and wait until the rendered grid changes.

Here is a complete `vim`

edit with no sleep:

```
agent-tui spawn -- vim todo.txt
agent-tui wait --ref '@vim.buffer'              # 1. wait for the buffer to exist
agent-tui press i                               # 2. enter insert mode
agent-tui wait --ref '@vim.mode[value=insert]'  # 3. wait for the mode to flip
agent-tui type 'review the draft'               # 4. type
agent-tui press '<esc>'                          # 5. leave insert
agent-tui wait --ref '@vim.mode[value=normal]'  # 6. wait for the mode to flip back
agent-tui press ':wq<cr>'                        # 7. save and quit
```

Each command waits for the next observable transition. After `press i`

, the script does not assume vim is ready for text. It waits until the parsed mode is `insert`

. After `<esc>`

, it waits until the mode is `normal`

again.

Refs avoid a common false positive. A regex can match the literal word `insert`

in the buffer. A wait on `@vim.mode[value=insert]`

watches Vim's parsed mode field. It is not looking at arbitrary screen text.

You can wait for absence too. `wait --ref '@vim.cmdline[focused]' --gone`

blocks until the command prompt closes. For terminal tests, that is usually the difference between "probably done" and "the UI state changed."

## Fallback to rendered text[#](#fallback-to-rendered-text)

Not every app has an adapter or a useful state signal. `htop`

is a good example: it has no JSON mode, and the useful output is often just the rendered screen.

```
agent-tui spawn -- htop
agent-tui wait --idle 500
agent-tui --json snapshot --mode text | jq -r .data.text
0[         0.0%]   4[         0.0%]   8[          0.0%] 12[          0.0%]
    1[******* 100.0%]   5[         0.0%]   9[          0.0%] 13[          0.0%]
  Mem[|||||#*@@@@@@@@@@@@@@@16.0G/124G] Tasks: 29, 117 thr, 0 kthr; 7 running
  Swp[                           0K/0K] Load average: 0.68 0.82 0.69

    PID USER       PRI  NI  VIRT   RES   SHR S  CPU%-MEM%   TIME+  Command
  50386 user        20   0  4900  3336  2456 R 160.0  0.0  0:00.02 htop
```

Use `wait --idle 500`

when an app has no better signal. It waits for the screen to stop changing, so it is still tied to terminal output instead of a fixed delay after input. Use refs and selectors when the app has structure. Use idle when it does not.

## What about tmux and expect?[#](#what-about-tmux-and-expect)

The obvious question is whether this is just `tmux send-keys`

plus `capture-pane`

, or a wrapper around `expect`

.

They are useful, but they stop at a lower level.

`tmux capture-pane`

gives you text from the rendered grid. It does not give you roles, named regions, or durable handles. `tmux send-keys`

can write input, but it has no opinion about what screen state should follow.

`expect`

is line-oriented. It is excellent for programs that print prompts and lines. It is a poor fit for a full-screen ncurses app that repaints a cell grid in place. There is no line that says "vim's mode indicator is now insert." The information is on the screen, but it is not in the stream in the form `expect`

wants.

agent-tui sits above the byte stream and below the agent. It reads the terminal state, assigns names to parts of the screen, and lets the caller wait on those names.

## Use stdout when stdout is enough[#](#use-stdout-when-stdout-is-enough)

Not every command needs a live terminal. If a program already has a non-interactive mode, the right interface is still stdin, stdout, stderr, and an exit code.

The `run`

subcommand exists for commands that should return data. It gives agents a typed, logged wrapper for those calls while PTY automation stays focused on live screens.

```
agent-tui run -- gh api /repos/ConductorOne/agent-tui \
  --jq '{repo: .full_name, lang: .language, default_branch: .default_branch}'
{"default_branch": "main", "lang": "Rust", "repo": "ConductorOne/agent-tui"}
```

The result is plain data. It can be piped into `jq`

, fed to another step, or asserted on in a test. `run`

also fronts AI CLIs that expose non-interactive modes (`claude -p`

, `codex exec`

, `pi --print`

, `opencode run`

), so one model's answer can become another step's input without screen-scraping a prompt. `ask`

is a short wrapper over that path.

Live AI CLI sessions have both paths. For a one-shot answer, use the data path. For the human terminal session, use the PTY path. Provider manifests expose the prompt and response as screen regions:

```
agent-tui spawn -- claude
agent-tui wait --ref '@claude.input[focused]'
agent-tui type --to '@claude.input' 'write a jq filter for this JSON'
agent-tui press --to '@claude.input' '<cr>'
agent-tui wait --ref '@claude.response[name~=/jq/]'
```

The adapter names the prompt and response from rendered cells. It does not read a provider transcript API. Knowing that a streaming answer is final across every AI CLI still needs provider-specific events or side channels. The split matters because it keeps the two cases separate: use `run`

when the child is already a data-producing process, and use `spawn`

, `snapshot`

, `press`

, and `wait`

when the child is an interactive screen.

## Capture artifacts[#](#capture-artifacts)

A live session also produces files. The daemon records each pane to asciicast-v3 under `$XDG_STATE_HOME/agent-tui/<session>/<pane>.cast`

. That file works with the [asciinema](https://www.asciinema.org/) ecosystem: `asciinema play`

can play it, and renderers such as `agg`

can turn it into a GIF for docs.

agent-tui uses the same cast as a test input:

```
cast="${XDG_STATE_HOME:-$HOME/.local/state}/agent-tui/default/p1.cast"
asciinema play "$cast"
agent-tui replay "$cast" --expect-snapshot expected.snap
```

`replay`

does not start the original program. It feeds the recorded output bytes into a fresh terminal engine and compares the rendered snapshot. A demo session can become a regression test input.

For screenshots, `snapshot`

can render the current grid to PNG. `--annotate`

draws boxes and labels for matching refs; `--chrome`

adds a frame for a README or blog image.

```
agent-tui snapshot --mode outline \
  --png vim-mode.png \
  --annotate '@vim.*' \
  --chrome 'vim todo.txt'
```

Use the cast when time matters. Use the PNG when the current frame matters, optionally with the screen regions named on top of it.

## Design choices[#](#design-choices)

**The daemon owns the PTY.** `agent-tui spawn`

starts the program under a daemon that serves screen state over a Unix socket. Later CLI calls connect to that socket, run one action, print a result, and exit. The terminal process keeps running between calls, so an agent can drive the same session one step at a time: snapshot, act, wait, snapshot again. Separate sessions isolate parallel work.

**Stable refs and selectors.** Coordinates are easy to produce and hard to trust. A resize, a theme change, or a status line can move the thing you meant. Refs give the model and the test harness a stable vocabulary: `@vim.mode[value=insert]`

, `@shell.prompt`

, `[role=buffer][focused]`

. New apps can be taught with a TOML manifest that maps screen regions to roles.

## A machine interface for PTY apps[#](#a-machine-interface-for-pty-apps)

A terminal program is already a state machine. The problem is that the state is presented as a screen.

agent-tui gives agents a way to observe that screen, name parts of it, act on it, and wait for it to change. It does not require every program to grow a `--json`

flag. It does not pretend a cell grid is a clean API. It gives the grid enough structure for a script to wait on facts instead of timing.

Terminal apps can be machine-readable and still work for humans. agent-tui is open source at [github.com/ConductorOne/agent-tui](https://github.com/ConductorOne/agent-tui). Bring the terminal apps your agents struggle with: adapters, bugs, and PRs are welcome.
