How Port of Context eliminated production agent failures with goose, Arcade.dev, and Code Mode

Port of Context eliminated production agent failures by implementing Code Mode, a sandboxed execution path within the open-source agent harness goose, achieving 100% task delivery versus 56% without it. The solution reduced per-run costs by roughly half to $0.20 and cut input token usage by approximately threefold, while eliminating all rate-limit failures that plagued standard tool-use loops.

Port of Context needed a GTM intelligence agent that could search multiple key terms in real time across data-intensive sources like GitHub and Hacker News. Before committing to an execution path, they wanted to test if their Code Mode solution built into goose would result in a more efficient and robust agent. Code Mode delivered every task at roughly half the per-run cost with zero rate-limit failures. About Port of Context Port of Context portofcontext.com http://portofcontext.com builds the context engineering and execution layer for production agentic AI. They contributed Code Mode to goose’s sandboxed-execution path. The GTM agent we needed Our goal was to flag recent GitHub issues, pull requests, and Hacker News threads relevant to the problems we’re trying to solve. This would unearth key GTM signals in real-time: Outreach: developers actively hitting MCP or agent-building hurdles, surfaced in real time for prime outreach opportunities. Positioning: what people are building, what’s getting traction, and which adjacent products are gaining ground in our space. Content: trending topics that inform our editorial direction. How we wanted the GTM agent to run Search specific key terms, not broad keywords, to surface only highly relevant threads from the past 7 days. Run multiple searches per day on a schedule, covering a broad range of key terms across data-heavy sources, without sacrificing reliability or efficiency. Deliver findings as a structured digest posted to a dedicated Slack channel the team can scan in seconds. How we built the GTM agent We built the GTM agent on three pieces: A custom-built MCP server “gtmdata” which exposes five tools the agent can call to search GitHub issues, pull requests, and repos; search HackerNews, and post Slack messages. gtmAgentSearchGithubIssues, gtmAgentSearchGithubPullRequests, gtmAgentSearchGithubRepositories, gtmAgentSearchHackernews, and gtmAgentPostSlackMessage. Arcade.dev , the MCP runtime, which gave us typed, reliable tools, managed authorization, and policy enforcement through per-request OAuth primitives GitHub scopes=… , Slack scopes=… that handled auth without us writing a single token-management line. The same controls govern every agent, on any model, with no lock-in. goose , AAIF’s open source agent harness, which takes the user prompt and dispatches tool calls. The execution choice we needed to validate Goose offers two ways to run tool calls: The standard tool-use loop: the agent issues one tool call at a time, the JSON result returns to the model’s context as a message, the model reasons about it in natural language, and decides the next call. Code Mode: a sandboxed execute typescript path Port of Context contributed to goose. The agent writes a small TypeScript program, calls MCP tools as typed function calls inside the sandbox, and only the final result returns to the conversation. Tool result payloads stay inside the sandbox. We had a hypothesis that the sandboxed Code Mode execution we built would be more reliable and more token-efficient under exactly this kind of multi-source, continuous workload. The experiment Same prompt, run using three different models in a fresh goose session each time: Use gtm agent to look up all mentions of the specific term “code mode” across sources over the past 7 days and send a digest to the slack channel. Models: Sonnet 4-5, Opus 4-6, GPT-5-4 Execution: Code Mode ON, Code Mode OFF Goose’s standard tool-use loop Runs per model: 6 3 ON, 3 OFF Everything else the architecture, the MCP server, the upstream APIs, the Slack channel was held constant. The results - 100% delivery with Code Mode ON. 56% with it off. Every rate-limit failure in the experiment was an OFF-mode run. - Average cost per run: $0.20 ON vs $0.41 OFF, roughly 2× cheaper. - Average input tokens per run: 31.8k vs 93.6k, roughly 3× more efficient. - Cheapest correct result in the dataset: $0.056, a Code Mode ON run. Why the gaps exist In Code Mode, the agent writes a single sandboxed TypeScript program that batches every MCP call into one Promise.all and chains the result straight into a Slack post. Tool result payloads stay inside the sandbox; only the final summary returns to the model. That’s the key contract: the agent reads result.total count as a typed field rather than parsing it out of a JSON-as-message blob, and four search payloads totaling 100k+ tokens never reach the model’s context window. The sandbox consumes them. In a real run, the whole thing collapses into a short reasoning trace, a single execute typescript call, and a Slack post. In Code Mode OFF, the same task plays out as a back-and-forth tool-use loop instead. The agent searches GitHub Issues, then GitHub PRs, then GitHub Repositories, then Hacker News — each tool call returning a 15–30 KB JSON payload that re-enters the model’s context as a message. By the time the agent compiles the digest and posts to Slack, the next inference call is carrying all four payloads in input context. That’s where most OFF runs hit the per-minute rate cap. By that final turn, the model is reasoning over all four search payloads at once, and every prior tool result comes along for the ride. That’s where the 3× input-token gap comes from, and where exceeding the per-minute rate cap stops being avoidable. Code Mode doesn’t change what the agent does. It changes where the data lives during the run, and that single shift is what closes the cost and reliability gap. Bonus observations on the model axis Sonnet 4-5 consistently failed to grasp the prompt’s phrase-search intent. In every delivery, it passed “code mode” through unquoted to GitHub Search, which ANDs the two words and returned ~25,000 issues — not “code mode” mentions. The model reported the broad totals as if they were the real count. None of sonnet’s four deliveries flagged the distinction. One run summed the four broad totals into a single headline figure of 208,107 mentions. Opus 4-6 grasped the phrase-search intent in every delivery. It either escaped quotes properly into a phrase-quoted API call, or recognized noise in unfiltered results mid-flight and applied a client-side phrase-validation pass, visible in the agent’s reasoning trace: “Now I have all the data. Let me carefully review the results to find actual mentions of the specific term ‘code mode’ as a distinct phrase, not just ‘code’ and ‘mode’ appearing separately .” GPT-5-4 always sent phrase-quoted queries to the API, but two of three GPT-5-4 ON runs had a downstream framing bug: the agent reported the array length 50, the server’s safe-limit cap instead of the real total count, so the digest claimed “Total Mentions: 122” instead of the actual ~4,640. Three models, three different failure profiles on the same prompt. Sonnet built confident lists from broken inputs. Opus self-corrected mid-flight when it spotted the noise. GPT got the API call right but mis-framed the counts twice. Appendix. How Code Mode works under the hood, and how pctx’s flavor differs Four teams have shipped a Code Mode style execution path in the last year. Cloudflare coined the term and ships it on Dynamic Workers. Anthropic ships Programmatic Tool Calling PTC on the Claude API, where Claude writes Python in a Code Execution container. Pydantic AI ships Monty, a deny-by-default Python interpreter written in Rust. Port of Context ships pctx, which is the implementation we ran in goose for this experiment. The premise is the same in all four. The model emits one program that calls tools as typed functions inside a sandbox. The sandbox swallows the intermediate payloads. Only the final reduce returns to the model’s context. The differences are in what language the program is written in, how the sandbox is enforced, how strongly the model is coupled to a specific runtime, and how visible the run is to the agent that wrote it. The shape of the loop | Mode | What happens on a turn | What re-enters the model’s context | | Code Mode OFF classic tool-use loop | Model emits one tool call. Result returns into context as a message. Model reasons over it, picks the next call. Repeat per tool. | Every tool result, every turn. Prior results carry forward in input context. | | Code Mode ON sandbox execution | Model writes one program. Sandbox runs all tools, often in parallel Promise.all, asyncio.gather . Sandbox holds the intermediate payloads. Sandbox returns the reduce. | One short final result. The 100 KB of search payloads stay in the sandbox. | The four Code-Mode implementations side by side | Aspect | pctx | Cloudflare Code Mode | Anthropic PTC | Pydantic Monty | | Sandbox language | TypeScript on Deno | TypeScript on V8 Dynamic Workers | Python in the Code Execution container | A Python subset interpreted by a Rust binary | | Pre-flight check | Typescript type-checks the program before any tool fires. Errors come back as Typescript diagnostics, not as runtime crashes. | None. Errors surface as Worker runtime exceptions. | None. Errors surface as Python tracebacks inside the container. | Ty by Astral type-checks the program in under 100 ms before any tool fires. Errors come back as Python type hints, not as runtime crashes. | | Sandbox model | Deno permission flags, 10 s timeout, network allow-list scoped to MCP hosts. | V8 isolate with globalOutbound: null. External fetch and connect are blocked. The only outbound paths are MCP bindings the host wires in. | Anthropic-managed container with heavier isolation. The model never sees raw tool payloads; only the final code output enters the conversation. | Deny-by-default Python. open, eval, exec, import do not exist. os and sys are stubs. Filesystem, network, env are external functions the host controls. | | LLM coupling | BYO-LLM. Any model that speaks MCP. We ran the same workload across Sonnet 4-5, Opus 4-6, and GPT-5-4 without changing a line. | Tied to the Cloudflare Agents SDK on Workers. | Claude API only. | Tied to Pydantic AI’s agent abstraction. | | Tool disclosure modes | Three. catalog dynamic discovery , filesystem model greps a virtual FS of typed tool definitions , sidecar upstream tools surfaced as MCP tools next to execute typescript . | One. Bindings to registered MCP servers. | One. In-container tool registry. | One. Sandbox calls wired via toolsets. | | Visibility into a run | Full diagnostic record returned to the agent: stdout, return value, type-check output, runtime errors with line and column. | Worker logs surfaced back to the agent run. Arcade.dev logs all tool calls and provides the governance layer. | Tool results stripped from context. Only the final code output is visible to the agent. | Host-controlled. Whatever the host wires back into the agent. | Why BYO-LLM matters for this experiment A core motivation for the case study was running the same workload across three model families. Anthropic’s PTC is structurally Claude-only, so we could not have measured GPT-5-4’s framing bug or Sonnet’s phrase-search miss in the same harness. Cloudflare Dynamic Workers run as part of the Workers platform; portability across other agent harnesses is possible but not the default. Monty assumes a Pydantic AI agent on top. With pctx the agent harness is goose, the execution is pctx, and the model is a config knob. The three failure modes we surfaced Sonnet confidently summing broken counts, Opus catching its own noise mid-flight, GPT mis-framing array length as total are only visible because the model was the only thing changing across runs. Where this leaves the comparison Code Mode is a good pattern regardless of vendor. Every implementation cuts input tokens 2 to 3 times and removes the rate-limit cliff the classic tool-use loop hits on data-heavy workloads. The implementation choices change what you can build on top: - Cloudflare is the right pick if your agent already lives on Workers and you want sandboxed execution as a platform primitive next to your MCP bindings. - Anthropic PTC is the right pick if you are on the Claude API, building one model deep, and want the lowest-friction path that needs no extra infrastructure. - Pydantic AI’s Monty is the right pick if you are inside Pydantic AI and want microsecond-startup execution of trusted Python with a tight security model. - pctx is the right pick when you need BYO-LLM any model, any harness that speaks MCP , pre-flight type safety so the agent learns from errors before paying for them, and a permissioned sandbox that runs the same in development and in production. Those are the constraints that matter for production agentic AI, and they are the ones that drove us to build it. Sources - pctx Code Mode docs: github.com/portofcontext/pctx/blob/main/docs/code-mode.md https://github.com/portofcontext/pctx/blob/main/docs/code-mode.md - Cloudflare Code Mode origin post : blog.cloudflare.com/code-mode https://blog.cloudflare.com/code-mode/ - Cloudflare Code Mode MCP and Dynamic Workers: blog.cloudflare.com/code-mode-mcp https://blog.cloudflare.com/code-mode-mcp/ , blog.cloudflare.com/dynamic-workers https://blog.cloudflare.com/dynamic-workers/ - Anthropic Programmatic Tool Calling: platform.claude.com/docs/…/programmatic-tool-calling https://platform.claude.com/docs/en/agents-and-tools/tool-use/programmatic-tool-calling - Anthropic advanced tool use overview: anthropic.com/engineering/advanced-tool-use https://www.anthropic.com/engineering/advanced-tool-use - Pydantic Monty announcement: pydantic.dev/articles/pydantic-monty https://pydantic.dev/articles/pydantic-monty - Pydantic Monty repo: github.com/pydantic/monty https://github.com/pydantic/monty Where this goes next Code Mode is now the GTM Agent’s production execution path. Next on the roadmap: more sources beyond GitHub and Hacker News, more keyword sets, and scheduled runs through goose. We’ll also be making the GTM agent itself public, so any team building MCP servers for sandboxed-execution clients can study, fork, or extend it. The typed-output-schema design baked into the server full input and output schemas, returned through Arcade’s runtime is what lets Code Mode chain calls together as cleanly as it does, and it generalizes to anyone building on the same stack. GTM Agent live on GitHub: https://github.com/portofcontext/gtm agent https://github.com/portofcontext/gtm agent Code Mode in the official MCP client best-practices guide: modelcontextprotocol.io http://modelcontextprotocol.io ·