This Month in Agentic Coding: May 2026 The cost of frontier proprietary models increased significantly in May 2026, with Gemini 3.5 Flash priced three times higher than its predecessor and GPT 5.5 costing double GPT 5.4, while open-weight models like DeepSeek and MiMo cut prices to roughly 10-30 times cheaper. Both Codex and Claude Code introduced a new /goal command for sustained task execution, and Anthropic doubled Claude Code's rate limits following a SpaceX compute partnership. A new DeepSWE benchmark for longer-horizon software engineering tasks was released, with GPT 5.5 topping the rankings ahead of Opus 4.8. Welcome to the first edition of ACW Monthly Brief. It's one email to catch you up on all the meaningful developments in agentic coding from the past month. Reading time: ~20 minutes In this issue, we will cover: Executive summary for the entire month Agentic coding trends that I observed over the course of the last month Models and benchmarks : models released last month, including Opus 4.8 and Gemini 3.5 Flash, where these models sit on the agentic coding benchmarks, and an intro to the newly released DeepSWE benchmark. Tool updates and new features in agentic coding tools like Claude Code, Codex, and Antigravity Interesting projects people are building with agentic codingBest agentic coding workflows worth trying from across the web Reading list for the month with 5 must-reads Miscellaneous updates and some fun stuff Non-AI disclaimer : I have written every single sentence in this article myself. See my reasoning here https://www.agenticcodingweekly.com/p/about-agentic-coding-weekly-newsletter . If you spot any problems with word choices or writing cadence, it's solely because I am bad at writing. Unlike your AI agents, if you tell me about those issues or your feedback by replying to this email , I'll remember them and put them into practice without you having to say "make no mistakes" 1. Executive Summary Tokens are getting both more expensive and cheaper. The frontier proprietary models got more expensive. Gemini 3.5 Flash is 3x more expensive than its predecessor. GPT 5.5 was already 2x GPT 5.4, and Opus 4.7 was already ~1.4x Opus 4.6 because it burns more tokens. On the other hand, open-weight models like DeepSeek and MiMo cut prices and are roughly 10-30x cheaper.There were five major model releases: Opus 4.8, Gemini 3.5 Flash, Composer 2.5, Qwen 3.7 Max, and the open-weight Step 3.7 Flash . Most were incremental updates. Gemini 3.5 Flash was disappointing, costing more than 3.1 Pro to run the Artificial Analysis suite while still ranking below it.Both Codex and Claude Code added a /goal command that keeps the agent working toward a verifiable objective instead of stopping after a single turn.Google dropped the IDE part with Antigravity 2.0 , turning it into an agent manager like Claude and Codex desktop. Antigravity CLI is replacing Gemini CLI and Gemini CLI will stop working from June 18, 2026 .Anthropic doubled the 5-hour rate limits for Claude Code after a SpaceX compute partnership and there is a 50% higher weekly limits promotion going on until July 13. They also formalized the subscription policy so that from June 15 your plan covers only interactive TUI use . And there will be a separate credits for programmatic and third-party use.Claude Code added a new dynamic workflow feature to turn a large multi-agent task into a JavaScript orchestration script. Instead of Claude manually coordinating work turn-by-turn, Claude writes a workflow script for the task, the runtime executes it, and many subagents can be fanned out across phases such as discovery, implementation, review, and synthesis.A new benchmark, DeepSWE , was released for longer-horizon SWE tasks. GPT 5.5 tops this benchmarks and ranks higher than Opus 4.8. It's worth a look, but it doesn't reflect the experience you'd actually get from Claude Code or Codex CLI, so I wouldn't advise to pick your day-to-day tools based on just this benchmark. 2. Trends Over the Month Trend 1: Tokens are getting both more expensive and cheaper Gemini 3.5 Flash is three times more expensive than its predecessor model in the Flash series. GPT 5.5 released near the end of April was twice the price of GPT 5.4 and Opus 4.7 released around mid-April was technically the same price, but it consumed more tokens, so it was around 1.4 times more expensive than Opus 4.6. Enterprises are now paying API pricing for both OpenAI and Anthropic along with $20 per user. They are not getting the benefits that individual users get from $200 or $100 subsidized subscriptions. So costs are increasing, but on the other hand, the open-weight models like DeepSeek and MiMo have lowered their pricing and are offering 10-30x times cheaper models. Given there are reports of Uber going past their entire annual AI budget just in May and Microsoft canceling their Claude usage. It'd be interesting to see how this increase in pricing is going to affect AI adoption in enterprises. Is there going to be stricter caps for AI usage or are companies going to move toward open-weight models? Trend 2: Making coding agents work longer There have been ongoing efforts since last year to make coding agents work for longer. This month, we saw both Codex and Claude Code add the /goal command that prevents them from stopping just after one turn, and they continue working until the stated goal is reached. The goal command is similar to the Ralph loop concept, but now it is built-in inside the harnesses. Trend 3: No AI lab has an IDE product Well, to be fair, only Google had an IDE product with Antigravity 1.0, but with the release of 2.0, they abandoned the IDE part, and it's now, an agents manager similar to Claude Desktop and Codex Desktop. Given the importance and the market fit the AI labs have found for coding, it is interesting to see that none of them offer an IDE product. Do they truly see no value worth pursuing in building a good IDE? Or do they truly feel that IDEs are not going to be the future? We're only gonna interact with our codebases through agents, not IDEs? It is also kind of interesting that we have seen demos from different labs where they use a collection of agents to build an operating system from scratch or build a C compiler or port Bun from Zig to Rust. But nobody has tried to build an IDE that provides an excellent UX and is not a resource-heavy Electron-based IDE. 3. Models and Benchmarks There were five major releases last month: four proprietary and one open-weight. The releases were mostly incremental updates except Gemini 3.5 Flash which was a disappointment both in terms of pricing and performance. Opus 4.8 https://www.anthropic.com/news/claude-opus-4-8 May 28 - An incremental upgrade over Opus 4.7 and no change in pricing. Anthropic says 4.8 is more honest and more likely to flag uncertainty than claiming unsupported progress. It's roughly four times less likely than 4.7 to let flaws in code it wrote pass unremarked. Based on personal vibes, I am finding it more usable compared to 4.7 which felt like a downgrade from 4.6. In the 4.8 announcement post, they also said that they are preparing to release Claude Mythos for all customers in the "coming weeks". Gemini 3.5 Flash https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/ May 19 - Beats Gemini 3.1 Pro on almost all benchmarks. Priced at $1.5 / $9 per million input / output tokes, it's 3x more expensive than the last flash model, 3 Flash, and close to 3.1 Pro's $2 / $12. It's quite fast over 200 tokens per second but also more verbose. To run the Artificial Analysis test suite, it costs $1551 for 3.5 Flash vs $892 for 3.1 Pro, while 3.5 Flash still ranked lower than 3.1 Pro. Composer 2.5 https://cursor.com/blog/composer-2-5 May 18 - Cursor's further fine-tuned version of Kimi K2.5. Benchmarks put it close to Opus 4.7 and GPT 5.5 but priced much lower at $0.50/M input and $2.50/M output tokens. Context window is 200k tokens. Qwen 3.7 Max https://qwen.ai/blog?id=qwen3.7 May 20 - Latest release in the Max series, the proprietary variant of the Qwen models. It scores 69.7% on Terminal Bench 2.0 and 60.6% on SWE Bench Pro. Pricing is $2.5 / $7.5 per million input / output tokens, and the context size is 1 million tokens. Step 3.7 Flash https://static.stepfun.com/blog/step-3.7-flash/ May 29 - Open-weights model from StepFun. 198B Mixture-of-Experts MoE with 11B active parameters. 256k context window. It loves to use the word "wait" in the reasoning. Tried the car wash test with this model and this is a short excerpt of the model’s reasoning from the test: "Wait no, wait—wait, no, wait: if you drive, you get in your car, drive 50m to the car wash, then... wait, but if you drive to the car wash, you're bringing the car with you, which is the point, right? Wait wait, no, wait the goal is to wash the car. Oh right ". It got the answer right though. You can check the full model reasoning and response in the GitHub Gist https://gist.github.com/primaprashant/d73a5d941ecd9cc2aa2352a00498f36d . I'll wait Other than the new model releases, DeepSeek and MiMo reduced their prices. Originally, DeepSeek V4 Pro was available at 75% off through the API until May 31st. After the discount, the pricing was $0.435 / $0.87 per million input / output tokens. They made the discount permanent https://x.com/deepseek ai/status/2057854261699195173 . So, the API pricing will stay at $0.435 / $0.87 permanently.MiMo V2.5 Pro was priced at $1 / $3 per million input / output tokens. They have discounted the price starting May 27 https://platform.xiaomimimo.com/docs/en-US/news/v2.5-price-update to match the DeepSeek V4 Pro pricing of $0.435 / $0.87. The reduced cache-hit input pricing by 98% bringing it down from $0.2 per million tokens to $0.0036. For agentic coding tasks, since most of the tokens consumed are cache reads, this model is truly cheap. Turning to the benchmarks, here's how the newly released Opus 4.8 and Gemini 3.5 Flash did on Simon Willison's pelican question, "Generate an SVG of a pelican riding a bicycle": Opus 4.8 xhigh - Pelican riding a bicycle Gemini 3.5 Flash xhigh - Pelican riding a bicycle And for comparison, this is what Opus 4.7 and GPT 5.5 produced: Opus 4.7 xhigh - Pelican riding a bicycle GPT 5.5 xhigh - Pelican riding a bicycle As for the current state of agentic coding benchmarks, this is the current standings of top proprietary and open-weights models. The model names in bold were newly released last month. Model | SWE-bench Pro | Terminal-Bench 2.1 | Pricing - per 1M | |---|---|---|---| | | 66.1% | $5 / $25 | | 60.6% | 69.7% | $2.5 / $7.5 | | 56.3% | 59.6% | $0.2 / $1.15 | | 55.1% | 76.2% | $1.5 / $9 | | - | 69.3% | $0.5 / $2.5 | Claude Opus 4.7 | 64.3% | 66.1% | $5 / $25 | GPT 5.5 | 58.6% | | $5 / $30 | Kimi K2.6 | 58.6% | - | $0.95 / $4 | GLM 5.1 | 58.4% | - | $1.4 / $4.4 | MiMo V2.5 Pro | 57.2% | - | $0.435 / $0.87 | DeepSeek V4 Pro | 55.4% | - | $0.435 / $0.87 | Gemini 3.1 Pro | 54.2% | 70.3% | $2 / $12 | Talking about benchmarks, there is a new one in town. You might have seen something like the bar plot below floating around. Let's look into what it is first and then where it falls short. DeepSWE Benchmark DeepSWE benchmark leaderboard So this is a benchmark for longer-horizon tasks https://deepswe.datacurve.ai/ compared to SWE-Bench Pro, and tasks have a higher diversity and are contamination-free. It contains a total of 113 tasks from 91 open-source repositories across five languages. And 90% of the tasks are in TypeScript, Go and Python, each about 30%. And the remaining two languages are JavaScript and Rust. This benchmark differs a lot from SWE Bench in terms of how the tasks are defined and how they are evaluated. So instead of giving it the full GitHub issue, where there is a lot of additional context, it only gives a small task prompt and the models need to figure out the rest themselves. Then the implementation is evaluated by a handwritten set of tests. In SWE Bench, tests are derived from the repository state after the issue was fixed and the commit was merged. Passing all the tests in the repository after the relevant commit was merged doesn't accurately reflect whether the implementation was correct or it just satisfied all the tests without actually solving the problem. During the implementation, each model gets the same system prompt and only a bash tool . The obvious limitation is that it doesn't test the true quality of the model or the kind of experience you would get in day-to-day implementation. This is because of two reasons. First, every model uses the same mini-SWE-agent harness, which is very, very minimal. They did this to keep the comparison fair and to test only the capability of the model, not the harness. But this also means that since each model is not getting all the tools that it was trained with, we cannot definitively say it's the true capability of the model. Similarly, models are not getting their own custom system prompt, so, it's hard to say the performance matches the true capability of the model. Second, the harness is just as important as the model these days. So, for these implementation tasks, they are not getting their purpose-built harness. In my opinion, the benchmark doesn't meet the experience you would actually get with these models when using Claude Code CLI or Codex CLI. Because of the limited number of samples and limitations of the harness, I'd say it's not the ultimate benchmark for picking your day-to-day agentic coding tools and models just yet. Speaking of tools... 4. Tool Updates Antigravity Along with 3.5 Flash, Google also launched Antigravity 2.0 https://antigravity.google/blog/introducing-google-antigravity-2-0 at Google I/O. AGY 2.0 is a parallel agent manager similar to Claude and Codex desktop. There's no IDE inside 2.0. They also announced Antigravity CLI https://antigravity.google/blog/introducing-google-antigravity-cli closed source coding agent which uses same harness as Antigravity 2.0 and is replacing Gemini CLI. Gemini CLI will stop working from June 18, 2026 https://developers.googleblog.com/an-important-update-transitioning-gemini-cli-to-antigravity-cli/ . The 2.0 auto-update was poorly handled and broke people's workflow https://www.0xsid.com/blog/antigravity-bait-n-switch without any warning. Dynamic Workflows in Claude Code Along with Opus 4.8, Anthropic also released a new workflow feature https://claude.com/blog/introducing-dynamic-workflows-in-claude-code to turn a large multi-agent task into a JavaScript orchestration script. Instead of Claude manually coordinating work turn-by-turn, Claude writes a workflow script for the task, the runtime executes it, and many subagents can be fanned out across phases such as discovery, implementation, review, and synthesis. To trigger a dynamic workflow, we can run a saved workflow command like /deep-research , include the word workflow in our prompt, or enable /effort ultracode which allows Claude to decide when to use workflows automatically. From my understanding, this dynamic workflow is different from using subagents or agent teams because those are coordinated by Claude and all the task context stays in the main agent's context window or in a shared task list. In a workflow, the task context is in the JavaScript code and there is no main agent coordinating everything. As for use cases, the docs say to use it for tasks that need more agents than one conversation can coordinate, or when we want the orchestration codified as a script that we can read and rerun. Of course, all of this consumes a shit ton of tokens. The recent rewrite of Bun from Zig to Rust was done using dynamic workflows. Boris, creator of Claude Code, also shared a few tasks for which he has been using workflows in this HN comment https://news.ycombinator.com/item?id=48312433 . Agent View in Claude Code New way to control and manage multiple Claude Code sessions from a single screen https://code.claude.com/docs/en/agent-view by running claude agents . GUI-based IDEs and ADEs like Zed, Cursor, Codex, and Claude desktop apps have had this for a while. Now it's time for TUIs to get it. /goal Command Both Codex https://developers.openai.com/codex/use-cases/follow-goals and Claude Code https://code.claude.com/docs/en/goal added the /goal slash command for goal-directed autonomous work. Instead of stopping after each response, Codex/Claude Code keeps working toward a concrete objective. To use it, run /goal with a verifiable objective. For example, /goal all tests in test/auth pass and the lint step is clean . To decide if the goal has been achieved, after Claude finishes responding, Claude sends the goal and conversation to a Haiku model which returns a yes-or-no decision and a short reason. Yes stops the loop, "no" sends the reason back to Claude and and is asked to keep working. I'd imagine something similar happens in Codex as well. Claude Code Limits and Subscription Changes Anthropic partnered with SpaceX https://www.anthropic.com/news/higher-limits-spacex to increase compute capacity and announced higher Claude usage limits. They have doubled the five-hour rate limits for Claude Code on Pro, Max, Team, and seat-based Enterprise plans. Default weekly limits are unchanged though, so this mostly helps with bursty sessions rather than total weekly usage. However, for a short period until July 13, they have also increased the weekly limits by 50% https://x.com/ClaudeDevs/status/2054639777685934564 . After all the fiasco with using Claude subscriptions with third-party tools like openclaw and OpenCode a couple of month ago, Anthropic has also formalized the policy for such use cases. Starting June 15 , 2026, a Claude subscription will cover only interactive use TUI of Claude Code https://support.claude.com/en/articles/15036540-use-the-claude-agent-sdk-with-your-claude-plan . We will get a separate monthly credit equal to our subscription amount for programmatic usage like claude -p , third-party apps using the Claude Agent SDK, or tools like openclaw. You have to manually claim the credits once https://x.com/ClaudeDevs/status/2054610158458904769 , though, then they refresh automatically each cycle. Quick Hits llama.cpp now supports MTP https://github.com/ggml-org/llama.cpp/pull/22673 Multi-Token Prediction based speculative decoding. If you run Qwen3.6 27B and Qwen3.6 35B-A3B locally, you can expect about 2x speed-up in token throughput.Codex is now available in the ChatGPT mobile app https://openai.com/index/work-with-codex-from-anywhere/ . Requires Codex desktop app running on one of your machines. Not the Codex CLI, the Codex app.xAI launched their own proprietary CLI coding agent, Grok Build CLI https://x.ai/news/grok-build-cli . Early beta, available only to SuperGrok Heavy subscribers $300/month plan .antirez built a specialized native inference engine for DeepSeek V4 Flash https://github.com/antirez/ds4 optimized for Apple Silicon Metal. Clean, minimal codebase worth reading. Runs at 2-bit quants on 128GB Macs, 4-bit on 256GB. And now before we move on, found this on Reddit https://old.reddit.com/r/vibecoding/comments/1tc2isb/claude this will take 2 weeks me hold my beer/ : 5. What People are Building Interesting projects where agentic coding played a major role: The Emacsification of Software https://sockpuppet.org/blog/2026/05/12/emacsification/ - Agentic coding allows people to build hyper-specific software for themselves i.e., Emacsification. The author built a markdown viewer specific for their needs. I Let AI Build a Tool to Help Me Figure Out What Was Waking Me Up at Night https://martin.sh/i-let-ai-build-a-tool-to-help-me-figure-out-what-was-waking-me-up-at-night/ - A bit over-engineered but who am I to judge. The project is pretty cool and journey is often more important than the destination. 6. Workflows to Try This Month 5 best workflow patterns I found on HN/Reddit/Twitter/Web: When exploring a new idea or tool, my go to prompt is In a single index.html, no dependencies, sparse styling, create an app that