# [AINews] All Model Labs are now Agent Labs

> Source: <https://www.latent.space/p/ainews-all-model-labs-are-now-agent>
> Published: 2026-05-23 04:21:17+00:00

# [AINews] All Model Labs are now Agent Labs

### a quiet day lets us tie together a few quotes as all model labs become agent labs

Ahead of OpenAI’s [likely IPO filing](https://aitoolsrecap.com/Blog/openai-ipo-2026-valuation-timeline-what-investors-need-to-know) next week, Greg makes the latest in a series of comments where [Model Labs are increasingly also building Agents](https://www.latent.space/p/agent-labs) as the product:

The quote is a big reversal of stance from a position ~uniformly held by anyone who worked at ** Team Big Model**, including

[his previous head of OpenAI Labs](https://x.com/CoreAutoAI/status/2056442820022747444):

This comes with the shuttering of AI21’s model team, which is now pivoting to agents:

and even the venerable DeepSeek is now building a “Harness team” for the first time:

The “Systems over Models” people will take this as a point of validation of what they have been saying all along… except for the nuance that models cotrained with harnesses does open the door for closing access to models even further — if you can effectively posttrain a model to only meaningfully perform with your closed source agent, then you get to funnel the majority of users to your agent at the expense of your model/API co-opetition.

But that’s a topic of a much larger discussion…

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits,

[544 Twitters]and no further Discords.[AINews’ website]lets you search all past issues. As a reminder,[AINews is now a section of Latent Space]. You can[opt in/out]of email frequencies!

**AI Twitter Recap**

**Agent Products, Harnesses, and the Shift Beyond “Just the Model”**

**The product surface is moving up-stack**: A recurring theme was that model quality alone is no longer the moat; the winning product is increasingly** model + harness + workflow + UI + memory + economics**.[@gdb](https://x.com/gdb/status/2057670776803996110)put it bluntly: “the model alone is no longer the product,” while[@dzhng](https://x.com/dzhng/status/2057748510947082539)argued top-tier products need**model <> harness <> product symbiosis**. The same pattern shows up in practice:[@signulll](https://x.com/signulll/status/2057850735048458639)framed ambient AI and agentic AI as the new seam of computing interfaces, and[@teortaxesTex](https://x.com/teortaxesTex/status/2057770692112798209)noted that harness research still risks converging on “replicate Claude Code” instead of exploring broader interfaces.**Coding-agent product differentiation is becoming concrete**: OpenAI shipped another substantial Codex update via[“codex thursday no. 6”](https://x.com/ajambrosino/status/2057716220963803577)with**appshots, /goal improvements, remote computer use while locked, annotation mode, plugin sharing, and analytics**.[@gdb](https://x.com/gdb/status/2057802037757157838)separately highlighted** Appshots**, while users reported meaningful workflow shifts:[@gdb](https://x.com/gdb/status/2057704270531903811)said it’s hard to remember coding before Codex, and[@reach_vb](https://x.com/reach_vb/status/2057830243201622368)said they haven’t opened an IDE in over a month. But product rough edges remain:[@theo](https://x.com/theo/status/2057960907997876412)praised**T3 Code’s remote feature** as ahead of alternatives, then contrasted it with buggy remote workflows in Codex in a follow-up[post](https://x.com/theo/status/2057961165175873930). On the Claude side,[@ClaudeDevs](https://x.com/ClaudeDevs/status/2057946803685974482)expanded**auto mode** to the Pro plan and added**Sonnet 4.6** support;[@_mohansolo](https://x.com/_mohansolo/status/2057910616153882949)also had to clarify and patch IDE support in**Antigravity 2.0** after user backlash.

**Model Performance, Cost Curves, and Frontier Competition**

**DeepSeek’s pricing move was the biggest market signal**:[@deepseek_ai](https://x.com/deepseek_ai/status/2057854261699195173)made the** 75% DeepSeek-V4-Pro discount permanent**, triggering strong reactions because it materially changes the** cost/performance frontier**.[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2058021452465799403)quantified first-party pricing at**$0.435/M input, $0.87/M output, $0.0036/M cached input**, estimating a blended**~$0.18/M** and placing V4 Pro on the Pareto frontier for intelligence vs run cost. They estimate running their Intelligence Index on V4 Pro costs**~3x less than Gemini 3.1 Pro Preview, ~12x less than GPT-5.5, and ~19x less than Claude Opus 4.7**. Community reaction centered on DeepSeek’s push toward “** intelligence too cheap to meter**,” as[@scaling01](https://x.com/scaling01/status/2057835507858518178)put it.[@Yuchenj_UW](https://x.com/Yuchenj_UW/status/2057855546460676410)and[@kimmonismus](https://x.com/kimmonismus/status/2057868472965640194)both emphasized the magnitude of the cut.**Gemini Flash improved, but usage feedback was mixed**:[@OfficialLoganK](https://x.com/OfficialLoganK/status/2057682092583227881)reported** Gemini 3.5 Flash**making major progress over** 3.1 Pro on GDPval**, claiming Flash is now “competing at the frontier,” and[@Designarena](https://x.com/Designarena/status/2057885688125968660)placed it**16th overall** on Design Arena, a**16-position jump** from Gemini 3 Flash Preview. But several builders pushed back on usefulness vs benchmark gains:[@Alezander907](https://x.com/Alezander907/status/2057686331380359566)saw only slight browser-agent improvement at higher cost,[@giffmana](https://x.com/giffmana/status/2057714729762627950)argued this isn’t “Flash progress” if the brand still implies cheapness, and[@jeremyphoward](https://x.com/jeremyphoward/status/2057923197639840033)said the model feels optimized to**max evals rather than cooperate with humans**. That aligns with broader eval skepticism from[@HamelHusain](https://x.com/HamelHusain/status/2057875320011882923), who argued current tooling underweights qualitative, HITL judgment.**Qwen and Chinese frontier models keep compressing the race**: The official[@Alibaba_Qwen](https://x.com/Alibaba_Qwen/status/2057767604048240987)teasers and a long third-party review from[@ZhihuFrontier](https://x.com/ZhihuFrontier/status/2057772126162354660)portrayed**Qwen3.7-Max** as a meaningful step up, especially in**instruction following, context reliability, and stability**, while still suffering from** verbosity and high token usage**. Elsewhere,[@scaling01](https://x.com/scaling01/status/2057937081070944709)claimed recent ALE-Bench runs show Chinese models like**Kimi-K2.6, DeepSeek-V4, GLM-5.1** outperforming several Western releases in that setting.[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2057914437156409577)also reported**Cursor Composer 2.5** as**3–18x cheaper than Opus 4.7** and**5–32x cheaper than GPT-5.5** on Coding Agent benchmarks, with notably lower token use.

**Protocols, Infra, and Agent Runtime Tooling**

**MCP’s new release candidate is a substantive protocol simplification**:[@dsp_](https://x.com/dsp_/status/2057780712187580924)announced the** MCP 2026-07-28 release candidate**, with the key change that the protocol is now** stateless**:** no handshake, no session ID, and any request can hit any server instance**. The RC also introduces** first-class extensions**like** MCP Apps**and** Tasks**, plus auth hardening and a clearer deprecation policy. For infra teams, statelessness is a big operational shift: easier scaling, simpler load balancing, fewer sticky-session concerns.**Sandboxes and managed execution are becoming first-class primitives**:[@_philschmid](https://x.com/_philschmid/status/2057833963633418426)demoed** Gemini Managed Agents + Interactions API**to give an agent a secure hosted Linux sandbox with memory and code execution.[@CoreWeave](https://x.com/CoreWeave/status/2057852737073942634)launched**CoreWeave Sandboxes** in public preview for**RL, agent tool use, and model eval**, while[@cnakazawa](https://x.com/cnakazawa/status/2057823910574588238)released** Cloudsail**for per-task Cloudflare sandboxes with shell, Codex, and GitHub access without exposing tokens. At the orchestration layer,[@skypilot_org](https://x.com/skypilot_org/status/2057854003648598312)argued**RL doesn’t work on Slurm** because modern RL is a multi-service system with heterogeneous hardware and recovery needs.**Open-source harnesses and memory layers are proliferating**:[@NVIDIAAI](https://x.com/NVIDIAAI/status/2057855521193881773)open-sourced** AI-Q agent skills**for portable deep-research pipelines that can plug into arbitrary harnesses.[@Teknium](https://x.com/Teknium/status/2057880570160701852)added**Bitwarden support** for key management in Hermes and later restored**256K context** for**Grok Build v0.1** in Hermes[here](https://x.com/Teknium/status/2057930638632812642).[@shannholmberg](https://x.com/shannholmberg/status/2057821004676956586)described a**shared-memory “gBrain” layer** under Hermes agents, with typed folders and read-first access for specialist agents.[@aakashadesara](https://x.com/aakashadesara/status/2057809590616461399)updated**CTOP** to support**Devin** and a CLI for listing, searching, and killing agent sessions.

**Research: RL, Distillation, Architectures, and Evaluation**

**RL post-training and reward design are under active reconsideration**:[@RyanBoldi](https://x.com/RyanBoldi/status/2057847412819906658)introduced** Vector Policy Optimization (VPO)**, arguing scalar reward collapse during RL can sabotage test-time scaling. VPO instead optimizes** vector-valued rewards**, improving search performance even on the original scalar objective.[@lateinteraction](https://x.com/lateinteraction/status/2057854814395019623)framed this as a way to train LLMs for more diverse environments and goals, while[@FeiziSoheil](https://x.com/FeiziSoheil/status/2057889865362993561)connected it to broader moves toward**structured feedback** instead of a single reward number. Separately,[@jsuarez](https://x.com/jsuarez/status/2057828106023703037)teased a solution to a long-standing RL problem involving extreme sparsity, with initial sweeps showing SOTA on one internal environment.**Agent compilation/distillation is emerging as a serious economic idea**:[@dair_ai](https://x.com/dair_ai/status/2057846601843146760)highlighted a paper showing a** full agentic workflow**—multi-step calls, tool use, scratchpads, decision structure—can be** distilled into weights**and run at**~100x lower inference cost** while preserving near-frontier quality. This is one of the clearest technical arguments yet for compiling expensive runtime agent loops into cheaper deployable models.**Architecture work remains lively beyond vanilla transformers**:[@ChunyuanDeng](https://x.com/ChunyuanDeng/status/2057826955236462715)introduced** LT2**, a** linear-time looped transformer**combining sparse and linear attention to make looping practical, along with a distilled** Ouro-hybrid-1.4B**.[@ZyphraAI](https://x.com/ZyphraAI/status/2057854519732847029)shared work extending** Equilibrium Propagation**beyond energy-based models toward biologically realistic neurons. On MoE,[@Jianlin_S](https://x.com/Jianlin_S/status/2057719868917793221)proposed**Moving Quantile Balancing** for**sequence-level load balancing without a loss penalty**. Meanwhile[@allen_ai](https://x.com/allen_ai/status/2057838486204326078)launched** ArtifactLinker**, which predicts which benchmarks a model is likely to set SOTA on before running them—a useful meta-eval tool amid growing benchmark sprawl.**Math and reasoning capability discourse shifted again**:[@cozyblaze265065](https://x.com/cozyblaze265065/status/2057739317649588558)reported** 99.46%**on a multi-digit multiplication experiment using** gpt-5.5**with medium reasoning and no tools, and[@teortaxesTex](https://x.com/teortaxesTex/status/2057826903721951273)noted modern LLMs can now do**100-digit multiplication** without tools. That’s not a complete theory of reasoning, but it further weakens old “autoregression can’t do arithmetic” talking points.

**Multimodal Systems: Video, Speech, World Models, and Imaging**

**Google’s I/O stack pushed toward persistent agents and world simulators**:[@Google](https://x.com/Google/status/2057841803550683336)introduced** Gemini Spark**, a** 24/7 personal AI agent**for recurring tasks, skills, and workflows.[@GoogleDeepMind](https://x.com/GoogleDeepMind/status/2057842131142590512)also launched**Project Genie + Street View**, letting users turn real U.S. locations into interactive worlds; follow-up posts confirm rollout to** Google AI Ultra**subscribers via Google Labs. The multimodal side was reinforced by[@Google](https://x.com/Google/status/2057881884219035752)announcing**Gemini Omni** for conversational video creation/editing and custom avatars, while[@emollick](https://x.com/emollick/status/2057874739817808223)emphasized the significance of a**fully multimodal** system that can natively edit video.**Runway and image/video tooling keep raising editability**:[@runwayml](https://x.com/runwayml/status/2057826728769134599)released** Aleph 2.0**, supporting** multishot sequences up to 30s at 1080p**with targeted edits that preserve the rest of the scene.[@CuriousRefuge](https://x.com/CuriousRefuge/status/2057920807389806699)highlighted**SeeDance 2 Stitcher** for seamlessly extending AI-generated cinematic clips using Omni-generated continuations.**Speech and image generation saw notable jumps**:[@ArtificialAnlys](https://x.com/ArtificialAnlys/status/2057878247782908109)ranked** Cartesia Sonic-3.5**as the new**#1 TTS model** on their Speech Arena, citing an**Elo of 1218**, support for** 42 languages**, and strong naturalness/transcript following. Cartesia claims** 82ms end-to-end first audio**in production[here](https://x.com/cartesia/status/2057880195403800633). In image generation,[@wildmindai](https://x.com/wildmindai/status/2057797994242523317)flagged Tencent’s**Z-Image 6B** as a**pixel-space generator** with**no VAE**,** 1K resolution**, and a transfer framework for converting Flux/SD models; related ecosystem work included Pixal3D demos from[@victormustar](https://x.com/victormustar/status/2057752615396557225)and training support for**Z-Image L2P 1k** in AI Toolkit from[@ostrisai](https://x.com/ostrisai/status/2057931161889095928).

**Security, Cyber, and Policy Pressure**

**Cybersecurity is quickly becoming a proving ground for advanced agents**:[@AnthropicAI](https://x.com/AnthropicAI/status/2057909102542549503)said** Project Glasswing**and partners found** more than ten thousand high- or critical-severity vulnerabilities**in essential software within a month, and explicitly warned the industry will need to adapt to the volume of vulnerabilities that models like**Claude Mythos Preview** can find. Security productization is following:[@perplexity_ai](https://x.com/perplexity_ai/status/2057869990536360334)open-sourced**Bumblebee**, a read-only scanner for macOS/Linux to detect risky packages, extensions, and AI tool configs;[@AravSrinivas](https://x.com/AravSrinivas/status/2057873563156402448)said enterprise deployment will require**agentic sandboxes** plus continuous security engineering.**US immigration policy changes triggered sharp backlash from AI leaders**: Several high-engagement posts argued a proposed rule forcing green-card applicants to apply from outside the US would directly damage the AI talent pipeline. See[@Nick_Davidov](https://x.com/Nick_Davidov/status/2057842593850118286),[@AndrewYNg](https://x.com/AndrewYNg/status/2057907324380217821),[@theo](https://x.com/theo/status/2057911377151582437),[@garrytan](https://x.com/garrytan/status/2057958284410380793), and[@togelius](https://x.com/togelius/status/2057912236262453607). The common argument: the rule punishes**legal high-skill immigrants**, undermines startups and research, and harms US competitiveness in AI.

**Top tweets (by engagement)**

[@deepseek_ai on making the V4-Pro discount permanent](https://x.com/deepseek_ai/status/2057854261699195173)— the clearest single-market signal in this batch around**LLM inference economics**.[@gdb on “the model alone is no longer the product”](https://x.com/gdb/status/2057670776803996110)— concise articulation of the current**agent/harness product thesis**.[@AnthropicAI on Glasswing finding 10,000+ critical vulnerabilities](https://x.com/AnthropicAI/status/2057909102542549503)— one of the strongest data points for**AI-driven cyber capability** moving into production.[@dsp_ on MCP 2026-07-28 RC](https://x.com/dsp_/status/2057780712187580924)— important protocol update:**stateless MCP** plus first-class extensions.[@GoogleDeepMind on Project Genie + Street View](https://x.com/GoogleDeepMind/status/2057842131142590512)— notable step toward**consumer-facing world models**.[@cursor_ai on opening the Cursor SDK for custom agents](https://x.com/cursor_ai/status/2057913121558413770)— relevant for teams building on top of coding-agent infrastructure.

**AI Reddit Recap**

**/r/LocalLlama + /r/localLLM Recap**

## Keep reading with a 7-day free trial

Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.