[AINews] All Model Labs are now Agent Labs

All major AI model labs are pivoting to agent-focused products, with OpenAI's Greg Brockman reversing the industry's previous stance that the model alone is the product. AI21 has shuttered its model team to focus on agents, and DeepSeek is building a "Harness team" for the first time, signaling a fundamental industry shift where model quality is no longer the primary competitive advantage. The move toward "model + harness + workflow" products could allow companies to close access to models by requiring proprietary agent software for meaningful performance.

AINews All Model Labs are now Agent Labs a quiet day lets us tie together a few quotes as all model labs become agent labs Ahead of OpenAI’s likely IPO filing https://aitoolsrecap.com/Blog/openai-ipo-2026-valuation-timeline-what-investors-need-to-know next week, Greg makes the latest in a series of comments where Model Labs are increasingly also building Agents https://www.latent.space/p/agent-labs as the product: The quote is a big reversal of stance from a position ~uniformly held by anyone who worked at Team Big Model , including his previous head of OpenAI Labs https://x.com/CoreAutoAI/status/2056442820022747444 : This comes with the shuttering of AI21’s model team, which is now pivoting to agents: and even the venerable DeepSeek is now building a “Harness team” for the first time: The “Systems over Models” people will take this as a point of validation of what they have been saying all along… except for the nuance that models cotrained with harnesses does open the door for closing access to models even further — if you can effectively posttrain a model to only meaningfully perform with your closed source agent, then you get to funnel the majority of users to your agent at the expense of your model/API co-opetition. But that’s a topic of a much larger discussion… AI News for 5/4/2026-5/5/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space . You can opt in/out of email frequencies AI Twitter Recap Agent Products, Harnesses, and the Shift Beyond “Just the Model” The product surface is moving up-stack : A recurring theme was that model quality alone is no longer the moat; the winning product is increasingly model + harness + workflow + UI + memory + economics . @gdb https://x.com/gdb/status/2057670776803996110 put it bluntly: “the model alone is no longer the product,” while @dzhng https://x.com/dzhng/status/2057748510947082539 argued top-tier products need model < harness < product symbiosis . The same pattern shows up in practice: @signulll https://x.com/signulll/status/2057850735048458639 framed ambient AI and agentic AI as the new seam of computing interfaces, and @teortaxesTex https://x.com/teortaxesTex/status/2057770692112798209 noted that harness research still risks converging on “replicate Claude Code” instead of exploring broader interfaces. Coding-agent product differentiation is becoming concrete : OpenAI shipped another substantial Codex update via “codex thursday no. 6” https://x.com/ajambrosino/status/2057716220963803577 with appshots, /goal improvements, remote computer use while locked, annotation mode, plugin sharing, and analytics . @gdb https://x.com/gdb/status/2057802037757157838 separately highlighted Appshots , while users reported meaningful workflow shifts: @gdb https://x.com/gdb/status/2057704270531903811 said it’s hard to remember coding before Codex, and @reach vb https://x.com/reach vb/status/2057830243201622368 said they haven’t opened an IDE in over a month. But product rough edges remain: @theo https://x.com/theo/status/2057960907997876412 praised T3 Code’s remote feature as ahead of alternatives, then contrasted it with buggy remote workflows in Codex in a follow-up post https://x.com/theo/status/2057961165175873930 . On the Claude side, @ClaudeDevs https://x.com/ClaudeDevs/status/2057946803685974482 expanded auto mode to the Pro plan and added Sonnet 4.6 support; @ mohansolo https://x.com/ mohansolo/status/2057910616153882949 also had to clarify and patch IDE support in Antigravity 2.0 after user backlash. Model Performance, Cost Curves, and Frontier Competition DeepSeek’s pricing move was the biggest market signal : @deepseek ai https://x.com/deepseek ai/status/2057854261699195173 made the 75% DeepSeek-V4-Pro discount permanent , triggering strong reactions because it materially changes the cost/performance frontier . @ArtificialAnlys https://x.com/ArtificialAnlys/status/2058021452465799403 quantified first-party pricing at $0.435/M input, $0.87/M output, $0.0036/M cached input , estimating a blended ~$0.18/M and placing V4 Pro on the Pareto frontier for intelligence vs run cost. They estimate running their Intelligence Index on V4 Pro costs ~3x less than Gemini 3.1 Pro Preview, ~12x less than GPT-5.5, and ~19x less than Claude Opus 4.7 . Community reaction centered on DeepSeek’s push toward “ intelligence too cheap to meter ,” as @scaling01 https://x.com/scaling01/status/2057835507858518178 put it. @Yuchenj UW https://x.com/Yuchenj UW/status/2057855546460676410 and @kimmonismus https://x.com/kimmonismus/status/2057868472965640194 both emphasized the magnitude of the cut. Gemini Flash improved, but usage feedback was mixed : @OfficialLoganK https://x.com/OfficialLoganK/status/2057682092583227881 reported Gemini 3.5 Flash making major progress over 3.1 Pro on GDPval , claiming Flash is now “competing at the frontier,” and @Designarena https://x.com/Designarena/status/2057885688125968660 placed it 16th overall on Design Arena, a 16-position jump from Gemini 3 Flash Preview. But several builders pushed back on usefulness vs benchmark gains: @Alezander907 https://x.com/Alezander907/status/2057686331380359566 saw only slight browser-agent improvement at higher cost, @giffmana https://x.com/giffmana/status/2057714729762627950 argued this isn’t “Flash progress” if the brand still implies cheapness, and @jeremyphoward https://x.com/jeremyphoward/status/2057923197639840033 said the model feels optimized to max evals rather than cooperate with humans . That aligns with broader eval skepticism from @HamelHusain https://x.com/HamelHusain/status/2057875320011882923 , who argued current tooling underweights qualitative, HITL judgment. Qwen and Chinese frontier models keep compressing the race : The official @Alibaba Qwen https://x.com/Alibaba Qwen/status/2057767604048240987 teasers and a long third-party review from @ZhihuFrontier https://x.com/ZhihuFrontier/status/2057772126162354660 portrayed Qwen3.7-Max as a meaningful step up, especially in instruction following, context reliability, and stability , while still suffering from verbosity and high token usage . Elsewhere, @scaling01 https://x.com/scaling01/status/2057937081070944709 claimed recent ALE-Bench runs show Chinese models like Kimi-K2.6, DeepSeek-V4, GLM-5.1 outperforming several Western releases in that setting. @ArtificialAnlys https://x.com/ArtificialAnlys/status/2057914437156409577 also reported Cursor Composer 2.5 as 3–18x cheaper than Opus 4.7 and 5–32x cheaper than GPT-5.5 on Coding Agent benchmarks, with notably lower token use. Protocols, Infra, and Agent Runtime Tooling MCP’s new release candidate is a substantive protocol simplification : @dsp https://x.com/dsp /status/2057780712187580924 announced the MCP 2026-07-28 release candidate , with the key change that the protocol is now stateless : no handshake, no session ID, and any request can hit any server instance . The RC also introduces first-class extensions like MCP Apps and Tasks , plus auth hardening and a clearer deprecation policy. For infra teams, statelessness is a big operational shift: easier scaling, simpler load balancing, fewer sticky-session concerns. Sandboxes and managed execution are becoming first-class primitives : @ philschmid https://x.com/ philschmid/status/2057833963633418426 demoed Gemini Managed Agents + Interactions API to give an agent a secure hosted Linux sandbox with memory and code execution. @CoreWeave https://x.com/CoreWeave/status/2057852737073942634 launched CoreWeave Sandboxes in public preview for RL, agent tool use, and model eval , while @cnakazawa https://x.com/cnakazawa/status/2057823910574588238 released Cloudsail for per-task Cloudflare sandboxes with shell, Codex, and GitHub access without exposing tokens. At the orchestration layer, @skypilot org https://x.com/skypilot org/status/2057854003648598312 argued RL doesn’t work on Slurm because modern RL is a multi-service system with heterogeneous hardware and recovery needs. Open-source harnesses and memory layers are proliferating : @NVIDIAAI https://x.com/NVIDIAAI/status/2057855521193881773 open-sourced AI-Q agent skills for portable deep-research pipelines that can plug into arbitrary harnesses. @Teknium https://x.com/Teknium/status/2057880570160701852 added Bitwarden support for key management in Hermes and later restored 256K context for Grok Build v0.1 in Hermes here https://x.com/Teknium/status/2057930638632812642 . @shannholmberg https://x.com/shannholmberg/status/2057821004676956586 described a shared-memory “gBrain” layer under Hermes agents, with typed folders and read-first access for specialist agents. @aakashadesara https://x.com/aakashadesara/status/2057809590616461399 updated CTOP to support Devin and a CLI for listing, searching, and killing agent sessions. Research: RL, Distillation, Architectures, and Evaluation RL post-training and reward design are under active reconsideration : @RyanBoldi https://x.com/RyanBoldi/status/2057847412819906658 introduced Vector Policy Optimization VPO , arguing scalar reward collapse during RL can sabotage test-time scaling. VPO instead optimizes vector-valued rewards , improving search performance even on the original scalar objective. @lateinteraction https://x.com/lateinteraction/status/2057854814395019623 framed this as a way to train LLMs for more diverse environments and goals, while @FeiziSoheil https://x.com/FeiziSoheil/status/2057889865362993561 connected it to broader moves toward structured feedback instead of a single reward number. Separately, @jsuarez https://x.com/jsuarez/status/2057828106023703037 teased a solution to a long-standing RL problem involving extreme sparsity, with initial sweeps showing SOTA on one internal environment. Agent compilation/distillation is emerging as a serious economic idea : @dair ai https://x.com/dair ai/status/2057846601843146760 highlighted a paper showing a full agentic workflow —multi-step calls, tool use, scratchpads, decision structure—can be distilled into weights and run at ~100x lower inference cost while preserving near-frontier quality. This is one of the clearest technical arguments yet for compiling expensive runtime agent loops into cheaper deployable models. Architecture work remains lively beyond vanilla transformers : @ChunyuanDeng https://x.com/ChunyuanDeng/status/2057826955236462715 introduced LT2 , a linear-time looped transformer combining sparse and linear attention to make looping practical, along with a distilled Ouro-hybrid-1.4B . @ZyphraAI https://x.com/ZyphraAI/status/2057854519732847029 shared work extending Equilibrium Propagation beyond energy-based models toward biologically realistic neurons. On MoE, @Jianlin S https://x.com/Jianlin S/status/2057719868917793221 proposed Moving Quantile Balancing for sequence-level load balancing without a loss penalty . Meanwhile @allen ai https://x.com/allen ai/status/2057838486204326078 launched ArtifactLinker , which predicts which benchmarks a model is likely to set SOTA on before running them—a useful meta-eval tool amid growing benchmark sprawl. Math and reasoning capability discourse shifted again : @cozyblaze265065 https://x.com/cozyblaze265065/status/2057739317649588558 reported 99.46% on a multi-digit multiplication experiment using gpt-5.5 with medium reasoning and no tools, and @teortaxesTex https://x.com/teortaxesTex/status/2057826903721951273 noted modern LLMs can now do 100-digit multiplication without tools. That’s not a complete theory of reasoning, but it further weakens old “autoregression can’t do arithmetic” talking points. Multimodal Systems: Video, Speech, World Models, and Imaging Google’s I/O stack pushed toward persistent agents and world simulators : @Google https://x.com/Google/status/2057841803550683336 introduced Gemini Spark , a 24/7 personal AI agent for recurring tasks, skills, and workflows. @GoogleDeepMind https://x.com/GoogleDeepMind/status/2057842131142590512 also launched Project Genie + Street View , letting users turn real U.S. locations into interactive worlds; follow-up posts confirm rollout to Google AI Ultra subscribers via Google Labs. The multimodal side was reinforced by @Google https://x.com/Google/status/2057881884219035752 announcing Gemini Omni for conversational video creation/editing and custom avatars, while @emollick https://x.com/emollick/status/2057874739817808223 emphasized the significance of a fully multimodal system that can natively edit video. Runway and image/video tooling keep raising editability : @runwayml https://x.com/runwayml/status/2057826728769134599 released Aleph 2.0 , supporting multishot sequences up to 30s at 1080p with targeted edits that preserve the rest of the scene. @CuriousRefuge https://x.com/CuriousRefuge/status/2057920807389806699 highlighted SeeDance 2 Stitcher for seamlessly extending AI-generated cinematic clips using Omni-generated continuations. Speech and image generation saw notable jumps : @ArtificialAnlys https://x.com/ArtificialAnlys/status/2057878247782908109 ranked Cartesia Sonic-3.5 as the new 1 TTS model on their Speech Arena, citing an Elo of 1218 , support for 42 languages , and strong naturalness/transcript following. Cartesia claims 82ms end-to-end first audio in production here https://x.com/cartesia/status/2057880195403800633 . In image generation, @wildmindai https://x.com/wildmindai/status/2057797994242523317 flagged Tencent’s Z-Image 6B as a pixel-space generator with no VAE , 1K resolution , and a transfer framework for converting Flux/SD models; related ecosystem work included Pixal3D demos from @victormustar https://x.com/victormustar/status/2057752615396557225 and training support for Z-Image L2P 1k in AI Toolkit from @ostrisai https://x.com/ostrisai/status/2057931161889095928 . Security, Cyber, and Policy Pressure Cybersecurity is quickly becoming a proving ground for advanced agents : @AnthropicAI https://x.com/AnthropicAI/status/2057909102542549503 said Project Glasswing and partners found more than ten thousand high- or critical-severity vulnerabilities in essential software within a month, and explicitly warned the industry will need to adapt to the volume of vulnerabilities that models like Claude Mythos Preview can find. Security productization is following: @perplexity ai https://x.com/perplexity ai/status/2057869990536360334 open-sourced Bumblebee , a read-only scanner for macOS/Linux to detect risky packages, extensions, and AI tool configs; @AravSrinivas https://x.com/AravSrinivas/status/2057873563156402448 said enterprise deployment will require agentic sandboxes plus continuous security engineering. US immigration policy changes triggered sharp backlash from AI leaders : Several high-engagement posts argued a proposed rule forcing green-card applicants to apply from outside the US would directly damage the AI talent pipeline. See @Nick Davidov https://x.com/Nick Davidov/status/2057842593850118286 , @AndrewYNg https://x.com/AndrewYNg/status/2057907324380217821 , @theo https://x.com/theo/status/2057911377151582437 , @garrytan https://x.com/garrytan/status/2057958284410380793 , and @togelius https://x.com/togelius/status/2057912236262453607 . The common argument: the rule punishes legal high-skill immigrants , undermines startups and research, and harms US competitiveness in AI. Top tweets by engagement @deepseek ai on making the V4-Pro discount permanent https://x.com/deepseek ai/status/2057854261699195173 — the clearest single-market signal in this batch around LLM inference economics . @gdb on “the model alone is no longer the product” https://x.com/gdb/status/2057670776803996110 — concise articulation of the current agent/harness product thesis . @AnthropicAI on Glasswing finding 10,000+ critical vulnerabilities https://x.com/AnthropicAI/status/2057909102542549503 — one of the strongest data points for AI-driven cyber capability moving into production. @dsp on MCP 2026-07-28 RC https://x.com/dsp /status/2057780712187580924 — important protocol update: stateless MCP plus first-class extensions. @GoogleDeepMind on Project Genie + Street View https://x.com/GoogleDeepMind/status/2057842131142590512 — notable step toward consumer-facing world models . @cursor ai on opening the Cursor SDK for custom agents https://x.com/cursor ai/status/2057913121558413770 — relevant for teams building on top of coding-agent infrastructure. AI Reddit Recap /r/LocalLlama + /r/localLLM Recap Keep reading with a 7-day free trial Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.