[AINews] All Model Labs are now Agent Labs

wpnews.pro

a quiet day lets us tie together a few quotes as all model labs become agent labs

Ahead of OpenAI’s likely IPO filing next week, Greg makes the latest in a series of comments where Model Labs are increasingly also building Agents as the product:

The quote is a big reversal of stance from a position ~uniformly held by anyone who worked at ** Team Big Model**, including

his previous head of OpenAI Labs: This comes with the shuttering of AI21’s model team, which is now pivoting to agents:

and even the venerable DeepSeek is now building a “Harness team” for the first time:

The “Systems over Models” people will take this as a point of validation of what they have been saying all along… except for the nuance that models cotrained with harnesses does open the door for closing access to models even further — if you can effectively posttrain a model to only meaningfully perform with your closed source agent, then you get to funnel the majority of users to your agent at the expense of your model/API co-opetition.

But that’s a topic of a much larger discussion…

AI News for 5/4/2026-5/5/2026. We checked 12 subreddits,

[544 Twitters]and no further Discords.[AINews’ website]lets you search all past issues. As a reminder,[AINews is now a section of Latent Space]. You can[opt in/out]of email frequencies!

AI Twitter Recap

Agent Products, Harnesses, and the Shift Beyond “Just the Model”

The product surface is moving up-stack: A recurring theme was that model quality alone is no longer the moat; the winning product is increasingly** model + harness + workflow + UI + memory + economics**.@gdbput it bluntly: “the model alone is no longer the product,” while@dzhngargued top-tier products needmodel <> harness <> product symbiosis. The same pattern shows up in practice:@signulllframed ambient AI and agentic AI as the new seam of computing interfaces, and@teortaxesTexnoted that harness research still risks converging on “replicate Claude Code” instead of exploring broader interfaces.Coding-agent product differentiation is becoming concrete: OpenAI shipped another substantial Codex update via“codex thursday no. 6”withappshots, /goal improvements, remote computer use while locked, annotation mode, plugin sharing, and analytics.@gdbseparately highlighted** Appshots**, while users reported meaningful workflow shifts:@gdbsaid it’s hard to remember coding before Codex, and@reach_vbsaid they haven’t opened an IDE in over a month. But product rough edges remain:@theopraisedT3 Code’s remote feature as ahead of alternatives, then contrasted it with buggy remote workflows in Codex in a follow-uppost. On the Claude side,@ClaudeDevsexpandedauto mode to the Pro plan and addedSonnet 4.6 support;@_mohansoloalso had to clarify and patch IDE support inAntigravity 2.0 after user backlash.

Model Performance, Cost Curves, and Frontier Competition

DeepSeek’s pricing move was the biggest market signal:@deepseek_aimade the** 75% DeepSeek-V4-Pro discount permanent**, triggering strong reactions because it materially changes the** cost/performance frontier**.@ArtificialAnlysquantified first-party pricing at**$0.435/M input, $0.87/M output, $0.0036/M cached input**, estimating a blended**~$0.18/M** and placing V4 Pro on the Pareto frontier for intelligence vs run cost. They estimate running their Intelligence Index on V4 Pro costs**~3x less than Gemini 3.1 Pro Preview, ~12x less than GPT-5.5, and ~19x less than Claude Opus 4.7**. Community reaction centered on DeepSeek’s push toward “** intelligence too cheap to meter**,” as@scaling01put it.@Yuchenj_UWand@kimmonismusboth emphasized the magnitude of the cut.Gemini Flash improved, but usage feedback was mixed:@OfficialLoganKreported** Gemini 3.5 Flashmaking major progress over 3.1 Pro on GDPval**, claiming Flash is now “competing at the frontier,” and@Designarenaplaced it16th overall on Design Arena, a16-position jump from Gemini 3 Flash Preview. But several builders pushed back on usefulness vs benchmark gains:@Alezander907saw only slight browser-agent improvement at higher cost,@giffmanaargued this isn’t “Flash progress” if the brand still implies cheapness, and@jeremyphowardsaid the model feels optimized tomax evals rather than cooperate with humans. That aligns with broader eval skepticism from@HamelHusain, who argued current tooling underweights qualitative, HITL judgment.Qwen and Chinese frontier models keep compressing the race: The official@Alibaba_Qwenteasers and a long third-party review from@ZhihuFrontierportrayedQwen3.7-Max as a meaningful step up, especially ininstruction following, context reliability, and stability, while still suffering from** verbosity and high token usage**. Elsewhere,@scaling01claimed recent ALE-Bench runs show Chinese models likeKimi-K2.6, DeepSeek-V4, GLM-5.1 outperforming several Western releases in that setting.@ArtificialAnlysalso reportedCursor Composer 2.5 as3–18x cheaper than Opus 4.7 and5–32x cheaper than GPT-5.5 on Coding Agent benchmarks, with notably lower token use.

Protocols, Infra, and Agent Runtime Tooling

MCP’s new release candidate is a substantive protocol simplification:@dsp_announced the** MCP 2026-07-28 release candidate**, with the key change that the protocol is now** stateless**:** no handshake, no session ID, and any request can hit any server instance**. The RC also introduces** first-class extensionslike MCP Appsand Tasks**, plus auth hardening and a clearer deprecation policy. For infra teams, statelessness is a big operational shift: easier scaling, simpler load balancing, fewer sticky-session concerns.Sandboxes and managed execution are becoming first-class primitives:@_philschmiddemoed** Gemini Managed Agents + Interactions APIto give an agent a secure hosted Linux sandbox with memory and code execution.@CoreWeavelaunchedCoreWeave Sandboxes** in public preview forRL, agent tool use, and model eval, while@cnakazawareleased** Cloudsailfor per-task Cloudflare sandboxes with shell, Codex, and GitHub access without exposing tokens. At the orchestration layer,@skypilot_orgarguedRL doesn’t work on Slurm** because modern RL is a multi-service system with heterogeneous hardware and recovery needs.Open-source harnesses and memory layers are proliferating:@NVIDIAAIopen-sourced** AI-Q agent skillsfor portable deep-research pipelines that can plug into arbitrary harnesses.@TekniumaddedBitwarden support** for key management in Hermes and later restored256K context forGrok Build v0.1 in Hermeshere.@shannholmbergdescribed ashared-memory “gBrain” layer under Hermes agents, with typed folders and read-first access for specialist agents.@aakashadesaraupdatedCTOP to supportDevin and a CLI for listing, searching, and killing agent sessions.

Research: RL, Distillation, Architectures, and Evaluation

RL post-training and reward design are under active reconsideration:@RyanBoldiintroduced** Vector Policy Optimization (VPO), arguing scalar reward collapse during RL can sabotage test-time scaling. VPO instead optimizes vector-valued rewards**, improving search performance even on the original scalar objective.@lateinteractionframed this as a way to train LLMs for more diverse environments and goals, while@FeiziSoheilconnected it to broader moves towardstructured feedback instead of a single reward number. Separately,@jsuarezteased a solution to a long-standing RL problem involving extreme sparsity, with initial sweeps showing SOTA on one internal environment.Agent compilation/distillation is emerging as a serious economic idea:@dair_aihighlighted a paper showing a** full agentic workflow**—multi-step calls, tool use, scratchpads, decision structure—can be** distilled into weightsand run at~100x lower inference cost** while preserving near-frontier quality. This is one of the clearest technical arguments yet for compiling expensive runtime agent loops into cheaper deployable models.Architecture work remains lively beyond vanilla transformers:@ChunyuanDengintroduced** LT2**, a** linear-time looped transformercombining sparse and linear attention to make looping practical, along with a distilled Ouro-hybrid-1.4B**.@ZyphraAIshared work extending** Equilibrium Propagationbeyond energy-based models toward biologically realistic neurons. On MoE,@Jianlin_SproposedMoving Quantile Balancing** forsequence-level load balancing without a loss penalty. Meanwhile@allen_ailaunched** ArtifactLinker**, which predicts which benchmarks a model is likely to set SOTA on before running them—a useful meta-eval tool amid growing benchmark sprawl.Math and reasoning capability discourse shifted again:@cozyblaze265065reported** 99.46%on a multi-digit multiplication experiment using gpt-5.5with medium reasoning and no tools, and@teortaxesTexnoted modern LLMs can now do100-digit multiplication** without tools. That’s not a complete theory of reasoning, but it further weakens old “autoregression can’t do arithmetic” talking points.

Multimodal Systems: Video, Speech, World Models, and Imaging

Google’s I/O stack pushed toward persistent agents and world simulators:@Googleintroduced** Gemini Spark**, a** 24/7 personal AI agentfor recurring tasks, skills, and workflows.@GoogleDeepMindalso launchedProject Genie + Street View**, letting users turn real U.S. locations into interactive worlds; follow-up posts confirm rollout to** Google AI Ultrasubscribers via Google Labs. The multimodal side was reinforced by@GoogleannouncingGemini Omni** for conversational video creation/editing and custom avatars, while@emollickemphasized the significance of afully multimodal system that can natively edit video.Runway and image/video tooling keep raising editability:@runwaymlreleased** Aleph 2.0**, supporting** multishot sequences up to 30s at 1080pwith targeted edits that preserve the rest of the scene.@CuriousRefugehighlightedSeeDance 2 Stitcher** for seamlessly extending AI-generated cinematic clips using Omni-generated continuations.Speech and image generation saw notable jumps:@ArtificialAnlysranked** Cartesia Sonic-3.5as the new#1 TTS model** on their Speech Arena, citing anElo of 1218, support for** 42 languages**, and strong naturalness/transcript following. Cartesia claims** 82ms end-to-end first audioin productionhere. In image generation,@wildmindaiflagged Tencent’sZ-Image 6B** as apixel-space generator withno VAE,** 1K resolution**, and a transfer framework for converting Flux/SD models; related ecosystem work included Pixal3D demos from@victormustarand training support forZ-Image L2P 1k in AI Toolkit from@ostrisai.

Security, Cyber, and Policy Pressure

Cybersecurity is quickly becoming a proving ground for advanced agents:@AnthropicAIsaid** Project Glasswingand partners found more than ten thousand high- or critical-severity vulnerabilitiesin essential software within a month, and explicitly warned the industry will need to adapt to the volume of vulnerabilities that models likeClaude Mythos Preview** can find. Security productization is following:@perplexity_aiopen-sourcedBumblebee, a read-only scanner for macOS/Linux to detect risky packages, extensions, and AI tool configs;@AravSrinivassaid enterprise deployment will requireagentic sandboxes plus continuous security engineering.US immigration policy changes triggered sharp backlash from AI leaders: Several high-engagement posts argued a proposed rule forcing green-card applicants to apply from outside the US would directly damage the AI talent pipeline. See@Nick_Davidov,@AndrewYNg,@theo,@garrytan, and@togelius. The common argument: the rule punisheslegal high-skill immigrants, undermines startups and research, and harms US competitiveness in AI.

Top tweets (by engagement) @deepseek_ai on making the V4-Pro discount permanent— the clearest single-market signal in this batch aroundLLM inference economics.@gdb on “the model alone is no longer the product”— concise articulation of the currentagent/harness product thesis.@AnthropicAI on Glasswing finding 10,000+ critical vulnerabilities— one of the strongest data points forAI-driven cyber capability moving into production.@dsp_ on MCP 2026-07-28 RC— important protocol update:stateless MCP plus first-class extensions.@GoogleDeepMind on Project Genie + Street View— notable step towardconsumer-facing world models.@cursor_ai on opening the Cursor SDK for custom agents— relevant for teams building on top of coding-agent infrastructure.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

Keep reading with a 7-day free trial #

Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.

source & further reading

latent.space — original article 5 Trends That Defined AI Engineering at World’s Fair 2026 [AINews] Codex usage up >10x in 6 months to 7M users, +1M in the past ~day; did Codex overtake Claude Code?? [AINews] OpenAI launches GPT 5.6 Sol/Terra/Luna, Codex becomes ChatGPT superapp

[AINews] All Model Labs are now Agent Labs

a quiet day lets us tie together a few quotes as all model labs become agent labs

Keep reading with a 7-day free trial #

Run your AI side-project on zahid.host