a quiet day before google i/o lets us amplify a notable blogpost
It is the day before Google I/O, when the next major Gemini releases are expected to be previewed, and it will probably be a quiet week from competitors, though Anthropic and OpenAI both had minor wins today, and Cursor shipped their first SpaceXAI model with some nice detail on synthetic data/reward hacking and continued pretraining with Muon. However the probable lasting title story candidate from today will be Vlad Feinberg’s (understandably Google/TPU centric) notes on job preparation, specifically on Pretraining:
Specifically he references last year’s Scaling handbook from DeepMind, and kernel work is an important part: The biggest bottleneck and innermost loop of all LLM work isperformance work that makes abstract, logical changes to the LLM practical to run. Every project needs people who cantune the LLMs at the kernel level. It is a skill you can pick up and is the most direct path into the labs.
There’s a surprise mention of DSLs for kernel dev, of which there is a concise history:
For someone at this level of the stack, surprisingly he also calls out Agent Work like autoresearch and AlphaEvolve. He ends with a surprisingly simple exercise: But the real hiring test is in the bottom paragraphs:
Derive Chinchilla laws for this; see how theydiffer for dense vs MoE architectures.Code your solution from scratch in jax by hand if you actually want the learning experience.
*Next, assuming you used jax.lax.ragged_dot for the MoE layer;*write a pallas kernel that beats ragged dot for F > D by fusing the up/down projections.Find a setting where you notice a measurable forward pass speedup and explain why it’s there.
If you can teach this to the rest of the community, we’d love to feature you as a workshop speaker. AI News for 5/16/2026-5/18/2026. We checked 12 subreddits,
[544 Twitters]and no further Discords.[AINews’ website]lets you search all past issues. As a reminder,[AINews is now a section of Latent Space]. You can[opt in/out]of email frequencies!
AI Twitter Recap
Coding Agents, Agent Ops, and the Move from Chat to Automation
Agent infrastructure is converging on observability + automation loops: Several posts point to a maturing stack for production agents.** LangSmith Engineis framed as the missing CI/CD loop for agents, automatically detecting failures from production traces, clustering issues, and drafting fixes/evals, with LangChain also highlightingSmithDB** as a purpose-built data layer for agent observability/eval workloads with low-latency querying over large traces and self-hosting/multi-cloud requirements@krishdpi,@LangChain. In parallel,Cognition launchedDevin Auto-Triage, positioning it as an always-on “first responder” for bugs, alerts, and incidents with long-term memory, manager/subagent structure, and PR generation; early users like Modal describe it as more useful than typical homegrown triage automations@cognition,@walden_yan,@russelljkaplan. The common pattern is less “chat with an agent” and morepersistent automation tied to traces, memory, and evals.** Operational patterns for coding agents are getting more concrete**: Anthropic published best practices for running** Claude Codeacross multi-million-line monorepos, legacy systems, and microservices, while adding prompt cache diagnosticsand making Fast mode default to Opus 4.7for lower-latency coding workflows@ClaudeDevs,@ClaudeDevs,@ClaudeDevs. OpenAI expandedCodex** workflows with aZoom plugin, mobile/desktop remote execution, and “keep your Mac awake” support so longer-running jobs continue from the phone app@coreyching,@OpenAIDevs. Microsoft pushedremote control for GitHub Copilot CLI and VS Code to GA@code. Across these, the product direction is clear:background execution, remote supervision, and agent fan-out, not just interactive completions.** Practitioners are converging on the same mental model: constrain, verify, decompose**: François Chollet’s framing of coding agents as “blind squirrels” that need carefully placed** verifiable constraintssuccinctly matches a broader shift toward harness-centric engineering@fchollet. Related advice includes usingasserts** heavily in Python/ML code to fail fast@gabriberton, building bothend-to-end and incremental evals for long-running agents@palashshah, and structuring multi-agent systems in staged maturity levels rather than maximizing agent count prematurely@shannholmberg. The practical consensus: agent quality depends more onverification surfaces, decomposition, and feedback loops than on prompt cleverness alone.
Model Releases, Ranking Shifts, and Frontier Coding Models
Cursor’s Composer 2.5 is the standout model launch in this batch: Cursor announced** Composer 2.5as its strongest model yet, emphasizing better sustained work on long-running tasks and more reliable instruction following, then disclosed a deeper strategic move: training a much larger model from scratch with“SpaceXAI,”** using10× more total compute and access toColossus 2’s million H100-equivalents@cursor_ai,@cursor_ai. Community reactions centered on itsefficiency/cost-performance profile and strong coding quality, with users calling it a major step up from Composer 2 and noting better collaboration behavior in messages/updates, not just raw benchmark gains@mntruell,@jonas_nelle,@kimmonismus.Alibaba’s Qwen line continues to climb:** Qwen3.7 Previewlanded on Arena with Qwen3.7 Max Previewat#13 overall** in text, including**#7 Math**,#9 Expert,#9 Software & IT, and**#10 Coding**;** Qwen3.7 Plus Previewreached#16 overall** in vision, making Alibaba the**#6 lab in text** and**#5 in vision** by Arena’s counts@arena,@Alibaba_Qwen. That reinforces the broader trend of Chinese labs steadily improving across both general and specialist arenas rather than only headline chat benchmarks.Open model and multimodal releases continue below the mega-frontier: ByteDance open-sourced** Lance**, described as a** unified multimodal modelfor image/video understanding, generation, and editing, with 3B video + 3B image + 3B decodercomponents@bdsqlsz. Perplexity released a small open multilingual ColBERTmodel as a continued-training variant of pplx-embed-0.6b**, with notes on using the** MaxSim kernel**@bo_wangbo. These are not frontier-scale launches, but they are technically meaningful because they targetretrieval quality andnative multimodal unification, two areas where open tooling still matters.
Inference, Deployment, and Local/Enterprise Serving
Local inference got a notable speed boost via MTP in llama.cpp: Georgi Gerganov announced** MTP support for the Qwen3.6 familyin llama.cpp**, calling it a significant milestone for local AI@ggerganov. Follow-on reports showed meaningful throughput gains, including aQwen3.6-27B dense jump from25 tok/s to 45 tok/s (+78%) on an A10G using draft-MTP flags@victormustar. This matters because it narrows the usability gap between local and hosted coding/general assistants on commodity hardware.Enterprise/on-prem deployment momentum remains strong: Hugging Face and Dell promoted one-click access to models including** Kimi K2.6**,** DeepSeek V4 Pro/Flash**,** GLM 5.1**, and** MiniMax M2.7through Dell Enterprise Huboptimized for PowerEdge XE9780 with NVIDIA B300**@jeffboudier. Clement Delangue argued that** on-prem/local AI based on open-source modelswill be an important answer to GPU shortages**, with advantages in** cost, latency, and safety/data control**@ClementDelangue.** Cross-hardware inference optimization is becoming more sophisticated**: Zyphra published end-to-end inference benchmarks on** AMD Instinct MI355X**, claiming strong outperformance over AMD’s baseline and a narrowed gap to** NVIDIA B200when serving Kimi K2.6, GLM 5.1, and DeepSeek V3.2**@ZyphraAI. Complementing that, Quentin Anthony posted a useful thread on why benchmarking needs to distinguishhardware ceilings vs current software state, arguing that many cross-stack comparisons conflate vendor maxes, achievable GEMM performance, and software maturity@QuentinAnthon15. For infra engineers, that’s a strong reminder to treat benchmark charts asstack-dependent snapshots, not absolute truths.
Research: MoEs, RL/Data Mixing, Architecture Search, and Agent Evaluation
Several papers this week focused on better training signals rather than bigger models: A summary of LeCun/Timor et al.’s**“On Training in Imagination”** highlighted that in model-based RL, smoother world/reward models withlow Lipschitz constants tighten error bounds; reward models often scale faster than dynamics models; andmany noisy reward labels can beat fewer high-quality ones, while biased rewards are especially dangerous@TheTuringPost. A separate thread onPedagogical RL argued that even correct reasoning traces can be poor training data if they are too surprising relative to the student policy; the method uses a privileged teacher plusspike-aware rewards andsurprisal-gated imitation to generate trajectories the student can actually learn from@blc_16,@NoahZiems.Architecture and scaling studies remain highly actionable: Meta’s** AIRAwork on agentic neural architecture discoverydrew attention because it beats Llama 3.2at 350M, 1B, and 3Bscales within a 24-hour compute budgetby splitting search into a planning agent ( AIRA-Compose**) and an implementation agent (** AIRA-Design**)@omarsar0,@dair_ai. Separately,“Slicing and Dicing MoEs” reports training2,000+ MoE LMs and concludes that much of the design space reduces toexpert size and expert count rather than the noisier discourse around MoE configuration knobs@margs_li.Data selection/eval methodology are emerging as first-class research problems:** On-Policy Mixtargets the unsolved problem of finding the right data mix as data distributions keep shifting, with applicability across pretraining, midtraining, and instruction tuning@michahu8. On evals, Cameron Wolfe published a guide toagent evaluation**, and a longer Zhihu summary argued that the agent era requires measuring** delegation intelligence**—when to search, code, reason, or call tools—rather than only static knowledge or internal chain-of-thought prowess@cwolferesearch,@ZhihuFrontier. That aligns closely with current product practice: the hard part is increasinglytool choice and verification policy, not text-only reasoning.
Ecosystem Moves: SDKs, Revenue Capture, and Open Tooling
Anthropic acquired Stainless: Anthropic announced the acquisition of** Stainless**, the SDK and MCP server platform that has powered Anthropic SDKs since early API days@AnthropicAI. Strategically, this points to continued vertical integration arounddeveloper ergonomics, SDK generation, and protocol surfaces, not just model quality.** Revenue concentration around foundation model providers appears to be increasing**: One post claimed that** Anthropic and OpenAI’s share of AI model/application revenues generated by 34 top AI startups is rising**, a signal that the ecosystem may be consolidating economically even as model choices proliferate@amir.Tooling and deployment curation remains in demand: The Turing Post’s roundup of** 13 open-source tools for foundation model deployment**—including** vLLM, TGI, SGLang, llama.cpp, Ollama, BentoML, Kubeflow, MLflowand others—was one of the more practically useful curation posts in the set@TheTuringPost. Meanwhile,Papers With Code is being revived with AI-agent-assisted parsing of methods, leaderboards, and SOTA tracking, underscoring renewed focus onresearch discoverability**@NielsRogge.
Top Tweets (by engagement) Cursor’s Composer 2.5 + bigger training push: The highest-signal high-engagement product news was** Composer 2.5and Cursor’s disclosure that it is training a much larger model from scratch with 10× more compute**@cursor_ai,@cursor_ai.OpenAI/Anthropic product updates with developer impact: Sam Altman said** ChatGPT improved significantly with the latest update**@sama, while Anthropic shipped** Fast mode defaulting to Opus 4.7and prompt cache diagnosticsin Claude Console@ClaudeDevs,@ClaudeDevs.Enduring research/engineering framing: Richard Sutton’s 26-word condensation of the Bitter Lesson**—focus on methods for creating knowledge that scale with compute, like search and learning—was among the most engaged research-adjacent posts and resonated with many of the week’s themes around agent harnesses, search, and verifier-driven systems@RichardSSutton.
AI Reddit Recap
/r/LocalLlama + /r/localLLM Recap
1. LLM Safety Benchmarks and Abliteration Forensics
Keep reading with a 7-day free trial #
Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.