[AINews] Founders and Forward Deployed Engineers Anthropic released Claude Opus 4.8, which independent benchmarks show as an incremental improvement with mixed results — better efficiency and less over-agentic behavior in coding, but regressions in content faithfulness and chart parsing. The company also shipped mid-conversation system instruction updates without breaking prompt cache, a significant change for long-running agent sessions, though API pricing remains a major complaint from developers. AINews Founders and Forward Deployed Engineers a quiet day lets us highlight the new AIE WF focuses Most people are still digesting the massive Anthropic news https://www.latent.space/p/ainews-anthropic-raises-965b-series from yesterday. We’re taking the opportunity to solicit the leading AI FDE’s https://ai.engineer/cfp in the world for AIE’s new Forward Deployed Engineer track, mirroring similar pushes from both OpenAI DeployCo https://www.latent.space/p/ainews-thinking-machines-native-interaction and Anthropic DeployCo https://www.blackstone.com/news/press/anthropic-partners-with-blackstone-hellman-friedman-and-goldman-sachs-to-launch-enterprise-ai-services-firm/ : as well as AIE’s new Founders program, where we are doing our version of the Startup Battlefield, a competitive pitch contest anchored by YCombinator’s Garry Tan and Howie Lu’s $10 Million dollar Hyperagent https://x.com/howietl/status/2057823823526014990 contest. Sign up and book hotel https://www.ai.engineer/worldsfair/2026 venue for details today if you are keen. AI News for 5/28/2026-5/29/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space . You can opt in/out of email frequencies AI Twitter Recap Claude Opus 4.8 Rollout, Benchmark Friction, and API Ergonomics Opus 4.8 landed into a noisy, mixed eval landscape : multiple independent benches converged on “incremental but not dominant.” @arena https://x.com/arena/status/2060160804767584512 pushed 200+ frontend/code tests comparing Opus 4.8 against prior Opus variants, Gemini, and GLM; @theo https://x.com/theo/status/2060172445592789064 reported CursorBench shows it as more efficient but slightly worse than 4.7 within margin of error ; @jerryjliu0 https://x.com/jerryjliu0/status/2060196252642648427 and @llama index https://x.com/llama index/status/2060165358569337102 found small gains on tables/layout but regressions on content faithfulness/charts in document parsing; @scaling01 https://x.com/scaling01/status/2060335738172911766 said no progress on ALE-Bench and separately flagged interesting failure modes on LisanBench. On the positive side, @jeremyphoward https://x.com/jeremyphoward/status/2060195641847107722 found 4.8 less over-agentic and more cooperative than 4.7/GPT-5.5 in coding, while @leo linsky https://x.com/leo linsky/status/2060205310871326894 called it a tangible product improvement over prior Anthropic releases. Anthropic also shipped useful platform-level changes : @ClaudeDevs https://x.com/ClaudeDevs/status/2060432688281251998 announced mid-conversation system instructions without breaking prompt cache , plus authoritative mid-conversation system-role updates, which matters for long-running agent sessions and cost control. But pricing remains a major complaint: @jeremyphoward https://x.com/jeremyphoward/status/2060198836963061998 argued Anthropic has done little for API affordability , preferring GPT-5.5 partly because subscription/API economics are easier to justify. Overall takeaway: 4.8 looks like a meaningful quality-of-life release for real use, not a clean benchmark reset. Agent Harnesses, Multi-Turn RL Bugs, and the Infrastructure Around Autonomy A subtle but important RL failure mode got called out : @ClementDelangue https://x.com/ClementDelangue/status/2060175330665508917 highlighted a Hugging Face deep-dive on why many tool-using, multi-turn RL training loops are silently broken . The core bug: decoding model output, parsing tool calls, then re-tokenizing the updated conversation can change tokenization, so gradients are applied to sequences the model never actually sampled. The proposed fix is a strict “Token-In, Token-Out” rule: never re-encode sampled tokens; keep a single token buffer across turns. @johnschulman2 https://x.com/johnschulman2/status/2060392679528337714 reinforced the broader point that renderers are foundational infrastructure between messages and tokens, with failure modes spanning train/test mismatch, caching inefficiency, and prompt injection risk. Harness design is becoming its own optimization discipline : @omarsar0 https://x.com/omarsar0/status/2060371848010019001 surfaced work on Effective Feedback Compute EFC , claiming raw token/tool counts explain agent success poorly while EFC reaches R² up to 0.99 , implying harness quality matters more than gross activity. This lines up with productized tuning efforts like @LangChain https://x.com/LangChain/status/2060349231722852680 , where Deep Agents v0.6 makes harness profiles first-class to get strong performance from Qwen/Kimi/DeepSeek at 20x+ lower cost than frontier APIs, and @hwchase17 https://x.com/hwchase17/status/2060355016989585919 explicitly framing “different models need different prompts/tools.” @vllm project https://x.com/vllm project/status/2060208480292843720 shipped native weight syncing APIs and improved pause/resume for async RL, and later added fastokens https://x.com/vllm project/status/2060414393666679229 , a Rust BPE tokenizer to reduce CPU tokenization bottlenecks in long-context/agentic workloads. Debate is shifting from “single vs multi-agent” to where the abstraction pays : @OfirPress https://x.com/OfirPress/status/2060352260723392658 argued current multi-agent systems are mostly speedups, not capability unlocks ; @scaling01 https://x.com/scaling01/status/2060363050272653625 took the opposite view, expecting swarm-style training to yield better planning and superintelligence-like behavior. Either way, the practical trend is clear: more teams are building around agent observability, traces, and continual improvement loops , e.g. @Vtrivedy10 https://x.com/Vtrivedy10/status/2060406006329278970 on mining production traces for SFT/distillation and long-horizon continual learning. Open Models, Local AI, and the OSS Toolchain Tightening Up Local-first and open-weight momentum continues to rise : @LangChain https://x.com/LangChain/status/2060405874993115532 said 1 in 3 AI teams ran an open-weights model in April 2026, up from 1 in 5 nine months earlier; @EpochAIResearch https://x.com/EpochAIResearch/status/2060451576779886942 estimated open-weight models now lag frontier proprietary models by about four months . On the toolchain side, @ggerganov https://x.com/ggerganov/status/2060394400237109567 launched llama.app , giving llama.cpp an official website, a unified installer, and a single llama entrypoint aimed at easier local deployment and third-party agent integration. @ollama https://x.com/ollama/status/2060428074102206496 announced OpenJarvis as a local-first personal AI via Ollama, explicitly tied to Stanford/Hazy’s “Intelligence Per Watt” framing. Open infrastructure is getting more enterprise-shaped : @ClementDelangue https://x.com/ClementDelangue/status/2060378354931388837 noted that ~50% of models and datasets on Hugging Face are now private , rising with HF’s storage/buckets offering; this is an important correction to the idea that HF is only public OSS infrastructure. @abidlabs https://x.com/abidlabs/status/2060404002341462044 showed Hugging Face Jobs replacing GitHub runners for CPU/serverless GPU CI. @DSPyOSS https://x.com/DSPyOSS/status/2060186371902587119 , @dbreunig https://x.com/dbreunig/status/2060187833084870746 , and others shipped a redesigned DSPy docs/front page ahead of a coming 4.0, focused on onboarding into programmable AI systems rather than pure prompting. Licensing and permissiveness are becoming strategic levers : @kimmonismus https://x.com/kimmonismus/status/2060458698930016378 highlighted NVIDIA moving its four open model families to Linux Foundation OpenMDW-1.1 , reducing legal fragmentation across weights/code/docs/data. New permissive data releases also matter: @keshigeyan https://x.com/keshigeyan/status/2060398262591668315 introduced GPIC , a 100M-pair permissive image corpus plus 1M-pair benchmark for visual generation, with explicit research + commercial usability. Google/OpenAI Product Surface Expands: Managed Agents, Gemini Spark/Omni, and Codex on Windows Google is widening the “managed agent” stack from API to consumer product : @ philschmid https://x.com/ philschmid/status/2060359976325992528 showed Managed Agents in the Gemini API : a single API call provisioning a sandboxed Linux environment with code execution, web access, and file I/O. On the consumer side, @GeminiApp https://x.com/GeminiApp/status/2060405496872579115 rolled out Gemini Spark to U.S. AI Ultra subscribers as a 24/7 personal agent that can operate across a user’s digital ecosystem under direction. Google also kept pushing Gemini Omni multimodal generation/editing demos example https://x.com/alexanderchen/status/2060322611586834518 , product thread https://x.com/GeminiApp/status/2060473816393150965 and announced Google Flow Agent for creative workflows in video/film production thread https://x.com/Google/status/2060473826362732611 . OpenAI’s Codex is moving closer to a persistent remote dev operator : @OpenAI https://x.com/OpenAI/status/2060428604727771421 and @OpenAIDevs https://x.com/OpenAIDevs/status/2060429591655927942 added computer use on Windows , including remote steering from the ChatGPT mobile app. Follow-on UX improvements included stable identicons for background agents and search across prior chat content @OpenAIDevs https://x.com/OpenAIDevs/status/2060478367921831936 ; @reach vb https://x.com/reach vb/status/2060430024537178215 summarized broader Codex updates around Windows control, mobile remote access, and profile/task stats. Separately, OpenAI updated gpt-5.5 instant to improve sycophancy, factuality, and multilingual performance per @michpokrass https://x.com/michpokrass/status/2060219759682330970 . This all points to more vertically integrated agent stacks : model + harness + sandbox + UI + remote control + pricing/quotas. Google is smoothing quotas on Gemini @joshwoodward https://x.com/joshwoodward/status/2060171610922058142 ; OpenAI is expanding Codex’s operating surface; Cursor added auto-review mode with subagent-based approval routing tweet https://x.com/cursor ai/status/2060406013098897765 . The common pattern is less “chatbot,” more managed execution environment with policy and memory . Research and Systems Papers Worth Attention Search, retrieval, and memory : @TheTuringPost https://x.com/TheTuringPost/status/2060194173505155358 highlighted Bidirectional Evolutionary Search BES from Harvard/MIT, combining forward search with backward decomposition and evolutionary operators; reported gains include Llama-3.2-3B-Instruct on MuSiQue from 4.0% to 7.0% . In retrieval, @ reachsumit https://x.com/ reachsumit/status/2060214762626306512 pointed to Latent Terms , showing sparse BM25-ready features can be extracted from frozen dense retrievers via SAEs. @topk io https://x.com/topk io/status/2060383255153569938 open-sourced Iso-ModernColBERT for more efficient late-interaction inference. Continual learning and belief/state management : @HuggingPapers https://x.com/HuggingPapers/status/2060312560323182657 summarized BeliefTrack , claiming optimized belief-state management cuts long-horizon reasoning failures by 70%+ . @AndrewLampinen https://x.com/AndrewLampinen/status/2060460827199599026 argued the continual learning field over-focused on interference instead of positive transfer; @victor207755822 https://x.com/victor207755822/status/2060315686329778432 presented a second DeliAutoResearch SKILL paper focused on self-iteration and CL. Multimodal/world models/robotics : NVIDIA-affiliated work included γ-World , a generative multi-agent world model streaming at 24 FPS tweet https://x.com/fangfu0830/status/2060233093894869499 , and minWM , a real-time interactive video world model framework tweet https://x.com/ akhaliq/status/2060392729473860026 . In robotics, @ akhaliq https://x.com/ akhaliq/status/2060388349425119540 shared Qwen-VLA , and @inventorOli https://x.com/inventorOli/status/2060357909561622885 demoed Robostral’s language-following and manipulation improvements. For always-on proactive agents, @dair ai https://x.com/dair ai/status/2060373102119555191 surfaced work replacing LLM wake-up decisions with a 220MiB temporal-graph encoder , gaining +16.7 mean F1 while running 4–83x faster . Top tweets by engagement OpenAI / biology : @OpenAI on Rosalind Biodefense https://x.com/OpenAI/status/2060376598642405492 announced trusted-access biology tooling for public health and biodefense. Google / consumer agents : @GeminiApp on Spark https://x.com/GeminiApp/status/2060405496872579115 rolled out its always-on personal agent to AI Ultra users in the U.S. OpenAI / dev tools : @OpenAI on Codex Windows support https://x.com/OpenAI/status/2060428604727771421 and @OpenAIDevs https://x.com/OpenAIDevs/status/2060429591655927942 expanded computer use to Windows plus mobile remote steering. llama.cpp UX milestone : @ggerganov https://x.com/ggerganov/status/2060394400237109567 launched llama.app with a unified installer and CLI entrypoint for local AI. HF / RL correctness : @ClementDelangue https://x.com/ClementDelangue/status/2060175330665508917 amplified the Token-In, Token-Out warning for multi-turn RL with tools. Open vs closed timing gap : @EpochAIResearch https://x.com/EpochAIResearch/status/2060451576779886942 estimated open-weight models are now about 4 months behind the frontier. AI Reddit Recap /r/LocalLlama + /r/localLLM Recap 1. Local LLM Performance: MoE Releases, Quants, VRAM Savings Activity: 637 : StepFun 3.7 Flash https://www.reddit.com/r/LocalLLaMA/comments/1tqloii/stepfun 37 flash/ StepFun released Step 3.7 Flash https://static.stepfun.com/blog/step-3.7-flash/ , a multimodal MoE with 196B total parameters, 11B active, and a built-in 1.8B ViT, advertised for high-throughput agent workflows up to 400 TPS and reportedly runnable locally with ~ 128GB RAM. Reported benchmarks position it unusually strongly for a flash-class/local model: SWE-Bench Pro 56.26% , DeepSearchQA F1 92.82% , HLE w/tools 47.2 , plus large gains over Step 3.5 Flash on Terminal-Bench, Toolathlon, ClawEval, and other agentic/tool-use tasks. Direct model artifacts are available on Hugging Face in BF16 https://huggingface.co/stepfun-ai/Step-3.7-Flash/ , FP8 https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8 , NVFP4 https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4 , and GGUF https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF , with day-0 llama.cpp support PR https://github.com/ggml-org/llama.cpp/pull/23845 and related MTP work in llama.cpp 23274 . Commenters characterize the model as technically odd: its hidden/thinking traces are described as nearly incoherent, but final answers can be “perfect” and competitive with much larger 1TB models; one user says the prior Step 3.5 “infinite thinking” issue appears fixed. There is cautious enthusiasm around local deployment, especially for users with 4x3090 -class hardware, and appreciation that StepFun upstreamed llama.cpp support instead of only maintaining a fork.StepFun released multiple Step-3.7-Flash checkpoints on Hugging Face: BF16 Step-3.7-Flash https://huggingface.co/stepfun-ai/Step-3.7-Flash/ , FP8 Step-3.7-Flash-FP8 https://huggingface.co/stepfun-ai/Step-3.7-Flash-FP8 , NVFP4 Step-3.7-Flash-NVFP4 https://huggingface.co/stepfun-ai/Step-3.7-Flash-NVFP4 , and GGUF Step-3.7-Flash-GGUF https://huggingface.co/stepfun-ai/Step-3.7-Flash-GGUF . One user reports the prior Step 3.5 Flash “infinite thinking” issue appears fixed, making 3.7 more usable despite still having an odd intermediate reasoning style.There is day-0 llama.cpp enablement via StepFun’s upstream PR: ggml-org/llama.cpp 23845 https://github.com/ggml-org/llama.cpp/pull/23845 , contrasting with Step 3.5’s fork-based support. A separate community PR for MTP support exists at ggml-org/llama.cpp 23274 https://github.com/ggml-org/llama.cpp/pull/23274 , though commenters note it needs updating for Step 3.7 and current master .A vLLM nightly test of the NVFP4 checkpoint on 2x Pro 6k with 64 concurrent shallow-context requests reached about 2200 tok/s . The reported config used tensor-parallel-size 2 , --enable-expert-parallel , --quantization modelopt , --kv-cache-dtype fp8 , --reasoning-parser step3p5 , and StepFun tool-call parsing; vLLM reported GPU KV cache size 1,667,645 tokens and max concurrency 6.36x for 262,144 tokens/request . Keep reading with a 7-day free trial Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.