[AINews] Google I/O 2026: Gemini 3.5 Flash, Omni (NanoBanana for Video), Spark (background agents), and Antigravity 2.0

At Google I/O 2026, the company announced Gemini 3.5 Flash is now generally available, positioning it as its strongest model for agentic and coding workloads with 1M-token context and four thinking levels. Google also introduced Gemini Omni for multimodal video generation and editing, alongside Antigravity 2.0, a broader agent stack spanning desktop, CLI, SDK, and API. The announcements signal Google's push to reposition Gemini as both a consumer AI surface and a developer platform, with the company reporting 900 million monthly Gemini users and processing 3.2 quadrillion tokens per month.

AINews Google I/O 2026: Gemini 3.5 Flash, Omni NanoBanana for Video , Spark background agents , and Antigravity 2.0 Google has been busy The full keynote livestream https://www.youtube.com/watch?v=wYSncx9zLIU&pp=ygUJZ29vZ2xlIGlv was 2 hours, but as usual, The Verge has the best supercut down to 30 mins, which is very worthwhile to get a narrative sense: The mainline Gemini 3.5 Flash is GA today very nice compared to some staged rollouts and is sold as a decent step up even compared to 3.1 Pro, with 3.5 Pro coming next month. Perhaps more impressive were the Gemini Live Voice and Omni Video and Google Pics/Flow Images/VFX/music modalities, where Google demonstrated industry leading capabilities and latency, all presumably made possible by industry leading hardware and models. Per longstanding tradition at every bigtech keynote these days, Google also showed off some smart glasses tech, which seems a little more likely to be seen on the street than many prior iterations from both Google and their peers. AI News for 5/18/2026-5/19/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space . You can opt in/out of email frequencies AI Twitter Recap Google used I/O to reposition Gemini as both a consumer AI surface and a developer/agent platform, with three core technical announcements: Gemini 3.5 Flash for fast agentic/coding workloads, Gemini Omni for multimodal generation/editing starting with video, and a broader Antigravity agent stack spanning desktop/CLI/SDK/API. Official posts emphasized scale — Google says it now processes over 3.2 quadrillion tokens/month , up 7x YoY from 480T/month , while the Gemini app has 900M+ monthly users and is available in 230+ countries and 70+ languages Google https://x.com/Google/status/2056783102085640252 , Google https://x.com/Google/status/2056783643381543253 , GeminiApp https://x.com/GeminiApp/status/2056799446684578250 . The most technically substantive release was Gemini 3.5 Flash , framed by Google as its strongest agentic/coding model yet, GA immediately , with 1M-token context , 65k max output , 4 thinking levels “minimal/low/medium/high” , and “thought preservation” across turns GoogleDeepMind https://x.com/GoogleDeepMind/status/2056787987774816525 , Google https://x.com/Google/status/2056788266872140232 , philschmid https://x.com/ philschmid/status/2056794978517750165 . Google paired that with Gemini Omni , a new family combining Gemini reasoning with generative media, initially via Omni Flash , capable of taking text/image/video/audio inputs and producing video edits/generation in Gemini, Flow, Shorts, and later APIs GoogleDeepMind https://x.com/GoogleDeepMind/status/2056786446636212467 , Google https://x.com/Google/status/2056786781992071172 , GeminiApp https://x.com/GeminiApp/status/2056800579159216202 . Around those models, Google launched or expanded Antigravity 2.0 desktop , CLI , SDK , Managed Agents in the Gemini API , Search-native generative UI/coding, Gemini Spark background agents on cloud VMs, and a long list of Gemini-app/Workspace/commerce/media integrations Google https://x.com/Google/status/2056789045548896516 , Google https://x.com/Google/status/2056838495298367773 , Google https://x.com/Google/status/2056791134295273554 . Facts vs. opinions Facts / directly claimed by official or third-party benchmark sources Google says it now processes 3.2 quadrillion tokens/month , up from 480 trillion a year earlier Google https://x.com/Google/status/2056783102085640252 .Google says Gemini has 900M+ monthly users Google https://x.com/Google/status/2056783643381543253 .Google says Gemini 3.5 Flash is GA today across Gemini app, Search AI Mode, Gemini API, AI Studio, Antigravity, Android Studio, and enterprise surfaces Google https://x.com/Google/status/2056791527314387208 , GeminiApp https://x.com/GeminiApp/status/2056789742910595342 .Google says Gemini 3.5 Flash has 1M context , 65k max output , 4 thinking levels , and “thought preservation” across turns philschmid https://x.com/ philschmid/status/2056794978517750165 .Google says 3.5 Flash beats Gemini 3.1 Pro on Terminal-Bench 2.1 , GDPval-AA , and MCP Atlas GoogleDeepMind https://x.com/GoogleDeepMind/status/2056787990110994511 , Google https://x.com/Google/status/2056788281317306466 .Google says 3.5 Flash runs 4x faster than comparable frontier models , and up to 12x faster in Antigravity Google https://x.com/Google/status/2056788266872140232 , JeffDean https://x.com/JeffDean/status/2056793419033588091 .Independent benchmarker Artificial Analysis reports Gemini 3.5 Flash scores 55 on its Intelligence Index, +9 vs Gemini 3 Flash , at 280 output tok/s , with MMMU-Pro 84% , GDPval-AA Elo 1656 , and pricing of $1.50 / $9.00 per 1M input/output tokens ; it also reports the model is 5.5x costlier to run than Gemini 3 Flash on its suite and 75% costlier than Gemini 3.1 Pro ArtificialAnlys https://x.com/ArtificialAnlys/status/2056795055512596817 .Arena reports Gemini 3.5 Flash reached 9 overall in Text Arena and 9 in Code Arena: Frontend , scoring 1507 , a +70 jump over Gemini 3 Flash, and becoming the top score in its price tier arena https://x.com/arena/status/2056793176720195693 .Google says Gemini Omni Flash is available in Gemini/Flow today for paid users, in Shorts/Create starting this week for free, and via APIs in coming weeks Google https://x.com/Google/status/2056789307856462061 .Google says Spark runs on dedicated Google Cloud virtual machines , allowing long-running tasks while user devices are closed Google https://x.com/Google/status/2056791134295273554 .Google claims an Antigravity + Gemini 3.5 Flash demo built a functioning OS in 12 hours using 93 parallel sub-agents , 15k+ model requests , 2.6B tokens , and < $1K API credits Google https://x.com/Google/status/2056789235500466273 .Google says Search will use Antigravity + 3.5 Flash to generate custom visual tools/simulations on the fly Google https://x.com/Google/status/2056795269694423065 . Opinions / interpretations / skepticism Positive takes: “Google is back,” “insane evals for a Flash model,” “world model towards AGI,” “mind blowing” for Search + Antigravity, etc. kimmonismus https://x.com/kimmonismus/status/2056791681073316071 , Kseniase https://x.com/Kseniase /status/2056798225378783656 , demishassabis https://x.com/demishassabis/status/2056831486251380783 .Neutral caution: some posters explicitly avoided overhyping due to self-reported benchmarks and noted pricing/perf concerns scaling01 https://x.com/scaling01/status/2056794370909593987 , simonw https://x.com/simonw/status/2056867815605625172 .Negative/skeptical takes focused on: Price inflation relative to earlier Flash models enricoros https://x.com/enricoros/status/2056816088785289481 .Comparisons where GPT-5.5-medium may be smarter/cheaper/faster end-to-end scaling01 https://x.com/scaling01/status/2056803273756000721 , scaling01 https://x.com/scaling01/status/2056798645983334890 .Benchmark caveats such as weak TerminalBench-Hard , mediocre MRCR / ARC-AGI-2 , or not clearly beating Kimi/GLM on some slices scaling01 https://x.com/scaling01/status/2056796392899645919 , teortaxesTex https://x.com/teortaxesTex/status/2056794752167645653 , scaling01 https://x.com/scaling01/status/2056795648742076743 .Product naming/UX confusion around Gemini CLI vs Antigravity CLI and broader interface design criticism zachtratar https://x.com/zachtratar/status/2056848643580482002 , kchonyc https://x.com/kchonyc/status/2056826706984337726 , teortaxesTex https://x.com/teortaxesTex/status/2056788641926509010 . Gemini 3.5 Flash: the main technical release Official positioning Google/DeepMind repeatedly described Gemini 3.5 Flash as the company’s strongest model yet for agents and coding , not its absolute flagship intelligence model. It’s meant to sit on the high-speed, high-utility part of the Pareto frontier, powering both Google products and developer workloads GoogleDeepMind https://x.com/GoogleDeepMind/status/2056787987774816525 , Google https://x.com/Google/status/2056788266872140232 , SundarPichai https://x.com/sundarpichai/status/2056796893951426705 . Technical details and metrics From Google and affiliated posts: GA availability now Google https://x.com/Google/status/2056791527314387208 1M token context window 65k max output tokens Thinking levels: minimal, low, medium new default , high Thought preservation across multi-turn conversations Text output Input modalities: text, image, video, speech per Artificial Analysis philschmid https://x.com/ philschmid/status/2056794978517750165 , ArtificialAnlys https://x.com/ArtificialAnlys/status/2056795055512596817 Pricing: $1.50 / 1M input , $9.00 / 1M output , 90% discount on cached input scaling01 https://x.com/scaling01/status/2056793465715822720 , ArtificialAnlys https://x.com/ArtificialAnlys/status/2056795055512596817 Official benchmark claims: Terminal-Bench 2.1: 76.2% GDPval-AA: 1656 Elo MCP Atlas: 83.6% Google-quoted multimodal result: MMMU-Pro 83.6% in one engineer post; Artificial Analysis reports 84% , highest recorded on its setup koraykv https://x.com/koraykv/status/2056795667088204234 , ArtificialAnlys https://x.com/ArtificialAnlys/status/2056795055512596817 Speed claims: Google marketing claim: 4x faster than comparable frontier models Google https://x.com/Google/status/2056788266872140232 In Antigravity, Google says it is up to 12x faster JeffDean https://x.com/JeffDean/status/2056793419033588091 , scaling01 https://x.com/scaling01/status/2056790573961326680 Artificial Analysis observed 280 output tok/s Some discussion cited ~867 tok/s in Antigravity-specific optimized serving scaling01 https://x.com/scaling01/status/2056790573961326680 , scaling01 https://x.com/scaling01/status/2056791726677782743 Third-party evaluation: Artificial Analysis says 3.5 Flash is the leader on the intelligence-vs-speed Pareto frontier , but the economics are notably worse than prior Flash:Intelligence Index 55 +9 over Gemini 3 FlashHallucination rate reduced to 61% , a 31-point drop vs Gemini 3 Flash on its omniscience setup GDPval-AA 1656 Elo 5.5x costlier than Gemini 3 Flash to run on its benchmark suite 75% costlier than Gemini 3.1 Pro on the same suite ArtificialAnlys https://x.com/ArtificialAnlys/status/2056795055512596817 Arena: 9 Text Arena 9 Code Arena: Frontend 1507 score, +70 over Gemini-3 FlashBetter than Gemini 3.1 Pro across categories in its frontend coding eval arena https://x.com/arena/status/2056793176720195693 , arena https://x.com/arena/status/2056803661859479812 Implications The notable shift is that Google appears to be using a “Flash” label for a model that, in prior cycles, would have been described more like a high-end product model optimized for deployment rather than simply a cheap lightweight tier. Several posters called this out directly, arguing Flash is becoming more expensive and possibly absorbing former Pro territory enricoros https://x.com/enricoros/status/2056816088785289481 , simonw https://x.com/simonw/status/2056867815605625172 . The strongest technical signal is not “best absolute benchmark model,” but: material agentic gains extreme serving speed deep integration into product surfaces tooling built around subagents and long-horizon execution That makes 3.5 Flash strategically important even if some competitors still win on raw price-adjusted intelligence in certain third-party comparisons. Gemini Omni: multimodal generation/editing as “create anything from any input” What Google announced Google introduced Gemini Omni as a new family merging Gemini reasoning/world knowledge with Google’s generative media stack, starting with video creation and editing. Official messaging described it as “create anything from any input,” but current rollout is narrower: Inputs: text, images, audio, video Initial output emphasis: video Product availability: Gemini app , Flow , YouTube Shorts/Create , later APIs Current shipping model: Gemini Omni Flash GoogleDeepMind https://x.com/GoogleDeepMind/status/2056786446636212467 , Google https://x.com/Google/status/2056786395067552140 , Google https://x.com/Google/status/2056789307856462061 Google/DeepMind claims: Better world understanding More robust physics Multi-turn editing where scene/character consistency is retained Ability to “reimagine” user video footage with conversational edits Google https://x.com/Google/status/2056786888930062369 , Google https://x.com/Google/status/2056786589175677089 Rollout specifics: Paid Gemini users globally in app/Flow “today” YouTube Shorts/Create rolling out “starting this week” at no cost APIs for developers/enterprise in coming weeks Google https://x.com/Google/status/2056789307856462061 , GeminiApp https://x.com/GeminiApp/status/2056814117047132301 Perspectives Supportive: users and Google employees described Omni as a major quality step, especially for video editing and consistency joshwoodward https://x.com/joshwoodward/status/2056827449556845051 , fofrAI https://x.com/fofrAI/status/2056789242274259242 , osanseviero https://x.com/osanseviero/status/2056863263305105424 .Strategic interpretation: several posters framed Omni as evidence Google is investing in world models and embodied/physical priors, not just text/code competition demishassabis https://x.com/demishassabis/status/2056831486251380783 , jparkerholder https://x.com/jparkerholder/status/2056789448554062232 , kimmonismus https://x.com/kimmonismus/status/2056802929957568881 .Skepticism: some UI/output examples drew criticism for looking like “B-tier video game interface” or too polished/template-like teortaxesTex https://x.com/teortaxesTex/status/2056787895977980172 , shlomifruchter https://x.com/shlomifruchter/status/2056858151987884087 . Context Omni matters less as “yet another video model” and more as Google’s attempt to unify: multimodal understanding, media editing, world grounding, agent interfaces, and eventually any-input/any-output generation. This aligns with DeepMind’s long-running world-model agenda and Google’s product distribution advantage. Antigravity: Google’s agent OS, not just a coding assistant A major underappreciated I/O theme was that Google is no longer presenting agents as a thin wrapper around a chat model. Antigravity is becoming the execution substrate . What launched / expanded Antigravity 2.0 desktop app : agent-first desktop with core conversations, artifacts, multi-agent orchestration Google https://x.com/Google/status/2056788868092006891 , Google https://x.com/Google/status/2056838653855650286 Antigravity SDK Google https://x.com/Google/status/2056789045548896516 Managed Agents in Gemini API : single API call gives an agent plus hosted Linux sandbox; supports Bash/Python/Node, files, browsing, custom markdown-defined skills, repo/GCS mounts Google https://x.com/Google/status/2056838495298367773 , GoogleAIStudio https://x.com/GoogleAIStudio/status/2056836824686059616 , philschmid https://x.com/ philschmid/status/2056836567470362955 Integrations with AI Studio , Android , Firebase , Workspace , web Google https://x.com/Google/status/2056789045548896516 , Google https://x.com/Google/status/2056837910851449177 One-click export from AI Studio to Antigravity Google https://x.com/Google/status/2056838913944424469 Native Android app generation in AI Studio / Android support in Antigravity Google https://x.com/Google/status/2056838230591574098 , AndroidDev https://x.com/AndroidDev/status/2056841786656711077 Technical signaling Google’s own demos centered on parallel sub-agents , hosted execution , high-frequency iterative loops , and artifact-oriented workflows . Jeff Dean explicitly described 3.5 Flash as a strong engine for “deploy sub-agents that collaborate, run high-frequency iterative loops, and solve real-world problems at scale” JeffDean https://x.com/JeffDean/status/2056793419033588091 . The marquee proof point: OS built in 12h 93 parallel sub-agents 15k+ requests 2.6B tokens < $1K credits Google https://x.com/Google/status/2056789235500466273 Even if this is mostly a stage-managed benchmark/demo, it reveals the architecture Google wants developers to adopt: many fast agents over one slow monolithic run . Reactions Positive: this is Google’s answer to Codex/Claude Code/OpenClaw/Hermes-style workflows, with a stronger infra story iScienceLuvr https://x.com/iScienceLuvr/status/2056792158988816767 , theo https://x.com/theo/status/2056826014739890204 .Critical: branding and product sprawl remain confusing; some users aren’t sure whether they should use Gemini CLI or Antigravity CLI, and Google’s design choices drew complaints kchonyc https://x.com/kchonyc/status/2056826706984337726 , zachtratar https://x.com/zachtratar/status/2056848643580482002 , teortaxesTex https://x.com/teortaxesTex/status/2056788641926509010 . Search, Gemini app, and consumer agents Search Google announced a redesigned AI-powered Search box, multimodal query support, and the most ambitious consumer-facing move: Search generating custom visual tools and simulations on the fly using Antigravity + Gemini 3.5 Flash Google https://x.com/Google/status/2056793802141044786 , Google https://x.com/Google/status/2056795269694423065 . It also previewed information agents in Search: persistent monitoring tasks web/news/social/real-time signals synthesized updates with links and actions This is a notable strategic shift: Search moves from retrieval/ranking to background agentic monitoring + generated applets . Gemini app Consumer Gemini updates included: new “ Neural Expressive ” design language Google https://x.com/Google/status/2056799862604046663 inline/instant Gemini Live voice Google https://x.com/Google/status/2056800029688352988 Daily Brief personalized digest from inbox/calendar/tasks Google https://x.com/Google/status/2056801159071883342 , GeminiApp https://x.com/GeminiApp/status/2056800978343764238 Gemini Spark as a 24/7 personal AI agent on cloud VMs, checking with users before major actions Google https://x.com/Google/status/2056791134295273554 , GeminiApp https://x.com/GeminiApp/status/2056801918018564538 macOS app + upcoming Spark/voice desktop workflows Google https://x.com/Google/status/2056802434303869118 , GeminiApp https://x.com/GeminiApp/status/2056802363269329304 Pricing / subscriptions Google introduced a new pricing ladder: This reads as a more aggressive bid for premium power users, especially coders and creators. Trust, provenance, and standards Google pushed SynthID across Search, Gemini, Chrome, and hardware/media surfaces, and announced partnerships with OpenAI, NVIDIA, Kakao, and ElevenLabs to bring SynthID to their generated content Google https://x.com/Google/status/2056787498676658576 , Google https://x.com/Google/status/2056787749965799508 . That is one of the more consequential standards moves from I/O: it gives Google a shot at owning part of the provenance layer for generative media; notably, OpenAI separately announced support for checking OpenAI-generated images via SynthID watermark + C2PA credentials OpenAI https://x.com/OpenAI/status/2056793648571011232 . This was less flashy than Omni/3.5 Flash, but likely more durable if provenance becomes mandatory infrastructure. Google’s science and world-model angle Several I/O items reinforced that Google does not want to compete only on coding/chat: Gemini for Science : Literature Insights, Hypothesis Generation, Computational Discovery GoogleDeepMind https://x.com/GoogleDeepMind/status/2056808869242826957 , Google https://x.com/Google/status/2056809034494124118 Nature publication links around ERA / Co-Scientist GoogleResearch https://x.com/GoogleResearch/status/2056797037426045105 , GoogleResearch https://x.com/GoogleResearch/status/2056857494107062718 Project Genie + Street View grounding , using ~20 years of maps imagery to create interactive real-location simulations Google https://x.com/Google/status/2056850758029464009 , poolio https://x.com/poolio/status/2056796361987850705 , bilawalsidhu https://x.com/bilawalsidhu/status/2056804315721843024 This broader context explains why some observers interpreted Omni as “world-model progress” rather than just a content tool demishassabis https://x.com/demishassabis/status/2056831486251380783 , jparkerholder https://x.com/jparkerholder/status/2056798252264018232 . Different opinions Bullish / supportive Gemini 3.5 Flash viewed as a major leap for a speed-tier model, especially on agentic coding kimmonismus https://x.com/kimmonismus/status/2056791681073316071 , SundarPichai https://x.com/sundarpichai/status/2056796893951426705 .Search + Antigravity seen as potentially transformative because Google can deploy generated UI/tools at enormous scale Kseniase https://x.com/Kseniase /status/2056798225378783656 , TheTuringPost https://x.com/TheTuringPost/status/2056795871098913209 .Omni praised for editing quality and for hinting at a deeper world-model roadmap joshwoodward https://x.com/joshwoodward/status/2056827449556845051 , kimmonismus https://x.com/kimmonismus/status/2056802929957568881 . Skeptical / opposing Concern that Google is leaning on self-reported benchmarks , and independent comparisons still leave room for competitors scaling01 https://x.com/scaling01/status/2056794370909593987 .Concern that “Flash” is no longer cheap enough to justify the name; pricing has climbed sharply from prior Flash generations enricoros https://x.com/enricoros/status/2056816088785289481 , simonw https://x.com/simonw/status/2056867815605625172 .Some believed GPT-5.5-medium still dominates on a combined smart/cheap/latency basis scaling01 https://x.com/scaling01/status/2056803273756000721 .Some benchmark slices imply unevenness — e.g. poor TerminalBench-Hard or middling reasoning metrics despite strong agentic numbers scaling01 https://x.com/scaling01/status/2056796392899645919 , teortaxesTex https://x.com/teortaxesTex/status/2056794752167645653 . Neutral / analytical Artificial Analysis gave the strongest balanced take: excellent speed-intelligence frontier position , substantial agentic gains , but materially worse cost than prior Flash and even higher than 3.1 Pro on their end-to-end suite ArtificialAnlys https://x.com/ArtificialAnlys/status/2056795055512596817 .Arena’s data also supports a “real improvement, not just marketing” conclusion, especially for frontend/code tasks, without claiming category dominance arena https://x.com/arena/status/2056793176720195693 . Why this matters Google now has a coherent deployment story. Earlier Gemini cycles often felt benchmark-heavy and product-fragmented. At I/O, Google tied model, infra, tools, APIs, consumer surfaces, and enterprise rollout together. The center of gravity is shifting from chatbot UX to agent execution. The important primitives were not just model IQ: they were subagents, hosted sandboxes, long-running tasks, generated artifacts, and integration with Search/Workspace/Android . Gemini 3.5 Flash suggests “fast enough to orchestrate many agents” may matter more than max benchmark score. For coding and tool use, throughput and latency are increasingly product-defining. Omni reveals Google’s differentiation thesis. Google is betting on multimodal/world-grounded systems rather than purely text-centric competition. Trust/provenance is becoming platform infrastructure. SynthID partnerships with OpenAI/NVIDIA/ElevenLabs/Kakao suggest some convergence around content-auth provenance layers. The biggest unresolved question is economics. Technically strong or not, 3.5 Flash drew substantial pushback on cost inflation. If “Flash” is no longer the cheap workhorse tier, Google may win on capability deployment while losing some developer mindshare on predictability and pricing simplicity. Talent, Labs, and Ecosystem Moves Karpathy joins Anthropic : The day’s most engaged AI tweet was Andrej Karpathy’s announcement https://x.com/karpathy/status/2056753169888334312 that he has joined Anthropic to “get back to R&D.” The tweet dominated discussion, with subsequent speculation from @scaling01 https://x.com/scaling01/status/2056773883982762114 citing Axios that he’ll work on RSI/autoresearch and start a new pretraining-focused effort. While the details remain unconfirmed by Anthropic, the move was widely interpreted as a major talent win for Anthropic. OpenAI capacity products : OpenAI announced, a commercial offering that lets customers secure Guaranteed Capacity https://x.com/OpenAI/status/2056823271774101907 long-term compute access for critical workloads. Sam Altman https://x.com/sama/status/2056827105401614656 framed it as a response to a world that will remain capacity constrained as models become more useful, offering discounted tokens for 1–3 year commits . GitHub and coding toolchain integrations : GitHub https://x.com/github/status/2056801675042779279 said Gemini 3.5 Flash is rolling out in Copilot , citing strong tool use, fast response times, and cache efficiency for iterative agentic coding. Cursor https://x.com/cursor ai/status/2056803731367456993 launched integration with Jira , allowing cloud agents to take work items and create merge-ready PRs. Code/VS Code https://x.com/code/status/2056803208559759447 also announced Gemini 3.5 Flash availability. Training Algorithms, Benchmarks, and Agent Evaluation RL/post-training discussion is shifting toward denser credit assignment : @nrehiew https://x.com/nrehiew /status/2056751826356297834 argued that the next scalable training breakthrough may build on GRPO but with denser, lower-bias credit assignment , citing directions like ECHO , Composer2 , self-distillation, and OPD. @lateinteraction https://x.com/lateinteraction/status/2056770702175318095 countered with a “pedagogical RL” framing: train a self-teacher that samples correct and easy-to-follow rollouts. Can coding agents do research? Not yet : Intology AI https://x.com/IntologyAI/status/2056764236668493868 released NanoGPT-Bench , an autonomous benchmark based on the NanoGPT Speedrun competition, testing whether coding agents can contribute to real AI R&D progress. Their headline result: Codex, Claude Code, and Autoresearch recover only 9.3% of human progress , mostly via hyperparameter tuning rather than algorithmic innovation. Agent harnesses and memory are getting more formalized : @omarsar0 https://x.com/omarsar0/status/2056764334181884158 highlighted a 100+ page survey on code-as-agent-harness , arguing future systems need to be executable, inspectable, stateful, and governed . François Chollet https://x.com/fchollet/status/2056777649880752160 made the related point that real tasks are rarely Markovian, so agents without high-fidelity trajectory compression are dramatically less useful. Verifier quality is emerging as a bottleneck : Threads from @Shahules786 https://x.com/Shahules786/status/2056773476585816255 emphasized that scaling agent benchmarks now depends less on adding tasks and more on improving verifier quality , citing SWE-bench Verified , OSWorld-Verified , ComputerRL , and BenchGuard . Science, Biology Models, and Domain-Specific Systems Hugging Face releases Carbon DNA models : One of the most technically interesting open releases was, a family of generative DNA foundation models. The team says Carbon https://x.com/lvwerra/status/2056774820872831234 Carbon-3B matches Evo2-7B while running 250–275x faster at inference , enough to process the whole human genome on a single GPU in under two days. The key recipe changes: deterministic 6-mer tokenization , a factorized loss FNS replacing plain cross-entropy late in training, and curated staged mixtures of functional DNA + mRNA data per @LoubnaBenAllal1 https://x.com/LoubnaBenAllal1/status/2056771927570530475 . The release includes models, training code, evals, data, and a demo . Google pushes AI for science as a product category : Google introduced, a suite of prototypes for researchers: Gemini for Science https://x.com/GoogleDeepMind/status/2056808869242826957 Literature Insights paper synthesis via NotebookLM , Hypothesis Generation a Co-Scientist-style multi-agent “idea tournament” , and Computational Discovery built with AlphaEvolve and ERA to generate and score thousands of code variants in parallel . Google Research also noted that ERA has now been published in Nature Google Research https://x.com/GoogleResearch/status/2056797037426045105 . Specialized pretraining is gaining support : @pratyushmaini https://x.com/pratyushmaini/status/2056780651219804582 pointed to evidence that early exposure / specialized pretraining improves robustness to forgetting, arguing that enterprises serious about domain use cases should consider training custom models from scratch , not just post-training. Safety, Governance, and Monitoring of Internal Agents METR’s first Frontier Risk Report : METR https://x.com/METR Evals/status/2056800023149760666 published a major new report based on unusually deep access across Anthropic, Google, Meta, and OpenAI , including model CoTs and non-public information about capabilities, alignment, and control. The report focuses on whether labs could lose control of their own internally deployed agents and includes extensive appendices and transcripts METR https://x.com/METR Evals/status/2056800047258649049 . Monitoring internal agents is now an active practice : @idavidrein https://x.com/idavidrein/status/2056800422422265897 described spending a month embedded at Anthropic stress-testing systems designed to detect whether internal AI agents could “go rogue.” A key caveat he noted is that the exercise allowed Anthropic discretion to redact sensitive information, so he frames it as an exercise rather than a formal audit . New safety standards org : Steven Adler https://x.com/sjgadler/status/2056762703033807068 announced Guidelight , a new AI safety standards organization co-founded with Page Hedley, releasing its first two standards. While the tweet thread in the dataset is partial, the move is notable as another sign of the field professionalizing around operational standards, not just model evals. Top tweets by engagement Karpathy joins Anthropic : @karpathy https://x.com/karpathy/status/2056753169888334312 Google introduces the Gemini 3.5 model series : @Google https://x.com/Google/status/2056788000546386273 Google DeepMind launches Gemini Omni : @GoogleDeepMind https://x.com/GoogleDeepMind/status/2056786446636212467 Gemini 3.5 Flash GA for agents and coding : @Google https://x.com/Google/status/2056788266872140232 OpenAI Guaranteed Capacity : @OpenAI https://x.com/OpenAI/status/2056823271774101907 Google’s 24/7 personal agent, Gemini Spark : @Google https://x.com/Google/status/2056791134295273554 AI Reddit Recap /r/LocalLlama + /r/localLLM Recap Keep reading with a 7-day free trial Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.