[AINews] Microsoft Build: MAI-Thinking-1 and MAI Family models

Microsoft announced seven new MAI models at its Build conference, including the flagship MAI-Thinking-1 reasoning model built with clean data lineage and no third-party distillation. The company released a 109-page technical report detailing the model's architecture, drawing praise from the research community for its transparency. Microsoft also used the event to position itself as both an AI platform company and a frontier-model lab, unveiling new Windows AI capabilities, agent-native features, and a Surface RTX Spark Dev Box.

AINews Microsoft Build: MAI-Thinking-1 and MAI Family models Microsoft Build recap, and new MAI model technical details Today was a big day, not least because we caught up on the state of GitHub vs Agents https://www.latent.space/p/github , and recorded a special pod with No Priors and Satya Nadella https://x.com/TheTuringPost/status/2061901518522188251?s=20 — at MS Build, Satya and Mustafa announced 7 new MAI models: This is an impressive lineup, especially considering that the Microsoft-Inflection deal that set up MAI https://news.smol.ai/issues/24-03-20-ainews-shipping-and-dipping-inflection-stability-edition only happened 2 years ago, and that these are all from-scratch pretrains. MAI today is by no means an unqualified frontier lab, but it is a good tier 2 neolab with obvious incentives to support domain specific finetunes as opposed to the frontier labs who have ~all killed finetuning https://www.latent.space/p/ainews-the-end-of-finetuning . The star of the show was the 100+ page MAI tech report https://microsoft.ai/wp-content/uploads/2026/06/main 20260602 2.pdf , which the research community is giving glowing reviews: You can catch up on all the rest of the announcement in the excellent Verge recap, and the tweet summaries below: AI News for 06/1/2026-6/2/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space . You can opt in/out of email frequencies AI Twitter Recap Top Story: Microsoft Build recap, and new MAI model technical details What happened Microsoft used Build to position itself as both an AI platform company and a frontier-model lab, pairing broad product launches with unusually detailed disclosures about its new MAI model family. Microsoft AI announced seven new MAI models spanning reasoning, code, image, speech transcription, and voice, led by MAI-Thinking-1 , MAI-Code-1-Flash , MAI-Image-2.5 , MAI-Transcribe-1.5 , and MAI-Voice-2 according to @MicrosoftAI https://x.com/MicrosoftAI/status/2061887500541366489 and @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 The flagship reasoning model MAI-Thinking-1 was presented as Microsoft’s first reasoning model , built with clean data lineage and zero distillation from third-party models in posts from @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 , @baseten https://x.com/baseten/status/2061878701823066431 , @tuhinone https://x.com/tuhinone/status/2061879239817969756 , and @HannaHajishirzi https://x.com/HannaHajishirzi/status/2061901432627044430 Microsoft released a 109-page technical report for MAI-Thinking-1, which drew strong positive reactions from technically oriented readers for its level of transparency, including @eliebakouch https://x.com/eliebakouch/status/2061877335960281459 , @ethanCaballero https://x.com/ethanCaballero/status/2061920873297088723 , @nrehiew https://x.com/nrehiew /status/2062013300196700395 , @yacinelearning https://x.com/yacinelearning/status/2061914159235617056 , and @stochasticchasm https://x.com/stochasticchasm/status/2061916808626815161 Microsoft also emphasized local AI and agent-native Windows : Build messaging highlighted secure execution layers for agents , a new Surface RTX Spark Dev Box , Windows AI access to the broader Windows GPU install base, and concept hardware such as Project Solara/Scout , summarized by @yusuf i mehdi https://x.com/yusuf i mehdi/status/2061882543641907528 , @TheTuringPost https://x.com/TheTuringPost/status/2061865165734506683 , @kimmonismus https://x.com/kimmonismus/status/2061860319547527191 , and @kimmonismus https://x.com/kimmonismus/status/2061875714933371220 Build also included a major GitHub Copilot app push as the “desktop home for agent-native software development,” with canvases , cross-device continuity, and tighter GitHub agent workflows, from @pierceboggan https://x.com/pierceboggan/status/2061868635241828688 , @lukehoban https://x.com/lukehoban/status/2061905434039246939 , and reactions from @techgirl1908 https://x.com/techgirl1908/status/2061870470237164018 Microsoft introduced Web IQ , a new grounding/search API stack for AI agents, claiming the APIs already power “nearly all AI agents and chatbots in the industry today, including Copilot and ChatGPT,” via @JordiRib1 https://x.com/JordiRib1/status/2061866606670581871 Satya Nadella framed Build as an ecosystem moment rather than a single-product launch, while Mustafa Suleyman framed it as the output of Microsoft’s internal “hill-climbing machine,” in @satyanadella https://x.com/satyanadella/status/2061896503304806521 , @mustafasuleyman https://x.com/mustafasuleyman/status/2061934667096596657 , and reaction from @nrehiew https://x.com/nrehiew /status/2061983583523475556 MAI model family: disclosed facts and technical details MAI-Thinking-1 Microsoft described MAI-Thinking-1 as a 35B active parameter MoE with a 256K context window in @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 A separate summary from @scaling01 https://x.com/scaling01/status/2061889624847343825 says the model is a 1T@35B parameter model , pre-trained on 30T tokens , and trained using 8192 GB200 GPUs ; this appears to be a reading of the technical report rather than Microsoft marketing copy @kimmonismus https://x.com/kimmonismus/status/2061877528781025381 similarly summarized it as a mid-size MoE with 45B active params , but this conflicts with Mustafa’s own 35B active figure; the more authoritative figure in the tweet set is the official 35B active numberMicrosoft claims 97% on AIME 2025 and 53% on SWE-Bench Pro , with blind human raters on Surge preferring it overall to Sonnet 4.6 , from @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 and @asadovsky https://x.com/asadovsky/status/2062008312603070891 Microsoft says the model is optimized on MAIA 200 , with 30% better performance per dollar and 1.4x performance-per-watt gain versus GB200 when running MAI models end-to-end, per @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 Microsoft and partners repeatedly stressed no third-party distillation , “clean data lineage,” and enterprise-controlled fine-tuning with “100% eyes-off” post-training data through Baseten, in @baseten https://x.com/baseten/status/2061878701823066431 , @tuhinone https://x.com/tuhinone/status/2061879239817969756 , and @MicrosoftAI https://x.com/MicrosoftAI/status/2061923309344756043 MAI-Code-1-Flash Microsoft introduced MAI-Code-1-Flash as a fast coding model for VS Code and GitHub Copilot CLI , first announced by @pierceboggan https://x.com/pierceboggan/status/2061877165810131297 and later highlighted by @mariorod1 https://x.com/mariorod1/status/2061914993550143513 Official Microsoft messaging via @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 says Code-1-Flash achieves 51% on SWE-Bench Pro despite having just 5B parameters , positioning it near Haiku-class size/costA competing summary from @scaling01 https://x.com/scaling01/status/2061891478176112794 describes it as a 137B parameter MoE , 256K context , trained on 10T+ tokens , and “stronger and more efficient than Claude 4.5 Haiku.” That likely indicates 5B active parameters rather than total parameters; the tweets do not fully reconcile this distinction, but together imply small active footprint within a much larger MoE Availability at launch was highlighted as GitHub Copilot / VS Code-first , per @scaling01 https://x.com/scaling01/status/2061891478176112794 and @mariorod1 https://x.com/mariorod1/status/2061914993550143513 MAI-Image-2.5 Microsoft launched MAI-Image-2.5 and a Flash variant, claiming both reached 2 on leaderboards , with @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 saying they surpass Nano Banana 2 on image editingIndependent leaderboard accounts supported the high ranking: @arena https://x.com/arena/status/2061887242579382660 reported 2 in Image Edit Arena with score 1401 , +10 points over Nano Banana 2 , Grok Imagine, and ChatGPT Image Latest HF @arena https://x.com/arena/status/2061894541888962712 further said MAI-Image-2.5 “advances the Pareto frontier,” meaning no model at its price tier scores higher on that benchmarkDistribution partners quickly followed, including @OpenRouter https://x.com/OpenRouter/status/2061894672847671724 and @fal https://x.com/fal/status/2061920052664820199 MAI-Transcribe-1.5 @ArtificialAnlys https://x.com/ArtificialAnlys/status/2061878491860324402 reported MAI-Transcribe-1.5 as an unusually strong speed/accuracy point on the STT frontier: ~276x realtime , 2.4% AA-WER , 3 overall on its leaderboardThe model supports 43 languages , including English, French, Arabic, Japanese, and Chinese, and supports keyword biasing for rarer terms such as names and medical terminology, per @ArtificialAnlys https://x.com/ArtificialAnlys/status/2061878491860324402 Pricing was reported as $6 per 1,000 minutes of audio via Microsoft Foundry in @ArtificialAnlys https://x.com/ArtificialAnlys/status/2061878498609053909 OpenRouter also listed the model among the three MAI launches it brought live the same day in @OpenRouter https://x.com/OpenRouter/status/2061894672847671724 MAI-Voice-2 MAI-Voice-2 appears in Microsoft’s “seven models” umbrella and in OpenRouter’s availability post at @OpenRouter https://x.com/OpenRouter/status/2061894672847671724 The tweet set contains little technical detail on Voice-2 itself beyond launch/availability Technical-report details that mattered to researchers Why the report stood out The dominant technical reaction was that Microsoft released an unusually detailed frontier-model report: @eliebakouch https://x.com/eliebakouch/status/2061965825037254947 called it “one of the most transparent for a model at this scale,” @nrehiew https://x.com/nrehiew /status/2062023547690828141 said it “could really serve as an updated textbook for LLM training today,” and @stochasticchasm https://x.com/stochasticchasm/status/2061879506139557979 called it a “gold mine”Multiple readers highlighted that the report disclosed pipeline details, scaling ladder methodology, data curation, infra metrics, and MFU numbers ; this level of specificity is what drew praise from @ethanCaballero https://x.com/ethanCaballero/status/2061920873297088723 , @eliebakouch https://x.com/eliebakouch/status/2062004670017486912 , and @nrehiew https://x.com/nrehiew /status/2062013300196700395 Pretraining and data A major technical claim repeated across commentary is that MAI-Thinking-1 used no synthetic data and no distillation , not only in post-training but throughout the disclosed pipeline, from @eliebakouch https://x.com/eliebakouch/status/2061965825037254947 , @stochasticchasm https://x.com/stochasticchasm/status/2061967095022366924 , and @HannaHajishirzi https://x.com/HannaHajishirzi/status/2061901432627044430 @eliebakouch https://x.com/eliebakouch/status/2061977834558804207 says the report explicitly notes data from Common Crawl plus private sources , with targeted sub-pipelines for different domains , heavy extraction/dedup work, and an intentional choice of no synthetic data The report’s internal private NLL set used for scaling decisions was summarized by @eliebakouch https://x.com/eliebakouch/status/2061976608265880004 as: 50% code 17.5% STEM 17.5% math 10% general knowledge 5% multilingual @eliebakouch https://x.com/eliebakouch/status/2061976230933496176 says architecture promotion in the scaling ladder was based on an Efficiency Gain EG metric: how much extra compute the baseline would need to match the candidate’s lossThe same thread notes ablations at roughly 100/200 tokens per parameter , described as around “Chinchilla optimal” for the setup, while also remarking this differs from dense-model heuristics due to MoE structure in @eliebakouch https://x.com/eliebakouch/status/2061975730414633043 Post-training / RL The most discussed technical choice was that Microsoft appears to have started RL from a checkpoint with no prior reasoning exposure , which several readers found notable. @stochasticchasm https://x.com/stochasticchasm/status/2061879070141677615 called this a “very interesting decision,” while @stochasticchasm https://x.com/stochasticchasm/status/2061878066314645861 reacted to graphs suggesting a jump from <20% AIME25 to 95% @HannaHajishirzi https://x.com/HannaHajishirzi/status/2061901432627044430 described the “climbing from scratch” recipe as simple recipes, rigorous science, self-distillation, patience, and great infra @soldni https://x.com/soldni/status/2061882085573616003 characterized the process as “climbing with no distillation, like the big boys do”Some independent readers inferred from the report that synth data remains very valuable for agentic performance in the broader field, even if Microsoft deliberately avoided it here; see @stochasticchasm https://x.com/stochasticchasm/status/2061961874879783376 Data curation / judges / DSPy GEPA A detail that got substantial attention from the DSPy/late-interaction crowd: Microsoft reportedly used GEPA / DSPy-optimized LLM judges in pretraining data curation and quality scoringThis was highlighted by @bj2rn https://x.com/bj2rn/status/2061941109828301241 , @LakshyAAAgrawal https://x.com/LakshyAAAgrawal/status/2062013650639241403 , and @lateinteraction https://x.com/lateinteraction/status/2062015109132873852 Infra / utilization / hardware co-design Microsoft reportedly disclosed exact MFU across iterations , which multiple readers said is rarely shared at this scale, per @eliebakouch https://x.com/eliebakouch/status/2061965825037254947 @scaling01 https://x.com/scaling01/status/2061889624847343825 summarized the run as using 8192 GB200 GPUs @eliebakouch https://x.com/eliebakouch/status/2062004120098144764 singled out a reported ~40% higher throughput per watt -type figure as “pretty impressive and bullish on microsoft chips,” though this may refer to rack-level budget or serving configuration and was not fully unpacked in-tweetMicrosoft’s official framing connected model design to MAIA 200 custom silicon and emphasized better performance-per-dollar and performance-per-watt vs NVIDIA GB200 in @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 Build’s broader Windows/local-AI narrative also centered on hardware specifics such as: 1 trillion parameters running locally on DGX Station 128GB unified memory 110 TOPS AI performance 20 CPU cores 70+ PowerToys utilities from @TheTuringPost https://x.com/TheTuringPost/status/2061852480636653924 Reactions also pointed to local runs of large models, e.g. @kimmonismus https://x.com/kimmonismus/status/2061852979318427988 on RTX Spark running a 120B parameter model locally Build product/platform recap beyond the models GitHub Copilot app and agent-native development GitHub unveiled the GitHub Copilot app , pitched as a desktop surface for agent-native software development by @pierceboggan https://x.com/pierceboggan/status/2061868635241828688 Key themes included: canvases for bidirectional work between users and agents, per @Techmeme https://x.com/Techmeme/status/2061875738694062419 continuity across CLI, mobile, web, local, and cloud , per @lukehoban https://x.com/lukehoban/status/2061905448287322243 a growing role for GitHub as the center of agent workflows, reflected in @techgirl1908 https://x.com/techgirl1908/status/2061870470237164018 and @OrenMe https://x.com/OrenMe/status/2061873010664001605 Copilot CLI also got an experimental terminal UI with tabs, built-in feedback/rubber duck, prompt scheduling, and voice input , per @GHchangelog https://x.com/GHchangelog/status/2061870684876272123 Windows as an agent runtime Microsoft’s Windows org framed Build around “faster developer execution, a secure execution layer for agents, and unmetered intelligence that runs locally on device,” per @yusuf i mehdi https://x.com/yusuf i mehdi/status/2061882543641907528 Several posts stressed that Microsoft wants Windows to be the trusted execution platform for agents, not just Azure @TheTuringPost https://x.com/TheTuringPost/status/2061865165734506683 described Project Solara as a platform for agent-first devices , with concepts including:a desktop AI companion a wearable badge with cameras, microphones, sensors, and secure authentication @kimmonismus https://x.com/kimmonismus/status/2061860319547527191 saw these as handheld/desktop devices for controlling agents and compared them to expectations people had for standalone OpenAI hardware @kimmonismus https://x.com/kimmonismus/status/2061875714933371220 separately highlighted Microsoft Scout as an “always-on personal agent for work” Web IQ and search for agents @JordiRib1 https://x.com/JordiRib1/status/2061866606670581871 announced Microsoft Web IQ as a suite of AI-native grounding APIs for web pages, news, images, and videos His framing is important context: classic search engines were built for humans, but Microsoft believes future search demand will come from agents, potentially 1000x more queries than human search trafficHe claimed Web IQ was re-architected from Bing’s stack for quality, latency, and token efficiency , and that it already powers major chatbots including Copilot and ChatGPT Foundry and open-model distribution @jeffboudier https://x.com/jeffboudier/status/2061868927207244277 said Satya cited 11,000+ models available in Microsoft Foundry , of which 10,928 come from Hugging FaceThis supports Microsoft’s parallel identity at Build: both a first-party model builder and a large multi-model hosting/distribution platform Build messaging around datacenters and compute Several observers noted Build discussion around data center expansion , community backlash, and Microsoft’s argument that AI infra can expand without raising electricity costs to local communities; see @kimmonismus https://x.com/kimmonismus/status/2061854806395015316 and @kimmonismus https://x.com/kimmonismus/status/2061903253890330639 @scaling01 https://x.com/scaling01/status/2061901702324695115 highlighted Mustafa saying AI compute will grow 1000x in the next 3 years , taking today’s rough 5e27 FLOPs frontier scale to 5e30 FLOPs by 2029 @mustafasuleyman https://x.com/mustafasuleyman/status/2061880029315764256 summarized the company’s philosophical theme as “Humanist superintelligence” Facts vs. opinions Factual claims in the tweet set Microsoft launched seven new MAI models at Build: @MicrosoftAI https://x.com/MicrosoftAI/status/2061887500541366489 Official metrics for MAI-Thinking-1: 35B active MoE , 256K context , 97% AIME 2025 , 53% SWE-Bench Pro , and blind human preference over Sonnet 4.6: @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 Official metrics for MAI-Code-1-Flash: 51% SWE-Bench Pro , 5B parameters as stated in tweet copy: @mustafasuleyman https://x.com/mustafasuleyman/status/2061880164498428188 MAI-Image-2.5 ranking claims were independently echoed by @arena https://x.com/arena/status/2061887242579382660 MAI-Transcribe-1.5 speed/accuracy details came from independent benchmark account @ArtificialAnlys https://x.com/ArtificialAnlys/status/2061878491860324402 Microsoft released a 109-page technical report : @eliebakouch https://x.com/eliebakouch/status/2061877335960281459 Opinions / interpretations “Microsoft is training serious models now?” from @teortaxesTex https://x.com/teortaxesTex/status/2061892492350407158 is an interpretive reaction to the model/report quality, not a standalone factClaims that the report is “one of the most transparent” or “an updated textbook” are opinions from @eliebakouch https://x.com/eliebakouch/status/2061965825037254947 and @nrehiew https://x.com/nrehiew /status/2062023547690828141 , albeit shared by many readers @kimmonismus https://x.com/kimmonismus/status/2061852480636653924 and @TheTuringPost https://x.com/TheTuringPost/status/2061865165734506683 framed Build as a strategic shift from cloud-only AI toward local reasoning/agents; that is analysis rather than official wordingPosts claiming Microsoft “leaked” Anthropic Mythos FLOPs, including @swyx https://x.com/swyx/status/2061878629504881151 and @scaling01 https://x.com/scaling01/status/2061897540161728791 , are speculative interpretations of a slide, later contested by the same cluster of commenters Different opinions and perspectives Supportive views Technical readers were broadly impressed by the report’s transparency and Microsoft’s willingness to publish details usually withheld at this scale: @eliebakouch https://x.com/eliebakouch/status/2061965825037254947 , @nrehiew https://x.com/nrehiew /status/2062023547690828141 , @ethanCaballero https://x.com/ethanCaballero/status/2061920873297088723 , @stochasticchasm https://x.com/stochasticchasm/status/2061916808626815161 Some saw MAI-Thinking-1 as proof Microsoft is becoming a genuine frontier lab rather than just a model reseller or application layer, e.g. @teortaxesTex https://x.com/teortaxesTex/status/2061892492350407158 , @echen https://x.com/echen/status/2061907282607100075 , @NandoDF https://x.com/NandoDF/status/2061901884042985728 Enterprise/platform supporters liked the clean-data-lineage , fine-tunable , eyes-off post-training data story, especially Baseten/Microsoft’s positioning around ownership and control: @baseten https://x.com/baseten/status/2061878701823066431 , @tuhinone https://x.com/tuhinone/status/2061879239817969756 Neutral / analytical views Several posts focused on reading and unpacking the report rather than cheering the launch, especially @stochasticchasm https://x.com/stochasticchasm/status/2061916808626815161 , @nrehiew https://x.com/nrehiew /status/2062013300196700395 , and @eliebakouch https://x.com/eliebakouch/status/2061965825037254947 Some commentators were careful on benchmark interpretation. @kimmonismus https://x.com/kimmonismus/status/2061918020843557110 noted Microsoft appeared to compare to Sonnet 4.6 generally, with Opus-level comparability only on SWE Pro @iScienceLuvr https://x.com/iScienceLuvr/status/2061926066453962952 specifically appreciated reporting on health benchmarks such as HealthBench Professional and MedXpertQA rather than only coding/math Skeptical / opposing views A subset questioned whether all numbers and comparisons were being interpreted correctly, especially around active params and external-model comparisons The most visible skepticism concerned the apparent Mythos FLOP “leak” . @iScienceLuvr https://x.com/iScienceLuvr/status/2061882397340393514 suggested it was probably just an estimate, not a leak; @scaling01 https://x.com/scaling01/status/2061989029025853757 later argued the original 6.1e27 FLOP figure was unrealistic and supplied a lower alternative estimate before posting a correction in @scaling01 https://x.com/scaling01/status/2061990840138899674 There was also implicit skepticism in the field about whether zero synth / zero distillation is the right long-term recipe for best agentic performance, as noted by readers emphasizing synth-data deltas elsewhere, e.g. @stochasticchasm https://x.com/stochasticchasm/status/2061961874879783376 Context: why this matters Build’s announcements matter because they suggest Microsoft is no longer content with being only: Azure/OpenAI’s cloud host GitHub’s developer surface Copilot’s application shell It is also trying to be a first-party frontier model developer with its own model family, silicon stack, and post-training platform The clean lineage / no distillation emphasis is strategically significant. It addresses enterprise concerns around IP provenance, future controllability, and dependence on external labsThe local AI emphasis matters because Microsoft is tying AI strategy to Windows and device distribution, not just to Azure. Build messaging repeatedly pushed the idea that reasoning models, planners, and agents can increasingly run on-device , not only in the cloud: @TheTuringPost https://x.com/TheTuringPost/status/2061852480636653924 , @yusuf i mehdi https://x.com/yusuf i mehdi/status/2061882543641907528 The 109-page report matters because frontier-model transparency has generally been shrinking, especially around data, infra, and training methodology. Multiple researchers explicitly noted the disclosure level is uncommon at this scale: @eliebakouch https://x.com/eliebakouch/status/2061965825037254947 , @nrehiew https://x.com/nrehiew /status/2062023547690828141 The Build recap also showed Microsoft trying to integrate all layers of the stack: models : MAI family chips : MAIA 200 cloud : Azure + Foundry OS : Windows agent runtime developer UX : Copilot app / VS Code / CLI retrieval/grounding : Web IQ hardware form factors : Solara / Scout concepts This combination is why several observers described the event less as a normal dev conference and more as a coordinated move toward an agent platform spanning cloud, edge, OS, and custom models , e.g. @satyanadella https://x.com/satyanadella/status/2061896503304806521 , @mustafasuleyman https://x.com/mustafasuleyman/status/2061934667096596657 , and @TheTuringPost https://x.com/TheTuringPost/status/2061865165734506683 The “Mythos FLOPs leak” mini-story During/after Build, some users claimed a Microsoft slide inadvertently revealed training compute for Anthropic’s rumored Claude Mythos , with @swyx https://x.com/swyx/status/2061878629504881151 asking if Mustafa had leaked the FLOP count @scaling01 https://x.com/scaling01/status/2061897540161728791 estimated the slide implied 6.1e27 FLOPs with a confidence interval based on pixel measurement, while @kimmonismus https://x.com/kimmonismus/status/2061908067034517853 noted that would be around Gemini 3.1 Pro-scale computeThat interpretation was subsequently challenged by @iScienceLuvr https://x.com/iScienceLuvr/status/2061882397340393514 , who argued it was probably an estimate, and then by @scaling01 https://x.com/scaling01/status/2061989029025853757 , who posted a lower-range model-based estimate of 3.37e26 to 1.46e27 FLOPs and later said the original numbers were bogus in @scaling01 https://x.com/scaling01/status/2061990840138899674 The episode is useful mostly as context: Build’s compute/scaling messaging was detailed enough that people started trying to infer competitor training budgets from presentation materials Developer tools, agents, and coding workflows OpenAI launched Sites in Codex , letting teams turn ideas/docs/plans into deployed internal websites/apps with auth and dynamic data, first for business/enterprise users, in @OpenAI https://x.com/OpenAI/status/2061845949170045346 , @TheRohanVarma https://x.com/TheRohanVarma/status/2061872164442403139 , and @gdb https://x.com/gdb/status/2061988413105156128 OpenAI also expanded role-specific Codex plugins across sales, data analytics, creative production, product design, and public equity workflows, with access to 62 apps and 110 skills , from @OpenAI https://x.com/OpenAI/status/2061887650391625870 and @OpenAIDevs https://x.com/OpenAIDevs/status/2061888366791246071 GitHub’s Copilot app and Microsoft’s Build push around agent-native software development were central to the day’s tooling news: @pierceboggan https://x.com/pierceboggan/status/2061868635241828688 , @lukehoban https://x.com/lukehoban/status/2061905434039246939 , @GHchangelog https://x.com/GHchangelog/status/2061870684876272123 Anthropic shipped a CLI for Claude Platform and upgraded Claude Code’s /fork to run a background agent with exact context + prompt cache, in @ClaudeDevs https://x.com/ClaudeDevs/status/2061877343078244459 and @ClaudeDevs https://x.com/ClaudeDevs/status/2061947411141169494 Nous launched Hermes Desktop , a local/native desktop surface for Hermes agents, in @NousResearch https://x.com/NousResearch/status/2061843507417944552 , @Teknium https://x.com/Teknium/status/2061844602735538266 , and later Tailscale/Ollama integration notes from @Teknium https://x.com/Teknium/status/2061984430370267210 and @ollama https://x.com/ollama/status/2062011585355551231 Cognition launched Devin Desktop , positioned as an agent-neutral desktop for managing local/cloud agents and handoff between local planning and cloud execution, in @cognition https://x.com/cognition/status/2061889596703551926 , @ScottWu46 https://x.com/ScottWu46/status/2061998361373532187 , and @russelljkaplan https://x.com/russelljkaplan/status/2061920322325205007 Models, local inference, and routing H Company launched Holo 3.1 , a local computer-use model family based on Qwen-style architecture, with checkpoints from 0.8B to 35B and formats including NVFP4, FP8, and Q4 GGUF ; a popular summary cited 79.3% on AndroidWorld for the 35B model in @TeksEdge https://x.com/TeksEdge/status/2061825310669332818 , with launch tweet from @hcompany ai https://x.com/hcompany ai/status/2061815355341725925 Perplexity announced hybrid agentic inference for Perplexity Computer, splitting work between local models on-device and frontier cloud models for privacy and token efficiency, in @perplexity ai https://x.com/perplexity ai/status/2061861293569765847 and @AravSrinivas https://x.com/AravSrinivas/status/2061875858542096520 OpenRouter data shared by @ttunguz https://x.com/ttunguz/status/2061846636805177692 showed open-weight models at 69.1% of token volume , versus 30.9% for closed modelsCommentary around model routing as a key future abstraction came from @ClementDelangue https://x.com/ClementDelangue/status/2061871024627482964 , @garrytan https://x.com/garrytan/status/2061878212213572083 , @matanSF https://x.com/matanSF/status/2061865185527074914 , and the counterpoint from @glennko https://x.com/glennko/status/2061896887699964171 , who argued enterprise production reliability makes generic routing harder than enthusiasts suggestLocal-AI UX improvements also appeared in Hugging Face’s hardware compatibility checks and oMLX’s native macOS app release from @m newhaus https://x.com/m newhaus/status/2061824017510584630 and @jundotkim https://x.com/jundotkim/status/2061863850874634242 Research and evals Google DeepMind announced Co-Scientist , a Gemini-based multi-agent hypothesis generation system for science, claiming collaborations that helped identify liver fibrosis targets, ALS approaches, and genetic leads for aging, in @GoogleDeepMind https://x.com/GoogleDeepMind/status/2061857539977842793 , @GoogleDeepMind https://x.com/GoogleDeepMind/status/2061857550438392094 , and @GoogleDeepMind https://x.com/GoogleDeepMind/status/2061857553076920643 The new Crafter / CraftEditor work on editable scientific figure generation drew attention as a five-agent workflow for producing and refining figures plus raster-to-SVG conversion, in @HuggingPapers https://x.com/HuggingPapers/status/2061800325959324069 , @ akhaliq https://x.com/ akhaliq/status/2061835314599993392 , and @TheTuringPost https://x.com/TheTuringPost/status/2061883014410629400 Tilde Research introduced Wall Attention , a RoPE-free attention method with diagonal forget gates, claiming training at 4k and generalization to 200k+ tokens plus Triton kernels and strong decode throughput, in @tilderesearch https://x.com/tilderesearch/status/2061839600562409581 A robotics vision encoder claiming +22.5% real-world OOD success by encoding dynamics-awareness rather than relying on static-image pretraining was posted by @jbhuang0604 https://x.com/jbhuang0604/status/2061840469966090308 New evals/benchmarks of note: PaintBench for precise image editing, where best model reached only 17.1% , from @itskaixu https://x.com/itskaixu/status/2061827068170518956 VSTAT for video state tracking, arguing frontier MLLMs remain weak at tracking evolving world state, from @PinzhiHuang https://x.com/PinzhiHuang/status/2062004108249145442 and @sainingxie https://x.com/sainingxie/status/2062011403733512253 Data Agent Benchmark for enterprise data workflows, from @sh reya https://x.com/sh reya/status/2061984097531310378 Inference, infrastructure, and agent systems Harvey + LangChain shared work on cheap verifiers for legal agents, showing DeepSeek V4 Flash could preserve 94–96% agreement with Opus 4.7 while reducing cost 18x in per-criterion mode and ~1000x in batch mode; for 3,200 RL rollouts , verification cost dropped from $18,000 to $18 , in @harvey https://x.com/harvey/status/2061866491033899371 , @hwchase17 https://x.com/hwchase17/status/2061867746141356427 , and @nikogrupen https://x.com/nikogrupen/status/2061866707988431039 W&B relaunched Weave as agent-first observability with integrations across common harnesses and automated detection of failure modes, in @wandb https://x.com/wandb/status/2061894943203831996 and @neutralino1 https://x.com/neutralino1/status/2061949197851742525 Prime-RL integrated Mooncake Store with vLLM for cross-node prefix / KV cache reuse, pitched as key for agentic rollouts, in @m sirovatka https://x.com/m sirovatka/status/2061862853997465738 Together detailed serving optimizations for MiniMax-M3 , citing 81–125% throughput improvements via KV-block-major sparse attention, paged decode, optimized index scoring, and multimodal preprocessing, in @togethercompute https://x.com/togethercompute/status/2061895336486949109 MiniMax itself highlighted 1M context , native multimodality, desktop-computer operation, and MSA reducing attention’s share of decode time from ~30% to ~5% , in @MiniMax AI https://x.com/MiniMax AI/status/2061944204604101020 Ecosystem, hardware, and industrial capacity Westmag emerged from stealth to build American robot actuators and drone motors , with $11M raised led by a16z and participation from Founders Fund, Lux, NFDG, Menlo and others, in @boxcardavid https://x.com/boxcardavid/status/2061825303715123234 , @packyM https://x.com/packyM/status/2061835223470330100 , and @oyhsu https://x.com/oyhsu/status/2061837257531670864 PyTorch noted NVIDIA adoption of OpenMDW-1.1 , a permissive AI-model licensing framework, across four open-model families in @PyTorch https://x.com/PyTorch/status/2061840384817328604 Martin Scorsese publicly demonstrated narrow, preproduction use of FLUX for storyboarding with Black Forest Labs, framed as exploratory and complementary to hand-drawn work rather than generative replacement, in @robrombach https://x.com/robrombach/status/2061804823352086681 and @TheRundownAI https://x.com/TheRundownAI/status/2061834880917357011 AI Reddit Recap /r/LocalLlama + /r/localLLM Recap 1. NVIDIA Nemotron 3 Ultra and RTX Spark Specs Keep reading with a 7-day free trial Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.