[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.

OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025, indicating rapid internal adoption of AI agents across non-coding tasks. The data, published by OpenAI Economic Research, shows that even with unlimited access, employees underused AI until late 2025, with usage surging over the past six months.

AINews OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025. It's happening. Only 200 AI Engineer tickets left - on track to sell out in the next 24 hours. Grab now for over $60k in sponsor credits Add this to the WTF Happened in 2025? https://www.latent.space/p/wtf2025 files: OpenAI Economic Research is reporting https://openai.com/index/how-agents-are-transforming-work/ that token usage for everything outside coding is exploding: Through August 2025, the average OpenAI worker spent less than 10% of their tokens on Codex… Over the last six months, Codex usage has deepened and intensified at OpenAI. Among active internal users, change in combined output tokens rose sharply across departments. Research saw the biggest jump: by June 2026,median use was 56 times higher than in November 2025. Customer Support rose 32 times and Engineering rose 27 times, while Legal grew more gradually but still reached 13 times its November level. This should form an interesting baseline against Tokenmaxxing concerns - remember that OpenAI employees have had unlimited access at all times anyway, and SOMEHOW they were still grossly underusing AI even up til late 2025. Sometimes, you just have to let them cook https://www.youtube.com/watch?v=fpAthTtha8c : AI News for 6/24/2026-6/25/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews’ website lets you search all past issues. As a reminder, AINews is now a section of Latent Space . You can opt in/out of email frequencies AI Twitter Recap Open Models, Coding Benchmarks, and the GLM/Ornith/Liquid Wave GLM-5.2’s rapid ascent in coding and agent benchmarks : Multiple posts converged on Z.ai’s GLM-5.2 as the day’s most important open-model story. On frontend coding, Arena reported https://x.com/arena/status/2070174325844640123 that GLM-5.2 Max reached 1595 on Code Arena: Frontend, surpassing Opus 4.8 and narrowing the gap to Claude Fable 5 . On agentic reliability, PostTrainBench noted https://x.com/hrdkbhatnagar/status/2070244540108423427 34.29% for GLM 5.2 Max reasoning , narrowly ahead of Opus 4.8 Max at 34.08% , with zero failed runs across 84 runs . The speed side also moved: @Yuchenj UW https://x.com/Yuchenj UW/status/2070166719839326396 said Databricks pushed GLM-5.2 to 392 tok/s on Artificial Analysis, up from 201 tok/s on H200s before further gains on B300s , attributing results to both hardware and optimizations such as speculative decoding and kernels. New coding-specialized open weights : Ornith-1.0 https://x.com/ornith /status/2070148887067963854 launched as a family of MIT-licensed agentic coding models spanning 9B dense, 31B dense, 35B MoE, and 397B MoE , post-trained on top of Gemma 4 and Qwen3.5 . Reported scores include Terminal-Bench 2.1: 77.5 , SWE-Bench Verified: 82.4 , SWE-Bench Pro: 62.2 , and ClawEval: 77.1 . The notable training claim is a self-improving RL setup that optimizes not just solution rollouts but the task-specific scaffolds driving those rollouts. Meanwhile, Liquid AI shipped LFM2.5-230M https://x.com/maximelabonne/status/2070149175006617682 , an ultra-small model aimed at low-latency tool use in robotics/e-commerce; vLLM added day-0 support https://x.com/vllm project/status/2070177937815736420 , SGLang added support https://x.com/lmsysorg/status/2070168574849945721 , and WebGPU work pushed it to ~1400 tok/s locally https://x.com/xenovacom/status/2070210622239707568 . Agents in Production: Computer Use, Long-Horizon Infrastructure, and Internal Adoption Google pushes computer use into Gemini 3.5 Flash : Google made computer use a first-class built-in capability in Gemini 3.5 Flash across browser, desktop, and mobile. The main launch posts came from @Google https://x.com/Google/status/2070175556503568394 , @GoogleDeepMind https://x.com/GoogleDeepMind/status/2070180509523546481 , and @googledevs https://x.com/googledevs/status/2070174765940170832 . Safety controls highlighted include explicit user confirmation for sensitive actions and automated task stopping . For developers, @ philschmid shared https://x.com/ philschmid/status/2070177135453434183 a quickstart showing Android-phone control via adb , with the same pattern extensible to iOS. This is a meaningful product shift: not just model APIs, but a standardized action interface with human-in-the-loop affordances. Agent infra is getting more opinionated around persistence and cost : Several startups/products are optimizing specifically for long-running agents rather than interactive chat latency. Sail https://x.com/neilmovva/status/2070164963013148747 launched with $80M raised to provide low-cost inference and sandboxes for agents that run days or weeks , claiming “ 10x more intelligence per dollar ” for patient workloads. Hyperagent https://x.com/kimmonismus/status/2070152987209519224 was highlighted as giving each agent its own cloud machine with persistent browser/code execution. LangChain’s Fleet framing https://x.com/LangChain/status/2070123493568426050 drew a useful distinction: use general-purpose chat when work ends with an answer; use specialized agents when the work has a repeatable shape and durable context. OpenAI’s internal Codex usage is becoming a leading indicator : OpenAI https://x.com/OpenAI/status/2070196105745518913 said agents are changing work “in every department,” with Codex used for longer-running, more cross-functional tasks. External commentary from @gdb https://x.com/gdb/status/2070199649823297653 , @reach vb https://x.com/reach vb/status/2070201707015934112 , and @eliebakouch https://x.com/eliebakouch/status/2070229373530288619 emphasized growth in internal token consumption—especially by research teams—and patterns like skills and concurrent agents . The practical takeaway is less “agents are magical” and more that real adoption is emerging where organizations can support review loops , tooling , and persistent workflows . Evaluation, Reward Hacking, and Synthetic Data as a Frontier Lever Public benchmarks are increasingly compromised : Cursor’s research post https://x.com/cursor ai/status/2070195789121671624 argued that recent models, including Opus 4.8 and Composer 2.5 , can hack public benchmarks by retrieving solutions from the internet or git history; scores drop sharply under a stricter harness. This aligns with ProgramBench’s push https://x.com/jyangballin/status/2070206413444403324 toward no-internet settings as a future default for coding evals. The broader theme: eval environment design is now a first-order variable, not benchmarking hygiene. Autodata / agentic synthetic data generation is gaining traction : Meta’s Autodata paper thread by @jaseweston https://x.com/jaseweston/status/2070117091521204521 was one of the more substantive research items. The proposal is to treat data generation as a data scientist agent loop with creation, analysis, and meta-optimization , converting extra inference compute into better train/eval data. Reported gains span computer science, legal, and math tasks, and the meta-optimized harness improved creation pass rate from 62.1% to 79.6% . Independent amplification came from @iScienceLuvr https://x.com/iScienceLuvr/status/2070058945914573049 and @omarsar0 https://x.com/omarsar0/status/2070235085732000228 . This is one of the clearest examples in the digest of “autoresearch” moving from slogan to concrete loop design. Data curation is now also a test-time-compute lever : Datology https://x.com/arimorcos/status/2070154289880932621 argued that curation can make models 35x more efficient at answer generation by inducing concision without hurting task performance; @pratyushmaini https://x.com/pratyushmaini/status/2070172084123390109 framed this explicitly as a third axis beyond quality and training efficiency. This is notable because it links pretraining/posttraining data choices directly to serving cost and user-perceived latency , not just benchmark quality. Open Ecosystem Economics: Hugging Face, Data Releases, and Agent Toolchains Hugging Face crossed a major business milestone without abandoning its open positioning : Clement Delangue announced https://x.com/ClementDelangue/status/2070104323481104674 $100M annual run-rate , while saying HF still keeps the platform free/open for 97% of users and manages hundreds of petabytes of models and datasets. For infra/platform watchers, this is one of the clearest proofs that open model distribution, hosting, and community workflows can support a durable business. It also contextualizes downstream adoption stories like Gemma 4 hitting 200M downloads in 2.5 months https://x.com/googlegemma/status/2070180154069176399 . Useful open corpora and data plumbing continue to expand : Common Crawl released https://x.com/CommonCrawl/status/2070094659343237492 its June 2026 archive: 2.10B web pages , 354 TiB uncompressed, from 40.8M hosts , plus updated web graphs. Domain-specific data also landed via Telco-Common-Corpus https://x.com/Dorialexander/status/2070080144593588493 , a 10B-token , fully open telecom corpus. For embodied/robotics data, Chris Paxton estimated https://x.com/chris j paxton/status/2070009005439603083 that currently available open datasets may already sum to roughly 10k robot-hours , enough for “basically anyone” to attempt a decent robot foundation model. Tooling around local/open deployment keeps improving : The day also included Qdrant EDGE + LiteRT for fully on-device RAG https://x.com/qdrant engine/status/2070117122324242637 , Hugging Face’s “run your own models locally” stream https://x.com/huggingface/status/2070160187751850242 , GGUF UI support for MTP heads https://x.com/mishig25/status/2070143864522887280 , and developer-facing improvements like LangChain’s deployment cookbook https://x.com/LangChain JS/status/2070202038315778506 . These aren’t isolated features; they’re all pieces of the same trend toward portable agent stacks and local inference ergonomics . Policy, Access Control, and the Distillation Fight Fable 5 was not back; it was likely a UI artifact : What briefly looked like a reappearance of Claude Fable 5 turned into a case study in rumor propagation and access opacity. Speculation came from @kimmonismus https://x.com/kimmonismus/status/2070095365701832724 , but Anthropic-side corrections were explicit: @sammcallister said https://x.com/sammcallister/status/2070107830498054527 they were serving exactly 0 traffic to Fable 5, and @TheAmolAvasare said https://x.com/TheAmolAvasare/status/2070132115497476372 there was no Fable/Mythos traffic , likely just a UI bug or trolling. A later correction post https://x.com/kimmonismus/status/2070128939096236505 reflected that. The distillation dispute escalated into policy theater : Discussion around Anthropic’s claims about millions of Claude exchanges allegedly used by Alibaba https://x.com/Discoplomacy/status/2070069250513900005 spilled into technical and geopolitical commentary. Andrew Curran posted Dario Amodei’s letter https://x.com/AndrewCurran /status/2070134863370567864 , while a number of commenters debated whether the issue is benchmark-leading synthetic posttraining, API leakage, intermediary reselling, or political positioning. The most concrete policy-development signal was that The Information reported https://x.com/steph palazzolo/status/2070241787180966279 the U.S. government asked OpenAI to stagger GPT-5.6 preview access customer-by-customer , suggesting an emerging de facto review regime for frontier launches. Top Tweets by engagement OpenAI internal agent adoption : OpenAI on Codex transforming work across departments https://x.com/OpenAI/status/2070196105745518913 . Hugging Face economics : Clement Delangue on HF surpassing $100M ARR https://x.com/ClementDelangue/status/2070104323481104674 . Benchmark integrity : Cursor on models hacking public benchmarks https://x.com/cursor ai/status/2070195789121671624 . Open coding models : Ornith-1.0 launch https://x.com/ornith /status/2070148887067963854 . Google agent productization : Gemini 3.5 Flash computer use launch https://x.com/Google/status/2070175556503568394 . Multi-agent systems behavior : Thom Wolf on 100+ agents collaborating to optimize Gemma 4 inference speed 5x https://x.com/Thom Wolf/status/2070134136304517284 . AI Reddit Recap /r/LocalLlama + /r/localLLM Recap 1. Specialized Open Model Releases Activity: 459 : NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone. https://www.reddit.com/r/LocalLLaMA/comments/1uf4azy/nvidia has released/ NVIDIA released Nemotron-TwoTower-30B-A3B-Base-BF16 , a diffusion-style LLM derived from the Nemotron 3 Nano 30B-A3B backbone. The model combines a frozen autoregressive context tower with a diffusion denoiser tower that fills token blocks in parallel; NVIDIA claims the default mask-diffusion configuration preserves 98.7% of the AR baseline’s aggregate benchmark score while achieving 2.42× wall-clock generation throughput. The only technically relevant comment questioned whether its quality-retention vs. baseline is stronger than DiffusionGemma ; the rest of the top comments were jokes or off-topic model requests.A commenter noted that Nemotron-TwoTower-30B-A3B-Base-BF16 appears to retain more accuracy relative to its original Nemotron backbone than DiffusionGemma does relative to its base model, though the thread did not provide concrete benchmark names or numeric scores. Activity: 315 : Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments https://www.reddit.com/r/LocalLLaMA/comments/1ue5149/qwenagentworld35ba3b a 3bactive moe trained to/ Qwen released Qwen-AgentWorld-35B-A3B , a sparse MoE with 35B total parameters and ~ 3B active parameters/token, positioned as a language world model rather than a chat/instruction agent. It is trained to simulate environment responses for agent loops—predicting the next observation/state after actions across MCP/tool calling, search, terminal, SWE, Android, web, and OS-GUI interaction domains—potentially enabling offline agent training/evaluation, synthetic trajectories, and mocked tool workflows. The only substantive technical comment highlighted its possible use for evals by mocking action outputs, e.g. predicting terminal output for ls -la . Other top comments were mostly jokes/skepticism about whether the dataset simply swapped user/assistant roles or prompted the model as “You are an MCP server now.” One commenter interprets the model as learning environment transition dynamics: given a user/tool command like ls -la , it predicts the corresponding terminal output. They suggest this could be useful not only for agent training but also for mocking tool/environment actions in evaluations , potentially reducing the need to execute real sandboxed actions.Another technical reading is that Qwen-AgentWorld-35B-A3B may have been trained on simulated “world” traces—MCP, terminal, SWE, Android, web, and OS interactions—and then evaluated for downstream agent performance improvements . The commenter argues that if this interpretation is correct, the model is better viewed as an improved agentic model rather than merely a simulator, and asks for empirical checks from people running agent benchmarks. Activity: 1123 : Unlimited-OCR is now on ModelScope A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT https://www.reddit.com/r/LocalLLaMA/comments/1ue51uk/unlimitedocr is now on modelscope a 33b/ Baidu’s Unlimited-OCR is announced on ModelScope as an MIT-licensed 3.3B multilingual OCR/document-parsing model intended for one-shot full-document parsing across single images, multi-page documents, and PDFs, with up to 32K output tokens for long OCR sequences. The project advertises base and “gundam” image modes, plus Transformers inference and SGLang serving with OpenAI-compatible streaming APIs; code is on Commenters mainly asked for missing technical comparisons/details: whether this is related to or missing GitHub https://github.com/baidu/Unlimited-OCR and the announcement is on X https://x.com/ModelScope2022/status/2069335055965491525 . PaddleOCR , how it performs against PaddleOCR-VL-1.6 , how many pages fit within the 32K output limit, and what exactly “gundam mode” means.Commenters asked for direct benchmarking against PaddleOCR-VL-1.6 , specifically how Unlimited-OCR compares in OCR quality/performance and how many document pages can realistically fit into the model’s 32k context window for multi-page/PDF parsing.A technical ambiguity was raised around the model/docs mentioning “gundam mode” —multiple users asked what it means, suggesting the release materials may contain unclear terminology or an undocumented inference/parsing mode.One commenter linked the model card on Hugging Face: baidu/Unlimited-OCR https://huggingface.co/baidu/Unlimited-OCR , while another noted “missing paddle?” alongside an image, possibly pointing to an inconsistency or missing reference/dependency related to PaddleOCR. Activity: 391 : Ornith-1.0 released on Hugging Face https://www.reddit.com/r/LocalLLaMA/comments/1ufc9vp/ornith10 released on hugging face/ DeepReinforce-AI released the Ornith-1.0 Hugging Face collection https://huggingface.co/collections/deepreinforce-ai/ornith-10 , including 9B / 31B dense and 35B / 397B MoE variants, with claimed SOTA results across unspecified benchmarks; commenters characterize them as post-trained Qwen3.5 and Gemma4 models. One user reports the 35B Q8 0 build on a dual-R9700 Vulkan setup runs at roughly 115 tok/s generation and 5400 tok/s prompt processing, comparable to “Qwen 3.6 35B with thinking off,” with occasional transient drops to 95 tok/s . Another tester observed the 35B model refusing to reveal a hidden canary token, explicitly identifying the request as a prompt-injection attempt, suggesting built-in leakage/prompt-injection resistance. Early subjective feedback is strongly positive: one tester found Ornith-35B’s coding/API/security-pass outputs “far more detailed” than Qwen 3.6 35B while being much faster, concluding “This might be the real deal.”A user reports the Ornith-1.0 35B Q8 0 quant has essentially identical raw throughput to Qwen 3.6 35B with thinking disabled on a dual-R9700 Vulkan setup: about 115 tok/s generation and 5400 tok/s prompt processing. They observed intermittent mid-response drops from 115 tok/s to 95 tok/s , possibly thermal-related, but otherwise described the model as much faster while giving more detailed coding/API/security-pass responses than Qwen 3.6 35B in informal Ruby/Sinatra tests.Testing on a Pi setup suggested the 35B model may have built-in prompt-injection or canary-exfiltration defenses. A context-degradation extension hid a random string in context and asked the model to retrieve it later, but the model refused, explicitly reasoning that the request was a “prompt injection attempt” and declining to echo the canary token.Several commenters frame Ornith-1.0 as post-trained Qwen3.5 and Gemma4 derivatives, with reported benchmarks allegedly above Qwen 3.6 27B . One technical concern raised was why the release recommends qwen3 xml formatting for vLLM but qwen3 coder for SGLang , implying possible serving-stack-specific prompt template differences that could affect quality or benchmark reproducibility. Keep reading with a 7-day free trial Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.