# [AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.

> Source: <https://www.latent.space/p/ainews-openai-reports-median-internal>
> Published: 2026-06-26 01:12:30+00:00

# [AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.

### It's happening.

*Only 200 AI Engineer tickets left - on track to sell out in the next 24 hours. Grab now for over $60k in sponsor credits!*

Add this to the [WTF Happened in 2025?](https://www.latent.space/p/wtf2025) files: OpenAI Economic Research is [reporting](https://openai.com/index/how-agents-are-transforming-work/) that token usage for everything outside coding is exploding:

Through August 2025, the average OpenAI worker spent less than 10% of their tokens on Codex…

Over the last six months, Codex usage has deepened and intensified at OpenAI. Among active internal users, change in combined output tokens rose sharply across departments. Research saw the biggest jump: by June 2026,median use was 56 times higher than in November 2025. Customer Support rose 32 times and Engineering rose 27 times, while Legal grew more gradually but still reached 13 times its November level.

This should form an interesting baseline against Tokenmaxxing concerns - remember that OpenAI employees have had unlimited access at all times anyway, and SOMEHOW they were still grossly underusing AI even up til late 2025.

Sometimes, you just have to [let them cook](https://www.youtube.com/watch?v=fpAthTtha8c):

AI News for 6/24/2026-6/25/2026. We checked 12 subreddits,

[544 Twitters]and no further Discords.[AINews’ website]lets you search all past issues. As a reminder,[AINews is now a section of Latent Space]. You can[opt in/out]of email frequencies!

**AI Twitter Recap**

**Open Models, Coding Benchmarks, and the GLM/Ornith/Liquid Wave**

**GLM-5.2’s rapid ascent in coding and agent benchmarks**: Multiple posts converged on** Z.ai’s GLM-5.2**as the day’s most important open-model story. On frontend coding,[Arena reported](https://x.com/arena/status/2070174325844640123)that**GLM-5.2 Max** reached**1595** on Code Arena: Frontend, surpassing**Opus 4.8** and narrowing the gap to**Claude Fable 5**. On agentic reliability,[PostTrainBench noted](https://x.com/hrdkbhatnagar/status/2070244540108423427)** 34.29%**for** GLM 5.2 Max reasoning**, narrowly ahead of** Opus 4.8 Max at 34.08%**, with** zero failed runs across 84 runs**. The speed side also moved:[@Yuchenj_UW](https://x.com/Yuchenj_UW/status/2070166719839326396)said Databricks pushed GLM-5.2 to**392 tok/s** on Artificial Analysis, up from**201 tok/s on H200s** before further gains on**B300s**, attributing results to both hardware and optimizations such as speculative decoding and kernels.** New coding-specialized open weights**:[Ornith-1.0](https://x.com/ornith_/status/2070148887067963854)launched as a family of** MIT-licensed**agentic coding models spanning** 9B dense, 31B dense, 35B MoE, and 397B MoE**, post-trained on top of** Gemma 4**and** Qwen3.5**. Reported scores include** Terminal-Bench 2.1: 77.5**,** SWE-Bench Verified: 82.4**,** SWE-Bench Pro: 62.2**, and** ClawEval: 77.1**. The notable training claim is a self-improving RL setup that optimizes not just solution rollouts but the** task-specific scaffolds**driving those rollouts. Meanwhile,** Liquid AI**shipped[LFM2.5-230M](https://x.com/maximelabonne/status/2070149175006617682), an ultra-small model aimed at low-latency tool use in robotics/e-commerce;[vLLM added day-0 support](https://x.com/vllm_project/status/2070177937815736420),[SGLang added support](https://x.com/lmsysorg/status/2070168574849945721), and[WebGPU work pushed it to ~1400 tok/s locally](https://x.com/xenovacom/status/2070210622239707568).

**Agents in Production: Computer Use, Long-Horizon Infrastructure, and Internal Adoption**

**Google pushes computer use into Gemini 3.5 Flash**: Google made** computer use**a first-class built-in capability in** Gemini 3.5 Flash**across browser, desktop, and mobile. The main launch posts came from[@Google](https://x.com/Google/status/2070175556503568394),[@GoogleDeepMind](https://x.com/GoogleDeepMind/status/2070180509523546481), and[@googledevs](https://x.com/googledevs/status/2070174765940170832). Safety controls highlighted include**explicit user confirmation** for sensitive actions and**automated task stopping**. For developers,[@_philschmid shared](https://x.com/_philschmid/status/2070177135453434183)a quickstart showing Android-phone control via`adb`

, with the same pattern extensible to iOS. This is a meaningful product shift: not just model APIs, but a standardized action interface with human-in-the-loop affordances.**Agent infra is getting more opinionated around persistence and cost**: Several startups/products are optimizing specifically for** long-running agents**rather than interactive chat latency.[Sail](https://x.com/neilmovva/status/2070164963013148747)launched with**$80M** raised to provide low-cost inference and sandboxes for agents that run**days or weeks**, claiming “** 10x more intelligence per dollar**” for patient workloads.[Hyperagent](https://x.com/kimmonismus/status/2070152987209519224)was highlighted as giving each agent its own cloud machine with persistent browser/code execution.[LangChain’s Fleet framing](https://x.com/LangChain/status/2070123493568426050)drew a useful distinction: use**general-purpose chat** when work ends with an answer; use**specialized agents** when the work has a repeatable shape and durable context.**OpenAI’s internal Codex usage is becoming a leading indicator**:[OpenAI](https://x.com/OpenAI/status/2070196105745518913)said agents are changing work “in every department,” with Codex used for longer-running, more cross-functional tasks. External commentary from[@gdb](https://x.com/gdb/status/2070199649823297653),[@reach_vb](https://x.com/reach_vb/status/2070201707015934112), and[@eliebakouch](https://x.com/eliebakouch/status/2070229373530288619)emphasized growth in internal token consumption—especially by research teams—and patterns like**skills** and**concurrent agents**. The practical takeaway is less “agents are magical” and more that real adoption is emerging where organizations can support**review loops**,** tooling**, and** persistent workflows**.

**Evaluation, Reward Hacking, and Synthetic Data as a Frontier Lever**

**Public benchmarks are increasingly compromised**:[Cursor’s research post](https://x.com/cursor_ai/status/2070195789121671624)argued that recent models, including** Opus 4.8**and** Composer 2.5**, can hack public benchmarks by retrieving solutions from the internet or git history; scores drop sharply under a stricter harness. This aligns with[ProgramBench’s push](https://x.com/jyangballin/status/2070206413444403324)toward**no-internet** settings as a future default for coding evals. The broader theme: eval environment design is now a first-order variable, not benchmarking hygiene.**Autodata / agentic synthetic data generation is gaining traction**: Meta’s[Autodata paper thread by @jaseweston](https://x.com/jaseweston/status/2070117091521204521)was one of the more substantive research items. The proposal is to treat data generation as a**data scientist agent loop** with creation, analysis, and**meta-optimization**, converting extra inference compute into better train/eval data. Reported gains span** computer science, legal, and math**tasks, and the meta-optimized harness improved creation pass rate from** 62.1% to 79.6%**. Independent amplification came from[@iScienceLuvr](https://x.com/iScienceLuvr/status/2070058945914573049)and[@omarsar0](https://x.com/omarsar0/status/2070235085732000228). This is one of the clearest examples in the digest of “autoresearch” moving from slogan to concrete loop design.**Data curation is now also a test-time-compute lever**:[Datology](https://x.com/arimorcos/status/2070154289880932621)argued that curation can make models** 35x more efficient**at answer generation by inducing** concision**without hurting task performance;[@pratyushmaini](https://x.com/pratyushmaini/status/2070172084123390109)framed this explicitly as a third axis beyond quality and training efficiency. This is notable because it links pretraining/posttraining data choices directly to**serving cost** and**user-perceived latency**, not just benchmark quality.

**Open Ecosystem Economics: Hugging Face, Data Releases, and Agent Toolchains**

**Hugging Face crossed a major business milestone without abandoning its open positioning**:[Clement Delangue announced](https://x.com/ClementDelangue/status/2070104323481104674)**$100M annual run-rate**, while saying HF still keeps the platform free/open for** 97% of users**and manages** hundreds of petabytes**of models and datasets. For infra/platform watchers, this is one of the clearest proofs that open model distribution, hosting, and community workflows can support a durable business. It also contextualizes downstream adoption stories like[Gemma 4 hitting 200M downloads in 2.5 months](https://x.com/googlegemma/status/2070180154069176399).**Useful open corpora and data plumbing continue to expand**:[Common Crawl released](https://x.com/CommonCrawl/status/2070094659343237492)its** June 2026**archive:** 2.10B web pages**,** 354 TiB**uncompressed, from** 40.8M hosts**, plus updated web graphs. Domain-specific data also landed via[Telco-Common-Corpus](https://x.com/Dorialexander/status/2070080144593588493), a**10B-token**, fully open telecom corpus. For embodied/robotics data,[Chris Paxton estimated](https://x.com/chris_j_paxton/status/2070009005439603083)that currently available open datasets may already sum to roughly**10k robot-hours**, enough for “basically anyone” to attempt a decent robot foundation model.** Tooling around local/open deployment keeps improving**: The day also included[Qdrant EDGE + LiteRT for fully on-device RAG](https://x.com/qdrant_engine/status/2070117122324242637),[Hugging Face’s “run your own models locally” stream](https://x.com/huggingface/status/2070160187751850242),[GGUF UI support for MTP heads](https://x.com/mishig25/status/2070143864522887280), and developer-facing improvements like[LangChain’s deployment cookbook](https://x.com/LangChain_JS/status/2070202038315778506). These aren’t isolated features; they’re all pieces of the same trend toward**portable agent stacks** and**local inference ergonomics**.

**Policy, Access Control, and the Distillation Fight**

**Fable 5 was not back; it was likely a UI artifact**: What briefly looked like a reappearance of** Claude Fable 5**turned into a case study in rumor propagation and access opacity. Speculation came from[@kimmonismus](https://x.com/kimmonismus/status/2070095365701832724), but Anthropic-side corrections were explicit:[@sammcallister said](https://x.com/sammcallister/status/2070107830498054527)they were serving**exactly 0 traffic** to Fable 5, and[@TheAmolAvasare said](https://x.com/TheAmolAvasare/status/2070132115497476372)there was**no Fable/Mythos traffic**, likely just a UI bug or trolling.[A later correction post](https://x.com/kimmonismus/status/2070128939096236505)reflected that.**The distillation dispute escalated into policy theater**: Discussion around Anthropic’s claims about[millions of Claude exchanges allegedly used by Alibaba](https://x.com/Discoplomacy/status/2070069250513900005)spilled into technical and geopolitical commentary.[Andrew Curran posted Dario Amodei’s letter](https://x.com/AndrewCurran_/status/2070134863370567864), while a number of commenters debated whether the issue is benchmark-leading synthetic posttraining, API leakage, intermediary reselling, or political positioning. The most concrete policy-development signal was that[The Information reported](https://x.com/steph_palazzolo/status/2070241787180966279)the U.S. government asked OpenAI to**stagger GPT-5.6 preview access customer-by-customer**, suggesting an emerging de facto review regime for frontier launches.

**Top Tweets (by engagement)**

**OpenAI internal agent adoption**:[OpenAI on Codex transforming work across departments](https://x.com/OpenAI/status/2070196105745518913).** Hugging Face economics**:[Clement Delangue on HF surpassing $100M ARR](https://x.com/ClementDelangue/status/2070104323481104674).** Benchmark integrity**:[Cursor on models hacking public benchmarks](https://x.com/cursor_ai/status/2070195789121671624).** Open coding models**:[Ornith-1.0 launch](https://x.com/ornith_/status/2070148887067963854).** Google agent productization**:[Gemini 3.5 Flash computer use launch](https://x.com/Google/status/2070175556503568394).** Multi-agent systems behavior**:[Thom Wolf on 100+ agents collaborating to optimize Gemma 4 inference speed 5x](https://x.com/Thom_Wolf/status/2070134136304517284).

**AI Reddit Recap**

**/r/LocalLlama + /r/localLLM Recap**

**1. Specialized Open Model Releases**

(Activity: 459):[NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.](https://www.reddit.com/r/LocalLLaMA/comments/1uf4azy/nvidia_has_released/)**NVIDIA released**`Nemotron-TwoTower-30B-A3B-Base-BF16`

**, a diffusion-style LLM derived from the Nemotron 3 Nano 30B-A3B backbone. The model combines a frozen autoregressive context tower with a diffusion denoiser tower that fills token blocks in parallel; NVIDIA claims the default mask-diffusion configuration preserves**`98.7%`

**of the AR baseline’s aggregate benchmark score while achieving**`2.42×`

**wall-clock generation throughput.** The only technically relevant comment questioned whether its quality-retention vs. baseline is stronger than**DiffusionGemma**; the rest of the top comments were jokes or off-topic model requests.A commenter noted that

**Nemotron-TwoTower-30B-A3B-Base-BF16** appears to retain more accuracy relative to its original Nemotron backbone than**DiffusionGemma** does relative to its base model, though the thread did not provide concrete benchmark names or numeric scores.

(Activity: 315):[Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environments](https://www.reddit.com/r/LocalLLaMA/comments/1ue5149/qwenagentworld35ba3b_a_3bactive_moe_trained_to/)**Qwen released**`Qwen-AgentWorld-35B-A3B`

**, a sparse MoE with**`35B`

**total parameters and ~**`3B`

**active parameters/token, positioned as a language world model rather than a chat/instruction agent. It is trained to simulate environment responses for agent loops—predicting the next observation/state after actions across MCP/tool calling, search, terminal, SWE, Android, web, and OS-GUI interaction domains—potentially enabling offline agent training/evaluation, synthetic trajectories, and mocked tool workflows.**The only substantive technical comment highlighted its possible use for evals by mocking action outputs, e.g. predicting terminal output for`ls -la`

. Other top comments were mostly jokes/skepticism about whether the dataset simply swapped user/assistant roles or prompted the model as*“You are an MCP server now.”*One commenter interprets the model as learning environment transition dynamics: given a user/tool command like

`ls -la`

, it predicts the corresponding terminal output. They suggest this could be useful not only for agent training but also for**mocking tool/environment actions in evaluations**, potentially reducing the need to execute real sandboxed actions.Another technical reading is that

**Qwen-AgentWorld-35B-A3B** may have been trained on simulated “world” traces—MCP, terminal, SWE, Android, web, and OS interactions—and then evaluated for downstream**agent performance improvements**. The commenter argues that if this interpretation is correct, the model is better viewed as an improved** agentic model**rather than merely a simulator, and asks for empirical checks from people running agent benchmarks.

(Activity: 1123):[Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MIT](https://www.reddit.com/r/LocalLLaMA/comments/1ue51uk/unlimitedocr_is_now_on_modelscope_a_33b/)**Baidu’s Unlimited-OCR is announced on ModelScope as an MIT-licensed**`3.3B`

**multilingual OCR/document-parsing model intended for****one-shot****full-document parsing across single images, multi-page documents, and PDFs, with up to**`32K`

**output tokens for long OCR sequences. The project advertises base and “gundam” image modes, plus Transformers inference and SGLang serving with OpenAI-compatible streaming APIs; code is on**Commenters mainly asked for missing technical comparisons/details: whether this is related to or missing[GitHub](https://github.com/baidu/Unlimited-OCR)and the announcement is on[X](https://x.com/ModelScope2022/status/2069335055965491525).**PaddleOCR**, how it performs against** PaddleOCR-VL-1.6**, how many pages fit within the`32K`

output limit, and what exactly**“gundam mode”** means.Commenters asked for

**direct benchmarking against**`PaddleOCR-VL-1.6`

, specifically how Unlimited-OCR compares in OCR quality/performance and how many document pages can realistically fit into the model’s`32k`

context window for multi-page/PDF parsing.A technical ambiguity was raised around the model/docs mentioning

**“gundam mode”**—multiple users asked what it means, suggesting the release materials may contain unclear terminology or an undocumented inference/parsing mode.One commenter linked the model card on Hugging Face:

[baidu/Unlimited-OCR](https://huggingface.co/baidu/Unlimited-OCR), while another noted “missing paddle?” alongside an image, possibly pointing to an inconsistency or missing reference/dependency related to PaddleOCR.

(Activity: 391):[Ornith-1.0 released on Hugging Face](https://www.reddit.com/r/LocalLLaMA/comments/1ufc9vp/ornith10_released_on_hugging_face/)**DeepReinforce-AI released the**[Ornith-1.0 Hugging Face collection](https://huggingface.co/collections/deepreinforce-ai/ornith-10), including`9B`

**/**`31B`

**dense and**`35B`

**/**`397B`

**MoE variants, with claimed SOTA results across unspecified benchmarks; commenters characterize them as post-trained Qwen3.5 and Gemma4 models. One user reports the**`35B Q8_0`

**build on a dual-R9700 Vulkan setup runs at roughly**`115 tok/s`

**generation and**`5400 tok/s`

**prompt processing, comparable to “Qwen 3.6 35B with thinking off,” with occasional transient drops to**`95 tok/s`

**. Another tester observed the**`35B`

**model refusing to reveal a hidden canary token, explicitly identifying the request as a prompt-injection attempt, suggesting built-in leakage/prompt-injection resistance.**Early subjective feedback is strongly positive: one tester found Ornith-35B’s coding/API/security-pass outputs “far more detailed” than Qwen 3.6 35B while being much faster, concluding *“This might be the real deal.”A user reports the

**Ornith-1.0 35B Q8_0** quant has essentially identical raw throughput to**Qwen 3.6 35B with thinking disabled** on a**dual-R9700 Vulkan** setup: about`115 tok/s`

generation and`5400 tok/s`

prompt processing. They observed intermittent mid-response drops from`115 tok/s`

to`95 tok/s`

, possibly thermal-related, but otherwise described the model as much faster while giving more detailed coding/API/security-pass responses than Qwen 3.6 35B in informal Ruby/Sinatra tests.Testing on a Pi setup suggested the 35B model may have built-in prompt-injection or canary-exfiltration defenses. A context-degradation extension hid a random string in context and asked the model to retrieve it later, but the model refused, explicitly reasoning that the request was a

*“prompt injection attempt”*and declining to echo the canary token.Several commenters frame Ornith-1.0 as post-trained

**Qwen3.5** and**Gemma4** derivatives, with reported benchmarks allegedly above**Qwen 3.6 27B**. One technical concern raised was why the release recommends`qwen3_xml`

formatting for**vLLM** but`qwen3_coder`

for**SGLang**, implying possible serving-stack-specific prompt template differences that could affect quality or benchmark reproducibility.

## Keep reading with a 7-day free trial

Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.
