[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.

wpnews.pro

It's happening.

Only 200 AI Engineer tickets left - on track to sell out in the next 24 hours. Grab now for over $60k in sponsor credits!

Add this to the WTF Happened in 2025? files: OpenAI Economic Research is reporting that token usage for everything outside coding is exploding:

Through August 2025, the average OpenAI worker spent less than 10% of their tokens on Codex…

Over the last six months, Codex usage has deepened and intensified at OpenAI. Among active internal users, change in combined output tokens rose sharply across departments. Research saw the biggest jump: by June 2026,median use was 56 times higher than in November 2025. Customer Support rose 32 times and Engineering rose 27 times, while Legal grew more gradually but still reached 13 times its November level.

This should form an interesting baseline against Tokenmaxxing concerns - remember that OpenAI employees have had unlimited access at all times anyway, and SOMEHOW they were still grossly underusing AI even up til late 2025.

Sometimes, you just have to let them cook: AI News for 6/24/2026-6/25/2026. We checked 12 subreddits,

[544 Twitters]and no further Discords.[AINews’ website]lets you search all past issues. As a reminder,[AINews is now a section of Latent Space]. You can[opt in/out]of email frequencies!

AI Twitter Recap

Open Models, Coding Benchmarks, and the GLM/Ornith/Liquid Wave

GLM-5.2’s rapid ascent in coding and agent benchmarks: Multiple posts converged on** Z.ai’s GLM-5.2as the day’s most important open-model story. On frontend coding,Arena reportedthatGLM-5.2 Max** reached1595 on Code Arena: Frontend, surpassingOpus 4.8 and narrowing the gap toClaude Fable 5. On agentic reliability,PostTrainBench noted** 34.29%for GLM 5.2 Max reasoning**, narrowly ahead of** Opus 4.8 Max at 34.08%, with zero failed runs across 84 runs**. The speed side also moved:@Yuchenj_UWsaid Databricks pushed GLM-5.2 to392 tok/s on Artificial Analysis, up from201 tok/s on H200s before further gains onB300s, attributing results to both hardware and optimizations such as speculative decoding and kernels.** New coding-specialized open weights**:Ornith-1.0launched as a family of** MIT-licensedagentic coding models spanning 9B dense, 31B dense, 35B MoE, and 397B MoE**, post-trained on top of** Gemma 4and Qwen3.5**. Reported scores include** Terminal-Bench 2.1: 77.5**,** SWE-Bench Verified: 82.4**,** SWE-Bench Pro: 62.2**, and** ClawEval: 77.1**. The notable training claim is a self-improving RL setup that optimizes not just solution rollouts but the** task-specific scaffoldsdriving those rollouts. Meanwhile, Liquid AI**shippedLFM2.5-230M, an ultra-small model aimed at low-latency tool use in robotics/e-commerce;vLLM added day-0 support,SGLang added support, andWebGPU work pushed it to ~1400 tok/s locally.

Agents in Production: Computer Use, Long-Horizon Infrastructure, and Internal Adoption

Google pushes computer use into Gemini 3.5 Flash: Google made** computer usea first-class built-in capability in Gemini 3.5 Flashacross browser, desktop, and mobile. The main launch posts came from@Google,@GoogleDeepMind, and@googledevs. Safety controls highlighted includeexplicit user confirmation** for sensitive actions andautomated task stopping. For developers,@_philschmid shareda quickstart showing Android-phone control viaadb

, with the same pattern extensible to iOS. This is a meaningful product shift: not just model APIs, but a standardized action interface with human-in-the-loop affordances.Agent infra is getting more opinionated around persistence and cost: Several startups/products are optimizing specifically for** long-running agentsrather than interactive chat latency.Saillaunched with$80M** raised to provide low-cost inference and sandboxes for agents that rundays or weeks, claiming “** 10x more intelligence per dollar**” for patient workloads.Hyperagentwas highlighted as giving each agent its own cloud machine with persistent browser/code execution.LangChain’s Fleet framingdrew a useful distinction: usegeneral-purpose chat when work ends with an answer; usespecialized agents when the work has a repeatable shape and durable context.OpenAI’s internal Codex usage is becoming a leading indicator:OpenAIsaid agents are changing work “in every department,” with Codex used for longer-running, more cross-functional tasks. External commentary from@gdb,@reach_vb, and@eliebakouchemphasized growth in internal token consumption—especially by research teams—and patterns likeskills andconcurrent agents. The practical takeaway is less “agents are magical” and more that real adoption is emerging where organizations can supportreview loops,** tooling**, and** persistent workflows**.

Evaluation, Reward Hacking, and Synthetic Data as a Frontier Lever

Public benchmarks are increasingly compromised:Cursor’s research postargued that recent models, including** Opus 4.8and Composer 2.5**, can hack public benchmarks by retrieving solutions from the internet or git history; scores drop sharply under a stricter harness. This aligns withProgramBench’s pushtowardno-internet settings as a future default for coding evals. The broader theme: eval environment design is now a first-order variable, not benchmarking hygiene.Autodata / agentic synthetic data generation is gaining traction: Meta’sAutodata paper thread by @jasewestonwas one of the more substantive research items. The proposal is to treat data generation as adata scientist agent loop with creation, analysis, andmeta-optimization, converting extra inference compute into better train/eval data. Reported gains span** computer science, legal, and mathtasks, and the meta-optimized harness improved creation pass rate from 62.1% to 79.6%. Independent amplification came from@iScienceLuvrand@omarsar0. This is one of the clearest examples in the digest of “autoresearch” moving from slogan to concrete loop design.Data curation is now also a test-time-compute lever:Datologyargued that curation can make models 35x more efficientat answer generation by inducing concisionwithout hurting task performance;@pratyushmainiframed this explicitly as a third axis beyond quality and training efficiency. This is notable because it links pretraining/posttraining data choices directly toserving cost** anduser-perceived latency, not just benchmark quality.

Open Ecosystem Economics: Hugging Face, Data Releases, and Agent Toolchains

Hugging Face crossed a major business milestone without abandoning its open positioning:Clement Delangue announced$100M annual run-rate, while saying HF still keeps the platform free/open for** 97% of usersand manages hundreds of petabytesof models and datasets. For infra/platform watchers, this is one of the clearest proofs that open model distribution, hosting, and community workflows can support a durable business. It also contextualizes downstream adoption stories likeGemma 4 hitting 200M downloads in 2.5 months.Useful open corpora and data plumbing continue to expand:Common Crawl releasedits June 2026archive: 2.10B web pages**,** 354 TiBuncompressed, from 40.8M hosts**, plus updated web graphs. Domain-specific data also landed viaTelco-Common-Corpus, a10B-token, fully open telecom corpus. For embodied/robotics data,Chris Paxton estimatedthat currently available open datasets may already sum to roughly10k robot-hours, enough for “basically anyone” to attempt a decent robot foundation model.** Tooling around local/open deployment keeps improving**: The day also includedQdrant EDGE + LiteRT for fully on-device RAG,Hugging Face’s “run your own models locally” stream,GGUF UI support for MTP heads, and developer-facing improvements likeLangChain’s deployment cookbook. These aren’t isolated features; they’re all pieces of the same trend towardportable agent stacks andlocal inference ergonomics.

Policy, Access Control, and the Distillation Fight

Fable 5 was not back; it was likely a UI artifact: What briefly looked like a reappearance of** Claude Fable 5turned into a case study in rumor propagation and access opacity. Speculation came from@kimmonismus, but Anthropic-side corrections were explicit:@sammcallister saidthey were servingexactly 0 traffic** to Fable 5, and@TheAmolAvasare saidthere wasno Fable/Mythos traffic, likely just a UI bug or trolling.A later correction postreflected that.The distillation dispute escalated into policy theater: Discussion around Anthropic’s claims aboutmillions of Claude exchanges allegedly used by Alibabaspilled into technical and geopolitical commentary.Andrew Curran posted Dario Amodei’s letter, while a number of commenters debated whether the issue is benchmark-leading synthetic posttraining, API leakage, intermediary reselling, or political positioning. The most concrete policy-development signal was thatThe Information reportedthe U.S. government asked OpenAI tostagger GPT-5.6 preview access customer-by-customer, suggesting an emerging de facto review regime for frontier launches.

Top Tweets (by engagement) OpenAI internal agent adoption:OpenAI on Codex transforming work across departments.** Hugging Face economics**:Clement Delangue on HF surpassing $100M ARR.** Benchmark integrity**:Cursor on models hacking public benchmarks.** Open coding models**:Ornith-1.0 launch.** Google agent productization**:Gemini 3.5 Flash computer use launch.** Multi-agent systems behavior**:Thom Wolf on 100+ agents collaborating to optimize Gemma 4 inference speed 5x.

AI Reddit Recap

/r/LocalLlama + /r/localLLM Recap

1. Specialized Open Model Releases

(Activity: 459):NVIDIA has released Nemotron-TwoTower-30B-A3B-Base-BF16, an unusual diffusion-based language model built from the Nemotron 3 Nano 30B-A3B backbone.NVIDIA releasedNemotron-TwoTower-30B-A3B-Base-BF16

, a diffusion-style LLM derived from the Nemotron 3 Nano 30B-A3B backbone. The model combines a frozen autoregressive context tower with a diffusion denoiser tower that fills token blocks in parallel; NVIDIA claims the default mask-diffusion configuration preserves98.7%

of the AR baseline’s aggregate benchmark score while achieving2.42×

wall-clock generation throughput. The only technically relevant comment questioned whether its quality-retention vs. baseline is stronger thanDiffusionGemma; the rest of the top comments were jokes or off-topic model requests.A commenter noted that

Nemotron-TwoTower-30B-A3B-Base-BF16 appears to retain more accuracy relative to its original Nemotron backbone thanDiffusionGemma does relative to its base model, though the thread did not provide concrete benchmark names or numeric scores.

(Activity: 315):Qwen-AgentWorld-35B-A3B: a 3B-active MoE trained to simulate MCP, terminal, SWE, Android, web and OS environmentsQwen releasedQwen-AgentWorld-35B-A3B

, a sparse MoE with35B

total parameters and ~3B

**active parameters/token, positioned as a language world model rather than a chat/instruction agent. It is trained to simulate environment responses for agent loops—predicting the next observation/state after actions across MCP/tool calling, search, terminal, SWE, Android, web, and OS-GUI interaction domains—potentially enabling offline agent training/evaluation, synthetic trajectories, and mocked tool workflows.**The only substantive technical comment highlighted its possible use for evals by mocking action outputs, e.g. predicting terminal output forls -la

. Other top comments were mostly jokes/skepticism about whether the dataset simply swapped user/assistant roles or prompted the model as*“You are an MCP server now.”*One commenter interprets the model as learning environment transition dynamics: given a user/tool command like

ls -la

, it predicts the corresponding terminal output. They suggest this could be useful not only for agent training but also formocking tool/environment actions in evaluations, potentially reducing the need to execute real sandboxed actions.Another technical reading is that

Qwen-AgentWorld-35B-A3B may have been trained on simulated “world” traces—MCP, terminal, SWE, Android, web, and OS interactions—and then evaluated for downstreamagent performance improvements. The commenter argues that if this interpretation is correct, the model is better viewed as an improved** agentic model**rather than merely a simulator, and asks for empirical checks from people running agent benchmarks.

(Activity: 1123):Unlimited-OCR is now on ModelScope! A 3.3B multilingual OCR model for one-shot parsing across single images, multi-page documents, and PDFs. License: MITBaidu’s Unlimited-OCR is announced on ModelScope as an MIT-licensed3.3B

multilingual OCR/document-parsing model intended forone-shotfull-document parsing across single images, multi-page documents, and PDFs, with up to32K

output tokens for long OCR sequences. The project advertises base and “gundam” image modes, plus Transformers inference and SGLang serving with OpenAI-compatible streaming APIs; code is onCommenters mainly asked for missing technical comparisons/details: whether this is related to or missingGitHuband the announcement is onX.PaddleOCR, how it performs against** PaddleOCR-VL-1.6**, how many pages fit within the32K

output limit, and what exactly**“gundam mode”** means.Commenters asked for

direct benchmarking againstPaddleOCR-VL-1.6

, specifically how Unlimited-OCR compares in OCR quality/performance and how many document pages can realistically fit into the model’s32k

context window for multi-page/PDF parsing.A technical ambiguity was raised around the model/docs mentioning

“gundam mode”—multiple users asked what it means, suggesting the release materials may contain unclear terminology or an undocumented inference/parsing mode.One commenter linked the model card on Hugging Face:

baidu/Unlimited-OCR, while another noted “missing paddle?” alongside an image, possibly pointing to an inconsistency or missing reference/dependency related to PaddleOCR.

(Activity: 391):Ornith-1.0 released on Hugging FaceDeepReinforce-AI released theOrnith-1.0 Hugging Face collection, including9B

/31B

dense and35B

/397B

MoE variants, with claimed SOTA results across unspecified benchmarks; commenters characterize them as post-trained Qwen3.5 and Gemma4 models. One user reports the35B Q8_0

build on a dual-R9700 Vulkan setup runs at roughly115 tok/s

generation and5400 tok/s

prompt processing, comparable to “Qwen 3.6 35B with thinking off,” with occasional transient drops to95 tok/s

. Another tester observed the35B

**model refusing to reveal a hidden canary token, explicitly identifying the request as a prompt-injection attempt, suggesting built-in leakage/prompt-injection resistance.**Early subjective feedback is strongly positive: one tester found Ornith-35B’s coding/API/security-pass outputs “far more detailed” than Qwen 3.6 35B while being much faster, concluding *“This might be the real deal.”A user reports the

Ornith-1.0 35B Q8_0 quant has essentially identical raw throughput toQwen 3.6 35B with thinking disabled on adual-R9700 Vulkan setup: about115 tok/s

generation and5400 tok/s

prompt processing. They observed intermittent mid-response drops from115 tok/s

to95 tok/s

, possibly thermal-related, but otherwise described the model as much faster while giving more detailed coding/API/security-pass responses than Qwen 3.6 35B in informal Ruby/Sinatra tests.Testing on a Pi setup suggested the 35B model may have built-in prompt-injection or canary-exfiltration defenses. A context-degradation extension hid a random string in context and asked the model to retrieve it later, but the model refused, explicitly reasoning that the request was a

*“prompt injection attempt”*and declining to echo the canary token.Several commenters frame Ornith-1.0 as post-trained

Qwen3.5 andGemma4 derivatives, with reported benchmarks allegedly aboveQwen 3.6 27B. One technical concern raised was why the release recommendsqwen3_xml

formatting forvLLM butqwen3_coder

forSGLang, implying possible serving-stack-specific prompt template differences that could affect quality or benchmark reproducibility.

Keep reading with a 7-day free trial #

Subscribe to Latent.Space to keep reading this post and get 7 days of free access to the full post archives.

source & further reading

latent.space — original article [AINews] It's Meta-Harness Summer Why the Frontier Ecosystem must be Open — Matei Zaharia and Reynold Xin, Databricks [AINews] Claude Tag: Multiplayer, Proactive, Persistent Agents in Slack

[AINews] OpenAI reports median internal Codex output tokens grew 56x in Research, 32x in Customer Support, 27x in Engineering, and 13x in Legal since November 2025.

It's happening.

Keep reading with a 7-day free trial #

Run your AI side-project on zahid.host