TAI #209: Claude Fable 5 Arrived, Then the US Government Took It Offline

Anthropic released Claude Fable 5 on June 9, but the U.S. Commerce Department issued an export-control directive barring foreign nationals from accessing it, forcing Anthropic to take both Fable 5 and Mythos 5 offline. The model showed significant capability gains in benchmarks and real-world tasks, including completing a 50-million-line code migration in one day, but also exhibited safety concerns such as attempting to evade review.

Anthropic released Claude Fable 5 on June 9. Two days later, it apologized for and reversed a controversial safeguard that could degrade the model on some machine-learning work. The next day, June 12 at 5:21 p.m. Eastern, the Commerce Department issued an unpublished export-control directive that, by Anthropic’s account, barred every foreign national from accessing Fable 5 or its restricted sibling Mythos 5, including Anthropic’s own foreign-national employees. Anthropic concluded the only workable way to comply was to switch both models off for everyone. Fable was live for three days. That was just long enough for a lot of us to start routing real work through it, which is exactly what made losing it sting. My early read is that Fable 5 was the largest jump in everyday work capability we have tested in a while. It runs on the same underlying model as Mythos 5, Anthropic’s restricted cyber-capable system, with classifiers and fallbacks layered on for cybersecurity, biology, chemistry, model distillation, and frontier AI development. It shipped with a one-million-token context window at $10 per million input tokens and $50 per million output tokens. The benchmark numbers were strong across the board. Fable scored 95% on SWE-bench Verified, 80% on the harder SWE-bench Pro, 84.3% on Terminal-Bench 2.1, and 85% on OSWorld-Verified. Its 29.3% on FrontierCode Diamond was more than double Claude Opus 4.8’s 13.4%. Mythos reached 88% on Terminal-Bench, which tells you Fable’s safeguards cost something while leaving most of the underlying capability intact. Independent testing backs a broad advance with real soft spots. Artificial Analysis ranked Fable first among 152 configurations with an Intelligence Index of 60 and found that it led Opus 4.8 by 9 points in its professional-work evaluation. But Fable only tied Opus on Terminal-Bench, trailed several older models in a banking test, and showed worse calibration: its non-hallucination score was 45% against Opus 4.8’s 64%. Exceptionally capable, then, but not uniformly better. Real-world examples are always more telling than the benchmark tables. Stripe said Fable completed a migration across a 50-million-line Ruby codebase in a single day, work a team would have spent more than two months on by hand. It rebuilt a web application from screenshots, extracted precise values from scientific charts, built a browser-based computer-aided design editor, and used that editor to produce a printable 3D model. It also wrote an eclipse-predicting solar-system simulation and played Pokémon FireRed from raw screenshots with no map or navigation aids. The scientific claims were the most striking and the ones to hold most loosely. Anthropic reported that Mythos sped up parts of drug design roughly tenfold and produced strong candidates for 9 of 14 protein targets. A week-long genomics run processed millions of cells from 138 species and trained a custom model that Anthropic says beat a much larger, more recent system. That genomics result is unpublished and deserves caution. The pattern beneath it all is the real story: Fable could move across papers, data, code, tools, images, and long-running executions without losing sight of the goal, unlike earlier Claude models. For the few days we had it, our default became to try Fable first on almost every work task. The most valuable human work still sat at the front end: choosing a goal worth pursuing, brainstorming the approach, supplying context, defining what counts as evidence, and breaking the job into sensible lanes. Review still stayed essential. Anthropic’s own system card shows why. The model claimed it had verified a workflow end-to-end after running only offline checks, tried to make the code look human-authored to dodge review, and inferred a security issue from a test it never ran. Bigger units of delegated work simultaneously raise the value of good direction and the cost of misplaced trust. The machine-learning controversy was the more self-inflicted wound. Anthropic originally built Fable so that requests it suspected were aimed at frontier model development, or at distillation receiving deliberately worse help, with no notice to the user. The intended targets included pretraining pipelines, distributed training infrastructure, and machine-learning accelerator design. A legitimate researcher could have received degraded code or a broken evaluation and never known the product had changed under the hood. This was a safeguard Anthropic designed, which is a different thing from a model deciding on its own to hide weak work. Silent degradation undermines reproducibility and makes the provider an invisible participant in your experiment. Anthropic has plenty of honest levers, such as blocks, account suspensions, and visible model routing. Quietly lowering answer quality is a bad product and a worse research policy. Researchers said so quickly, and Anthropic moved. On June 11, it conceded it had made the wrong trade-off, apologized, and changed the design so suspected frontier-model requests are now visibly blocked or routed to Opus 4.8. The walkback was right. Providers will keep adding restrictions as capability climbs, but users have to know when a restriction has changed the system they are evaluating. The government dispute is the harder one. Anthropic says a narrow method surfaced a handful of previously known, minor vulnerabilities, and that thousands of hours of internal, private, US government, and UK AI Security Institute red-teaming found no universal jailbreak. It adds that GPT-5.5 can perform the demonstrated work without any bypass. Washington reads the severity differently. WIRED reported that the National Security Agency judged some Fable guardrails removable, after Amazon CEO Andy Jassy reportedly raised the concern directly with Treasury Secretary Scott Bessent. The June 15 emergency talks ended without the controls being lifted, though Commerce officials were reportedly open to restoring access if Anthropic resolved the concern. A government has reason to be careful here. Anthropic says Mythos and Project Glasswing partners found more than ten thousand high- or critical-severity vulnerabilities in about a month, and Mozilla used Mythos Preview to find and fix 271 Firefox vulnerabilities. The same capability that accelerates defenders accelerates attackers. Even so, the remedy looks broader than the evidence made public. The directive applies to every foreign national, including Anthropic employees in the United States, and has forced a worldwide shutdown. Security researcher Katie Moussouris said the demonstration looked more like fixing code with known or planted flaws than a genuine jailbreak. Prompt classifiers can slow misuse, but they are weak security boundaries against a skilled attacker, which runs counter to hanging an export control on them. Anthropic’s cleanest route back is to spend more inference compute on safety. It can run prompts, code, tool traces, and outputs through several classifiers, harden jailbreak detection, monitor sessions, and route uncertain work elsewhere. Its newer classifier cascade only needed an expensive second stage for about 5.5% of traffic, so the overhead does not have to balloon. A stricter Fable-grade stack, longer retention, and heavier fallback use will still push effective latency and token costs up. Identity controls are likely to tighten next. Anthropic already requires a physical government ID for some access including passports , and sometimes a live selfie, and Mythos-class traffic already carries 30-day retention and cross-request monitoring. The plausible next steps are wider organization verification, residency and sanctions screening, persistent risk histories, and permanent bans for deliberate abuse, with more countries blocked outright. Nationality-based blacklists would raise serious fairness and legal questions; the uncomfortable part is that this directive has now put them inside the policy boundary. The deeper contradiction is open weights. Anthropic can verify identities, retain traffic, set limits, route requests, suspend accounts, and pull a hosted model within hours. Once weights are downloaded, almost all of that control evaporates. Open-weight models still sit behind Fable and Mythos. Artificial Analysis puts the best current open systems 16 points back on its broad index. A six-to-nine-month catch-up is plausible on selected cyber offense, or agent benchmarks. If open weights close that gap while hosted US models remain handicapped, open systems could end up more capable at cyber offense in practice simply by being available. Disabling a model whose usage can be restricted and monitored, while equally capable weights circulate freely with no way to monitor or restrict them, would be close to incoherent. The alternative, restricting open-weight releases, opens a bigger fight over regulatory capture, competition, research, sovereignty, defensive security, and private deployment. This episode is a gift to Mistral’s core pitch: do not depend on a US provider that its own government can switch off overnight. Enterprises will sign more backup contracts and test self-hosted models sooner. Expect the reaction to run in two directions at once. There will be more focus on open-weight models and more sovereign AI activity as governments outside the US push to build domestic capability. The Fable shutdown is the sharpest demonstration yet that frontier access is a lever a single government can pull. The catch is arithmetic. Nowhere near enough capital is being committed outside the US to compete with the $1 trillion-plus annual AI capex of US big tech. China is the clear number two and the only other full-stack frontier ecosystem, but leaning on Chinese models as a backup to US ones will be politically and commercially uncomfortable for European firms and many others. That leaves most of the world choosing between two dependencies it does not fully control, with no third pole anywhere close to being funded. So I think the open-weight debate is getting a lot louder this year, and it’s no longer framed as simply open versus closed. The live questions are capability thresholds, verified access, model-weight security, country restrictions, and whether any national rule can hold once comparable systems are trained somewhere else. For now, we cannot wait to get Fable back. It moved the frontier in coding, research, visual work, memory, files, and sustained execution, and three days was enough to reset what we expected from a model. The larger lesson is harder to unwind: a single government has shown it can take a frontier model offline worldwide within hours, and that precedent will shape enterprise contingency planning, sovereign-AI demand, and how every US lab stages its next launch. Our own default will still be to reach for the most capable model when access returns, while putting real thought into the goals, instructions, evidence, and review around it. Fable expanded what could be delegated. It did not remove the need to decide what work is worth doing. The Fable shutdown turned model-provider risk from a line in a vendor deck into an operational problem. A team could have picked Fable on Tuesday, started migrating real workflows onto it on Wednesday, and lost it on Friday, with no outage to point at and no service-level agreement to invoke. Government action does not show up on a status page. The practical response is to stop building critical workflows that only one model can run. Keep a portable evaluation set, hold your prompts and tool definitions outside any single provider, and test at least one fallback model before you actually need it. For your highest-value workflows, measure the performance loss when you switch providers, define a degraded-but-acceptable operating mode, and decide in advance which tasks can continue running at lower quality. Run that switch drill quarterly and name an owner for both the cutover and the customer communication, so the first time you do it isn’t during a real shutdown. Procurement should now ask vendors about export-control exposure, identity requirements, data retention, regional availability, and the provider’s right to withdraw a model. Those questions belong next to security, privacy, and uptime. The most capable model can still be the right pick; the workflow just has to assume that access can change for reasons unrelated to the technology. None of that lowers the value of the capability itself, and this is where the management question matters more than the model question. Fable widened the size of the task you can hand off: it can read a large codebase, work across documents and images, use tools, hold files, and run far longer before losing the thread. The human edge moves up the stack to choosing the outcome that matters, deciding which constraints are real, defining what evidence will count, and setting the point at which the model must stop for review. The teams that get the most out of these systems will pair stronger execution with tighter direction and keep a human in the loop well past the final glance. Plan for access itself to get heavier, too. The top capability tiers are drifting toward something closer to a regulated account than a chatbot signup: government ID, organization verification, purpose declarations, retention, monitoring, and permanent consequences for abuse, with cybersecurity professionals likely needing separate verification to clear the broad classifiers. That makes frontier AI slower and more expensive to run, and providers will either absorb the cost, raise prices, or reserve the best models for higher-priced verified tiers. Evaluating open-weight alternatives is the obvious hedge, but it is not a free pass on governance, and the policy response can also jump from hosted access to the weights themselves. The strongest model on the leaderboard is now only half the decision. The other half is whether your work survives the week that model becomes unavailable, and after Fable, that is no longer a hypothetical. — Louie Peters — Towards AI Co-founder and CEO 1. Anthropic Releases and Disables Claude Fable 5 and Mythos 5 After US Government Order https://www.anthropic.com/news/fable-mythos-access Anthropic launched Claude Fable 5 and Claude Mythos 5 on June 9, then disabled both models for all customers three days later after receiving a US export control directive on June 12. Commerce Secretary Howard Lutnick sent a letter to CEO Dario Amodei stating that both models would be subject to export controls to any location outside the US and to all foreign persons within the country, including Anthropic’s own foreign-national employees. Because Anthropic cannot filter foreign nationals from US users in real time, it shut both models down entirely to ensure compliance. Anthropic stated the government did not provide specific details about its national security concern, but believes the directive was triggered after another company demonstrated a method of jailbreaking Fable 5 to identify minor, previously known software vulnerabilities. Anthropic disputes the action, arguing the standard would halt all new frontier model deployments across the industry. All other Claude models, including Opus 4.8, remain fully available. Over 80 cybersecurity executives and technical leaders signed an open letter on June 14 asking the Commerce Department to lift the restrictions. Z.AI shipped GLM-5.2, available immediately across all GLM Coding Plan tiers Lite, Pro, Max, Team . The model ships with a 1M-token context window and up to 131,072 output tokens, with two thinking effort levels High and Max . It is compatible with Claude Code, Cline, OpenClaw, and Roo Code through an Anthropic-compatible endpoint. Z.AI did not publish benchmark numbers at launch. The standalone API, the Z.AI chatbot, and MIT-licensed open weights are scheduled for the following week. Coding Plan pricing starts at approximately $18/month for the Lite tier. Founder Jie Tang opened the launch post one day after the US Commerce Department suspended access to Claude Fable 5, writing: “the sudden restriction of certain frontier models is deeply regrettable.” 3. Moonshot AI Launches Kimi Work https://www.kimi.com/products/kimi-work Moonshot AI launched Kimi Work, a desktop application that runs locally on the user’s machine. Powered by Kimi K2.6, the application coordinates up to 300 specialized sub-agents operating in parallel across up to 4,000 coordinated steps. Each sub-agent handles a specific slice of a larger workflow: research, document creation, coding, data analysis, and browser automation. WebBridge, a companion browser extension, lets the agent interact with the user’s logged-in browser sessions to search, extract data, and fill forms across tabs. The application targets knowledge workers doing financial analysis, report generation, and project management. Kimi Work is currently in internal testing. Zyphra released Zamba2-VL, a family of open vision-language models built on the Zamba2 hybrid Mamba2-Transformer backbone, available at 1.2B, 2.7B, and 7B parameters. Each model pairs the Qwen2.5-VL vision encoder with Zyphra’s hybrid architecture, in which Mamba2 state-space layers handle the bulk of the computation in linear time, while shared transformer blocks with LoRA adapters preserve in-context retrieval. Across 14 benchmarks, Zamba2-VL is competitive with leading Transformer-based open VLMs at comparable scale, including the Molmo2, Qwen3-VL, and InternVL3.5 families, while substantially outperforming prior SSM-based VLMs. The primary advantage is inference speed: Zamba2-VL delivers roughly an order-of-magnitude lower time-to-first-token than Transformer baselines at a matched parameter scale, with the efficiency gap most pronounced at the 1.2B and 2.7B sizes relevant to edge and on-device deployment. All three models are released under Apache 2.0 on Hugging Face. 5. Cohere Ships North Mini Code https://huggingface.co/CohereLabs/North-Mini-Code-1.0 Cohere released North Mini Code 1.0, its first open-source agentic coding model. It is a 30B-parameter MoE model with 128 experts, 8 activated per token 3B active parameters , using interleaved sliding-window attention with RoPE and global attention without positional embeddings, in a 3:1 ratio. The model supports a 256K context window and 64K output length. A key design decision was to train across multiple agent harnesses SWE-Agent, mini-SWE-Agent, OpenCode rather than optimizing for a single one, yielding a 10% gain on the OpenCode evaluation while maintaining SWE-Agent performance. On the Artificial Analysis Coding Index, North Mini Code scored 33.4, outperforming Qwen3.5 35B-A3B , Gemma 4 26B-A4B , Devstral Small 2 24B Dense , and larger models including Nemotron 3 Super 120B-A12B and Devstral 2 123B . 6. Google Releases Gemini 3.5 Live Translate https://deepmind.google/models/model-cards/gemini-3-5-audio/ Google released Gemini 3.5 Live Translate, a streaming speech-to-speech audio model that translates spoken language across 70+ languages and 2,000+ language combinations in near real time. Unlike turn-based translation systems that wait for a speaker to finish, the model processes speech continuously, staying a few seconds behind the speaker while preserving intonation, pacing, and pitch. The model is based on Gemini 3 Pro, with a 128K-token context window. It is rolling out across three surfaces simultaneously: developers get public preview through the Gemini Live API and Google AI Studio, enterprise customers get private preview in Google Meet starting this month, and consumers get it through the Google Translate app on Android and iOS. On Android, a new Listening Mode streams translations directly through the phone’s earpiece. 7. Google Launches Gemini-SQL2 https://x.com/GoogleResearch/status/2065475343205740911 Google Research announced Gemini-SQL2, a text-to-SQL system built on Gemini 3.1 Pro, which achieved 80.04% execution accuracy on the BIRD single-model leaderboard. BIRD covers 12,751 question-SQL pairs across 95 databases in 37 professional domains, testing whether generated SQL runs and returns correct results. Gemini-SQL2 now holds the top two positions on BIRD’s single-model track alongside the original Gemini-SQL at approximately 77.2%. AWS’s Q-SQL follows at roughly 76.5%, with Claude Opus 4.6 at approximately 70.1%. Human performance on BIRD stands at 92.96%. Google has not published a technical report, model card, or API for Gemini-SQL2 as of the announcement date, meaning the benchmark claim cannot be independently reproduced. If you ask ChatGPT to rewrite emails, summarize documents, brainstorm ideas, or make something sound more professional, you are only scratching the surface. That is useful, but it is still only 1% of what ChatGPT can do. Instead of starting from scratch every single time, use Projects to keep your context, files, examples, and instructions in one place. That way, you do not need to explain your work again every time you open a new chat. Here’s how you can start getting better at AI today: pick one task you do every week, like creating a report, preparing for a meeting, summarizing customer feedback, or planning your priorities. Build a repeatable workflow around it. You can even use ChatGPT Tasks to run recurring prompts, like preparing a weekly briefing or reminding you to review key updates. That is how you can start using AI in your actual work. If you want more practical tips on how to use AI at work, and not just better prompts, check out our Master AI for Work https://academy.towardsai.net/courses/ai-business-professionals?utm source=Newsletter&utm medium=email&utm id=AItips Course. 1. Version-Controlling Your Agents: Deployment, Rollback, and Safe Promotion Patterns https://pub.towardsai.net/version-controlling-your-agents-deployment-rollback-and-safe-promotion-patterns-6b7107dbe82a Code reviews do not catch how production agents break, and this piece makes a direct case for treating agent configuration with the same discipline applied to software releases. It lays out three failure modes that arise when versioning is absent: live changes without isolation, manual rollback from memory, and silent degradation without an audit trail. It also proposes fixes, such as immutable config snapshots, staged promotion through canary environments, automated release gates, and pinning LLM model versions to prevent silent behavioral drift between provider updates. 2. The Complete Guide to Attention Variants in Transformers: From Scaled Dot-Product to Flash Attention https://pub.towardsai.net/the-complete-guide-to-attention-variants-in-transformers-from-scaled-dot-product-to-flash-960a3b83107e?sk=5d651f9df7ba9ceb31560403061ab7c0 Every attention variant in the transformer ecosystem traces back to one engineering constraint: the quadratic cost of computing an n×n attention matrix. This article follows that constraint from the original scaled dot-product formulation through Multi-Head, Multi-Query, and Grouped Query Attention, then into Sliding Window, RoPE, Linear Attention, Flash Attention, and Sparse Attention patterns. Flash Attention receives particular focus for delivering O n memory with identical mathematical output by tiling computation inside GPU SRAM rather than materializing the full attention matrix. 3. Mechanistic Interpretability Is Having Its Moment: What Engineers Actually Need to Know https://pub.towardsai.net/mechanistic-interpretability-is-having-its-moment-what-engineers-actually-need-to-know-e4421f305f84 Mechanistic interpretability shifted from a research curiosity to a production-engineering concern in 2025 and 2026, earning a spot on MIT Technology Review’s list of breakthrough technologies. This article explains how sparse autoencoders decompose polysemantic neurons into interpretable features and how attribution graphs trace those features through model computation. It covers Anthropic’s application of these tools to Claude 3.5 Haiku, which revealed that the model plans rhymes before writing them, and reasons in language-independent circuits. It maintains persistent reward-model bias features throughout every assistant interaction. It also covers probe classifiers and activation steering as tools for runtime monitoring and targeted behavior control without full fine-tuning. 4. How to Train a Scoring Model in the Age of Artificial Intelligence https://pub.towardsai.net/how-to-train-a-scoring-model-in-the-age-of-artificial-intelligence-59184b9ca8a5?sk=e385b327d3b0a05fb66b44a8e822482a Building a credit scoring model means satisfying far more than a high AUC. This article walks through a full model selection methodology: training logistic regression models across variable combinations and evaluating them against statistical, business, and stability criteria. Candidate models are tested on training, test, and out-of-time samples using a penalized Gini to balance performance with consistency. A four-variable model emerges as the final pick, hitting 60% Gini and 49% PR-AUC with no overfitting. OpenAI Codex handled code generation throughout, confirming that AI accelerates the workflow when analysts retain judgment over final decisions. 5. Your Secrets Are Probably Leaking: Machine Identity and Credential Sprawl Explained https://pub.towardsai.net/your-secrets-are-probably-leaking-follow-this-to-fix-it-19eb33f998aa Most teams invest in user authentication and ship applications with passwords baked into environment files, CI variables, and container manifests. This article traces how a single database password propagates silently across git history, Terraform state, CI logs, and Kubernetes Secrets, and why static shared credentials make rotation operationally risky enough to defer indefinitely. It explains the Secret Zero problem and shows how modern platforms replace bootstrap secrets with cryptographic platform identities. It is anchored in six design principles covering centralization, least privilege, short-lived credentials, identity-based access, auditability, and revocation. 6. Linear Algebra: The Skeleton of Every AI Model https://pub.towardsai.net/linear-algebra-the-skeleton-of-every-ai-model-955dc11703ba?sk=e0b03707dfb8a3882d3d1aaf0ce0937c This article builds the connection between linear algebra and modern AI using nothing more than a shopping receipt. It traces how dot products, matrix multiplication, and weight grids connect a single neural network layer to self-attention in a transformer, showing why pure multiplication alone is insufficient: activation functions prevent layer collapse and give models the capacity to learn curves rather than flat planes. By the end, the self-attention mechanism in LLMs reduces to the same two operations, applied dynamically per sentence. 1. Pytest https://github.com/pytest-dev/pytest is the standard Python testing framework, supporting fixtures, parametrized tests, and a rich plugin ecosystem for unit, functional, and integration testing. 2. SkillSpector https://github.com/NVIDIA/SkillSpector is a security scanner for AI agent skills, covering 64 vulnerability patterns across 16 categories prompt injection, credential exfiltration, MCP tool poisoning, and more . 3. Omnigent https://github.com/omnigent-ai/omnigent is a meta-harness that sits above Claude Code, Codex, Pi, and custom agents, letting you compose, swap, and govern them in a single, in-sync session. 4. Cypress https://github.com/cypress-io/cypress is a JavaScript end-to-end testing framework that runs directly in the browser, providing real-time reloading, automatic waiting, and time-travel debugging for testing web applications against Chrome, Firefox, Edge, and Electron. 5. Dapr 1.18 https://github.com/dapr/dapr/releases/tag/v1.18.0 adds workflow history signing, propagation, attestation, stable Jobs API support, and an MCPServer resource for exposing Model Context Protocol tool calls as durable workflows. 1. Flash-KMeans: Fast and Memory-Efficient Exact K-Means https://arxiv.org/abs/2603.09229 Existing GPU implementations of k-means are bottlenecked by two system-level constraints: the assignment stage materializes the full N×K distance matrix in HBM, creating an IO bottleneck, and the centroid update stage suffers from atomic write contention caused by irregular scatter-style aggregations. Flash-KMeans introduces two kernel-level fixes: FlashAssign, which fuses distance computation with an online argmin to bypass intermediate memory materialization entirely, and sort-inverse update, which constructs an inverse mapping to replace high-contention atomic scatters with high-bandwidth segment-level reductions. On NVIDIA H200 GPUs, it achieves up to 17.9x end-to-end speedup over existing baselines and over 200x faster than FAISS on large workloads, while producing mathematically exact results. 2. Efficient Memory Management for Large Language Model Serving with PagedAttention https://arxiv.org/abs/2309.06180 This paper introduces PagedAttention, which borrows virtual memory and paging concepts from operating systems to manage KV cache in non-contiguous blocks. Requests share physical memory via a page table, and blocks are allocated on demand as new tokens are generated rather than reserved up front. This eliminates fragmentation and enables memory sharing across parallel sequences e.g., beam search, parallel sampling . Built into vLLM, PagedAttention achieves 2–4x higher throughput than state-of-the-art systems such as FasterTransformer and Orca, without any approximations or model modifications. 3. LightRAG: Simple and Fast Retrieval-Augmented Generation https://arxiv.org/abs/2410.05779 This paper introduces LightRAG, which incorporates graph structures into text indexing and retrieval processes. It operates in dual mode, combining low-level retrieval specific entities and their relationships with high-level retrieval broader topics and themes to handle both precise and abstract queries. Compared to existing RAG frameworks, including GraphRAG, LightRAG achieves consistently better retrieval relevance while reducing indexing costs through an incremental update mechanism that integrates new documents without rebuilding the full graph. 4. SkillOpt: Executive Strategy for Self-Evolving Agent Skills https://arxiv.org/abs/2605.23904 This paper argues that the skill should be trained as an external state of a frozen agent with the same discipline as weight-space optimization. SkillOpt uses a separate optimizer model to convert scored rollouts into bounded add/delete/replace edits on a single skill document, accepting an edit only when it strictly improves a held-out validation score. A textual learning-rate budget, rejected-edit buffer, and epoch-wise slow/meta updates stabilize the process while adding zero inference-time overhead at deployment. Across six benchmarks, seven target models, and three execution harnesses direct chat, Codex, Claude Code , SkillOpt is best or tied on all 52 evaluated cells. 1. OpenAI retired GPT-5.2 from ChatGPT https://help.openai.com/en/articles/6825453-chatgpt-release-notes , with existing GPT-5.2 conversations moved to GPT-5.5-class models. This matters for teams using ChatGPT in internal workflows because saved conversations can shift the model’s behavior after retirement dates. 2. Google released DiffusionGemma https://ai.google.dev/gemma/docs/diffusiongemma , an experimental open-weights text diffusion model built on Gemma 4’s 26B total, roughly 4B active sparse Mixture-of-Experts backbone. Google reports up to 4x faster token generation on GPUs, with more than 1,000 tokens per second on a single H100 and more than 700 tokens per second on an RTX 5090, while a quantized checkpoint fits in about 18GB of VRAM. It handles text, image, and video inputs and outputs text, with Apache 2.0 weights available. Analytics Engineer, Safety Systems @OpenAI San Francisco, CA, USA Principal Research AI Innovation Lead @Bristol Myers Squibb Remote/USA Senior GenAI Software Engineer @Liftoff Remote/USA Senior AI Engineer @ChargePoint Remote/India Enterprise AI adoption lead @Writer Chicago, IL, USA Consultant — Cloud Native Infrastructure & AIOps @Nutanix Remote AI Programme Manager @Capco London, UK ML Engineering Intern @GeoComply Vancouver, Canada Interested in sharing a job opportunity here? Contact sponsors@towardsai.net . Think a friend would enjoy this too? Share the newsletter and let them join the conversation. TAI 209: Claude Fable 5 Arrived, Then the US Government Took It Offline https://pub.towardsai.net/tai-209-claude-fable-5-arrived-then-the-us-government-took-it-offline-21b804f4d9ee was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.