From 732 bytes to nowhere: shutting down Copy Fail in production

wpnews.pro

cd /news/ai-infrastructure/from-732-bytes-to-nowhere-shutting-d… · home › topics › ai-infrastructure › article

[ARTICLE · art-13445] src=together.ai ↗ pub=2026-04-30T00:00Z topic=ai-infrastructure verified=true sentiment=↓ negative

From 732 bytes to nowhere: shutting down Copy Fail in production

Together AI disabled the vulnerable `algif_aead` kernel module across its entire infrastructure within hours of exploit details for CVE-2026-31431, a Linux kernel bug that gives unprivileged local users a precise 4-byte write primitive into any readable file's page cache. The company unloaded the module and removed it from the module path to prevent silent re-enabling, treating the vulnerability as a fleet-level emergency because the exploit can turn a container compromise into root access on shared AI infrastructure hosts.

read4 min views18 publishedApr 30, 2026

We were able to get ahead of Copy Fail (CVE‑2026‑31431) by treating it as a fleet‑level emergency, shutting off the vulnerable crypto socket interface across our infrastructure within hours and rolling in kernel patches once they were stable in our AI workloads. Before upstream fixes were widely available, we relied on a targeted kernel hardening step: Un the vulnerable module and removing it from the module path so it could not be silently re-enabled.

Copy Fail in one paragraph #

Copy Fail (CVE‑2026‑31431) is a logic bug in the Linux kernel’s crypto subsystem in the algif_aead

AF_ALG interface used for AEAD operations. It gives any unprivileged local user a precise 4‑byte write primitive into the page cache of any readable file on the system. In practice, public exploits flip a few bytes in shared, setuid binaries in memory and ride that to root on mainstream Linux distributions. The on‑disk file never changes, and the page is never marked dirty, which means traditional file‑integrity checks don’t see the attack even as the modified binary runs.

Why this matters for AI infrastructure #

On a developer laptop, Copy Fail is just a local privilege escalation. In a modern AI platform, “local” usually means CI jobs, multi‑tenant GPU nodes, ephemeral research environments, or third‑party workloads bringing their own dependencies.

From a cloud and AI perspective, the risk looks like this:

A compromise inside a container with access to AF_ALG sockets can be turned into root on the underlying host.
Because the page cache is shared, a write from one workload can subtly corrupt binaries or libraries used by other tenants on the same node.
Once a host is rooted, access to attached storage, control planes, and adjacent workloads becomes much easier.

We already operate under the assumption that containers are not a security boundary. Copy Fail is exactly the kind of quiet, deterministic primitive that can collapse the remaining margin in shared‑kernel multi‑tenant environments if you leave the vulnerable interface exposed.

Our immediate response: disable `algif_aead` #

everywhere

As soon as working exploit details landed, we focused on the most direct lever available: Stop exposing the vulnerable AF_ALG interface.

For Together AI’s production workloads, we do not depend on userspace algif_aead sockets on inference or training hosts. That gave us the freedom to take a blunt but safe action across the fleet:

Un the algif_aead

module shut down the vulnerable code path immediately in the running kernel. Moving the module file out of the standard module directory prevented system services or automation from re‑ it later during normal operations. This approach had a few important properties:

Fast: No reboot required, which matters when you’re running long‑lived GPU jobs.
Low‑risk: Typical server and AI workloads don’t rely on AF_ALG AEAD sockets directly, so the operational impact was minimal.
Durable: Even if a host rebooted into the same vulnerable kernel, it came back up with algif_aead

still disabled.

We encoded this as an idempotent compliance check in our configuration management: A host is not considered healthy until the module is unloaded and the .ko file is quarantined.

Rolling out kernel patches safely #

Disabling algif_aead

was a mitigation, not the final state. Once vendors release patches for CVE‑2026‑31431, we will move to a more traditional lifecycle:

Stage patched kernels in non‑production clusters that mirror our heaviest AI workloads, including dense multi‑tenant GPU nodes.
Run accelerated soak tests for performance, GPU driver compatibility, and stability under real inference and training loads.
Roll out patched kernels gradually by region and environment, starting with less shared clusters and moving toward heavily multi‑tenant ones as telemetry stayed clean.

Even after patching, we are keeping algif_aead

disabled in environments that do not have a clear need for it. Narrow, specialized kernel interfaces can have an ecosystem‑wide blast radius once something goes wrong; if we can safely run without them, we will.

In parallel, our detection teams added Copy Fail‑aware signals into our telemetry:

Alerts for unexpected AF_ALG usage or crypto module on nodes where it should never happen.
Behavioral monitoring for privileged binaries, looking for anomalies even when the on‑disk image remains unchanged.

Lessons for running secure AI platforms #

Copy Fail is a good illustration of how small kernel bugs can have outsized impact in AI infrastructure:

Shared kernels and dense multi‑tenancy amplify local bugs into cross‑tenant risks.
Page cache tricks can bypass traditional file‑integrity‑based defenses.
Narrow interfaces that “nobody uses” can suddenly become the main attack surface.

Our takeaway at Together AI is to keep tightening our kernel exposure model: Default‑off for niche interfaces, fast fleet‑wide toggles when something goes wrong, and a validation pipeline that proves these decisions are compatible with high‑performance AI workloads.

source & further reading

together.ai — original article Open, convenient and predictable: Introducing Provisioned Throughput Announcing our $800M Series C to accelerate the shift to open-source AI Together AI at ICML 2026: frontier research across the full stack

~/api · this article 200

$curl api.wpnews.pro/v1/news/from-732-bytes-to-nowher…

Read original on together.ai → www.together.ai/blog/shutting-down-copy-fail-in-…

mentioned entities

Copy Fail

CVE-2026-31431

Linux kernel

AF_ALG

metadata

slugfrom-732-bytes-to-nowhere-shutting-down-copy-fail-in-production

topic#ai-infrastructure

secondary2 topics

sentimentnegative

canonicaltogether.ai

navigation

← prevVictory after a decade preventin…

next →Configuring minimum release age …

── more in #ai-infrastructure 4 stories · sorted by recency

cloud.google.com · 9 Jul · #ai-infrastructure

Safely run AI-generated code in Cloud Run sandboxes

ai-2040.com · 9 Jul · #ai-infrastructure

AI 2040: Plan A

byteiota.com · 9 Jul · #ai-infrastructure

Meta’s Vistara CXL Chip Puts Dead DDR4 Back to Work

lesswrong.com · 9 Jul · #ai-infrastructure

Debate with Self-Play Best-of-N Optimization

── more on @copy fail 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 8 Jul · #artificial-intelligence

Anthropic's "J-lens" reveals workspace in Claude mirrors theory of consciousness

wpnews · 8 Jul · #ai-safety

China warns of security risks in Anthropic’s AI tool, impacting market confidence

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required

From 732 bytes to nowhere: shutting down Copy Fail in production

Copy Fail in one paragraph #

Why this matters for AI infrastructure #

Our immediate response: disable algif_aead #

Rolling out kernel patches safely #

Lessons for running secure AI platforms #

Run your AI side-project on zahid.host

Our immediate response: disable `algif_aead` #