cd /news/ai-infrastructure/from-732-bytes-to-nowhere-shutting-d… · home topics ai-infrastructure article
[ARTICLE · art-13445] src=together.ai pub= topic=ai-infrastructure verified=true sentiment=↓ negative

From 732 bytes to nowhere: shutting down Copy Fail in production

Together AI disabled the vulnerable `algif_aead` kernel module across its entire infrastructure within hours of exploit details for CVE-2026-31431, a Linux kernel bug that gives unprivileged local users a precise 4-byte write primitive into any readable file's page cache. The company unloaded the module and removed it from the module path to prevent silent re-enabling, treating the vulnerability as a fleet-level emergency because the exploit can turn a container compromise into root access on shared AI infrastructure hosts.

read4 min publishedApr 30, 2026

We were able to get ahead of Copy Fail (CVE‑2026‑31431) by treating it as a fleet‑level emergency, shutting off the vulnerable crypto socket interface across our infrastructure within hours and rolling in kernel patches once they were stable in our AI workloads. Before upstream fixes were widely available, we relied on a targeted kernel hardening step: Un the vulnerable module and removing it from the module path so it could not be silently re-enabled.

Copy Fail in one paragraph #

Copy Fail (CVE‑2026‑31431) is a logic bug in the Linux kernel’s crypto subsystem in the algif_aead

AF_ALG interface used for AEAD operations. It gives any unprivileged local user a precise 4‑byte write primitive into the page cache of any readable file on the system. In practice, public exploits flip a few bytes in shared, setuid binaries in memory and ride that to root on mainstream Linux distributions. The on‑disk file never changes, and the page is never marked dirty, which means traditional file‑integrity checks don’t see the attack even as the modified binary runs.

Why this matters for AI infrastructure #

On a developer laptop, Copy Fail is just a local privilege escalation. In a modern AI platform, “local” usually means CI jobs, multi‑tenant GPU nodes, ephemeral research environments, or third‑party workloads bringing their own dependencies.

From a cloud and AI perspective, the risk looks like this:

  • A compromise inside a container with access to AF_ALG sockets can be turned into root on the underlying host.
  • Because the page cache is shared, a write from one workload can subtly corrupt binaries or libraries used by other tenants on the same node.
  • Once a host is rooted, access to attached storage, control planes, and adjacent workloads becomes much easier.

We already operate under the assumption that containers are not a security boundary. Copy Fail is exactly the kind of quiet, deterministic primitive that can collapse the remaining margin in shared‑kernel multi‑tenant environments if you leave the vulnerable interface exposed.

Our immediate response: disable algif_aead #

everywhere

As soon as working exploit details landed, we focused on the most direct lever available: Stop exposing the vulnerable AF_ALG interface.

For Together AI’s production workloads, we do not depend on userspace algif_aead sockets on inference or training hosts. That gave us the freedom to take a blunt but safe action across the fleet:

Un the algif_aead

module shut down the vulnerable code path immediately in the running kernel. Moving the module file out of the standard module directory prevented system services or automation from re‑ it later during normal operations. This approach had a few important properties:

  • Fast: No reboot required, which matters when you’re running long‑lived GPU jobs.
  • Low‑risk: Typical server and AI workloads don’t rely on AF_ALG AEAD sockets directly, so the operational impact was minimal.
  • Durable: Even if a host rebooted into the same vulnerable kernel, it came back up with algif_aead

still disabled.

We encoded this as an idempotent compliance check in our configuration management: A host is not considered healthy until the module is unloaded and the .ko file is quarantined.

Rolling out kernel patches safely #

Disabling algif_aead

was a mitigation, not the final state. Once vendors release patches for CVE‑2026‑31431, we will move to a more traditional lifecycle:

  • Stage patched kernels in non‑production clusters that mirror our heaviest AI workloads, including dense multi‑tenant GPU nodes.
  • Run accelerated soak tests for performance, GPU driver compatibility, and stability under real inference and training loads.
  • Roll out patched kernels gradually by region and environment, starting with less shared clusters and moving toward heavily multi‑tenant ones as telemetry stayed clean.

Even after patching, we are keeping algif_aead

disabled in environments that do not have a clear need for it. Narrow, specialized kernel interfaces can have an ecosystem‑wide blast radius once something goes wrong; if we can safely run without them, we will.

In parallel, our detection teams added Copy Fail‑aware signals into our telemetry:

  • Alerts for unexpected AF_ALG usage or crypto module on nodes where it should never happen.
  • Behavioral monitoring for privileged binaries, looking for anomalies even when the on‑disk image remains unchanged.

Lessons for running secure AI platforms #

Copy Fail is a good illustration of how small kernel bugs can have outsized impact in AI infrastructure:

  • Shared kernels and dense multi‑tenancy amplify local bugs into cross‑tenant risks.
  • Page cache tricks can bypass traditional file‑integrity‑based defenses.
  • Narrow interfaces that “nobody uses” can suddenly become the main attack surface.

Our takeaway at Together AI is to keep tightening our kernel exposure model: Default‑off for niche interfaces, fast fleet‑wide toggles when something goes wrong, and a validation pipeline that proves these decisions are compatible with high‑performance AI workloads.

── more in #ai-infrastructure 4 stories · sorted by recency
── more on @copy fail 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/from-732-bytes-to-no…] indexed:0 read:4min 2026-04-30 ·