{"slug": "we-made-our-filesystem-47x-faster-by-deleting-it", "title": "We made our filesystem 47× faster by deleting it", "summary": "The article describes how the developers of microsandbox improved performance by 47× by replacing their user-space FUSE filesystem with a Linux EROFS disk image mounted directly by the VM. The original approach required every file operation to bounce between the VM and host through FUSE, causing severe slowdowns, especially for operations like Python imports. The fix eliminated this overhead by having the VM's own kernel handle filesystem operations natively, while also removing 5,300 lines of host-side code.", "body_md": "A user in our Discord said microsandbox felt slow. Listing every file in the Python standard library took 5.3 seconds inside a sandbox; in Docker it took milliseconds. We went digging.\nWe fixed it in v0.4: we replaced our user-space filesystem with a Linux disk image that the VM mounts directly. The geometric mean speedup across our mixed guest-visible filesystem suite is 47×, with the worst-case rows more than 1,000× faster, and the host filesystem code is about 5,300 lines shorter.\nWhere this started\nMy first try was monofs: a content-addressed filesystem with block-level dedup, compression, and distributed read replicas. It stored images at 1.3× their original size on disk, and microsandbox is local-first, so the long-tail dedup payoff wasn't worth the up-front cost. For v0.3 I switched to OCI plus a user-space overlay built on a libkrun hook; we got layer dedup and identical behavior on Linux and macOS, but everything still ran outside the kernel.\nWhere the time was going\nEvery file operation inside the VM had to bounce out to the host through FUSE, which is Linux's mechanism for letting an ordinary program act as a filesystem. To open a file, the VM hands the request to our host process, which walks every layer looking for the file and sends the answer back; the same trip happens for every stat\n, every readdir\n, and every cache miss. A single Python import\ntriggers dozens of these round trips before your code even starts running, and a ten-layer image multiplies the cost of each one.\nWe spent the next stretch of v0.3 trying to make that path faster: better caching, fewer syscalls, smaller responses. Each change shaved a few percent. None of them changed the order of magnitude.\nDocker doesn't have this problem because Docker uses the kernel's own layered-filesystem driver (overlayfs), so file operations never leave the kernel. We were trying to match a kernel filesystem from outside the kernel; no cache could close that gap.\nSo we deleted the filesystem.\nThe new plan\nThe new plan was to stop bouncing every file operation between the VM and the host. We'd build a Linux filesystem image ahead of time, hand it to the VM as a virtual disk, and let the VM's own kernel mount it. With FUSE out of the loop, file operations inside the VM would stay inside the VM.\nThe filesystem we picked is EROFS: read-only, in-tree since the kernel needed it for Android, and easy to author. EROFS also solved the macOS problem: the VM's own kernel is Linux regardless of what's running outside it, so once the disk image is built, the host's filesystem stops mattering.\nNo mkfs, no mount, no helpers\nmicrosandbox runs on both Linux and macOS, and macOS lacks the host-side tools you'd normally use to build a filesystem image: no mkfs.ext4\n, no mkfs.erofs\n, no loopback mounts. If our image pipeline depended on any of them, we'd either have to ship a helper VM (heavy, slow to start) or live with a permanent split between platforms, and neither option fit microsandbox's \"single self-contained binary\" promise. So we wrote the image writers ourselves in Rust. A filesystem is a byte layout on disk; the writers just produce that layout. Three small pieces do the work:\n- An EROFS writer that emits the read-only image of an OCI layer.\n- An ext4 writer that emits the sparse, journaled scratch area each sandbox gets.\n- A VMDK descriptor that stitches everything into one virtual disk.\nNothing in the pipeline shells out, asks for root, or mounts a loopback device, and the same Rust code path builds the images on Linux and Apple Silicon without depending on host-only filesystem tools. The EROFS artifacts round-trip through a reader we also wrote, and CI boots the full stack under the real VM kernel. If a byte is wrong, two different readers tell us about it.\nThe first cut\nThe obvious way to use these writers was one EROFS image per OCI layer. The VM would get one virtual disk per layer plus one for the scratch area, and the kernel's overlayfs would merge them at boot. It worked: the first measurements landed between 10× and 175× faster than v0.3 depending on the workload, and we were ready to ship.\nThen we counted the layers. A Python image runs around ten; CUDA images more; some user-built ones push thirty or forty. microVMs cap how many devices they can carry, and we were attaching one disk per layer. We raised the cap, but the real fix was to stop using virtual disks to tell the VM \"this image has layers\" when the filesystem could carry that information itself.\nThe cleaner shape\nThe EROFS folks pointed us at a feature we hadn't been using: EROFS can build a metadata-only image, just the merged directory tree plus a pointer per file saying which underlying blob holds its bytes and at what offset. The kernel reads that image, treats the whole bundle as one virtual disk, and answers every lookup with a single calculation instead of a search across layers.\nThe pipeline becomes:\n- Pull the OCI layers as usual.\n- Build one small metadata image describing the merged tree.\n- Hand the VM one virtual disk that stitches the metadata and the layer blobs together.\nThe VM now only has to attach two rootfs block devices, no matter how many layers the original image had: one read-only VMDK-backed stack for the image (which internally references the merged-metadata image plus the per-layer EROFS extents), and one writable ext4 upper for the sandbox. Overlayfs only ever combines those two. This is the version we shipped, with a small libkrunfw kernel config tweak (CONFIG_EROFS_FS_XATTR\n+ CONFIG_EROFS_FS_SECURITY\n) so EROFS exposes the xattrs overlayfs needs for whiteouts.\nInside the pipeline\nAt pull time, the host materializes each OCI layer into an EROFS artifact keyed by its diff ID, merges the layer metadata with provenance, writes fsmeta.erofs\n, and emits a VMDK descriptor over fsmeta.erofs\nplus the layer extents. At sandbox create time, microsandbox creates a sparse upper.ext4\nfor that sandbox. At boot, the guest sees /dev/vda\nfor the read-only lower stack and /dev/vdb\nfor the writable upper, and Linux overlayfs assembles /\n.\nWhat it bought us\nWe ran the same benchmark suite three times against both versions on a python\nimage, with fresh state between runs. Across fourteen mixed guest-visible filesystem workloads, the geometric mean speedup is 47.18×, and the eight biggest movers are below.\nThe bars fall into two groups:\n- Rootfs path: the cleanest measure of the new OCI path; these operations now stay inside the guest kernel instead of bouncing through the host.\n/tmp\ntmpfs: real guest-visible wins, but from cutting out the FUSE round-trip on guest tmpfs workloads rather than from the new EROFS lower-rootfs path.\nmetadata_scan_stdlib\nscans the metadata of every file in the Python standard library. It used to take half a second. It now takes about 2 milliseconds.\nWhat we stopped having to worry about\nLinux's overlayfs is a large spec, covering whiteouts, opaque directories, hardlinks across copy-up, directory renames, and a handful of xattr conventions that all have to behave exactly right. Our v0.3 reimplemented most of it in user space, and we were still chasing edge cases the day we deleted it. v0.4 doesn't reimplement any of it, because the VM's own kernel does the merging, and the bugs we used to have aren't fixed; they're gone.\nThe host still has to understand OCI layer semantics, but only once, at pull time. Whiteouts, opaque directories, hardlinks, xattrs, and case-sensitive paths get normalized into the merged metadata tree before fsmeta.erofs\nis written. After that, the runtime path is ordinary kernel EROFS plus overlayfs.\nmacOS's APFS is case-insensitive by default. Plenty of Linux images contain files whose names differ only by case, and extracting them onto a Mac used to collapse the second into the first. v0.4 never extracts to the host filesystem; the EROFS writer streams the tar straight into a binary image where both names live as distinct entries on disk.\nWhat this lets us build on\nBecause the rootfs is now a real disk image, the surrounding product surface gets cheaper.\n- OCI patches. Rootfs patches users want on top of the image get baked into\nupper.ext4\nbefore boot, instead of bolted on through a runtime overlay protocol. - Shared lower layers. The per-layer EROFS artifacts are content-addressed by diff ID, so two sandboxes that share a base image share those bytes on disk and in cache.\n- Snapshots. A sandbox's writable state is a single ext4 file; preserving or copying it is a file copy.\n- Disk-image roots. Custom non-OCI disk-image rootfs reuses the same block-device boot machinery, minus the fsmerge step in front of it.\nWhat this doesn't fix\nOCI rootfs only. Bind volumes (host directories you share into the VM) still go through the old path. Their contents can change at any time while the VM is reading from them, which a read-only disk image cannot represent.\nFirst pulls aren't faster. We do more work at pull time now to build the images, though it is parallel across layers and bounded by tar decompression, so it lands close to where it was. Subsequent sandbox creates are faster, because we only emit a sparse scratch image.\nWrites to the image are still copy-on-write through overlayfs. Modifying a file from the image copies it up into the writable upper, exactly as in any overlay setup. The rootfs wins here are on lookup- and read-heavy paths; the /tmp\nlifecycle wins in the chart come from /tmp\nbeing a guest-side tmpfs by default, which is a separate runtime decision.\nWhat we would tell our past selves\nThe boring primitive in the kernel often beats the clever one in user space. Both monofs and our v0.3 overlay were ambitious designs, but EROFS is a boring, in-kernel file format, and for a sandbox rootfs the boring one won. We spent months tuning user-space code before accepting that the structural answer was to stop competing with the kernel and use it.\nNIH is fine when the existing thing breaks your design. Shelling out to mkfs.ext4\nor mkfs.erofs\nwould have meant either a helper VM or a Linux-only split, both of which would have undone microsandbox's \"single self-contained binary\" promise. Writing the writers ourselves was the cost of keeping that promise, and we'd make the same trade again.\nStay open to better ideas while shipping. Our first cut was already a big win, and we were tempted to ship it as is. The cleaner shape EROFS suggested looked like a nice-to-have at the time, but holding the PR open another week to absorb it turned a one-off optimization into something we are happy to support long term.\nRun benchmarks inside the VM. Timing from the host would have hidden the worst of the FUSE round-trip costs and made the win look smaller than it was. Time the thing your user actually waits on.\nTry it\nThis ships in microsandbox 0.4 and later. Install the CLI:\ncurl -sSL https://install.microsandbox.dev | sh\nOr use the SDK for your language:\nuv add microsandbox # python\nnpm install microsandbox # typescript\ncargo add microsandbox # rust\nThe benchmarks live in their own repo so they can grow into a cross-runtime comparison. With msb\non PATH\nand a fresh ~/.microsandbox\n:\ngit clone https://github.com/superradcompany/sandbox-bench\ncd sandbox-bench/benches/fs\njust bench-quick", "url": "https://wpnews.pro/news/we-made-our-filesystem-47x-faster-by-deleting-it", "canonical_source": "https://microsandbox.dev/blog/oci-filesystem-47x-faster", "published_at": "2026-05-19 18:01:23+00:00", "updated_at": "2026-05-23 19:35:32.237267+00:00", "lang": "en", "topics": ["developer-tools", "open-source", "cloud-computing"], "entities": ["microsandbox", "Docker", "Python", "Linux", "FUSE", "OCI", "libkrun"], "alternates": {"html": "https://wpnews.pro/news/we-made-our-filesystem-47x-faster-by-deleting-it", "markdown": "https://wpnews.pro/news/we-made-our-filesystem-47x-faster-by-deleting-it.md", "text": "https://wpnews.pro/news/we-made-our-filesystem-47x-faster-by-deleting-it.txt", "jsonld": "https://wpnews.pro/news/we-made-our-filesystem-47x-faster-by-deleting-it.jsonld"}}