A debugging story about heap snapshots, native memory that --max-old-space-size
can't touch, and a WebAssembly filesystem quietly hoarding files.
I run a small service that gives a team of Claude Code users one shared memory store. Mechanically it's a Node/Express proxy that wraps a stdio MCP server (ruflo
) and exposes it over HTTP. You don't need the product to follow the bug β just one fact: a long-lived Node process serves memory operations, and underneath it uses sql.js (SQLite compiled to WebAssembly) to hold the store.
One instance in production kept growing. Not spiking β creeping. ~36 GB RSS over six weeks, then the cgroup OOM-killer would reap it and the clock reset. Classic leak shape.
The proxy and the wrapped MCP child are separate processes. ps
settled it fast: the proxy sat flat at ~60 MB; the ruflo mcp start
child was the one ballooning. So the leak was below my code, in the wrapped process. Good β narrower problem.
First instinct on a Node leak is the V8 heap. So I looked at process.memoryUsage()
on the live child:
rss 1385 MB
heapTotal 24 MB
heapUsed 21 MB
external 1286 MB
arrayBuffers 995 MB
This is the whole story in five numbers. heapTotal
β the V8 JS heap β is flat at 24 MB. The growth is entirely in ** external / arrayBuffers**: native memory backing
ArrayBuffer
s, That immediately kills two "obvious" fixes:
--max-old-space-size
So: what holds ~1 GB of ArrayBuffer
s?
I opened the inspector on the live process (kill -USR1 <pid>
, then connected over the WebSocket β Node 22 has a global WebSocket
, so a 30-line script does it) and took a HeapProfiler.takeHeapSnapshot
. The snapshot was only ~18 MB, which is itself a clue: if the leak were hundreds of thousands of small JS objects, the graph would be huge. A small graph holding a lot of bytes means a few big buffers.
Parsing the snapshot (the format is just nodes
/ edges
/ strings
arrays), the top retained objects were unambiguous:
203 Γ native:system / JSArrayBufferData @ 11.0 MB = 2233 MB
203 buffers, 11 MB each. And 11 MB was exactly the size of the on-disk memory.db
. The retainer chain:
JSArrayBufferData (11 MB)
<- ArrayBuffer
<- Buffer
<- (MEMFS file node).contents
<- FS.nodes (an Array)
<- Context (the sql.js Emscripten module β has WebAssembly.Memory, HEAPF32, createNode, /dev/ttyβ¦)
<- SqlJsBackend.db
That Context
with createNode
, /dev/tty
, and a WebAssembly.Memory
is the tell: it's Emscripten's in-memory filesystem (MEMFS). The file names confirmed it β each buffer was a MEMFS file called dbfile_<random>
, and there were ~200 of them, each a full copy of the database.
Here's the mechanism. sql.js's Database
constructor writes its input bytes into a MEMFS file (dbfile_<random>
) via FS.createDataFile
. Database.prototype.close()
is what removes it (FS.unlink
). And the sql.js module is a process-wide singleton β one MEMFS shared by every Database
you ever open.
The backend opened the database like this, per operation path, with no caching:
this.db = new SQL.Database(fs.readFileSync(path)); // loads the whole 11MB image
// ...used, then the wrapper goes out of scope
When that JS Database
wrapper is dropped, V8 garbage-collects the wrapper object β but GC has no idea about the MEMFS file it created inside the WASM module. Only an explicit close()
unlinks it. No close()
β the 11 MB dbfile_<random>
lives in MEMFS forever. One leaked DB image per open. Multiply by traffic and you get 36 GB.
This is the trap in one sentence: garbage-collecting a JS handle does not free native/WASM memory it allocated. The GC sees a tiny wrapper; the cost is in a buffer the GC doesn't manage.
Containment (ship today). I added an RSS watchdog to the proxy: it reads the child's RSS from /proc/<pid>/status
, and when it crosses a threshold it gracefully respawns the child once it's idle (reusing an existing single-flight reconnect path β kill the old child, spawn a fresh one). A respawn drops the entire bloated MEMFS at once. Symptomatic, but it bounds memory with zero dropped requests.
Root cause (fix it properly). Cache the backend per database path so the DB opens once and is reused, instead of a fresh SQL.Database
per call. No repeated loads β no new dbfile_*
. I bake this as a build-time patch into the image and filed it upstream with the snapshot.
The earlier hard OOM-kills had interrupted a sql.js write mid-flight and left one memory.db
corrupted β database disk image is malformed
, busted overflow pages in the B-tree. Recovery turned into its own adventure:
.recover
(SQLite's salvage mode) reconstructed the bulk of the rows by walking the B-tree fragments.-wal
), which .recover
doesn't replay, and some sat on the corrupted pages. I ended up parsing WAL frames by hand (apply page images by page number) and carving SQLite leaf-page records directly to recover the rest.Lesson burned in: a WAL-mode SQLite backup is three files β db
-
-wal -
-shm
. Copy only the .db
and you get exactly that "malformed" error, because the latest committed state is still in the WAL.
heapTotal
flat + external
/arrayBuffers
rising = native leak. Don't reach for --max-old-space-size
; it can't help.JSArrayBufferData
nodes and their retainer chain pointed straight at the owning structure. A small snapshot holding big bytes = few large buffers.Upstream writeup with the full retainer trace: ruvnet/ruflo#2432. The wrapper itself, if you're curious:
jazz-max/ruflo-hub