Tracing Codex's 640TB-a-year SQLite writes

A developer created an eBPF-based tool called sqlite-trace to log SQLite queries from any binary, after a GitHub issue revealed OpenAI's Codex was writing 640TB of logs per year to SQLite. The tool works by hooking into SQLite functions at runtime, even for statically linked binaries, solving a long-standing difficulty in debugging SQLite write-heavy applications.

A recent Github issue https://github.com/openai/codex/issues/28224 for OpenAI's Codex on how the harness writes way way too many logs in SQLite started gaining traction the other day, and I wanted to take a shot at figuring out what's going on with it. The issue already outlines most of the problem, but I'm curious what's doing the writes. There are a lot of tools out there that can help diagnose problems like these by logging or keeping track of queries made to databases like Mysql or Postgres. Postgres has a pg stat statements extension you can enable to keep track of what queries are being made, for example. But there's really no equivalent in SQLite. There's no extension for logging queries that I'm aware of, and there's no connection to "proxy" and read incoming queries, all queries are made directly on disk. This problem is already notoriously difficult for many databases, but SQLite is the worst version of it I can think of. To get around this, I've been working on sqlite-trace https://github.com/Query-Doctor/sqlite-trace , an eBPF program for logging queries made by any unknown binary that links either dynamically or statically against SQLite. Now in this specific case, yes... Codex is open source, and we have a good idea why this logging behavior exists. But there are plenty of closed-source programs that use this database, and sometimes you need runtime instrumentation to get a better idea of what's happening regardless. And currently, that's really difficult to do from the outside. How do you log SQLite queries? Let me show how it works first with fossil , a version control system made exclusively for SQLite, and dogfoods the db to store version control data. xetera@lima-ebpf sudo ./build/sqlite trace --lib $ which fossil ... ... I run fossil new a in a different folder ... omitted extra queries ... ... INSERT OR IGNORE INTO user login, info VALUES 'xetera','' fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101 DONE t=701.9us in: sql=59B bound=0B total=59B vars=0 scanned=0 app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns pid=56296 cgroup=0x1746 cmd="fossil new a" UPDATE user SET cap='s', pw='TGHWVvgdDy' WHERE login='xetera' fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101 DONE t=5.2us in: sql=61B bound=0B total=61B vars=0 scanned=0 app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns pid=56296 cgroup=0x1746 cmd="fossil new a" INSERT OR IGNORE INTO user login,pw,cap,info VALUES 'anonymous',hex randomblob 8 ,'hz','Anon' ; fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101 DONE t=4.6us in: sql=99B bound=0B total=99B vars=0 scanned=0 app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns pid=56296 cgroup=0x1746 cmd="fossil new a It's able to pick up queries the binary makes against the target database, neat. Normally, this functionality is easily accomplished by attaching uprobes to functions applications call from libraries like libsqlite3 . The tricky part is, when installed from apt , fossil doesn't link against it at all. xetera@lima-ebpf ldd $ which fossil linux-vdso.so.1 0x0000f0d6c613c000 libresolv.so.2 = /lib/aarch64-linux-gnu/libresolv.so.2 0x0000f0d6c5b50000 libssl.so.3 = /lib/aarch64-linux-gnu/libssl.so.3 0x0000f0d6c5a20000 libcrypto.so.3 = /lib/aarch64-linux-gnu/libcrypto.so.3 0x0000f0d6c5410000 libz.so.1 = /lib/aarch64-linux-gnu/libz.so.1 0x0000f0d6c53d0000 libm.so.6 = /lib/aarch64-linux-gnu/libm.so.6 0x0000f0d6c5300000 libc.so.6 = /lib/aarch64-linux-gnu/libc.so.6 0x0000f0d6c5100000 /lib/ld-linux-aarch64.so.1 0x0000f0d6c6100000 libzstd.so.1 = /lib/aarch64-linux-gnu/libzstd.so.1 0x0000f0d6c5030000 Which is a huge pain, because it would be so much nicer if we could just hook into a function like sqlite3 step that produces row outputs in the shared library it targets. And this is how some programs work like with the sqlite3 binary itself. Sadly, in just as many other cases, we have to deal with static linking. To make matters worse, fossil and most other production binaries you would be interested in logging strip all their symbols: xetera@lima-ebpf nm $ which fossil nm: /usr/bin/fossil: no symbols xetera@lima-ebpf nm -D $ which fossil | grep sqlite xetera@lima-ebpf To attempt to solve all this, the program looks for well-known strings that can't be optimized out like cannot commit transaction - SQL statements in progress and abort due to ROLLBACK and follows the chain of calls that reference it down to a plausible looking location. Also worth keeping in mind that SQLite's public API looks something like this: typedef struct sqlite3 stmt sqlite3 stmt; int sqlite3 step sqlite3 stmt ; // other public functions... Where the sqlite3 stmt is an opaque struct. Which means even if you manage to hook the function, you can't actually extract the query from inside the statement, because you don't know what offset that field will be at in that struct. There are other sneaky ways of going about this by hooking sqlite3 prepare functions and extracting the const char from the parameters there, but then you have to keep track of the pointer that function returns, then aliasing and copying becomes a problem you need to deal with, and it's just a headache in general. If you really wanted to solve this, you'd need to do something crazy like generate internal type definitions for every published version of SQLite, check which version the target binary is using and use those specific offsets. And that's exactly what the tool does today. Every modern version of sqlite is built from source and turned into a .btf file that includes type definitions for that release. What codex does At the time of writing this, the issue doesn't seem to be fixed so I'll go off what I'm seeing on my end by just running the tracer against my version of the binary 0.142.2 . Codex seems to perform these two queries within the same transaction as it flushes in-memory logs to its store. Which happens significantly more as it sends requests and streams responses back. INSERT INTO logs ts, ts nanos, level, target, feedback log body, thread id, process uuid, module path, file, line, estimated bytes VALUES ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? , ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? , ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? -- and so on DELETE FROM logs WHERE id IN SELECT id FROM SELECT id, SUM estimated bytes OVER PARTITION BY process uuid ORDER BY ts DESC, ts nanos DESC, id DESC AS cumulative bytes, ROW NUMBER OVER PARTITION BY process uuid ORDER BY ts DESC, ts nanos DESC, id DESC AS row number FROM logs WHERE thread id IS NULL AND process uuid IN ? WHERE cumulative bytes ? OR row number ? ; Along with this third one, they make up the overwhelming majority of the queries being run within a 5 minute light coding test session. SELECT process uuid FROM logs WHERE thread id IS NULL AND process uuid IN ? GROUP BY process uuid HAVING SUM estimated bytes ? OR COUNT ?; Many of the inserts are batched with 700+ entries at a time, most of which are TRACE logs that definitely should not have been enabled by default in production. The total amount of data sent through INSERT statements as parameters in function calls grows fast over the course of 5 minutes too. Up to almost 35MB in total. In reality this value is a lot higher than that figure, as mutliple indexes have to be updated and shuffled around for each insert. There's another interesting finding here when we introspect the logs db with .schema logs CREATE TABLE logs id INTEGER PRIMARY KEY AUTOINCREMENT, ts INTEGER NOT NULL, ts nanos INTEGER NOT NULL, level TEXT NOT NULL, target TEXT NOT NULL, feedback log body TEXT, module path TEXT, file TEXT, line INTEGER, thread id TEXT, process uuid TEXT, estimated bytes INTEGER NOT NULL DEFAULT 0 ; CREATE INDEX idx logs ts ON logs ts DESC, ts nanos DESC, id DESC ; CREATE INDEX idx logs thread id ON logs thread id ; CREATE INDEX idx logs thread id ts ON logs thread id, ts DESC, ts nanos DESC, id DESC ; CREATE INDEX idx logs process uuid threadless ts ON logs process uuid, ts DESC, ts nanos DESC, id DESC WHERE thread id IS NULL; The index idx logs thread id is unnecessary here because it shares the same prefix with idx logs thread id ts , and causes slightly more churn than necessary since the db has to do extra work to maintain an index when it could be useing the other one instead. There are some instances where an index with fewer columns could speed up the most critical queries, since fewer pages are being fetched from disk, but that's not relevant in this case. Getting rid of that index entirely would reduce writes. In my opinion, not only should the TRACE level not be enabled by default in production, but no logs should be getting persisted unless the user turns it on for debugging. The entire point of the logs table is for a single submit feedback feature. But somehow, it's being updated, deleted, and queried huh? multiple times a second without even submitting feedback. There are big indexes being maintained exclusively to speed up the deletion. This is no way to treat a database. If you enjoy database observability, we're building a more advanced version of the tool here for your production apps that use Postgres. Showing you the best index for each query, along with query rewrite opportunities and much more. Check it out here https://app.querydoctor.com?utm source=blog