{"slug": "tracing-codex-s-640tb-a-year-sqlite-writes", "title": "Tracing Codex's 640TB-a-year SQLite writes", "summary": "A developer created an eBPF-based tool called sqlite-trace to log SQLite queries from any binary, after a GitHub issue revealed OpenAI's Codex was writing 640TB of logs per year to SQLite. The tool works by hooking into SQLite functions at runtime, even for statically linked binaries, solving a long-standing difficulty in debugging SQLite write-heavy applications.", "body_md": "A recent [Github issue](https://github.com/openai/codex/issues/28224) for OpenAI's Codex on how the harness writes way *way* too many logs in SQLite started gaining traction the other day, and I wanted to take a shot at figuring out what's going on with it. The issue already outlines most of the problem, but I'm curious what's doing the writes.\n\nThere are a lot of tools out there that can help diagnose problems like these by logging or keeping track of queries made to databases like Mysql or Postgres. Postgres has a `pg_stat_statements`\n\nextension you can enable to keep track of what queries are being made, for example. But there's really no equivalent in SQLite. There's no extension for logging queries that I'm aware of, and there's no connection to \"proxy\" and read incoming queries, all queries are made directly on disk. This problem is already notoriously difficult for many databases, but SQLite is the worst version of it I can think of.\n\nTo get around this, I've been working on [sqlite-trace](https://github.com/Query-Doctor/sqlite-trace), an eBPF program for logging queries made by any unknown binary that links either dynamically or *statically* against SQLite. Now in this specific case, yes... Codex is open source, and we have a good idea why this logging behavior exists. But there are plenty of closed-source programs that use this database, and sometimes you need runtime instrumentation to get a better idea of what's happening regardless. And currently, that's really difficult to do from the outside.\n\n# How do you log SQLite queries?\n\nLet me show how it works first with `fossil`\n\n, a version control system made exclusively for SQLite, and dogfoods the db to store version control data.\n\n```\nxetera@lima-ebpf sudo ./build/sqlite_trace --lib $(which fossil)\n...\n... I run `fossil new a` in a different folder\n... omitted extra queries ...\n...\nINSERT OR IGNORE INTO user(login, info) VALUES('xetera','')\n    fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101(DONE) t=701.9us\n    in: sql=59B bound=0B total=59B (vars=0 scanned=0)\n    app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns_pid=56296 cgroup=0x1746 cmd=\"fossil new a\"\n\nUPDATE user SET cap='s', pw='TGHWVvgdDy' WHERE login='xetera'\n    fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101(DONE) t=5.2us\n    in: sql=61B bound=0B total=61B (vars=0 scanned=0)\n    app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns_pid=56296 cgroup=0x1746 cmd=\"fossil new a\"\n\nINSERT OR IGNORE INTO user(login,pw,cap,info)   VALUES('anonymous',hex(randomblob(8)),'hz','Anon');\n    fossil pid=56296 db=/Users/xetera/projects/sqlite-trace/a rows=0 rc=101(DONE) t=4.6us\n    in: sql=99B bound=0B total=99B (vars=0 scanned=0)\n    app: service=fossil exe=/usr/bin/fossil uid=501 gid=1000 ns_pid=56296 cgroup=0x1746 cmd=\"fossil new a\n```\n\nIt's able to pick up queries the binary makes against the target database, neat.\n\nNormally, this functionality is easily accomplished by attaching uprobes to functions applications call from libraries like `libsqlite3`\n\n. The tricky part is, when installed from `apt`\n\n, `fossil`\n\ndoesn't link against it at all.\n\n```\nxetera@lima-ebpf ldd $(which fossil)\n        linux-vdso.so.1 (0x0000f0d6c613c000)\n        libresolv.so.2 => /lib/aarch64-linux-gnu/libresolv.so.2 (0x0000f0d6c5b50000)\n        libssl.so.3 => /lib/aarch64-linux-gnu/libssl.so.3 (0x0000f0d6c5a20000)\n        libcrypto.so.3 => /lib/aarch64-linux-gnu/libcrypto.so.3 (0x0000f0d6c5410000)\n        libz.so.1 => /lib/aarch64-linux-gnu/libz.so.1 (0x0000f0d6c53d0000)\n        libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000f0d6c5300000)\n        libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000f0d6c5100000)\n        /lib/ld-linux-aarch64.so.1 (0x0000f0d6c6100000)\n        libzstd.so.1 => /lib/aarch64-linux-gnu/libzstd.so.1 (0x0000f0d6c5030000)\n```\n\nWhich is a huge pain, because it would be so much nicer if we could just hook into a function like `sqlite3_step`\n\nthat produces row outputs in the shared library it targets. And this is how some programs work like with the `sqlite3`\n\nbinary itself. Sadly, in just as many other cases, we have to deal with static linking. To make matters worse, fossil and most other production binaries you would be interested in logging strip all their symbols:\n\n```\nxetera@lima-ebpf nm $(which fossil)\nnm: /usr/bin/fossil: no symbols\nxetera@lima-ebpf nm -D $(which fossil) | grep sqlite\nxetera@lima-ebpf\n```\n\nTo attempt to solve all this, the program looks for well-known strings that can't be optimized out like `cannot commit transaction - SQL statements in progress`\n\nand `abort due to ROLLBACK`\n\nand follows the chain of calls that reference it down to a plausible looking location.\n\nAlso worth keeping in mind that SQLite's public API looks something like this:\n\n```\ntypedef struct sqlite3_stmt sqlite3_stmt;\n\nint sqlite3_step(sqlite3_stmt*);\n// other public functions...\n```\n\nWhere the `sqlite3_stmt`\n\nis an opaque struct. Which means *even* if you manage to hook the function, you can't actually extract the query from inside the statement, because you don't know what offset that field will be at in that struct. There are other sneaky ways of going about this by hooking `sqlite3_prepare`\n\nfunctions and extracting the `const char*`\n\nfrom the parameters there, but then you have to keep track of the pointer that function returns, then aliasing and copying becomes a problem you need to deal with, and it's just a headache in general.\n\nIf you really wanted to solve this, you'd need to do something crazy like generate internal type definitions for every published version of SQLite, check which version the target binary is using and use those specific offsets. And that's exactly what the tool does today. Every modern version of sqlite is built from source and turned into a `.btf`\n\nfile that includes type definitions for that release.\n\n# What codex does\n\nAt the time of writing this, the issue doesn't seem to be fixed so I'll go off what I'm seeing on my end by just running the tracer against my version of the binary (0.142.2). Codex seems to perform these two queries within the same transaction as it flushes in-memory logs to its store. Which happens significantly more as it sends requests and streams responses back.\n\n```\nINSERT INTO logs (ts, ts_nanos, level, target, feedback_log_body, thread_id, process_uuid, module_path, file, line, estimated_bytes)\nVALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),\n       (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?),\n       (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) -- and so on\nDELETE FROM logs\nWHERE id IN (\n    SELECT id\n    FROM (\n        SELECT\n            id,\n            SUM(estimated_bytes) OVER (\n                PARTITION BY process_uuid\n                ORDER BY ts DESC, ts_nanos DESC, id DESC\n            ) AS cumulative_bytes,\n            ROW_NUMBER() OVER (\n                PARTITION BY process_uuid\n                ORDER BY ts DESC, ts_nanos DESC, id DESC\n            ) AS row_number\n        FROM logs\n        WHERE thread_id IS NULL\n          AND process_uuid IN (?)\n    )\n    WHERE cumulative_bytes > ? OR row_number > ?\n);\n```\n\nAlong with this third one, they make up the overwhelming majority of the queries being run within a 5 minute light coding test session.\n\n```\nSELECT process_uuid FROM logs\nWHERE thread_id IS NULL AND process_uuid IN (?)\nGROUP BY process_uuid\nHAVING SUM(estimated_bytes) > ? OR COUNT(*) > ?;\n```\n\nMany of the inserts are batched with 700+ entries at a time, most of which are `TRACE`\n\nlogs that definitely should not have been enabled by default in production.\n\nThe total amount of data sent through `INSERT`\n\nstatements as parameters in function calls grows fast over the course of 5 minutes too. Up to almost 35MB in total. In reality this value is a lot higher than that figure, as mutliple indexes have to be updated and shuffled around for each insert.\n\nThere's another interesting finding here when we introspect the logs db with `.schema logs`\n\n```\nCREATE TABLE logs (\n    id INTEGER PRIMARY KEY AUTOINCREMENT,\n    ts INTEGER NOT NULL,\n    ts_nanos INTEGER NOT NULL,\n    level TEXT NOT NULL,\n    target TEXT NOT NULL,\n    feedback_log_body TEXT,\n    module_path TEXT,\n    file TEXT,\n    line INTEGER,\n    thread_id TEXT,\n    process_uuid TEXT,\n    estimated_bytes INTEGER NOT NULL DEFAULT 0\n);\nCREATE INDEX idx_logs_ts ON logs(ts DESC, ts_nanos DESC, id DESC);\nCREATE INDEX idx_logs_thread_id ON logs(thread_id);\nCREATE INDEX idx_logs_thread_id_ts ON logs(thread_id, ts DESC, ts_nanos DESC, id DESC);\nCREATE INDEX idx_logs_process_uuid_threadless_ts ON logs(process_uuid, ts DESC, ts_nanos DESC, id DESC)\n  WHERE thread_id IS NULL;\n```\n\nThe index `idx_logs_thread_id`\n\nis unnecessary here because it shares the same prefix with `idx_logs_thread_id_ts`\n\n, and causes slightly more churn than necessary since the db has to do extra work to maintain an index when it could be useing the other one instead.\n\nThere are some instances where an index with fewer columns could speed up the most critical queries, since fewer pages are being fetched from disk, but that's not relevant in this case. Getting rid of that index entirely would reduce writes.\n\nIn my opinion, not only should the `TRACE`\n\nlevel not be enabled by default in production, but no logs should be getting persisted unless the user turns it on for debugging. The entire point of the `logs`\n\ntable is for a single submit feedback feature. But somehow, it's being updated, deleted, and queried (huh?) multiple times a second without even submitting feedback. There are big indexes being maintained exclusively to speed up the deletion. This is no way to treat a database.\n\nIf you enjoy database observability, we're building a more advanced version of the tool here for your production apps that use Postgres. Showing you the best index for each query, along with query rewrite opportunities and much more. [Check it out here](https://app.querydoctor.com?utm_source=blog)", "url": "https://wpnews.pro/news/tracing-codex-s-640tb-a-year-sqlite-writes", "canonical_source": "https://querydoctor.com/blog/tracing-codexs-640tb-year-sqlite-writes", "published_at": "2026-07-01 04:21:25+00:00", "updated_at": "2026-07-01 04:50:01.481489+00:00", "lang": "en", "topics": ["developer-tools", "ai-infrastructure"], "entities": ["OpenAI", "Codex", "SQLite", "sqlite-trace", "eBPF", "fossil", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/tracing-codex-s-640tb-a-year-sqlite-writes", "markdown": "https://wpnews.pro/news/tracing-codex-s-640tb-a-year-sqlite-writes.md", "text": "https://wpnews.pro/news/tracing-codex-s-640tb-a-year-sqlite-writes.txt", "jsonld": "https://wpnews.pro/news/tracing-codex-s-640tb-a-year-sqlite-writes.jsonld"}}