Codex Chronicle was paying for every frame.

OpenAI's cloud-based Chronicle service, which incurred per-frame costs for analyzing screen captures, with a local Gemma 4 E4B 4-bit MLX model running on a $599 Mac mini. This new setup processes video from four independent sensors (screen, wearable camera, security cameras, and AI commentary) with zero outbound LLM calls and effectively no marginal inference cost. The author chose this approach for its native multimodal capabilities, sufficient 16GB unified memory, Apache 2.0 licensed weights that cannot be deprecated or repriced, and the ability to serve multiple homelab vision workloads from a single model instance.

I built a four-sensor Gemma 4 replacement on a Mac mini. For about a week I had OpenAI’s research-preview Chronicle running on my MacBook. Every ten minutes it screenshotted my display, uploaded frames to OpenAI for analysis, and wrote Markdown summaries on my Mac. I was crawling that folder and ingesting the data in a Postgres table on my homelab. It worked. It also cost credits for every cycle of attention. This weekend I replaced it with a single Gemma 4 E4B 4-bit MLX instance running on a $599 Mac mini, summarizing four independent sensor streams locally with zero outbound LLM calls and effectively zero marginal inference cost. OpenAI describes the constraints plainly in their own documentation: screen captures are uploaded to OpenAI’s servers for processing, the feature “uses rate limits quickly,” it “increases risk of prompt injection,” memories are stored as “unencrypted Markdown files” on the user’s machine, and it is unavailable in the EU, UK, and Switzerland. Chronicle is a Pro-tier feature on a Pro-tier price. The architectural choice is honest: cloud inference, per-frame cost, the model belongs to OpenAI. I wanted a different shape. What I built This weekend I replaced Chronicle. Not with a better cloud service. With a single Gemma 4 E4B 4-bit MLX instance on a $599 Mac mini, summarizing video from four sensors my screen, a wearable camera, the security cameras in my living room, and the wearable’s realtime AI commentary and writing them all to one Postgres table, redacted at ingest, queryable in SQL. Zero outbound LLM calls. Zero per-frame cost. The same model instance also serves the rest of my homelab’s vision workloads. The marginal cost of adding the fifth sensor which is already in a box on the way is whatever shipping cost was paid for a Raspberry Pi Zero 2 W. This is the sequel to a piece I published five days ago about putting Gemma 4 behind my homelab AI gateway. That one ended with: “Anvil is not just a dev box. For some multimodal work, it is a useful inference target.” This is about Anvil graduating. Why Gemma 4 E4B specifically The reasoning, in order of how much each one mattered to me: Native multimodal in one checkpoint. Image AND video AND audio paths in the same file. The whole sensor mesh runs through one weights load. No model swap per input type. 16 GB of unified memory is enough. The 4-bit MLX build sits at about 6 GB peak resident in isolation, around 8.5 GB under co-tenant load. On a base M-series Mac mini that leaves comfortable headroom for the OS, the FastAPI daemon, and a menubar app to watch it. Apache 2.0 weights. The model file is on my machine. Nobody can deprecate it out from under me, reprice it overnight, or restrict it by jurisdiction. It’s already loaded. I was routing this exact model through Forge for unrelated work. Spinning a second model for Logbook specifically would have been waste. One Gemma 4 instance. Two production roles. Four sensors, one envelope MacBook Screen Looki Wearable Blink Cameras │ │ │ └──────────────────┼──────────────────┘ ▼ Logbook Producers │ ▼ Anvil / Gemma 4 E4B │ ▼ Redaction Layer │ ▼ Postgres Every Logbook row is an observation.event.v1 envelope. The schema fits in one paragraph: a deterministic UUIDv5, a source enum, a captured at timestamp, a clip duration s, optional frame count, an image summary, an optional video summary, a media uri for the staging location, an inference metadata blob, and a source metadata blob. Same schema, four producers. The producers: MacBook screen. A Python capture daemon running as a LaunchAgent. Records a short screen video on a fixed cadence, pauses when HID idle exceeds 10 minutes, POSTs the clip to Anvil for analysis, then POSTs the resulting envelope to the homelab ingest endpoint. Looki wearable clips . A worker polls the wearable’s cloud, stages new motion clips to local NVMe, runs them through the same Anvil daemon. Looki wearable realtime . The wearable emits realtime AI commentary as text events. A second worker forwards those as image-summary-only observations into the same table. Blink security cameras. A continuous Node.js daemon polls Blink’s cloud, stages motion clips to NVMe, hands them to Anvil. Every clip lands on the same Anvil daemon, which runs one Gemma 4 E4B 4-bit MLX instance. The daemon serves two surfaces: /v1/analyze for Logbook image-pass + native-video-pass per clip . /v1/chat/completions and /v1/responses for every other Forge VLM client in the homelab. The model does not care which surface called it. The previous standalone gemma-4-multimodal LaunchAgent was retired and its plist removed. End state: one Gemma 4 instance, dual-purpose, no duplication. Redaction happens once, at the ingest endpoint, before the INSERT. UUIDs, filesystem paths, IPv4 and IPv6, internal hostnames, email addresses, API key shapes. Single pass. The day the model pretended to watch video For most of the build day, Logbook produced two summaries per clip: one from a native-video call mlx vlm.generate video=path, fps=1.0 , and one from a separate frame-extracted multi-image pass. The image summaries were excellent. They read pixels at 1280 px width and reported real strings: Termius, Phase 9, LOGBOOK BUILD BRIEF.md. Per-capture variation. Forensic detail. Anyone reading the raw table rows could tell which IDE window was on top. The video summaries were a different story. Every video summary for every mac screen capture, hour after hour, described “a person standing in a kitchen setting, facing a counter, holding a small dark object.” Word for word. The MacBook does not have a webcam pointed at the kitchen. The capture content was screen recordings. I revised the prompt to be explicit “you are observing a screen recording from a computer display” . Every video summary then described an identical Stack Overflow visit. Still word-for-word across captures. The model was not hallucinating. Hallucinating implies seeing something and misinterpreting it. The model was outputting the same paragraph because the same paragraph was the most likely next-token sequence given only the prompt. The video bytes were not reaching the attention layer at all. An MD5-hash query broke the case open. Across seven consecutive mac screen captures of five different windows, every video summary collapsed to two unique hashes one per prompt variant , perfectly correlated with the prompt text. The image summaries from the same seven captures produced seven unique hashes. Image was reading pixels. Video was reading nothing. Running the same script against two different Blink motion clips from the living room made it worse. Identical output on E4B. Identical output on E2B. E2B’s variant of the bug was more honest than E4B’s: where E4B confabulated plausible scenes, E2B simply replied “Please provide the video or a description of what you are seeing so I can describe it for you.” The model was literally asking for the video. Root cause was four lines deep in anvil/server.py . The daemon was building the formatted prompt with apply chat template processor, config, prompt, num images=N and then calling generate video=path, ... . The dispatcher in mlx vlm’s prompt utils.py checks kwargs.get "video" on the chat template call to decide whether to insert the <video placeholder. We were not passing it. The formatted prompt had no video marker. generate ’s video=path argument was effectively ignored at the attention layer: the video tokens had no anchor in the prompt to attend to. The fix is one branch: if video path: formatted = apply chat template processor, config, prompt, video=video path, num images=0, else: formatted = apply chat template processor, config, prompt, num images=num images, After the fix, the same seven captures produced seven unique video summaries. The model was watching. The bug was masked by polite-looking output. The summaries were grammatical, plausible, well-formed paragraphs. They just had nothing to do with the input. Numbers, and the redaction pass Isolated benchmarks on a single warmed clip, no other traffic on the daemon: Image pass: 4.08 s latency, 17.6 tok/s, 5.89 GB peak resident. Video pass: 6.67 s latency, 14.1 tok/s, 6.03 GB peak resident. Production averages across 467 ingested rows from a single day’s running, with the daemon also serving the rest of Forge’s VLM clients: | source | avg image latency | image tok/s | avg video latency | video tok/s | peak resident | |---|---|---|---|---|---| | mac screen | 11.20 s | 33.7 | 20.62 s | 33.9 | 8.52 GB | | looki clips | 8.57 s | 33.7 | 11.98 s | 33.9 | 8.50 GB | | blink | 24.85 s | 33.7 | 27.31 s | 34.7 | 8.52 GB | Two things shift between the bench and production. Throughput nearly doubles under load 33.7 tok/s vs. 17.6 because the model handles concurrent VLM work efficiently. Latency stretches by a factor of 2-6 depending on source because the same instance is now serving Logbook’s four producers alongside every other Forge VLM client. Peak resident memory climbs to 8.52 GB, still comfortably inside a 16 GB Mac mini. The latency stretch is the consolidation. One model, two surfaces, shared queue. Anvil idles at single-digit watts when the daemon is not actively inferring. Throughput is comfortable for the production cadence of all four sensors. No batching tricks required. The redaction pass is in production. A real row from this morning’s bronze layer, image summary verbatim: Email visible: REDACTED . IP shown: REDACTED The model saw both. The Postgres row holds neither. The model is local. The data is local. The redaction is at the ingest boundary. The audit trail is a SELECT statement against a table on hardware I own. What this actually changes The headline is not “I replaced OpenAI with Gemma.” The headline is that inference is no longer the bottleneck. When Chronicle does a screen capture, the inference is a network round trip to an API the user does not own, billed per request, rate-limited by the provider, and explicitly described in the provider’s own documentation as carrying “increased risk of prompt injection,” “memories stored as unencrypted Markdown files,” and consumption that “uses rate limits quickly.” The architecture treats each sensor as a customer of a paid service. When Logbook does a screen capture, the inference is a function call on hardware I own. The bottleneck is bytes-on-wire and bytes-on-disk, both of which are problems we already know how to solve. The model is a fixed cost. Every new sensor pays for itself in the wall clock of the moment it is added, not in the per-frame economics of the API. What ends up running on the Mac mini is closer to a personal telemetry fabric than to an AI assistant: distributed multi-modal sensors, normalized events, local inference, append-only memory. Chronicle did one thing competently and charged per frame. Logbook does the same thing four times over, from 360°, runs locally, and charges per electron. What’s next A Raspberry Pi Zero 2 W Basic was delivered to the house on May 16. A 250 g spool of 1.75 mm PLA filament arrived the day before. The shape of those two purchases together is a fifth sensor: a tiny always-on Linux SBC in a 3D-printed enclosure, somewhere on the spectrum of ambient sensor, audio recorder, or environmental probe. The exact function is the sensor’s business. The Logbook architecture does not care. The fifth sensor will arrive at the same ingest endpoint, in the same envelope shape, summarized by the same Gemma 4 instance that is already running. Whatever it captures will slot into raw ingest observations at its own captured at and interleave with the other four sources in time order. When it lands, the work will be writing one small handler.