Codex Chronicle was paying for every frame. OpenAI's cloud-based Chronicle service, which incurred per-frame costs for analyzing screen captures, with a local Gemma 4 E4B 4-bit MLX model running on a $599 Mac mini. This new setup processes video from four independent sensors (screen, wearable camera, security cameras, and AI commentary) with zero outbound LLM calls and effectively no marginal inference cost. The author chose this approach for its native multimodal capabilities, sufficient 16GB unified memory, Apache 2.0 licensed weights that cannot be deprecated or repriced, and the ability to serve multiple homelab vision workloads from a single model instance. I built a four-sensor Gemma 4 replacement on a Mac mini. For about a week I had OpenAI’s research-preview Chronicle running on my MacBook. Every ten minutes it screenshotted my display, uploaded frames to OpenAI for analysis, and wrote Markdown summaries on my Mac. I was crawling that folder and ingesting the data in a Postgres table on my homelab. It worked. It also cost credits for every cycle of attention. This weekend I replaced it with a single Gemma 4 E4B 4-bit MLX instance running on a $599 Mac mini, summarizing four independent sensor streams locally with zero outbound LLM calls and effectively zero marginal inference cost. OpenAI describes the constraints plainly in their own documentation: screen captures are uploaded to OpenAI’s servers for processing, the feature “uses rate limits quickly,” it “increases risk of prompt injection,” memories are stored as “unencrypted Markdown files” on the user’s machine, and it is unavailable in the EU, UK, and Switzerland. Chronicle is a Pro-tier feature on a Pro-tier price. The architectural choice is honest: cloud inference, per-frame cost, the model belongs to OpenAI. I wanted a different shape. What I built This weekend I replaced Chronicle. Not with a better cloud service. With a single Gemma 4 E4B 4-bit MLX instance on a $599 Mac mini, summarizing video from four sensors my screen, a wearable camera, the security cameras in my living room, and the wearable’s realtime AI commentary and writing them all to one Postgres table, redacted at ingest, queryable in SQL. Zero outbound LLM calls. Zero per-frame cost. The same model instance also serves the rest of my homelab’s vision workloads. The marginal cost of adding the fifth sensor which is already in a box on the way is whatever shipping cost was paid for a Raspberry Pi Zero 2 W. This is the sequel to a piece I published five days ago about putting Gemma 4 behind my homelab AI gateway. That one ended with: “Anvil is not just a dev box. For some multimodal work, it is a useful inference target.” This is about Anvil graduating. Why Gemma 4 E4B specifically The reasoning, in order of how much each one mattered to me: Native multimodal in one checkpoint. Image AND video AND audio paths in the same file. The whole sensor mesh runs through one weights load. No model swap per input type. 16 GB of unified memory is enough. The 4-bit MLX build sits at about 6 GB peak resident in isolation, around 8.5 GB under co-tenant load. On a base M-series Mac mini that leaves comfortable headroom for the OS, the FastAPI daemon, and a menubar app to watch it. Apache 2.0 weights. The model file is on my machine. Nobody can deprecate it out from under me, reprice it overnight, or restrict it by jurisdiction. It’s already loaded. I was routing this exact model through Forge for unrelated work. Spinning a second model for Logbook specifically would have been waste. One Gemma 4 instance. Two production roles. Four sensors, one envelope MacBook Screen Looki Wearable Blink Cameras │ │ │ └──────────────────┼──────────────────┘ ▼ Logbook Producers │ ▼ Anvil / Gemma 4 E4B │ ▼ Redaction Layer │ ▼ Postgres Every Logbook row is an observation.event.v1 envelope. The schema fits in one paragraph: a deterministic UUIDv5, a source enum, a captured at timestamp, a clip duration s, optional frame count, an image summary, an optional video summary, a media uri for the staging location, an inference metadata blob, and a source metadata blob. Same schema, four producers. The producers: MacBook screen. A Python capture daemon running as a LaunchAgent. Records a short screen video on a fixed cadence, pauses when HID idle exceeds 10 minutes, POSTs the clip to Anvil for analysis, then POSTs the resulting envelope to the homelab ingest endpoint. Looki wearable clips . A worker polls the wearable’s cloud, stages new motion clips to local NVMe, runs them through the same Anvil daemon. Looki wearable realtime . The wearable emits realtime AI commentary as text events. A second worker forwards those as image-summary-only observations into the same table. Blink security cameras. A continuous Node.js daemon polls Blink’s cloud, stages motion clips to NVMe, hands them to Anvil. Every clip lands on the same Anvil daemon, which runs one Gemma 4 E4B 4-bit MLX instance. The daemon serves two surfaces: /v1/analyze for Logbook image-pass + native-video-pass per clip . /v1/chat/completions and /v1/responses for every other Forge VLM client in the homelab. The model does not care which surface called it. The previous standalone gemma-4-multimodal LaunchAgent was retired and its plist removed. End state: one Gemma 4 instance, dual-purpose, no duplication. Redaction happens once, at the ingest endpoint, before the INSERT. UUIDs, filesystem paths, IPv4 and IPv6, internal hostnames, email addresses, API key shapes. Single pass. The day the model pretended to watch video For most of the build day, Logbook produced two summaries per clip: one from a native-video call mlx vlm.generate video=path, fps=1.0 , and one from a separate frame-extracted multi-image pass. The image summaries were excellent. They read pixels at 1280 px width and reported real strings: Termius, Phase 9, LOGBOOK BUILD BRIEF.md. Per-capture variation. Forensic detail. Anyone reading the raw table rows could tell which IDE window was on top. The video summaries were a different story. Every video summary for every mac screen capture, hour after hour, described “a person standing in a kitchen setting, facing a counter, holding a small dark object.” Word for word. The MacBook does not have a webcam pointed at the kitchen. The capture content was screen recordings. I revised the prompt to be explicit “you are observing a screen recording from a computer display” . Every video summary then described an identical Stack Overflow visit. Still word-for-word across captures. The model was not hallucinating. Hallucinating implies seeing something and misinterpreting it. The model was outputting the same paragraph because the same paragraph was the most likely next-token sequence given only the prompt. The video bytes were not reaching the attention layer at all. An MD5-hash query broke the case open. Across seven consecutive mac screen captures of five different windows, every video summary collapsed to two unique hashes one per prompt variant , perfectly correlated with the prompt text. The image summaries from the same seven captures produced seven unique hashes. Image was reading pixels. Video was reading nothing. Running the same script against two different Blink motion clips from the living room made it worse. Identical output on E4B. Identical output on E2B. E2B’s variant of the bug was more honest than E4B’s: where E4B confabulated plausible scenes, E2B simply replied “Please provide the video or a description of what you are seeing so I can describe it for you.” The model was literally asking for the video. Root cause was four lines deep in anvil/server.py . The daemon was building the formatted prompt with apply chat template processor, config, prompt, num images=N and then calling generate video=path, ... . The dispatcher in mlx vlm’s prompt utils.py checks kwargs.get "video" on the chat template call to decide whether to insert the