How we moderate a live video-chat app in real time (without going broke on AI calls) The article describes the real-time moderation system for Camdiv, an anonymous one-to-one video chat app, where the primary engineering challenge is not video streaming but preventing harmful content. The system uses a browser to sample JPEG frames every few seconds, which are sent to a separate moderation microservice running a vision-language model (Gemini Flash Lite) that returns a verdict with booleans for unsafe or minor content. To manage costs, the app relies on a single VLM instead of multiple systems, as it provides contextual understanding and structured output, though the expense of millions of daily API calls remains a critical factor. I work on Camdiv, an anonymous one-to-one video chat. You open the page, you get matched with a stranger, you talk. It's the Omegle-style format, and from the outside the hard part looks like the video: WebRTC, NAT traversal, keeping latency down. It isn't. WebRTC is mostly a solved problem. The hard engineering is moderation. You're putting two anonymous strangers on a live camera together, with almost no friction, and you have a few seconds to catch it if one of them does something that gets your platform pulled from every app store on earth. Three things shaped every decision below, and they fight each other the whole way. The first is cost: moderate live video naively and the bill alone will sink you. The second is false positives, because a wrong ban is a real person you just kicked off for nothing. The third took a near-miss to learn, so it gets the longest section here: you can't actually trust the video frame you're moderating. Most moderation problems give you time. A user uploads a photo or writes a comment, and you can scan it before anyone else sees it. The content sits still while you decide. Live video gives you none of that: So whatever you build has to be automated, run per frame, stay fast, and be cheap enough to run nonstop. Those goals do not sit comfortably together. The browser samples a JPEG from the local video every few seconds and sends it over Socket.IO to our backend. The backend forwards it to a separate moderation microservice a small FastAPI app on its own host over HTTPS, locked down with an internal shared key and an origin allowlist at the reverse proxy. The service runs the classifier and returns a compact verdict. flowchart LR A "Browser