Thanks for putting this much into it — really useful.
On the NLLB q8 / WebGPU point: good catch, and timely. That’s exactly what I’m testing right now — NLLB on q8 + WebGPU is my higher-quality path, so the two issues you linked are right on target. Appreciate the direct links, saves me the digging.
On offscreen vs service worker: this was a deliberate choice. The live overlay loop is latency-sensitive and needs the model + WebGPU context to stay warm, and an MV3 service worker gets torn down on idle — it can’t reliably hold a few-hundred-MB model resident between bursts, and re-spinning it up mid-overlay would kill the “text changes, re-translate” feel. So inference lives in the offscreen document and the service worker just routes and coordinates — basically the hybrid split you described. I’ll still read the HF service-worker guide to compare properly, but for this workload offscreen felt right.
The rest — OCR scheduling, capture strategy, cache/offline and the privacy story — is a great checklist. Saving the whole thing. Thanks again.