I built an AI Chrome extension with zero backend cost — here's the exact architecture

A developer built three AI-powered Chrome extensions—PR summarization, risk scoring, and draft review generation—with zero backend cost by using a Bring Your Own Key (BYOK) architecture. The extensions call AI providers directly from the browser using the user's own API key, eliminating the need for a server and addressing privacy concerns. The approach supports multiple providers including OpenAI, Groq, Mistral, and local models via Ollama.

You want to add AI to your Chrome extension. The obvious path: spin up a Node.js server, hold a master API key, charge users monthly, eat the AI cost. That's what everyone does. I didn't do that. I built three Chrome extensions with AI features — PR summarization, risk scoring, draft review generation — and my monthly infrastructure bill is $0. No server. No backend. No API key to protect. Here's the exact architecture, the real trade-offs, and the specific places where this approach breaks down so you don't find out the hard way. Most AI-powered extensions work like this: User → Extension → Your server → AI provider → Your server → Extension → User Your server holds a master API key. Users pay you. You pay the AI provider out of that margin. The problems: You're a proxy business now. You're paying OpenAI $X, charging users $Y, and the difference is your margin. But you're also responsible for rate limiting, uptime, abuse prevention, and GDPR compliance for every request that touches your server. Private code goes through your infra. For a developer tool that reads GitHub diffs, this is the question users ask first: "is my code going to your server?" With a hosted backend, the honest answer is yes. You're competing on price against companies with VC money. CodeRabbit, GitHub Copilot, Linear, and a dozen others are running hosted AI with economies of scale you can't match as a solo developer. There's a different architecture. It's not new — it's called BYOK Bring Your Own Key , and it shifts the AI provider relationship from you to the user. User → Extension → AI provider user's own key No server in the middle. No margin math. No "is my code safe" question. The core mechanic is simple: instead of your extension calling your server, it calls the AI provider directly from the browser using the user's own API key. // The user pastes their API key during onboarding // You store it locally — never send it anywhere else await chrome.storage.local.set { aiApiKey: userProvidedKey, aiProvider: 'groq' // or 'openai', 'mistral', 'ollama' } ; // Every AI call uses their key, from their browser async function callAI prompt { const { aiApiKey, aiProvider } = await chrome.storage.local.get 'aiApiKey', 'aiProvider' ; const endpoint = getEndpoint aiProvider ; const response = await fetch endpoint, { method: 'POST', headers: { 'Authorization': Bearer ${aiApiKey} , 'Content-Type': 'application/json' }, body: JSON.stringify { model: getModel aiProvider , messages: { role: 'user', content: prompt } , max tokens: 500 } } ; return response.json ; } The API key lives in chrome.storage.local . It never leaves the browser except to go directly to the AI provider. Your extension never sees it again after the user pastes it in. For direct API calls from a Chrome extension, declare host permissions for each provider you support: { "manifest version": 3, "permissions": "storage" , "host permissions": "https://api.openai.com/ ", "https://api.groq.com/ ", "https://api.mistral.ai/ ", "http://localhost: / " } The localhost entry covers Ollama — for users who want a fully local model with zero API costs. Important:In MV3, host permissions are scrutinized during review. Be specific. Don't use <all urls when you can name the exact domains. I've been through CWS review twice with this manifest — being explicit helps. All four major providers use the OpenAI-compatible /v1/chat/completions format. One implementation, four providers: js const AI PROVIDERS = { groq: { endpoint: 'https://api.groq.com/openai/v1/chat/completions', model: 'llama-3.3-70b-versatile', maxTokens: 1024, supportsStreaming: true, }, openai: { endpoint: 'https://api.openai.com/v1/chat/completions', model: 'gpt-4o-mini', maxTokens: 1024, supportsStreaming: true, }, mistral: { endpoint: 'https://api.mistral.ai/v1/chat/completions', model: 'mistral-small-latest', maxTokens: 1024, supportsStreaming: false, }, ollama: { endpoint: 'http://localhost:11434/v1/chat/completions', model: 'llama3.2', maxTokens: 1024, supportsStreaming: true, } }; async function getProviderConfig { const { aiProvider } = await chrome.storage.local.get 'aiProvider' ; return AI PROVIDERS aiProvider || AI PROVIDERS.groq; } Store the model name here, not hardcoded in your fetch calls. When Groq deprecated an older Llama version, I pushed one config update and every user was on the new model automatically — no user action required. Here's the real cost of BYOK: users have to get an API key before they can use your AI features. Some users bounce at this step. What actually reduces friction: 1. Lead with Groq. Groq's free tier covers ~14,400 requests per day https://console.groq.com/settings/limits for smaller models. For most individual developers, it's genuinely free. This changes the conversation from "go pay for an API key" to "go get a free API key in 2 minutes." 2. Give the exact steps, not a vague instruction: Step 1: Go to console.groq.com/keys Step 2: Click "Create API key" Step 3: Paste the key here → input Three lines. No ambiguity. I track where users drop off in onboarding — the step with the most abandonment is always the one where I said "get your API key" without saying exactly where. 3. Make core features work without AI. If every feature is gated behind BYOK setup, the first session is a setup session — and many users don't return for a second. In PR Focus, multi-account GitHub, PR sorting, CSV export, and stale notifications all work without any API key. The AI features are additive. If you want to stream AI responses token by token, you hit an MV3 constraint: service workers handle the API calls, but streaming requires a long-lived connection, and service workers can be terminated mid-stream. The pattern that works — service worker handles the fetch, sends tokens to the popup via messages: // Service worker — handles the streaming fetch chrome.runtime.onMessage.addListener message, sender, sendResponse = { if message.type === 'STREAM AI' { streamAIResponse message.prompt, sender.tab.id ; return true; // Keep the message channel open } } ; async function streamAIResponse prompt, tabId { const config = await getProviderConfig ; const { aiApiKey } = await chrome.storage.local.get 'aiApiKey' ; const response = await fetch config.endpoint, { method: 'POST', headers: { 'Authorization': Bearer ${aiApiKey} , 'Content-Type': 'application/json' }, body: JSON.stringify { model: config.model, messages: { role: 'user', content: prompt } , stream: true } } ; const reader = response.body.getReader ; const decoder = new TextDecoder ; while true { const { done, value } = await reader.read ; if done break; const chunk = decoder.decode value ; const lines = chunk.split '\n' .filter line = line.startsWith 'data: ' ; for const line of lines { const data = line.slice 6 ; if data === ' DONE ' continue; try { const parsed = JSON.parse data ; const token = parsed.choices 0 ?.delta?.content || ''; chrome.tabs.sendMessage tabId, { type: 'AI TOKEN', token } ; } catch e { // Skip malformed chunks — they happen } } } chrome.tabs.sendMessage tabId, { type: 'AI DONE' } ; } The fetch keeps the service worker alive for the duration of the stream. Tokens go to the popup via messages. The popup accumulates them and renders progressively. The most common support category with BYOK: users with wrong or misconfigured keys. Generic "AI error" messages generate follow-up tickets. Status-code-specific messages don't: js async function validateApiKey apiKey, provider { try { const config = AI PROVIDERS provider ; const response = await fetch config.endpoint, { method: 'POST', headers: { 'Authorization': Bearer ${apiKey} , 'Content-Type': 'application/json' }, body: JSON.stringify { model: config.model, messages: { role: 'user', content: 'test' } , max tokens: 1 } } ; if response.status === 401 return { valid: false, error: 'Invalid API key — check you copied it completely, no trailing spaces.' }; if response.status === 429 return { valid: false, error: 'Rate limit hit — your key is valid but you\'ve hit the free tier ceiling.' }; if response.status === 403 return { valid: false, error: 'Permission denied — this key may not have access to this model tier.' }; if response.ok return { valid: false, error: Provider returned ${response.status} — try again in a moment. }; return { valid: true }; } catch e { return { valid: false, error: 'Network error — check your internet connection or try a different provider.' }; } } A typical PR summary in PR Focus: ~800 tokens input diff context + system prompt , ~150 tokens output. ~950 tokens per PR. | Provider | Tier | Cost per PR | 100 PRs/day | |---|---|---|---| | Groq Llama 3.3 70B | Free | $0 | $0 | | OpenAI GPT-4o-mini | Paid | ~$0.0001 | ~$0.01 | | Mistral Small | Paid | ~$0.00008 | ~$0.008 | | Ollama local | Free | $0 | $0 | The cost argument for BYOK isn't just privacy — it's math. A hosted model charging $10/month makes pennies after AI costs and infrastructure. Users with their own Groq key pay nothing for individual use. That's a value proposition you can't match with a hosted backend. Corporate users behind strict proxies. Some enterprise environments block direct browser-to-external-API calls. You can't fix this. Be upfront about it, and point to Ollama as the local workaround. Ollama requires a separate install. It's not "just paste a key" — it's "install Ollama, pull a model, run it locally, then configure the extension." Worth supporting for privacy-first users, but don't pitch it as the simple path. You can't cache responses. Each user's key means each user pays for their own calls. No cross-user caching. For most use cases this doesn't matter, but if you're building something where 1000 users asking the same question is likely, hosted with caching will be cheaper for them. Yes, if: No, if: chrome.storage.local ├── aiApiKey ← user's own, never leaves browser except to provider └── aiProvider ← 'groq' | 'openai' | 'mistral' | 'ollama' Popup / content script └── message → service worker: { type: 'RUN AI', prompt } Service worker ├── reads key + provider from storage ├── calls provider API directly fetch └── streams tokens → popup via chrome.runtime.sendMessage Infrastructure cost: $0 Monthly AI bill: $0 Trust question "does my code go to your server?" : No. Everything in this article is running in PR Focus Pro — a Chrome extension that triages GitHub pull requests with AI summaries, hybrid risk scoring 0–100 , and one-click draft reviews. Free to install; AI features activate with your own API key. The full engineering decision log behind this architecture — including the options I rejected, what it cost in user friction, and whether I'd choose it again — is Build Log 007 https://github.com/projekta2/build-logs/blob/main/build-logs/007-byok-chrome-extension-architecture.md in my public Build Logs repo. If you're building something similar and want a second pair of eyes on your implementation, the Summer Review Swap https://github.com/projekta2/build-logs/issues/1 is open — there's a PR waiting for a reviewer right now if you want to jump straight in. What's your approach to AI in browser extensions? Running your own backend, BYOK, or something else entirely? Particularly curious whether anyone has found a cleaner solution to the streaming + service worker termination problem — drop it in the comments. Links in this article: