I built an AI Chrome extension with zero backend cost — here's the exact architecture

wpnews.pro

You want to add AI to your Chrome extension.

The obvious path: spin up a Node.js server, hold a master API key, charge users monthly, eat the AI cost. That's what everyone does.

I didn't do that. I built three Chrome extensions with AI features — PR summarization, risk scoring, draft review generation — and my monthly infrastructure bill is $0. No server. No backend. No API key to protect.

Here's the exact architecture, the real trade-offs, and the specific places where this approach breaks down so you don't find out the hard way.

Most AI-powered extensions work like this:

User → Extension → Your server → AI provider → Your server → Extension → User

Your server holds a master API key. Users pay you. You pay the AI provider out of that margin.

The problems:

You're a proxy business now. You're paying OpenAI $X, charging users $Y, and the difference is your margin. But you're also responsible for rate limiting, uptime, abuse prevention, and GDPR compliance for every request that touches your server.

Private code goes through your infra. For a developer tool that reads GitHub diffs, this is the question users ask first: "is my code going to your server?" With a hosted backend, the honest answer is yes.

You're competing on price against companies with VC money. CodeRabbit, GitHub Copilot, Linear, and a dozen others are running hosted AI with economies of scale you can't match as a solo developer.

There's a different architecture. It's not new — it's called BYOK (Bring Your Own Key), and it shifts the AI provider relationship from you to the user.

User → Extension → AI provider (user's own key)

No server in the middle. No margin math. No "is my code safe" question.

The core mechanic is simple: instead of your extension calling your server, it calls the AI provider directly from the browser using the user's own API key.

// The user pastes their API key during onboarding
// You store it locally — never send it anywhere else
await chrome.storage.local.set({ 
  aiApiKey: userProvidedKey,
  aiProvider: 'groq' // or 'openai', 'mistral', 'ollama'
});

// Every AI call uses their key, from their browser
async function callAI(prompt) {
  const { aiApiKey, aiProvider } = await chrome.storage.local.get(['aiApiKey', 'aiProvider']);

  const endpoint = getEndpoint(aiProvider);

  const response = await fetch(endpoint, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${aiApiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: getModel(aiProvider),
      messages: [{ role: 'user', content: prompt }],
      max_tokens: 500
    })
  });

  return response.json();
}

The API key lives in chrome.storage.local

. It never leaves the browser except to go directly to the AI provider. Your extension never sees it again after the user pastes it in.

For direct API calls from a Chrome extension, declare host permissions for each provider you support:

{
  "manifest_version": 3,
  "permissions": [
    "storage"
  ],
  "host_permissions": [
    "https://api.openai.com/*",
    "https://api.groq.com/*",
    "https://api.mistral.ai/*",
    "http://localhost:*/*"
  ]
}

The localhost

entry covers Ollama — for users who want a fully local model with zero API costs.

Important:In MV3, host permissions are scrutinized during review. Be specific. Don't use<all_urls>

when you can name the exact domains. I've been through CWS review twice with this manifest — being explicit helps.

All four major providers use the OpenAI-compatible /v1/chat/completions

format. One implementation, four providers:

const AI_PROVIDERS = {
  groq: {
    endpoint: 'https://api.groq.com/openai/v1/chat/completions',
    model: 'llama-3.3-70b-versatile',
    maxTokens: 1024,
    supportsStreaming: true,
  },
  openai: {
    endpoint: 'https://api.openai.com/v1/chat/completions',
    model: 'gpt-4o-mini',
    maxTokens: 1024,
    supportsStreaming: true,
  },
  mistral: {
    endpoint: 'https://api.mistral.ai/v1/chat/completions',
    model: 'mistral-small-latest',
    maxTokens: 1024,
    supportsStreaming: false,
  },
  ollama: {
    endpoint: 'http://localhost:11434/v1/chat/completions',
    model: 'llama3.2',
    maxTokens: 1024,
    supportsStreaming: true,
  }
};

async function getProviderConfig() {
  const { aiProvider } = await chrome.storage.local.get('aiProvider');
  return AI_PROVIDERS[aiProvider] || AI_PROVIDERS.groq;
}

Store the model name here, not hardcoded in your fetch calls. When Groq deprecated an older Llama version, I pushed one config update and every user was on the new model automatically — no user action required.

Here's the real cost of BYOK: users have to get an API key before they can use your AI features. Some users bounce at this step.

What actually reduces friction:

1. Lead with Groq. Groq's free tier covers ~14,400 requests per day for smaller models. For most individual developers, it's genuinely free. This changes the conversation from "go pay for an API key" to "go get a free API key in 2 minutes."

2. Give the exact steps, not a vague instruction:

Step 1: Go to console.groq.com/keys
Step 2: Click "Create API key"
Step 3: Paste the key here → [input]

Three lines. No ambiguity. I track where users drop off in onboarding — the step with the most abandonment is always the one where I said "get your API key" without saying exactly where.

3. Make core features work without AI. If every feature is gated behind BYOK setup, the first session is a setup session — and many users don't return for a second. In PR Focus, multi-account GitHub, PR sorting, CSV export, and stale notifications all work without any API key. The AI features are additive.

If you want to stream AI responses token by token, you hit an MV3 constraint: service workers handle the API calls, but streaming requires a long-lived connection, and service workers can be terminated mid-stream.

The pattern that works — service worker handles the fetch, sends tokens to the popup via messages:

// Service worker — handles the streaming fetch
chrome.runtime.onMessage.addListener((message, sender, sendResponse) => {
  if (message.type === 'STREAM_AI') {
    streamAIResponse(message.prompt, sender.tab.id);
    return true; // Keep the message channel open
  }
});

async function streamAIResponse(prompt, tabId) {
  const config = await getProviderConfig();
  const { aiApiKey } = await chrome.storage.local.get('aiApiKey');

  const response = await fetch(config.endpoint, {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${aiApiKey}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({
      model: config.model,
      messages: [{ role: 'user', content: prompt }],
      stream: true
    })
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    const chunk = decoder.decode(value);
    const lines = chunk.split('\n').filter(line => line.startsWith('data: '));

    for (const line of lines) {
      const data = line.slice(6);
      if (data === '[DONE]') continue;

      try {
        const parsed = JSON.parse(data);
        const token = parsed.choices[0]?.delta?.content || '';

        chrome.tabs.sendMessage(tabId, { type: 'AI_TOKEN', token });
      } catch (e) {
        // Skip malformed chunks — they happen
      }
    }
  }

  chrome.tabs.sendMessage(tabId, { type: 'AI_DONE' });
}

The fetch keeps the service worker alive for the duration of the stream. Tokens go to the popup via messages. The popup accumulates them and renders progressively.

The most common support category with BYOK: users with wrong or misconfigured keys. Generic "AI error" messages generate follow-up tickets. Status-code-specific messages don't:

async function validateApiKey(apiKey, provider) {
  try {
    const config = AI_PROVIDERS[provider];
    const response = await fetch(config.endpoint, {
      method: 'POST',
      headers: { 
        'Authorization': `Bearer ${apiKey}`, 
        'Content-Type': 'application/json' 
      },
      body: JSON.stringify({
        model: config.model,
        messages: [{ role: 'user', content: 'test' }],
        max_tokens: 1
      })
    });

    if (response.status === 401) 
      return { valid: false, error: 'Invalid API key — check you copied it completely, no trailing spaces.' };
    if (response.status === 429) 
      return { valid: false, error: 'Rate limit hit — your key is valid but you\'ve hit the free tier ceiling.' };
    if (response.status === 403) 
      return { valid: false, error: 'Permission denied — this key may not have access to this model tier.' };
    if (!response.ok) 
      return { valid: false, error: `Provider returned ${response.status} — try again in a moment.` };

    return { valid: true };
  } catch (e) {
    return { valid: false, error: 'Network error — check your internet connection or try a different provider.' };
  }
}

A typical PR summary in PR Focus: ~800 tokens input (diff context + system prompt), ~150 tokens output. ~950 tokens per PR.

Provider	Tier	Cost per PR	100 PRs/day
Groq (Llama 3.3 70B)	Free	$0	$0
OpenAI GPT-4o-mini	Paid	~$0.0001	~$0.01
Mistral Small	Paid	~$0.00008	~$0.008
Ollama (local)	Free	$0	$0

The cost argument for BYOK isn't just privacy — it's math. A hosted model charging $10/month makes pennies after AI costs and infrastructure. Users with their own Groq key pay nothing for individual use. That's a value proposition you can't match with a hosted backend.

Corporate users behind strict proxies. Some enterprise environments block direct browser-to-external-API calls. You can't fix this. Be upfront about it, and point to Ollama as the local workaround.

Ollama requires a separate install. It's not "just paste a key" — it's "install Ollama, pull a model, run it locally, then configure the extension." Worth supporting for privacy-first users, but don't pitch it as the simple path.

You can't cache responses. Each user's key means each user pays for their own calls. No cross-user caching. For most use cases this doesn't matter, but if you're building something where 1000 users asking the same question is likely, hosted with caching will be cheaper for them.

Yes, if:

No, if:

chrome.storage.local
  ├── aiApiKey      ← user's own, never leaves browser except to provider
  └── aiProvider    ← 'groq' | 'openai' | 'mistral' | 'ollama'

Popup / content script
  └── message → service worker: { type: 'RUN_AI', prompt }

Service worker
  ├── reads key + provider from storage
  ├── calls provider API directly (fetch)
  └── streams tokens → popup via chrome.runtime.sendMessage

Infrastructure cost: $0
Monthly AI bill: $0
Trust question ("does my code go to your server?"): No.

Everything in this article is running in ** PR Focus Pro** — a Chrome extension that triages GitHub pull requests with AI summaries, hybrid risk scoring (0–100), and one-click draft reviews. Free to install; AI features activate with your own API key.

The full engineering decision log behind this architecture — including the options I rejected, what it cost in user friction, and whether I'd choose it again — is Build Log #007 in my public Build Logs repo.

If you're building something similar and want a second pair of eyes on your implementation, the Summer Review Swap is open — there's a PR waiting for a reviewer right now if you want to jump straight in.

What's your approach to AI in browser extensions? Running your own backend, BYOK, or something else entirely? Particularly curious whether anyone has found a cleaner solution to the streaming + service worker termination problem — drop it in the comments.

Links in this article:

source & further reading

dev.to — original article The 7 Ways AI Agents Fail in Production — And How to Catch Them How to Write DESIGN.md Prose That AI Agents Actually Follow How Modern Teams Separate Business Logic from Application Code

I built an AI Chrome extension with zero backend cost — here's the exact architecture

Run your AI side-project on zahid.host