I Pointed Chrome's Prompt API at a 1.25 Million Character Memoir, and It Got Interesting Fast

wpnews.pro

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

A straightforward engineering question: what happens when you feed a long book to an on-device language model in Chrome and start adjusting the parameters?

To explore this, I built a small experiment called Gemini Nano Book Lab: a Chrome extension sidepanel that uses Chrome’s built-in Prompt API to answer questions about Richard Wagner’s My Life, while also exposing some of the underlying mechanics.

The response is only part of it. The experiment also captures:

If you’re an engineer interested in systems that have rough edges—and therefore teach you something—this is a useful area to explore.

Chrome’s Prompt API is part of the browser’s built-in AI features. Instead of sending prompts to a cloud endpoint, a web app or extension can request an on-device language model session and prompt it locally.

Resources:

Core capabilities:

contextoverflow

This makes it more than a simple text box—it becomes an environment for experimentation.

Long inputs expose the interesting problems. Short prompts hide a lot; a paragraph‑long demo can make any model look magical. A long corpus forces concrete decisions:

For the first version, I used Project Gutenberg’s plain text of Richard Wagner’s My Life:

That gave a corpus of about 219,572 words and 1,251,663 characters in the run shown below.

The demo is a Chrome extension sidepanel rather than a normal web app. This was a deliberate choice. Extensions provide a more reliable built‑in AI surface in Chrome, and they allow a compact benchmark UI where controls, streamed output, and telemetry live side by side.

The extension has three tasks:

The benchmark starts simple. I didn’t begin with embeddings, vector databases, or sophisticated semantic retrieval. I wanted a baseline that is easy to reason about.

The first‑version controls are:

This provides enough surface to see the tradeoffs without making the experiment too complex.

The first question isn’t “What should I prompt?” but “Is the model available here?”

Here’s the availability and session setup wrapper:

function getPromptApi(): PromptApi | null {
    const maybePromptApi = (globalThis as typeof globalThis & {
        LanguageModel?: PromptApi
    }).LanguageModel
    return maybePromptApi ?? null
}

export async function inspectPromptApi(): Promise<PromptApiCapabilities> {
    const promptApi = getPromptApi()

    if (!promptApi) {
        return {
            supported: false,
            availability: 'unavailable',
            statusMessage:
                'LanguageModel is unavailable in this browser context. Use a recent Chrome build with the Prompt API enabled.',
            defaultTemperature: null,
            maxTemperature: null,
            defaultTopK: null,
            maxTopK: null,
        }
    }

    const availability = await promptApi.availability({
        expectedInputs: [{ type: 'text', languages: ['en'] }],
        expectedOutputs: [{ type: 'text', languages: ['en'] }],
    })

    return {
        supported: true,
        availability,
        statusMessage:
            availability === 'available'
                ? 'Prompt API ready.'
                : 'Model can be downloaded or is unavailable on this device.',
        defaultTemperature: null,
        maxTemperature: null,
        defaultTopK: null,
        maxTopK: null,
    }
}

This may not look exciting, but it matters. One early lesson with built‑in AI is that availability is part of your product surface. Hardware support, model download state, and browser support determine whether your app works at all.

After the book, I split it into overlapping chunks. The code tries to respect paragraph and sentence boundaries rather than slicing blindly at exactly N

characters.

export function buildChunks(
    text: string,
    chunkSize: number,
    overlap: number,
): CorpusChunk[] {
    const safeChunkSize = Math.max(600, chunkSize)
    const safeOverlap = clampOverlap(safeChunkSize, overlap)
    const chunks: CorpusChunk[] = []

    let startOffset = 0
    let index = 0

    while (startOffset < text.length) {
        const desiredEnd = Math.min(text.length, startOffset + safeChunkSize)
        const endOffset =
            desiredEnd === text.length
                ? text.length
                : findBoundary(text, startOffset, desiredEnd)

        const textSlice = text.slice(startOffset, endOffset).trim()

        if (textSlice) {
            index += 1
            chunks.push({
                id: `chunk-${String(index).padStart(3, '0')}`,
                index,
                text: textSlice,
                startOffset,
                endOffset,
            })
        }

        if (endOffset >= text.length) {
            break
        }

        startOffset = Math.max(endOffset - safeOverlap, startOffset + 1)
    }

    return chunks
}

This decision changes the system’s behavior. Small chunks improve precision but can break context apart. Large chunks preserve narrative structure but use more context budget. Overlap helps with boundaries but increases repeated text and token pressure. Engineering often comes down to choosing which kind of trade‑off you can accept.

The first retriever is lexical, not semantic. That keeps the failure modes visible. If retrieval is too smart too early, you skip an educational stage.

export function rankChunks(
    chunks: CorpusChunk[],
    query: string,
    maxChunks: number,
): RankedChunk[] {
    const queryTokens = tokenize(query)

    return chunks
        .map((chunk) => {
            const { score, matchedTerms } = scoreChunk(chunk, queryTokens, query)
            return {
                ...chunk,
                score,
                matchedTerms,
            }
        })
        .filter((chunk) => chunk.score > 0)
        .sort((left, right) => right.score - left.score)
        .slice(0, maxChunks)
}

This retriever scores term overlap between the question and chunk text. It is fast, explainable, and flawed—exactly what I wanted for a baseline.

The benchmark records more than whether the model answered correctly. It measures:

This is the core flow:

const corpus = await loadWagnerCorpus()
const chunks = buildChunks(corpus.text, config.chunkSize, config.chunkOverlap)
const selectedChunks = rankChunks(chunks, query, config.retrievedChunks)

const session = await createPromptSession({
    config,
    onDownloadProgress(progress) {
        downloadProgress.push(progress)
    },
})

const estimatedInputUsage = await measureContextUsage(session, input)

const { text, firstChunkMs } = await executePrompt({
    session,
    input,
    streaming: config.streaming,
    signal,
    onChunk: callbacks.onChunk,
})

At this point the demo becomes less a “chatbot” and more an instrument panel.

In the run shown in the screenshots, the app reported approximately:

Several observations stand out.

Lexical retrieval took 8.7 ms. That is tiny compared to the 17.4 second prompt time. For early‑stage RAG in the browser, this suggests a useful lesson: before over‑optimizing retrieval, understand your inference costs. In this setup, retrieval is not the bottleneck. Prompting is.

The first chunk arrived after about 7.2 seconds. That number changes the perceived feel of the product. If the first token arrives quickly, the experience feels responsive. If it takes several seconds, users may wonder if it has hung or if they asked too much. A good benchmark should capture that moment, not just the final duration.

The run used about 3417 units of a 9216 context window. That sounds comfortable, but long‑form exploration can consume budget quickly. If you increase chunk size, overlap, or retrieved chunk count, the window fills with evidence before the model answers. That’s why the demo exposes chunk controls prominently.

The total was about 32.8 seconds—notably higher than prompt time alone. That gap hides real product behavior: corpus , chunking, preparation work, model readiness, UI update overhead, and one‑time costs that don’t appear if you only look at prompt()

. For engineers, this is an important shift: users experience the whole pipeline, not just the API call.

The Prompt API is interesting not because it’s limitless, but because its limits are visible and teach you something. Here are the main ones I encountered.

You cannot stuff an entire million‑character book into a prompt. Even when the corpus lives locally, context remains scarce. That pushes you toward retrieval, chunking, and prompt construction strategies sooner than you might expect.

The retrieved excerpts screenshot shows this clearly. Some selected chunks are relevant to the query “How does Wagner describe his early artistic ambitions?” But some are relevant mostly because they contain overlapping words like “early”, “artistic”, or “ambitions”, not because they are the best narrative evidence. That is a useful failure mode—it shows why better retrieval becomes necessary.

The Prompt API is not a universal browser primitive yet. It depends on Chrome support, device capability, model management, and the environment. Every serious app needs a plan for unsupported devices, first‑time model download, delayed readiness, and the possibility that the model is unavailable or removed.

Streaming makes the wait feel more humane after generation starts, but it does not remove the wait before generation starts. A slow first‑token experience remains an issue.

In the current version, I can measure prompt timing and context usage cleanly. What I cannot claim cleanly is exact model memory consumption, the way I might with a dedicated server‑side runtime. Some metrics are authoritative; some are approximate. Good benchmarking should label the difference honestly.

Even with those limits, building on a browser‑native AI surface has clear benefits. You ask the browser what is available. You create a session. You stream output. You inspect context pressure. You see download progress. You can build a real experiment around that.

For an engineer, that means you can learn about product design, retrieval systems, latency, UI feedback, and model constraints all within one project.

Obvious and useful extensions:

This becomes less about whether the model answered, and more about why this configuration behaved the way it did.

The Prompt API made me think less about “AI features” and more about systems behavior under constraints. That is why this experiment felt worth building. The model answered a question about Wagner—fine. But the more interesting outcome was watching the browser become a measurable inference environment with its own quirks, bottlenecks, and product tradeoffs.

If you are early in your engineering journey, this is the kind of project I would recommend: one that looks like a demo from a distance, but up close turns into a lesson about architecture. And that is usually where the real learning starts.

*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

** git-lrc fixes this.** It hooks into

git commit

and reviews every diff git-lrc-intro-60s.mp4See git-lrc catch serious security issues such as leaked credentials, expensive cloud operations, and sensitive material in log statements

source & further reading

dev.to — original article Awaithuman: pagerduty cost Origin Part 19: The Number Was Wrong Building an Autonomous Agent on an M1 Mac, by Choice

I Pointed Chrome's Prompt API at a 1.25 Million Character Memoir, and It Got Interesting Fast

Run your AI side-project on zahid.host