My Bookmark Engine Returned Chunks. I Added One Endpoint to Make It Answer.

wpnews.pro

cd /news/large-language-models/my-bookmark-engine-returned-chunks-i… · home › topics › large-language-models › article

[ARTICLE · art-28573] src=dev.to ↗ pub=2026-06-15T20:40Z topic=large-language-models verified=true sentiment=↑ positive

My Bookmark Engine Returned Chunks. I Added One Endpoint to Make It Answer.

A developer built a search engine over 50k saved tweets using hybrid retrieval and reranking, then added a single endpoint to turn it into an answer engine. By wiring Gemma 4 MoE to the retrieval pipeline, the system produces direct answers grounded in saved content, though quality depends on index depth and embedding precision.

read4 min views20 publishedJun 15, 2026

Search returns things you have to read. An answer engine reads them for you.

I built a search engine on top of 50k saved tweets. Ask it something and it returns the five most relevant chunks — found through hybrid retrieval (BM25 keyword search plus vector search) and reranked by a cross-encoder. A Gemma 4 MoE layer already runs in the background too, writing its own reflections on how saved documents connect to each other. You get the chunks back, ranked. Then you read them and synthesise.

That last step bothered me. The model already synthesises when generating reflections. The retrieval already works. The only missing piece was wiring them together at query time.

So I added POST /search?mode=answer

Same retrieval pipeline. Top 5 chunks, reranked. Then instead of returning them raw, Gemma 4 MoE reads them and produces a direct answer grounded in what you saved.

const prompt =
  `Answer the question below using only the sources provided. ` +
  `If the sources don't contain the answer, say so directly.\n\n` +
  `Question: "${query}"\n\n` +
  `Sources:\n${context}\n\n` +
  `Write a direct answer in 2–4 sentences. No preamble. No bullets.\n` +
  `Answer:`;

max_tokens: 512

returns an empty answer. Gemma 4 is a thinking model: it burns the token budget on internal reasoning before producing output. max_tokens: 2048

fixes it. The reflection engine hit the same wall. Same fix.

Three queries against the live index.

"What do people say about consistency and showing up every day?"

"Consistency wins. Some people say all the right things but never quite measure up to their words, while others do everything right without saying a word."

Two tweets, cleanly synthesised. Nothing hallucinated from outside the index.

"What do people say about money and wealth building?"

"Discussions include frustrations over the devaluation of the Naira, declining purchasing power, and debates regarding financial expertise and social media income. Some suggest only a certain regular amount is needed to sustain a lifestyle, as anything beyond that is superfluous. Regarding wealth building, it is suggested that one should provide value and be the supply in a supply and demand relationship."

This one surfaced a reflection document alongside raw tweets — a reflection

-type entry the engine had already generated from CBEX-related tweets and stored back into the index. The answer pulled from both layers: raw saved content and a previously generated insight. That's the system compounding on itself.

"What are the best ways to learn programming?"

"Search Google, find a video, watch while coding, repeat."

Thin. The retrieval matched surface-level tweets: "Did you learn programming all by yourself?", "Tips on learning how to code 🧵". Not substantive content. The model answered honestly from what it got. The answer is technically grounded. It's just not useful.

It works well on topics with substantive saved content. It returns thin answers on surface-level matches, like the programming question above. The synthesis did its job. The index just doesn't have the depth yet.

The retrieval scores on most queries are low (0.006–0.013 range). That's down to the embedding model the index was built on: bge-small

, 384 dimensions, the old default, built for speed over precision. Embeddings are how the engine turns text into numbers it can compare. More dimensions means more room to capture shades of meaning. The index can't switch models without re-ingesting all 50k tweets. When I eventually migrate to qwen3-0.6b

(1024 dimensions), retrieval precision improves first, and answer quality follows from that.

For now: the endpoint works. Strong on topics the index has depth on, honest about topics it doesn't.

The sources come back with every answer. Verify the model, check the scores, read the original chunks. And it's not search over the open web. Every answer traces back to something you chose to save. The model can't hallucinate from outside the index because the prompt gives it nothing outside the index to hallucinate from. The grounding is structural, not just instructed.

Retrieval quality is the ceiling on answer quality right now. The next piece is gap detection: a weekly pass that surfaces the three most persistent unanswered questions in the index, showing where the index has depth and where it doesn't. This endpoint makes those gaps visible in real time, one query at a time. Gap detection will map them systematically, every week.

The endpoint is live. Query it:

POST /search?mode=answer
{ "query": "your question here" }

Source chunks come back alongside the answer. The model used to synthesise it: @cf/google/gemma-4-26b-a4b-it

. Same Worker, same $5/month.

The index has 50k saved tweets going back to 2016. What you get back is bounded by that. Google searches the internet. This searches what you decided was worth keeping.

source & further reading

dev.to — original article 50 headline prompts that don't sound like AI wrote them How I Decide What to Build Next at a One-Person Studio All software engineers are now QAs

~/api · this article 200

$curl api.wpnews.pro/v1/news/my-bookmark-engine-retur…

Read original on dev.to → dev.to/dannwaneri/my-bookmark-engine-returned-ch…

mentioned entities

Gemma 4 MoE

bge-small

qwen3-0.6b

BM25

cross-encoder

metadata

slugmy-bookmark-engine-returned-chunks-i-added-one-endpoint-to-make-it-answer

topic#large-language-models

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prevMore Power To You – And To The D…

next →Jax: Commitment Issues

── more in #large-language-models 4 stories · sorted by recency

promptcube3.com · 25 Jul · #large-language-models

Search Engines vs. LLMs: Why Lexical Search Still Wins

eoinhurrell.com · 22 Jul · #large-language-models

The Reranker Tax: When a Smart Layer Can't Save a Weak Foundation

dev.to · 31 Jul · #large-language-models

All software engineers are now QAs

dev.to · 31 Jul · #large-language-models

How I Decide What to Build Next at a One-Person Studio

── more on @gemma 4 moe 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 30 Jul · #artificial-intelligence

Microsoft Will Soon Release an AI Super App

wpnews · 30 Jul · #ai-safety

Can AI Agents Be Aligned with Human Rights?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required