Ask the Canon: Semantic Search Without a Vector Database

wpnews.pro

Working on something challenging? I coach developers 1:1 on the judgment behind the code, not just the syntax. How it works →

I built out askthecanon.com this weekend, a semantic search over 100 public-domain books (from the Gutenberg project). You ask a question in plain language and get the passages that mean that, cited by author, title, and chapter. I wanted a non-AI, local solution, hence a retrieval engine using Hugging Face, NumPy, no full vector database (yet), no external API or AI involved.

Why Ask the Canon? #

I am already finding timeless wisdom using it myself (some of the best apps come from "scratching your own itch"), and I hope it offers a breath of fresh air in a world that seems to be dominated by AI-generated content and quick summaries.

My itch was wanting to read the originals, not an AI-generated summary, but also recognizing I don't have time and focus to read through a whole work (although it's still my aim, I see deep value in it). What if we can meet somewhere in the middle?

Most of what we wrestle with is not new: fear, ambition, grief, how to deal with people who wrong us. A chatbot will distill what the canon says about any of it in seconds, in one smooth, agreeable, slightly forgettable voice. Sometimes that's enough. Often I'd rather read the actual sentence Marcus Aurelius or Francis Bacon wrote and sit with it.

Thoreau, in Walden, on what real reading asks of you:

"Most men have learned to read to serve a paltry convenience ... but of reading as a noble intellectual exercise they know little or nothing; yet this only is reading, in a high sense, not that which lulls us as a luxury and suffers the nobler faculties to sleep the while, but what we have to stand on tip-toe to read and devote our most alert and wakeful hours to."

Ask the Canon does only that: it points a plain-language question at a hand-picked shelf of public-domain books and returns the real passages that answer it, cited down to the chapter. It never writes a word of its own, so nothing is invented and nothing is misattributed.

A result, rendered by the app's own "share as image" feature. Thoreau made my argument in 1854.

This is the first of three posts on how it's built. This one is the engine: how you go from a folder of messy text files to ranked, cited answers without reaching for the heavy infrastructure everyone assumes you need.

The default is over-engineered #

Reach for "semantic search" and the stock answer is a vector database (Pinecone, Weaviate, pgvector) plus an embeddings API you call on every query.

That's the right shape at a billion vectors. At personal scale, tens of thousands of passages, it buys you operational weight and a network round-trip you don't need.

The whole corpus here is 79,292 passages across 100 books. As a float32

matrix at 768 dimensions that's about 240 MB, small enough to load once and keep resident. Once it's in RAM, "find the most similar passage" is a matrix multiply, and np.argsort

over the result. That's the entire search:

def embed(texts: list[str]) -> np.ndarray:
    return _model().encode(texts, normalize_embeddings=True)

def retrieve(query: str, vectors: np.ndarray, k: int = 5) -> list[tuple[int, float]]:
    scores = vectors @ embed([query])[0]
    top = np.argsort(scores)[::-1][:k]
    return [(int(i), float(scores[i])) for i in top]

Because the vectors are L2-normalized at embed time (normalize_embeddings=True

), the dot product vectors @ query

is cosine similarity. No similarity function to import, no index to tune. One @

.

(I later refactored that bare multiply into a _scores()

helper so the index can ship as float16

, halving its memory on a small box. The math is identical; how the helper keeps a float16

matmul fast is a Part 2 detail.)

Embed once, cache to disk #

The trick that makes this cheap: you embed the corpus exactly once. The model never runs at query time except on the single short query string. I run a local all-mpnet-base-v2 model, so there's no API key and nothing leaves the machine.

Indexing a book writes the vectors straight to a .npy

file next to the source:

def build_index(book_id: int, text: str) -> tuple[list[Chunk], np.ndarray]:
    chunks = chunk_text(text)
    vectors = embed([c.text for c in chunks])
    np.save(BOOKS_DIR / f"{book_id}.npy", vectors)
    return chunks, vectors

At startup, the per-book matrices stack into one library matrix with np.vstack

, and that's what every query multiplies against.

Embedding is the only expensive step, and it happens offline, on my laptop, never on the server. This separation also makes the deploy boring: build the .npy

files locally, rsync

them to the droplet.

The server never loads the model to build anything; it only embeds the incoming query. (The model itself is lazy and gated on offline env vars so there's no hub round-trip, but that's a Part 2 detail.)

Chunks carry their own citations #

A vector store usually means a second store for metadata: which book, which chapter, where in the text. I don't have one. The citation rides along with the chunk.

When I split a book, I track Gutenberg's CHAPTER

/ BOOK

/ CANTO

headings as I go and stamp each chunk with the section it fell in:

class Chunk(NamedTuple):
    label: str  # e.g. "BOOK XI — Chapter IX"
    text: str

So a result isn't a naked paragraph. It's Marcus Aurelius · Meditations — BOOK IV

, reconstructed from data that lived in the chunk all along.

The whole "database" is four kinds of file, no server:

books/<id>.npy

: the embeddings for one bookbooks/<id>.chunks.json

: the passages and their chapter labelsbooks/<id>.meta.json

: the title and authorlibrary.txt

: the list of Gutenberg IDs I grow by hand (now committed to the repo, could split off as "config" later)

To add or remove books, I update library.txt

and run sync

to rebuild the index which I then rsync to the server. No database, no migrations, no schema, no API calls.

The bug hiding in the chunk size #

Chunk size is the one knob that decides whether any of this works, and my first cut got it wrong in a "silent error" way.

all-mpnet-base-v2

reads at most 384 tokens, roughly 290 words, per input. Anything longer is truncated before a single number is computed.

My first chunk_text

targeted 600-word chunks, and it checked the size after appending each paragraph, so 600 was a floor it always overshot. A 599-word chunk plus one more paragraph landed well past 1,000 words.

So the model embedded the opening third of each passage and silently dropped the rest. The text I stored and cited was the whole passage, but the vector ranking it represented only the first few hundred words. Search was answering on text it had never read, and a long passage's real subject often sat in the part that got cut.

Nothing errored. The results looked on point, but they were quietly wrong: ranking was decided by the opening of each passage and ignored the rest. The cited passages were often long, and search was not matching the later paragraphs that actually could have better answered the query.

The fix was two small changes. Drop the target to 250 words, a safe buffer under the token limit, and flush a chunk before a paragraph pushes it over, so target_words

is a ceiling instead of a floor:

TARGET_WORDS = 250  # ~330-350 tokens, below mpnet's 384 max

para_words = len(para.split())
if current and words + para_words > target_words:
    chunks.append(Chunk(chunk_label, "\n\n".join(current)))
    current = current[-overlap:] if overlap else []
    words = sum(len(p.split()) for p in current)

Now every chunk is embedded in full. Matches got sharper and the cited passages are short enough to read at a glance. Smaller chunks also nearly tripled the corpus, from 29,435 passages to 79,292, which is why the matrix grew while the search improved.

One follow-on: tighter chunks shifted the score distribution (spotted doing some end-to-end testing with Claude), so I raised the noise threshold from 0.32 to 0.34 to keep false positives out.

When a vector database stops being overkill #

A thoughtful skeptic would push back: this doesn't scale. Correct, and that's the point. A linear scan is O(n) per query. At 80k passages it's a few milliseconds; at 30 million it's not. The moment you outgrow a single machine's RAM, or need filtered queries, multi-tenant isolation, or sub-millisecond latency at scale, we'd go with a real vector database.

But as detailed in Build the Simplest Thing That Works I prefer to get something working fast and validate it first putting it in front of real users.

Two lines of NumPy sit at the root of this engine. Everything else, the cards, the PDF export, the off-domain rejection, is built on top of that.

Using Claude was awesome, both in terms of getting the vectorization off the ground and it matching a classic vibe design so well. But 100-plus commits in, I kept having to step in with my experience and judgment to make the right call and tune things.

This judgment call is the same one I keep coming back to: match the tool to the actual size of the problem, not the size you imagine. It's the kind of engineering judgment AI doesn't change. AI is an accelerator, not a compass, and it still needs you to point it.

Tutorials teach syntax. Courses teach patterns. AI gives unvetted code. None of them review your decisions on your code. That's what 1:1 coaching is for. Here's how it works →

source & further reading

belderbos.dev — original article There Is No Magic: An AI Agent in 60 Lines of Python Python Is Not Enough: Why Pythonistas Love Rust (Podcast) AI Is an Accelerator, Not a Compass