{"slug": "ask-the-canon-semantic-search-without-a-vector-database", "title": "Ask the Canon: Semantic Search Without a Vector Database", "summary": "Developer built Ask the Canon, a semantic search engine over 100 public-domain books, using Hugging Face and NumPy without a vector database or external AI APIs. The system loads 79,292 passages as a 240 MB float32 matrix in RAM and performs search via a single matrix multiply, avoiding over-engineered infrastructure. The project aims to provide direct access to original texts rather than AI-generated summaries.", "body_md": "# Ask the Canon: Semantic Search Without a Vector Database\n\n*Working on something challenging? I coach developers 1:1 on the judgment behind the code, not just the syntax. How it works →*\n\nI built out [askthecanon.com](https://askthecanon.com) this weekend, a semantic search over 100 public-domain books (from the Gutenberg project). You ask a question in plain language and get the passages that *mean* that, cited by author, title, and chapter. I wanted a non-AI, local solution, hence a retrieval engine using Hugging Face, NumPy, no full vector database (yet), no external API or AI involved.\n\n## Why Ask the Canon?\n\nI am already finding timeless wisdom using it myself (some of the best apps come from \"scratching your own itch\"), and I hope it offers a breath of fresh air in a world that seems to be dominated by AI-generated content and quick summaries.\n\nMy itch was wanting to read the originals, not an AI-generated summary, but also recognizing I don't have time and focus to read through a whole work (although it's still my aim, I see deep value in it). What if we can meet somewhere in the middle?\n\nMost of what we wrestle with is not new: fear, ambition, grief, how to deal with people who wrong us. A chatbot will distill what the canon says about any of it in seconds, in one smooth, agreeable, slightly forgettable voice. Sometimes that's enough. Often I'd rather read the actual sentence Marcus Aurelius or Francis Bacon wrote and sit with it.\n\nThoreau, in *Walden*, on what real reading asks of you:\n\n\"Most men have learned to read to serve a paltry convenience ... but of reading as a noble intellectual exercise they know little or nothing; yet this only is reading, in a high sense, not that which lulls us as a luxury and suffers the nobler faculties to sleep the while, but what we have to stand on tip-toe to read and devote our most alert and wakeful hours to.\"\n\nAsk the Canon does only that: it points a plain-language question at a hand-picked shelf of public-domain books and returns the real passages that answer it, cited down to the chapter. It never writes a word of its own, so nothing is invented and nothing is misattributed.\n\n*A result, rendered by the app's own \"share as image\" feature. Thoreau made my argument in 1854.*\n\nThis is the first of three posts on how it's built. This one is the engine: how you go from a folder of messy text files to ranked, cited answers without reaching for the heavy infrastructure everyone assumes you need.\n\n## The default is over-engineered\n\nReach for \"semantic search\" and the stock answer is a vector database (Pinecone, Weaviate, pgvector) plus an embeddings API you call on every query.\n\nThat's the right shape at a billion vectors. At personal scale, tens of thousands of passages, it buys you operational weight and a network round-trip you don't need.\n\nThe whole corpus here is 79,292 passages across 100 books. As a `float32`\n\nmatrix at 768 dimensions that's about 240 MB, small enough to load once and keep resident. Once it's in RAM, \"find the most similar passage\" is a matrix multiply, and `np.argsort`\n\nover the result. That's the entire search:\n\n``` php\ndef embed(texts: list[str]) -> np.ndarray:\n    return _model().encode(texts, normalize_embeddings=True)\n\ndef retrieve(query: str, vectors: np.ndarray, k: int = 5) -> list[tuple[int, float]]:\n    scores = vectors @ embed([query])[0]\n    top = np.argsort(scores)[::-1][:k]\n    return [(int(i), float(scores[i])) for i in top]\n```\n\nBecause the vectors are L2-normalized at embed time (`normalize_embeddings=True`\n\n), the dot product `vectors @ query`\n\n*is* cosine similarity. No similarity function to import, no index to tune. One `@`\n\n.\n\n(I later refactored that bare multiply into a `_scores()`\n\nhelper so the index can ship as `float16`\n\n, halving its memory on a small box. The math is identical; how the helper keeps a `float16`\n\nmatmul fast is a Part 2 detail.)\n\n## Embed once, cache to disk\n\nThe trick that makes this cheap: you embed the corpus exactly once. The model never runs at query time except on the single short query string. I run a local [ all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) model, so there's no API key and nothing leaves the machine.\n\nIndexing a book writes the vectors straight to a `.npy`\n\nfile next to the source:\n\n``` php\ndef build_index(book_id: int, text: str) -> tuple[list[Chunk], np.ndarray]:\n    chunks = chunk_text(text)\n    vectors = embed([c.text for c in chunks])\n    np.save(BOOKS_DIR / f\"{book_id}.npy\", vectors)\n    return chunks, vectors\n```\n\nAt startup, the per-book matrices stack into one library matrix with `np.vstack`\n\n, and that's what every query multiplies against.\n\nEmbedding is the only expensive step, and it happens offline, on my laptop, never on the server. This separation also makes the deploy boring: build the `.npy`\n\nfiles locally, `rsync`\n\nthem to the droplet.\n\nThe server never loads the model to *build* anything; it only embeds the incoming query. (The model loading itself is lazy and gated on offline env vars so there's no hub round-trip, but that's a Part 2 detail.)\n\n## Chunks carry their own citations\n\nA vector store usually means a second store for metadata: which book, which chapter, where in the text. I don't have one. The citation rides along with the chunk.\n\nWhen I split a book, I track Gutenberg's `CHAPTER`\n\n/ `BOOK`\n\n/ `CANTO`\n\nheadings as I go and stamp each chunk with the section it fell in:\n\n```\nclass Chunk(NamedTuple):\n    label: str  # e.g. \"BOOK XI — Chapter IX\"\n    text: str\n```\n\nSo a result isn't a naked paragraph. It's `Marcus Aurelius · Meditations — BOOK IV`\n\n, reconstructed from data that lived in the chunk all along.\n\nThe whole \"database\" is four kinds of file, no server:\n\n`books/<id>.npy`\n\n: the embeddings for one book`books/<id>.chunks.json`\n\n: the passages and their chapter labels`books/<id>.meta.json`\n\n: the title and author`library.txt`\n\n: the list of Gutenberg IDs I grow by hand (now committed to the repo, could split off as \"config\" later)\n\nTo add or remove books, I update `library.txt`\n\nand run `sync`\n\nto rebuild the index which I then rsync to the server. No database, no migrations, no schema, no API calls.\n\n## The bug hiding in the chunk size\n\nChunk size is the one knob that decides whether any of this works, and my first cut got it wrong in a \"silent error\" way.\n\n`all-mpnet-base-v2`\n\nreads at most 384 tokens, roughly 290 words, per input. Anything longer is truncated before a single number is computed.\n\nMy first `chunk_text`\n\ntargeted 600-word chunks, and it checked the size *after* appending each paragraph, so 600 was a floor it always overshot. A 599-word chunk plus one more paragraph landed well past 1,000 words.\n\nSo the model embedded the opening third of each passage and silently dropped the rest. The text I stored and cited was the whole passage, but the vector ranking it represented only the first few hundred words. Search was answering on text it had never read, and a long passage's real subject often sat in the part that got cut.\n\nNothing errored. The results looked on point, but they were quietly wrong: ranking was decided by the opening of each passage and ignored the rest. The cited passages were often long, and search was not matching the later paragraphs that actually could have better answered the query.\n\nThe fix was two small changes. Drop the target to 250 words, a safe buffer under the token limit, and flush a chunk *before* a paragraph pushes it over, so `target_words`\n\nis a ceiling instead of a floor:\n\n```\nTARGET_WORDS = 250  # ~330-350 tokens, below mpnet's 384 max\n\npara_words = len(para.split())\nif current and words + para_words > target_words:\n    chunks.append(Chunk(chunk_label, \"\\n\\n\".join(current)))\n    current = current[-overlap:] if overlap else []\n    words = sum(len(p.split()) for p in current)\n```\n\nNow every chunk is embedded in full. Matches got sharper and the cited passages are short enough to read at a glance. Smaller chunks also nearly tripled the corpus, from 29,435 passages to 79,292, which is why the matrix grew while the search improved.\n\nOne follow-on: tighter chunks shifted the score distribution (spotted doing some end-to-end testing with Claude), so I raised the noise threshold from 0.32 to 0.34 to keep false positives out.\n\n## When a vector database stops being overkill\n\nA thoughtful skeptic would push back: this doesn't scale. Correct, and that's the point. A linear scan is O(n) per query. At 80k passages it's a few milliseconds; at 30 million it's not. The moment you outgrow a single machine's RAM, or need filtered queries, multi-tenant isolation, or sub-millisecond latency at scale, we'd go with a real vector database.\n\nBut as detailed in [Build the Simplest Thing That Works](/blog/build-the-simplest-thing-that-works/) I prefer to get something working fast and validate it first putting it in front of real users.\n\nTwo lines of NumPy sit at the root of this engine. Everything else, the cards, the PDF export, the off-domain rejection, is built on top of that.\n\nUsing Claude was awesome, both in terms of getting the vectorization off the ground and it matching a *classic* vibe design so well. But 100-plus commits in, I kept having to step in with my experience and judgment to make the right call and tune things.\n\nThis judgment call is the same one I keep coming back to: match the tool to the actual size of the problem, not the size you imagine. It's [the kind of engineering judgment AI doesn't change](/blog/ai-doesnt-change-what-software-engineering-is/). AI is an accelerator, not a compass, and [it still needs you to point it](/blog/ai-accelerator-needs-direction/).\n\nTutorials teach syntax. Courses teach patterns. AI gives unvetted code. None of them review *your* decisions on *your* code. That's what 1:1 coaching is for. [Here's how it works →](/coaching/)", "url": "https://wpnews.pro/news/ask-the-canon-semantic-search-without-a-vector-database", "canonical_source": "https://belderbos.dev/blog/semantic-search-without-a-vector-database/", "published_at": "2026-06-30 00:00:00+00:00", "updated_at": "2026-06-30 15:56:09.472697+00:00", "lang": "en", "topics": ["artificial-intelligence", "natural-language-processing", "ai-tools", "developer-tools"], "entities": ["Hugging Face", "NumPy", "Gutenberg project", "Marcus Aurelius", "Francis Bacon", "Thoreau", "Walden", "Ask the Canon"], "alternates": {"html": "https://wpnews.pro/news/ask-the-canon-semantic-search-without-a-vector-database", "markdown": "https://wpnews.pro/news/ask-the-canon-semantic-search-without-a-vector-database.md", "text": "https://wpnews.pro/news/ask-the-canon-semantic-search-without-a-vector-database.txt", "jsonld": "https://wpnews.pro/news/ask-the-canon-semantic-search-without-a-vector-database.jsonld"}}