Ask the Canon: Semantic Search Without a Vector Database

Developer built Ask the Canon, a semantic search engine over 100 public-domain books, using Hugging Face and NumPy without a vector database or external AI APIs. The system loads 79,292 passages as a 240 MB float32 matrix in RAM and performs search via a single matrix multiply, avoiding over-engineered infrastructure. The project aims to provide direct access to original texts rather than AI-generated summaries.

Ask the Canon: Semantic Search Without a Vector Database Working on something challenging? I coach developers 1:1 on the judgment behind the code, not just the syntax. How it works → I built out askthecanon.com https://askthecanon.com this weekend, a semantic search over 100 public-domain books from the Gutenberg project . You ask a question in plain language and get the passages that mean that, cited by author, title, and chapter. I wanted a non-AI, local solution, hence a retrieval engine using Hugging Face, NumPy, no full vector database yet , no external API or AI involved. Why Ask the Canon? I am already finding timeless wisdom using it myself some of the best apps come from "scratching your own itch" , and I hope it offers a breath of fresh air in a world that seems to be dominated by AI-generated content and quick summaries. My itch was wanting to read the originals, not an AI-generated summary, but also recognizing I don't have time and focus to read through a whole work although it's still my aim, I see deep value in it . What if we can meet somewhere in the middle? Most of what we wrestle with is not new: fear, ambition, grief, how to deal with people who wrong us. A chatbot will distill what the canon says about any of it in seconds, in one smooth, agreeable, slightly forgettable voice. Sometimes that's enough. Often I'd rather read the actual sentence Marcus Aurelius or Francis Bacon wrote and sit with it. Thoreau, in Walden , on what real reading asks of you: "Most men have learned to read to serve a paltry convenience ... but of reading as a noble intellectual exercise they know little or nothing; yet this only is reading, in a high sense, not that which lulls us as a luxury and suffers the nobler faculties to sleep the while, but what we have to stand on tip-toe to read and devote our most alert and wakeful hours to." Ask the Canon does only that: it points a plain-language question at a hand-picked shelf of public-domain books and returns the real passages that answer it, cited down to the chapter. It never writes a word of its own, so nothing is invented and nothing is misattributed. A result, rendered by the app's own "share as image" feature. Thoreau made my argument in 1854. This is the first of three posts on how it's built. This one is the engine: how you go from a folder of messy text files to ranked, cited answers without reaching for the heavy infrastructure everyone assumes you need. The default is over-engineered Reach for "semantic search" and the stock answer is a vector database Pinecone, Weaviate, pgvector plus an embeddings API you call on every query. That's the right shape at a billion vectors. At personal scale, tens of thousands of passages, it buys you operational weight and a network round-trip you don't need. The whole corpus here is 79,292 passages across 100 books. As a float32 matrix at 768 dimensions that's about 240 MB, small enough to load once and keep resident. Once it's in RAM, "find the most similar passage" is a matrix multiply, and np.argsort over the result. That's the entire search: php def embed texts: list str - np.ndarray: return model .encode texts, normalize embeddings=True def retrieve query: str, vectors: np.ndarray, k: int = 5 - list tuple int, float : scores = vectors @ embed query 0 top = np.argsort scores ::-1 :k return int i , float scores i for i in top Because the vectors are L2-normalized at embed time normalize embeddings=True , the dot product vectors @ query is cosine similarity. No similarity function to import, no index to tune. One @ . I later refactored that bare multiply into a scores helper so the index can ship as float16 , halving its memory on a small box. The math is identical; how the helper keeps a float16 matmul fast is a Part 2 detail. Embed once, cache to disk The trick that makes this cheap: you embed the corpus exactly once. The model never runs at query time except on the single short query string. I run a local all-mpnet-base-v2 https://huggingface.co/sentence-transformers/all-mpnet-base-v2 model, so there's no API key and nothing leaves the machine. Indexing a book writes the vectors straight to a .npy file next to the source: php def build index book id: int, text: str - tuple list Chunk , np.ndarray : chunks = chunk text text vectors = embed c.text for c in chunks np.save BOOKS DIR / f"{book id}.npy", vectors return chunks, vectors At startup, the per-book matrices stack into one library matrix with np.vstack , and that's what every query multiplies against. Embedding is the only expensive step, and it happens offline, on my laptop, never on the server. This separation also makes the deploy boring: build the .npy files locally, rsync them to the droplet. The server never loads the model to build anything; it only embeds the incoming query. The model loading itself is lazy and gated on offline env vars so there's no hub round-trip, but that's a Part 2 detail. Chunks carry their own citations A vector store usually means a second store for metadata: which book, which chapter, where in the text. I don't have one. The citation rides along with the chunk. When I split a book, I track Gutenberg's CHAPTER / BOOK / CANTO headings as I go and stamp each chunk with the section it fell in: class Chunk NamedTuple : label: str e.g. "BOOK XI — Chapter IX" text: str So a result isn't a naked paragraph. It's Marcus Aurelius · Meditations — BOOK IV , reconstructed from data that lived in the chunk all along. The whole "database" is four kinds of file, no server: books/<id .npy : the embeddings for one book books/<id .chunks.json : the passages and their chapter labels books/<id .meta.json : the title and author library.txt : the list of Gutenberg IDs I grow by hand now committed to the repo, could split off as "config" later To add or remove books, I update library.txt and run sync to rebuild the index which I then rsync to the server. No database, no migrations, no schema, no API calls. The bug hiding in the chunk size Chunk size is the one knob that decides whether any of this works, and my first cut got it wrong in a "silent error" way. all-mpnet-base-v2 reads at most 384 tokens, roughly 290 words, per input. Anything longer is truncated before a single number is computed. My first chunk text targeted 600-word chunks, and it checked the size after appending each paragraph, so 600 was a floor it always overshot. A 599-word chunk plus one more paragraph landed well past 1,000 words. So the model embedded the opening third of each passage and silently dropped the rest. The text I stored and cited was the whole passage, but the vector ranking it represented only the first few hundred words. Search was answering on text it had never read, and a long passage's real subject often sat in the part that got cut. Nothing errored. The results looked on point, but they were quietly wrong: ranking was decided by the opening of each passage and ignored the rest. The cited passages were often long, and search was not matching the later paragraphs that actually could have better answered the query. The fix was two small changes. Drop the target to 250 words, a safe buffer under the token limit, and flush a chunk before a paragraph pushes it over, so target words is a ceiling instead of a floor: TARGET WORDS = 250 ~330-350 tokens, below mpnet's 384 max para words = len para.split if current and words + para words target words: chunks.append Chunk chunk label, "\n\n".join current current = current -overlap: if overlap else words = sum len p.split for p in current Now every chunk is embedded in full. Matches got sharper and the cited passages are short enough to read at a glance. Smaller chunks also nearly tripled the corpus, from 29,435 passages to 79,292, which is why the matrix grew while the search improved. One follow-on: tighter chunks shifted the score distribution spotted doing some end-to-end testing with Claude , so I raised the noise threshold from 0.32 to 0.34 to keep false positives out. When a vector database stops being overkill A thoughtful skeptic would push back: this doesn't scale. Correct, and that's the point. A linear scan is O n per query. At 80k passages it's a few milliseconds; at 30 million it's not. The moment you outgrow a single machine's RAM, or need filtered queries, multi-tenant isolation, or sub-millisecond latency at scale, we'd go with a real vector database. But as detailed in Build the Simplest Thing That Works /blog/build-the-simplest-thing-that-works/ I prefer to get something working fast and validate it first putting it in front of real users. Two lines of NumPy sit at the root of this engine. Everything else, the cards, the PDF export, the off-domain rejection, is built on top of that. Using Claude was awesome, both in terms of getting the vectorization off the ground and it matching a classic vibe design so well. But 100-plus commits in, I kept having to step in with my experience and judgment to make the right call and tune things. This judgment call is the same one I keep coming back to: match the tool to the actual size of the problem, not the size you imagine. It's the kind of engineering judgment AI doesn't change /blog/ai-doesnt-change-what-software-engineering-is/ . AI is an accelerator, not a compass, and it still needs you to point it /blog/ai-accelerator-needs-direction/ . Tutorials teach syntax. Courses teach patterns. AI gives unvetted code. None of them review your decisions on your code. That's what 1:1 coaching is for. Here's how it works → /coaching/