Ask the Canon: Semantic Search Without a Vector Database Developer built Ask the Canon, a semantic search engine over 100 public-domain books, using Hugging Face and NumPy without a vector database or external AI APIs. The system loads 79,292 passages as a 240 MB float32 matrix in RAM and performs search via a single matrix multiply, avoiding over-engineered infrastructure. The project aims to provide direct access to original texts rather than AI-generated summaries. Ask the Canon: Semantic Search Without a Vector Database Working on something challenging? I coach developers 1:1 on the judgment behind the code, not just the syntax. How it works → I built out askthecanon.com https://askthecanon.com this weekend, a semantic search over 100 public-domain books from the Gutenberg project . You ask a question in plain language and get the passages that mean that, cited by author, title, and chapter. I wanted a non-AI, local solution, hence a retrieval engine using Hugging Face, NumPy, no full vector database yet , no external API or AI involved. Why Ask the Canon? I am already finding timeless wisdom using it myself some of the best apps come from "scratching your own itch" , and I hope it offers a breath of fresh air in a world that seems to be dominated by AI-generated content and quick summaries. My itch was wanting to read the originals, not an AI-generated summary, but also recognizing I don't have time and focus to read through a whole work although it's still my aim, I see deep value in it . What if we can meet somewhere in the middle? Most of what we wrestle with is not new: fear, ambition, grief, how to deal with people who wrong us. A chatbot will distill what the canon says about any of it in seconds, in one smooth, agreeable, slightly forgettable voice. Sometimes that's enough. Often I'd rather read the actual sentence Marcus Aurelius or Francis Bacon wrote and sit with it. Thoreau, in Walden , on what real reading asks of you: "Most men have learned to read to serve a paltry convenience ... but of reading as a noble intellectual exercise they know little or nothing; yet this only is reading, in a high sense, not that which lulls us as a luxury and suffers the nobler faculties to sleep the while, but what we have to stand on tip-toe to read and devote our most alert and wakeful hours to." Ask the Canon does only that: it points a plain-language question at a hand-picked shelf of public-domain books and returns the real passages that answer it, cited down to the chapter. It never writes a word of its own, so nothing is invented and nothing is misattributed. A result, rendered by the app's own "share as image" feature. Thoreau made my argument in 1854. This is the first of three posts on how it's built. This one is the engine: how you go from a folder of messy text files to ranked, cited answers without reaching for the heavy infrastructure everyone assumes you need. The default is over-engineered Reach for "semantic search" and the stock answer is a vector database Pinecone, Weaviate, pgvector plus an embeddings API you call on every query. That's the right shape at a billion vectors. At personal scale, tens of thousands of passages, it buys you operational weight and a network round-trip you don't need. The whole corpus here is 79,292 passages across 100 books. As a float32 matrix at 768 dimensions that's about 240 MB, small enough to load once and keep resident. Once it's in RAM, "find the most similar passage" is a matrix multiply, and np.argsort over the result. That's the entire search: php def embed texts: list str - np.ndarray: return model .encode texts, normalize embeddings=True def retrieve query: str, vectors: np.ndarray, k: int = 5 - list tuple int, float : scores = vectors @ embed query 0 top = np.argsort scores ::-1 :k return int i , float scores i for i in top Because the vectors are L2-normalized at embed time normalize embeddings=True , the dot product vectors @ query is cosine similarity. No similarity function to import, no index to tune. One @ . I later refactored that bare multiply into a scores helper so the index can ship as float16 , halving its memory on a small box. The math is identical; how the helper keeps a float16 matmul fast is a Part 2 detail. Embed once, cache to disk The trick that makes this cheap: you embed the corpus exactly once. The model never runs at query time except on the single short query string. I run a local all-mpnet-base-v2 https://huggingface.co/sentence-transformers/all-mpnet-base-v2 model, so there's no API key and nothing leaves the machine. Indexing a book writes the vectors straight to a .npy file next to the source: php def build index book id: int, text: str - tuple list Chunk , np.ndarray : chunks = chunk text text vectors = embed c.text for c in chunks np.save BOOKS DIR / f"{book id}.npy", vectors return chunks, vectors At startup, the per-book matrices stack into one library matrix with np.vstack , and that's what every query multiplies against. Embedding is the only expensive step, and it happens offline, on my laptop, never on the server. This separation also makes the deploy boring: build the .npy files locally, rsync them to the droplet. The server never loads the model to build anything; it only embeds the incoming query. The model loading itself is lazy and gated on offline env vars so there's no hub round-trip, but that's a Part 2 detail. Chunks carry their own citations A vector store usually means a second store for metadata: which book, which chapter, where in the text. I don't have one. The citation rides along with the chunk. When I split a book, I track Gutenberg's CHAPTER / BOOK / CANTO headings as I go and stamp each chunk with the section it fell in: class Chunk NamedTuple : label: str e.g. "BOOK XI — Chapter IX" text: str So a result isn't a naked paragraph. It's Marcus Aurelius · Meditations — BOOK IV , reconstructed from data that lived in the chunk all along. The whole "database" is four kinds of file, no server: books/