One Core, Two Interfaces, No Rewrites Developer builds Ask the Canon with a functional core shared by CLI and web interfaces, using caching and pre-warming to avoid latency on first request. The architecture separates ranking logic from presentation, allowing both interfaces to reuse the same engine without rewrites. One Core, Two Interfaces, No Rewrites Building agentic AI? I co-run a 6-week cohort where you ship a production-ready agent, not another API wrapper. When building applications, I always build the core first, then the interfaces. It was no different with Ask the Canon https://askthecanon.com : a uv run main.py ask "..." CLI for quick iteration and validation, then the web app for MVP. Search, ranking, citations, all using the same engine. Ask the Canon's core is a handful of pure functions in one module. Both interfaces are thin wrappers. This is the second post in a series on how it's built. The first one /blog/semantic-search-without-a-vector-database/ was about the retrieval engine. This one is about the wider architecture. Functional core, two interfaces I just have one module with pure functions, clear contracts, and no hidden state: php def embed texts: list str - np.ndarray: ... def load library book ids=None - tuple list Passage , np.ndarray : ... def search passages query, passages, vectors, ... - list tuple int, float : ... def reflow text: str - str: ... load library reads the cached .npy files off disk and hands back a list of Passage tuples plus the stacked matrix. search passages takes those two and a query and returns ranked index, score pairs. The web layer consumes the core functions, no re-implementation: python from main import embed, Passage, humanize author, load library, reflow, search passages, The CLI's ask and the web app's /api/ask share the same spine: load the library, call search passages , walk the ranked index, score pairs. From there each does its own thing. The CLI prints rich panels and offers an interactive deep-read; the web app serializes to Match JSON and logs a bit of analytics on the way out. The ranking decision, what comes back and in what order, is shared. Everything downstream is presentation, which is exactly where a CLI and a web app should differ. We do the same in our agentic AI program https://pythonagenticai.com : one core engine, three interfaces CLI, Telegram, API / web dashboard . I needed caching load library is not cheap. It walks books/ , reads a JSON file and an .npy file per book, and stacks 80k vectors into one matrix with np.vstack . You don't want to pay that overhead on every HTTP request In the CLI that's a non-issue: the process loads once and exits. On the web side, it's one decorator away: php from functools import cache @cache def library - tuple list Passage , np.ndarray : return load library @cache turns the first call into the real load and every call after into a dictionary lookup, much faster. python @app.get "/api/ask" def ask q: str, k: int = 5, per book: int = 2, floor: float = 0.6 - list Match : passages, vectors = library cached ... Pre-warm on startup, not on the first visitor There's a subtlety @cache doesn't solve on its own. If the first request is what triggers library "wakes up PyTorch" , then the first real visitor pays that tax. App restarts are rare, but making the first visitor wait still isn't acceptable. FastAPI's lifespan offers a nice fix for this: do it as soon as the app starts, before the first request: python @asynccontextmanager async def lifespan app: FastAPI : init db logger.info "Pre-warming vector library and loading models into RAM..." = library fills the @cache with the stacked matrix = embed "warmup" forces PyTorch to wake up and allocate logger.info "Ready for traffic." yield app = FastAPI title="classics", lifespan=lifespan - I left a log line to watch the startup time. I also added some comments for possible collaborators and my future self. - I use as a throwaway variable to make it clear the return value is ignored. - You can put shutdown logic after yield , similar to how pytest fixtures work. Clean. By the time the first request lands, both are warm. Lazy loading I am a proponent of imports at the top, but lazy loading is a serious performance consideration. It's coming in 3.15: Lazy imports defer the loading and execution of a module until the first time the imported name is used, in contrast to ‘normal’ imports, which eagerly load and execute a module at the point of the import statement. - PEP 810 – Explicit lazy imports That's the automatic version, landing in 3.15. Here I do it by hand: defer the model import into the function that needs it: python @cache def model : import sentence transformers as st lazy, so the offline env vars take effect first return st.SentenceTransformer EMBED MODEL So the model loads once, and only if something actually calls model . @cache hands back the same instance every time after. The "offline env vars" part refers to the second reason I need the import here. At the top of the module I have: os.environ.setdefault "HF HUB OFFLINE", "1" os.environ.setdefault "TRANSFORMERS OFFLINE", "1" os.environ.setdefault "TQDM DISABLE", "1" Hugging Face reads HF HUB OFFLINE at import time . Import sentence-transformers before those are set and it will try to reach out to the internet, which is not what I want because I have the data and model cached locally. Set them first and the model stays fully offline, no surprise network calls. Functions vs classes None of this needs a class. The core is functions over plain data Passage and Chunk are NamedTuple s , the only state is a memoized function, and the two interfaces are thin adapters that share common behavior. That's the payoff. When I want a third interface tomorrow e.g. a scheduled job or a different API , it imports the same functions and gets the same behavior for free. Claude scaffolded a first version fast, which saved time. But the offline-import ordering, the pre-warming, the lazy loading, the thin adapters, and the split between core and interface: all that took multiple iterations and engineering judgment. The kind you only catch if you already know to look, and that a per-session agent easily writes past. As I wrote here /blog/ai-accelerator-needs-direction/ , AI is an accelerator, not a compass. And as I argued here /blog/ai-doesnt-change-what-software-engineering-is/ , it's this engineering judgment that AI doesn't change. Going from prototype to production is still a complex, human job. Next up in part 3: the small post-processing tricks that make the results actually good, no bigger model required. Most AI tutorials end at "call the API." This cohort ends with a deployed agent: function calling, structured outputs, three interfaces, Docker, 95%+ test coverage. Six weeks of real engineering, not notebooks. Join the next Agentic AI cohort → https://pythonagenticai.com