Building agentic AI? I co-run a 6-week cohort where you ship a production-ready agent, not another API wrapper.
When building applications, I always build the core first, then the interfaces. It was no different with Ask the Canon: a uv run main.py ask "..."
CLI for quick iteration and validation, then the web app for MVP. Search, ranking, citations, all using the same engine.
Ask the Canon's core is a handful of pure functions in one module. Both interfaces are thin wrappers. This is the second post in a series on how it's built. The first one was about the retrieval engine. This one is about the wider architecture.
Functional core, two interfaces #
I just have one module with pure functions, clear contracts, and no hidden state:
def embed(texts: list[str]) -> np.ndarray: ...
def load_library(book_ids=None) -> tuple[list[Passage], np.ndarray]: ...
def search_passages(query, passages, vectors, ...) -> list[tuple[int, float]]: ...
def reflow(text: str) -> str: ...
load_library
reads the cached .npy
files off disk and hands back a list of Passage
tuples plus the stacked matrix.
search_passages
takes those two and a query and returns ranked (index, score)
pairs.
The web layer consumes the core functions, no re-implementation:
from main import (
embed,
Passage,
humanize_author,
load_library,
reflow,
search_passages,
)
The CLI's ask()
and the web app's /api/ask
share the same spine: load the library, call search_passages
, walk the ranked (index, score)
pairs. From there each does its own thing. The CLI prints rich
panels and offers an interactive deep-read; the web app serializes to Match
JSON and logs a bit of analytics on the way out.
The ranking decision, what comes back and in what order, is shared. Everything downstream is presentation, which is exactly where a CLI and a web app should differ.
We do the same in our agentic AI program: one core engine, three interfaces (CLI, Telegram, API / web dashboard).
I needed caching #
load_library
is not cheap. It walks books/
, reads a JSON file and an .npy
file per book, and stacks 80k vectors into one matrix with np.vstack
. You don't want to pay that overhead on every HTTP request!
In the CLI that's a non-issue: the process loads once and exits. On the web side, it's one decorator away:
from functools import cache
@cache
def library() -> tuple[list[Passage], np.ndarray]:
return load_library()
@cache
turns the first call into the real load and every call after into a dictionary lookup, much faster.
@app.get("/api/ask")
def ask(q: str, k: int = 5, per_book: int = 2, floor: float = 0.6) -> list[Match]:
passages, vectors = library() # cached
...
Pre-warm on startup, not on the first visitor #
There's a subtlety @cache
doesn't solve on its own. If the first request is what triggers library()
("wakes up PyTorch"), then the first real visitor pays that tax. App restarts are rare, but making the first visitor wait still isn't acceptable.
FastAPI's lifespan
offers a nice fix for this: do it as soon as the app starts, before the first request:
@asynccontextmanager
async def lifespan(app: FastAPI):
init_db()
logger.info("Pre-warming vector library and models into RAM...")
_ = library() # fills the @cache with the stacked matrix
_ = embed(["warmup"]) # forces PyTorch to wake up and allocate
logger.info("Ready for traffic.")
yield
app = FastAPI(title="classics", lifespan=lifespan)
- I left a log line to watch the startup time. I also added some comments for possible collaborators and my future self.
- I use
_
as a throwaway variable to make it clear the return value is ignored. - You can put shutdown logic after
yield
, similar to how pytest fixtures work. Clean.
By the time the first request lands, both are warm.
Lazy #
I am a proponent of imports at the top, but lazy is a serious performance consideration. It's coming in 3.15:
Lazy imports defer the and execution of a module until the first time the imported name is used, in contrast to ‘normal’ imports, which eagerly load and execute a module at the point of the import statement. -
[PEP 810 – Explicit lazy imports]
That's the automatic version, landing in 3.15. Here I do it by hand: defer the model import into the function that needs it:
@cache
def _model():
import sentence_transformers as st # lazy, so the offline env vars take effect first
return st.SentenceTransformer(EMBED_MODEL)
So the model loads once, and only if something actually calls _model()
. @cache
hands back the same instance every time after.
The "offline env vars" part refers to the second reason I need the import here. At the top of the module I have:
os.environ.setdefault("HF_HUB_OFFLINE", "1")
os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
os.environ.setdefault("TQDM_DISABLE", "1")
Hugging Face reads HF_HUB_OFFLINE
at import time. Import sentence-transformers
before those are set and it will try to reach out to the internet, which is not what I want because I have the data and model cached locally. Set them first and the model stays fully offline, no surprise network calls.
Functions vs classes #
None of this needs a class. The core is functions over plain data (Passage
and Chunk
are NamedTuple
s), the only state is a memoized function, and the two interfaces are thin adapters that share common behavior.
That's the payoff. When I want a third interface tomorrow (e.g. a scheduled job or a different API), it imports the same functions and gets the same behavior for free.
Claude scaffolded a first version fast, which saved time. But the offline-import ordering, the pre-warming, the lazy , the thin adapters, and the split between core and interface: all that took multiple iterations and engineering judgment. The kind you only catch if you already know to look, and that a per-session agent easily writes past.
As I wrote here, AI is an accelerator, not a compass. And as I argued here, it's this engineering judgment that AI doesn't change. Going from prototype to production is still a complex, human job.
Next up in part 3: the small post-processing tricks that make the results actually good, no bigger model required.
Most AI tutorials end at "call the API." This cohort ends with a deployed agent: function calling, structured outputs, three interfaces, Docker, 95%+ test coverage. Six weeks of real engineering, not notebooks. Join the next Agentic AI cohort →