# One Core, Two Interfaces, No Rewrites

> Source: <https://belderbos.dev/blog/two-interfaces-one-core/>
> Published: 2026-07-04 00:00:00+00:00

# One Core, Two Interfaces, No Rewrites

*Building agentic AI? I co-run a 6-week cohort where you ship a production-ready agent, not another API wrapper.*

When building applications, I always build the core first, then the interfaces. It was no different with [Ask the Canon](https://askthecanon.com): a `uv run main.py ask "..."`

CLI for quick iteration and validation, then the web app for MVP. Search, ranking, citations, all using the same engine.

Ask the Canon's core is a handful of pure functions in one module. Both interfaces are thin wrappers. This is the second post in a series on how it's built. [The first one](/blog/semantic-search-without-a-vector-database/) was about the retrieval engine. This one is about the wider architecture.

## Functional core, two interfaces

I just have one module with pure functions, clear contracts, and no hidden state:

``` php
def embed(texts: list[str]) -> np.ndarray: ...
def load_library(book_ids=None) -> tuple[list[Passage], np.ndarray]: ...
def search_passages(query, passages, vectors, ...) -> list[tuple[int, float]]: ...
def reflow(text: str) -> str: ...
```

`load_library`

reads the cached `.npy`

files off disk and hands back a list of `Passage`

tuples plus the stacked matrix.

`search_passages`

takes those two and a query and returns ranked `(index, score)`

pairs.

The web layer consumes the core functions, no re-implementation:

``` python
from main import (
    embed,
    Passage,
    humanize_author,
    load_library,
    reflow,
    search_passages,
)
```

The CLI's `ask()`

and the web app's `/api/ask`

share the same spine: load the library, call `search_passages`

, walk the ranked `(index, score)`

pairs. From there each does its own thing. The CLI prints `rich`

panels and offers an interactive deep-read; the web app serializes to `Match`

JSON and logs a bit of analytics on the way out.

The *ranking* decision, what comes back and in what order, is shared. Everything downstream is presentation, which is exactly where a CLI and a web app *should* differ.

We do the same in our [agentic AI program](https://pythonagenticai.com): one core engine, three interfaces (CLI, Telegram, API / web dashboard).

## I needed caching

`load_library`

is not cheap. It walks `books/`

, reads a JSON file and an `.npy`

file per book, and stacks 80k vectors into one matrix with `np.vstack`

. You don't want to pay that overhead on every HTTP request!

In the CLI that's a non-issue: the process loads once and exits. On the web side, it's one decorator away:

``` php
from functools import cache

@cache
def library() -> tuple[list[Passage], np.ndarray]:
    return load_library()
```

`@cache`

turns the first call into the real load and every call after into a dictionary lookup, much faster.

``` python
@app.get("/api/ask")
def ask(q: str, k: int = 5, per_book: int = 2, floor: float = 0.6) -> list[Match]:
    passages, vectors = library()  # cached
    ...
```

## Pre-warm on startup, not on the first visitor

There's a subtlety `@cache`

doesn't solve on its own. If the *first* request is what triggers `library()`

("wakes up PyTorch"), then the first real visitor pays that tax. App restarts are rare, but making the first visitor wait still isn't acceptable.

FastAPI's `lifespan`

offers a nice fix for this: do it as soon as the app starts, before the first request:

``` python
@asynccontextmanager
async def lifespan(app: FastAPI):
    init_db()
    logger.info("Pre-warming vector library and loading models into RAM...")
    _ = library()          # fills the @cache with the stacked matrix
    _ = embed(["warmup"])   # forces PyTorch to wake up and allocate
    logger.info("Ready for traffic.")
    yield

app = FastAPI(title="classics", lifespan=lifespan)
```

- I left a log line to watch the startup time. I also added some comments for possible collaborators and my future self.
- I use
`_`

as a throwaway variable to make it clear the return value is ignored. - You can put shutdown logic after
`yield`

, similar to how pytest fixtures work. Clean.

By the time the first request lands, both are warm.

## Lazy loading

I am a proponent of imports at the top, but lazy loading is a serious performance consideration. It's coming in 3.15:

Lazy imports defer the loading and execution of a module until the first time the imported name is used, in contrast to ‘normal’ imports, which eagerly load and execute a module at the point of the import statement. -

[PEP 810 – Explicit lazy imports]

That's the automatic version, landing in 3.15. Here I do it by hand: defer the model import into the function that needs it:

``` python
@cache
def _model():
    import sentence_transformers as st  # lazy, so the offline env vars take effect first
    return st.SentenceTransformer(EMBED_MODEL)
```

So the model loads once, and only if something actually calls `_model()`

. `@cache`

hands back the same instance every time after.

The "offline env vars" part refers to the second reason I need the import here. At the top of the module I have:

```
os.environ.setdefault("HF_HUB_OFFLINE", "1")
os.environ.setdefault("TRANSFORMERS_OFFLINE", "1")
os.environ.setdefault("TQDM_DISABLE", "1")
```

Hugging Face reads `HF_HUB_OFFLINE`

*at import time*. Import `sentence-transformers`

before those are set and it will try to reach out to the internet, which is not what I want because I have the data and model cached locally. Set them first and the model stays fully offline, no surprise network calls.

## Functions vs classes

None of this needs a class. The core is functions over plain data (`Passage`

and `Chunk`

are `NamedTuple`

s), the only state is a memoized function, and the two interfaces are thin adapters that share common behavior.

That's the payoff. When I want a third interface tomorrow (e.g. a scheduled job or a different API), it imports the same functions and gets the same behavior for free.

Claude scaffolded a first version fast, which saved time. But the offline-import ordering, the pre-warming, the lazy loading, the thin adapters, and the split between core and interface: all that took multiple iterations and engineering judgment. The kind you only catch if you already know to look, and that a per-session agent easily writes past.

As [I wrote here](/blog/ai-accelerator-needs-direction/), AI is an accelerator, not a compass. And as [I argued here](/blog/ai-doesnt-change-what-software-engineering-is/), it's this engineering judgment that AI doesn't change. Going from prototype to production is still a complex, human job.

Next up in part 3: the small post-processing tricks that make the results actually good, no bigger model required.

Most AI tutorials end at "call the API." This cohort ends with a deployed agent: function calling, structured outputs, three interfaces, Docker, 95%+ test coverage. Six weeks of real engineering, not notebooks. [Join the next Agentic AI cohort →](https://pythonagenticai.com)