# lilbee: a single-executable local AI search engine you can talk to. It pulls and runs models from the Hub, indexes your files and code, and crawls the web into a private, cited library

> Source: <https://discuss.huggingface.co/t/lilbee-a-single-executable-local-ai-search-engine-you-can-talk-to-it-pulls-and-runs-models-from-the-hub-indexes-your-files-and-code-and-crawls-the-web-into-a-private-cited-library/176915#post_1>
> Published: 2026-06-18 02:48:21+00:00

A bit about me first. I’ve been into local AI since I discovered Ollama a few years ago, and I’m mostly in the terminal. I got fed up watching people paste sensitive documents into frontier AI web apps, and I wanted a lightweight, on-demand way to talk to my own stuff without worrying about managing infrastructure and all that. So I built lilbee, and it’s open source.

lilbee is a local-first search engine you can talk to, all in a single executable (compiled with Nuitka). It runs and manages its own models, indexes everything you own (90+ file formats and 150+ programming languages), crawls and saves web pages, and answers in plain English with a citation to the exact source file and line. No separate model server, no vector database, no containers to wire together. It runs entirely on your hardware and only touches the cloud if you point it at a cloud model.

Technical details:

Models: inference runs on llama.cpp, and every model comes from Hugging Face. The huggingface_hub Python library powers every download, and the Hub is the model catalog. lilbee reads each model’s architecture from the GGUF metadata before downloading and tags the ones the pinned runtime can’t load, so you don’t pull multiple GB only to hit “unsupported architecture” at load time. Each role (chat, embedding, reranking, vision) runs in its own persistent subprocess, so one role holding the GIL during inference can’t stall another or freeze the UI, and cancellation rides the abort_callback at every token tick.

A fix I contributed back to the Hub: downloads in lilbee run inside a Textual terminal app that captures stdout and stderr, and I hit a crash where hf_hub_download fails when stderr isn’t a real file descriptor. I fixed it upstream and it is now merged: [[BUG FIX]: hf_hub_download crashes when stderr lacks a real file descriptor by tobocop2 · Pull Request #4065 · huggingface/huggingface_hub · GitHub](https://github.com/huggingface/huggingface_hub/pull/4065) . That is what lets lilbee show real-time progress bars for model downloads right inside the TUI.

Retrieval: documents go through Kreuzberg (heading-aware markdown chunking that prepends the section hierarchy onto each chunk), and code through tree-sitter. Embeddings and chunks live in LanceDB. Search is hybrid: BM25 over LanceDB’s FTS index fused with vector cosine similarity via Reciprocal Rank Fusion, then reranked with cross-encoders or an LLM reranker. Query embeddings are asymmetric for instruction-tuned embedders.

Generation: a context budget fits retrieved chunks and history to the served window, grounded refusal makes it say when the library has nothing relevant instead of guessing, and every answer carries file-and-line citations.

Already on Ollama or LM Studio? Point lilbee at them and keep using them. You can drive it from a full-screen terminal app, a CLI, an MCP server for coding agents, an HTTP API, or a Python library. There’s also a free Obsidian plugin for people who don’t live in the terminal.

Try it:

Source:

Coming very soon: I’m moving from in-process inference to a llama-server and llama-swap setup, with gguf-parser sizing each load and tensor-split across cards, so the same one binary scales from a laptop to multiple GPUs. I recently ran MiniMax M2 across 3 cards, all managed by lilbee. That work is here: [Multi-GPU model serving and opencode integration by tobocop2 · Pull Request #267 · tobocop2/lilbee · GitHub](https://github.com/tobocop2/lilbee/pull/267)
