lilbee: a single-executable local AI search engine you can talk to. It pulls and runs models from the Hub, indexes your files and code, and crawls the web into a private, cited library Developer tobocop2 released lilbee, an open-source, single-executable local AI search engine that indexes files, code, and web pages, and answers queries with citations. The tool runs entirely on local hardware, uses models from Hugging Face, and includes a fix for the huggingface_hub library. lilbee aims to provide a private, infrastructure-free alternative to cloud-based AI search. A bit about me first. I’ve been into local AI since I discovered Ollama a few years ago, and I’m mostly in the terminal. I got fed up watching people paste sensitive documents into frontier AI web apps, and I wanted a lightweight, on-demand way to talk to my own stuff without worrying about managing infrastructure and all that. So I built lilbee, and it’s open source. lilbee is a local-first search engine you can talk to, all in a single executable compiled with Nuitka . It runs and manages its own models, indexes everything you own 90+ file formats and 150+ programming languages , crawls and saves web pages, and answers in plain English with a citation to the exact source file and line. No separate model server, no vector database, no containers to wire together. It runs entirely on your hardware and only touches the cloud if you point it at a cloud model. Technical details: Models: inference runs on llama.cpp, and every model comes from Hugging Face. The huggingface hub Python library powers every download, and the Hub is the model catalog. lilbee reads each model’s architecture from the GGUF metadata before downloading and tags the ones the pinned runtime can’t load, so you don’t pull multiple GB only to hit “unsupported architecture” at load time. Each role chat, embedding, reranking, vision runs in its own persistent subprocess, so one role holding the GIL during inference can’t stall another or freeze the UI, and cancellation rides the abort callback at every token tick. A fix I contributed back to the Hub: downloads in lilbee run inside a Textual terminal app that captures stdout and stderr, and I hit a crash where hf hub download fails when stderr isn’t a real file descriptor. I fixed it upstream and it is now merged: BUG FIX : hf hub download crashes when stderr lacks a real file descriptor by tobocop2 · Pull Request 4065 · huggingface/huggingface hub · GitHub https://github.com/huggingface/huggingface hub/pull/4065 . That is what lets lilbee show real-time progress bars for model downloads right inside the TUI. Retrieval: documents go through Kreuzberg heading-aware markdown chunking that prepends the section hierarchy onto each chunk , and code through tree-sitter. Embeddings and chunks live in LanceDB. Search is hybrid: BM25 over LanceDB’s FTS index fused with vector cosine similarity via Reciprocal Rank Fusion, then reranked with cross-encoders or an LLM reranker. Query embeddings are asymmetric for instruction-tuned embedders. Generation: a context budget fits retrieved chunks and history to the served window, grounded refusal makes it say when the library has nothing relevant instead of guessing, and every answer carries file-and-line citations. Already on Ollama or LM Studio? Point lilbee at them and keep using them. You can drive it from a full-screen terminal app, a CLI, an MCP server for coding agents, an HTTP API, or a Python library. There’s also a free Obsidian plugin for people who don’t live in the terminal. Try it: Source: Coming very soon: I’m moving from in-process inference to a llama-server and llama-swap setup, with gguf-parser sizing each load and tensor-split across cards, so the same one binary scales from a laptop to multiple GPUs. I recently ran MiniMax M2 across 3 cards, all managed by lilbee. That work is here: Multi-GPU model serving and opencode integration by tobocop2 · Pull Request 267 · tobocop2/lilbee · GitHub https://github.com/tobocop2/lilbee/pull/267