Knowa – Open-Source LLM Context Optimizer Knowa has released an open-source hybrid retrieval library and knowledge base server that reduces LLM context costs by 90-99% by indexing documents as vector chunks, full-text pages, and a named-entity graph, then extracting only relevant chunks per query. The tool, which ingests from local directories, Notion, and Confluence, aims to solve the problem of sending millions of tokens per request at high cost — a 1,000-page knowledge base can cost tens of thousands of dollars monthly at scale. Knowa's precision retrieval approach, which includes optional LLM entity enrichment for domain-specific terms, is positioned as critical for scaling AI from prototype to company-wide infrastructure without breaking budgets. The naive approach to building AI-powered apps is to load your documents into the prompt and let the LLM figure it out. It works in demos. It breaks in production. A 1,000-page knowledge base is roughly 2–4 million tokens. At current API pricing, sending that on every request costs dollars per query. At 10,000 queries a day that is tens of thousands of dollars a month — for context that is 95% irrelevant to the question being asked. As AI usage scales across teams and products, this becomes the dominant cost line. Knowa's core job is to solve this. It indexes your documents once — as vector chunks, full-text pages, and a named-entity graph — then for each question extracts only the handful of chunks that are actually relevant, typically 1,000–3,000 tokens out of a large corpus. At scale that is a 90–99% reduction in context sent to the LLM, with no loss in answer quality. Every query response includes a measured token savings figure so you can track this in production. Savings are corpus-size dependent — see Understanding token savings understanding-token-savings for what to expect at different scales. This matters now and will matter much more as AI usage goes from prototype to product to company-wide infrastructure. Precision retrieval is not a nice-to-have — it is the difference between an AI feature that scales and one that breaks your budget. Knowa is a hybrid retrieval library and knowledge base server. It ingests documents from local directories, Notion, Confluence, and any custom source you connect — storing them as vector chunks and full-text pages in PostgreSQL, and as a named-entity knowledge graph in PostgreSQL default , Neo4j, or Kuzu. For each question it retrieves only the most relevant context across all three representations and returns it ready to inject into any LLM or AI pipeline you choose. The knowledge graph is built during indexing and gives retrieval a structural dimension that pure vector search misses — connecting people, products, organisations, and concepts across your entire document corpus so entity-centric questions "what teams work on X?", "which pages mention both Y and Z?" get precise, targeted answers. How the graph is populated depends on how much coverage you need: - spaCy NER default — runs locally at indexing time with zero API cost. Recognises standard entity types people, organisations, locations, dates, and more . You can swap the model to improve accuracy or target a specific domain: en core web trf transformer-based, better on ambiguous names , scispaCy models for scientific or biomedical text, or any spaCy-compatible model. - LLM entity enrichment opt-in — an additive second pass using any OpenAI-compatible model gpt-4o-mini, Qwen3, Kimi2, and others . The LLM extracts domain-specific entities that a general NER model misses: product features, internal codenames, technical standards, abstract concepts — anything with meaning in your organisation's language. Runs concurrently across pages to keep indexing fast at scale, and costs roughly $0.30 per 1,000 pages with gpt-4o-mini. - Python 3.11+ Docker https://docs.docker.com/get-docker/ for PostgreSQL - OpenAI API key docker compose up -d This starts a PostgreSQL 16 instance with pgvector pre-installed, exposed on localhost:5432 . Data is persisted in a Docker volume across restarts. python -m venv .venv source .venv/bin/activate Windows: .venv\Scripts\activate OR - if you have miniconda installed conda create -n knowa python=3.12 conda activate knowa pip install -r requirements.txt python -m spacy download en core web sm pip install -e . registers the knowa CLI command cp .env.example .env Edit .env and fill in the required values: | Variable | Required | Notes | |---|---|---| DATABASE URL | Yes | Pre-filled to match docker compose — no change needed | OPENAI API KEY | Yes | Your OpenAI API key | OPENAI MODEL | Yes | e.g. gpt-5.4 | API KEY | Yes | Any random secret — used to protect the REST API | NOTION API KEY | No | Required only for Notion sources | CONFLUENCE | No | All four vars required for Confluence sources | SPACY MODEL | No | NER model for graph extraction default: en core web sm | ENTITY LLM MODEL | No | Enable LLM entity enrichment, e.g. gpt-4o-mini see below | Generate a strong API key: python python3 -c "import secrets; print secrets.token urlsafe 32 " knowa index /path/to/docs --name "My Docs" Migrations run automatically on the first command. Supported file types: .md , .txt , .pdf , .docx knowa chat "What would you like to know?" uvicorn knowa.api.main:app --reload --port 8000 Open http://localhost:8000/admin/ui — search, browse sources, and trigger rebuilds from the browser. As a Python library — embed Knowa directly in your app. You own the LLM call; Knowa handles indexing, retrieval, and context formatting. Works with Anthropic, OpenAI, Gemini, local models, LangChain, LlamaIndex, or any other framework. As a REST API — run the FastAPI server and hit /query to get complete answers with citations from any language or tool, no Python required. As a CLI tool — use the knowa command to index directories and chat with your knowledge base directly from the terminal, no server or code required. Up to 90–99% token reduction at scale — three-path hybrid retrieval vector search, full-text search, property graph surgically extracts only what is relevant; every query reports measured token savings so you can track efficiency in production. See Understanding token savings understanding-token-savings for realistic expectations at different corpus sizes. Hierarchical chunking — documents are split into small child chunks for precise vector search, then expanded to larger parent chunks for full context — maximising relevance without losing surrounding information Pluggable embedders — OpenAIEmbedder default or SentenceTransformerEmbedder fully local, no API key needed at query time ; implement the Embedder protocol to add your own Multiple sources — Notion, Confluence Cloud, and local directories .md , .txt , .pdf , .docx Incremental sync — only re-processes files changed since the last run Zero LLM calls at index time by default — spaCy handles entity extraction; no OpenAI spend during indexing regardless of knowledge base size. For richer entity coverage, an optional LLM enrichment pass can be enabled see Graph entity extraction graph-entity-extraction Admin UI — per-source sync/rebuild controls, query interface, token savings tracking, and interactive entity graph visualization CLI — full index and chat management without running the server The token savings figure shown after each query measures how much of your indexed corpus Knowa avoided sending to the LLM: savings % = 1 − tokens retrieved by this query / all parent-chunk tokens in the DB This is a corpus-size-relative metric . With a small corpus, retrieval may cover most of it on every query — and savings will be near 0%. That is expected and correct; it is not a bug or a misconfiguration. By default Knowa retrieves the top 5 child chunks by vector similarity TOP K CHUNKS=5 , then expands each to its parent chunk ~2,048 tokens . That means roughly 10,000 tokens of context per query, regardless of corpus size. Savings only accumulate once your total indexed content significantly exceeds that window. The estimates below assume: Default settings — TOP K CHUNKS=5 , parent chunk size ~2,048 tokens Typical document sizes — wiki pages, Notion pages, Confluence pages: 500–3,000 words → 1–4 parent chunks each. Short README-style files under 500 words may produce a single chunk; long PDFs or DOCX files 5,000+ words may produce 5–15 chunks each. Mixed file types shift the breakpoints: a corpus of 10 long PDFs can behave like 50+ short markdown files in terms of total chunk count. | Corpus | Approx. parent chunks | Expected savings | |---|---|---| | 5 short docs | 5–10 | 0–10% — retrieval covers most of the corpus | | 20–30 substantial pages | 30–50 | 40–60% | | 100+ pages | 100–200 | 70–85% | | 500+ pages | 500+ | 85–95% | | Large wiki 1,000+ pages | 1,000+ | 90–99% | A 0% savings reading does not mean retrieval is broken or that Knowa is not helping. It means your corpus is small enough that the retrieval window covers essentially all of it — which is the correct behaviour. The answer quality benefit finding the right pages and synthesising a coherent answer is present at any corpus size. The savings badge becomes a useful production monitoring signal once your knowledge base grows large enough that retrieval is genuinely selective — typically 20–30 substantial documents or more. If you have a large corpus but still see low savings, consider reducing TOP K CHUNKS in your .env . The default of 5 is conservative; dropping to 3 roughly halves retrieved context and increases savings, at the cost of slightly lower recall on broad questions. The CLI connects directly to the database — no server needed. Index a directory — incremental by default, full rebuild on first run knowa index /path/to/docs knowa index /path/to/docs --name "Engineering Docs" attach a friendly label knowa index /path/to/docs --full force full rebuild knowa index /path/to/docs --workers 4 parallel indexing 4 threads See all indexed sources with page/chunk counts and labels knowa list Chat with the index knowa chat interactive REPL knowa chat "What is our refund policy?" single-shot knowa chat --source "Engineering Docs" scoped to one source knowa chat --debug show retrieval path before each answer knowa chat --no-index /path/to/docs bypass index, read docs directly Inspect graph backend entity/edge counts knowa debug Benchmark questions with and without index knowa bench questions.json Clear index for a source knowa clear /path/to/docs knowa clear --source-id