{"slug": "build-a-docs-agent-without-vector-search", "title": "Build a Docs Agent Without Vector Search", "summary": "A new approach to building documentation agents replaces vector search with a virtual file system and shell commands, allowing AI agents to search and extract data by running grep, sed, and find commands. The open-source Bashkit tool provides an in-process, sandboxed shell environment for LangChain agents, enabling search as a workflow rather than a single API call. This method offers an inspectable environment where agents can iteratively refine queries without needing semantic search or embeddings.", "body_md": "When we hear about classic agentic applications, there is almost always some version of semantic search for data. In most cases it is implemented as vector search across embeddings, like OpenAI Embeddings.\n\nIn this post, let’s build alternative approach, where data is represented as files in virtual file system, and then agent figures out how to search it.\n\n## Basic Idea\n\nImagine that all data is represented as files in file system. And that agent has means to work with these files: traverse, search, grep, list, and print selected ranges. So now, when someone asks agent to answer question, agent just does what it is trained for: it uses own knowledge to orchestrate reading files, extract data, and answer.\n\nOn top of that, let’s give agent an ability to run shell commands against these files. Then semantic search becomes not needed for many cases. The agent can do what developer does:\n\n```\nrg -i -n 'bashkit-cli' /docs/public /docs/rustdoc | head -20\ngrep -R -i -n -C 1 -m 3 -- 'readonly' /docs/public /docs/rustdoc /docs/examples\nsed -n '40,90p' /docs/public/cli.md\nfind /docs/examples -maxdepth 2 -type f -name '*.md'\n```\n\nThis looks too primitive. It is not.\n\nThe useful thing is not `grep`\n\nitself. The useful thing is that agent has an inspectable environment. It can try one query, see nothing, retry with different word, narrow down with filenames, read nearby lines with `sed`\n\n, combine two searches, and only then answer.\n\nThat is search as workflow, not search as one API call.\n\n## The Example\n\nThe example is a small console [docs search agent](https://github.com/everruns/bashkit/tree/main/examples/docs-grep-agent). It is LangGraph agent created through LangChain 1.0, with one Bashkit tool attached.\n\nIf you build your own version outside Bashkit repo, first dependency is just:\n\n```\nuv add bashkit\n```\n\nFor this exact shape you also need LangChain and OpenAI package:\n\n```\nuv add bashkit langchain langchain-openai\n```\n\nRun example from Bashkit repo:\n\n```\ncd examples/docs-grep-agent\nexport OPENAI_API_KEY=sk-...\nuv run docs-grep-agent \"what is bashkit\"\nuv run docs-grep-agent \"give me example on how to use bashkit cli\"\n```\n\nHere is it answering from docs, with no vector index behind it:\n\nAnd if you want to see what agent actually runs:\n\n```\nuv run docs-grep-agent --show-tools \"how do read-only mounts work?\"\n```\n\n## The Whole Trick\n\nThere are three parts.\n\nFirst, docs are mounted as read-only folders into Bashkit:\n\n```\ndocs_mounts = [\n    (root / \"docs\", \"/docs/public\"),\n    (root / \"crates/bashkit/docs\", \"/docs/rustdoc\"),\n    (root / \"examples/bashkit-pi\", \"/docs/examples/bashkit-pi\"),\n    (root / \"examples/browser\", \"/docs/examples/browser\"),\n]\n```\n\nThen these mounts are passed to `bashkit.langchain.create_bash_tool()`\n\n:\n\n```\nreturn create_bash_tool(\n    username=\"agent\",\n    hostname=\"docs\",\n    max_commands=120,\n    max_loop_iterations=1000,\n    timeout_seconds=3,\n    mounts=[\n        {\"host_path\": str(host_path), \"vfs_path\": vfs_path, \"writable\": False}\n        for host_path, vfs_path in docs_mounts\n    ],\n    files=build_example_files(root),\n    allowed_mount_paths=[str(host_path) for host_path, _ in docs_mounts],\n    readonly_filesystem=True,\n    max_output_length=MAX_CONTEXT_CHARS,\n)\n```\n\nThis gives LangChain agent one tool: run bash inside Bashkit.\n\nThen the tool is attached to agent. This is the whole instantiation:\n\n```\nagent = create_agent(\n    model=ChatOpenAI(**llm_kwargs),\n    tools=[create_docs_bash_tool(root)],\n    system_prompt=SYSTEM_PROMPT,\n)\n```\n\nLangChain `create_agent`\n\nbuilds LangGraph agent under the hood. Bashkit provides the shell tool. The model writes bash scripts, LangGraph runs the loop, Bashkit executes the script in virtual filesystem.\n\nBut not host bash. Bashkit bash. In-process, sandboxed, read-only filesystem, resource limits, capped tool output, no `cp docs /tmp`\n\nescape hatch. The example even blocks scratch files, because I wanted to force agent to answer from mounted docs and not build some private cache.\n\nThird part is prompt. Not long, but opinionated:\n\n```\nCall the bash tool before answering Bashkit documentation questions.\nKeep tool output compact.\nSearch progressively with rg, grep, sed, and targeted find.\nPrefer grep when you need context flags.\nUse only facts present in bash tool output.\nDo not treat a failed command as proof that something is absent;\nretry with a simpler command when needed.\nIf the bash output does not answer the question, say the docs snippets do not show it.\n```\n\nThat is more or less it.\n\nNo semantic index hidden somewhere. The model writes shell search scripts. Bashkit runs them. Model answers from output.\n\n## Why Not Just Vector Search\n\nBecause vector search is not free architecture.\n\nOnce you add it, you have few annoying things:\n\n**Index freshness.** Docs changed. Did embeddings update? Did deployment pick right index? Is local dev same as prod?**Chunk boundaries.** The answer is often in lines before or after retrieved chunk. Then you add bigger chunks. Then precision drops.**Exact strings.** CLI flags, error codes, filenames, function names. Semantic search can find them, but shell search is much more honest here.**Debuggability.** With shell search, failed answer shows failed command. With vectors, failed answer often shows “similarity was weird”.**Permissions.** Filesystem mounts are easy to reason about. This folder is mounted read-only. This one is absent. Done.\n\nI am not saying “never use vector search”. I am saying defaulting to it is lazy.\n\nFor docs, code, logs, configs, tickets exported as markdown, product manuals, generated API specs, and many internal knowledge bases, files + shell tools are already strong baseline.\n\n## What Shell Search Gives Agent\n\nThere is small but important shift here.\n\nWith vector search, agent asks external system “give me relevant stuff.” The search system decides relevance.\n\nWith shell, agent controls the search plan:\n\n```\n# discover likely files\nfind /docs/public /docs/rustdoc -type f -name '*cli*'\nfind /docs/public /docs/rustdoc -type f -name '*security*'\n\n# broad discovery\nrg -i -n 'mount' /docs/public /docs/rustdoc | head -20\n\n# search exact phrase\ngrep -R -i -n -C 2 -- 'read-only mount' /docs/public /docs/rustdoc\n\n# read selected range after finding the file and line\nsed -n '40,90p' /docs/public/security.md\n\n# broaden after no result\ngrep -R -i -n -C 2 -- 'readonly\\|writable\\|mount' /docs/public /docs/rustdoc\n```\n\nAnd yes, prompt in the example says Bashkit `rg`\n\nis intentionally simpler than full ripgrep, and Bashkit `find`\n\ndoes not support every GNU expression. That is exactly the kind of local tool knowledge you want in system prompt. Agent does not need perfect Unix. It needs accurate constraints for this runtime.\n\n## Safety Is Not Optional\n\nIf you give agent real shell over real docs directory, it will eventually do something stupid.\n\nBashkit makes this much more boring:\n\n- docs mounted read-only at\n`/docs/public`\n\n,`/docs/rustdoc`\n\n,`/docs/examples`\n\n; - full Bashkit filesystem is read-only;\n- execution has command count, loop, timeout, and output limits;\n- self-test checks that write attempts fail.\n\n## Where Semantic Search Still Fits\n\nI would still use semantic search when corpus is messy, natural-language-heavy, or not file-shaped. Customer support conversations. Long prose. Lots of synonyms. No stable keywords. Then embeddings earn their place.\n\nBut for many agent systems, the better first step is different:\n\n- Put source of truth into files.\n- Mount files into safe virtual environment.\n- Give agent shell tools it already understands.\n- Let it inspect, search, and compose commands.\n- Add semantic search only if shell-search baseline is not enough.\n\nThis also works nicely with structured data. JSON files, CSV exports, OpenAPI specs, markdown docs, logs. Agent can use `rg`\n\n, `grep`\n\n, `sed`\n\n, `awk`\n\n, `jq`\n\nwhen available. It can build search as sequence of concrete observations.\n\n## Small But Useful Rule\n\nDo not let agent be file browser.\n\nIn the example prompt, if user asks to list or dump directories, agent should refuse raw listing and explain it answers Bashkit docs questions. Listing files is allowed only as internal discovery for a specific question.\n\nThis avoids turning docs bot into slow `tree`\n\ncommand. It also reduces accidental data exposure in larger systems.\n\nThis is where virtual filesystem is nice. You decide what exists for agent. Prompt decides how it can use it. Runtime enforces what prompt cannot.\n\n## The Point\n\nI like this pattern because it is stupid simple:\n\n``` php\nquestion -> agent -> bash search script -> docs snippets -> answer\n```\n\nAnd because it is simple, it is easy to test. Easy to inspect. Easy to constrain. Easy to run locally.\n\nSemantic search is optimization. Sometimes good optimization. Sometimes needed.\n\nBut if your agent can safely access data as files, start with shell search.\n\nThat is 80% of the value. Everything else is retrieval engineering.\n\n## Code\n\nFor comments or feedback, write at\n[x.com/chaliy](https://x.com/chaliy).", "url": "https://wpnews.pro/news/build-a-docs-agent-without-vector-search", "canonical_source": "https://chaliy.name/blog/docs-agent-without-vector-search/", "published_at": "2026-05-22 00:00:00+00:00", "updated_at": "2026-06-18 22:34:22.692157+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "large-language-models", "natural-language-processing", "ai-tools"], "entities": ["Bashkit", "LangChain", "OpenAI", "LangGraph", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/build-a-docs-agent-without-vector-search", "markdown": "https://wpnews.pro/news/build-a-docs-agent-without-vector-search.md", "text": "https://wpnews.pro/news/build-a-docs-agent-without-vector-search.txt", "jsonld": "https://wpnews.pro/news/build-a-docs-agent-without-vector-search.jsonld"}}