{"slug": "a-file-level-tree-that-lets-an-llm-reason-over-a-document-corpus", "title": "A file-level tree that lets an LLM reason over a document corpus", "summary": "PageIndex, the open-source retrieval framework that has amassed 26,000 GitHub stars and serves 23,000 cloud users, announced the PageIndex File System, a new layer enabling a single index to reason over millions of documents. The system, available today as part of PageIndex Enterprise with a cloud edition coming later this month, replaces traditional vector-based retrieval with a file-level tree structure that allows an LLM to navigate document hierarchies directly. The release addresses scaling limitations in classic RAG systems, which the company says break down due to embedding models' limited representation power and the disconnect between similarity and relevance in large corpora.", "body_md": "PageIndex Team\n\nPageIndex now scales to millions of documents\n\n*Available today for enterprise. Cloud rollout coming soon. (Get early access)*\n\nWe started PageIndex with one belief: **retrieval over long documents should look more like human reading than like semantic similarity search**. Since launch, the open-source PageIndex, one of the fastest-growing AI-infra repos on GitHub, has crossed **26k GitHub stars** in a few months, hit **#1 on GitHub Trending**, been selected for the **GitHub Secure Open Source Fund**, and now serves **23k+ cloud users** in production.\n\nToday we're announcing the next chapter: the **PageIndex File System**, a new layer on top of the vectorless retrieval engine that lets a single index reason over **millions of documents**. It ships today as part of **PageIndex Enterprise**, with a cloud edition arriving later this month.\n\nThis post is a quick tour: why classic vector-based RAG hits a ceiling, what PageIndex is, why a plain file system stops working at this scale, and what the PageIndex File System adds to get past it.\n\nWhere classic vector-based RAG breaks\n\nThe standard RAG recipe is by now familiar: chunk every document into passages, run each chunk through an embedding model to get a fixed-size vector, store those vectors in a vector database, and at query time embed the question and pull back the top-*K* nearest neighbors. It works, until it doesn't. Two things go wrong, and both get worse as the corpus grows.\n\n**1. Embeddings have limited representation power.**\nA single fixed-length vector has to summarize an entire chunk into a few hundred numbers, and embedding models cap their input length at a few hundred or a few thousand tokens. That cap forces two compromises that quietly degrade quality:\n\n*Chunking breaks semantic continuity.*Real documents have sections, tables, footnotes, and cross-references that flow across page boundaries. Slicing them into fixed-size windows shreds those dependencies. The chunk that contains the answer is often missing the context that makes the answer make sense.*Retrieval is blind to context.*Only the user's literal query gets embedded. The conversation that came before, the user's role, the evolving intent of a multi-turn dialogue: all of that has to be discarded before encoding. The retriever sees a context-stripped probe, not a real question in a real situation.\n\n**2. Similarity is not the same as relevance.**\nVector search ranks by cosine similarity to the query. But what users actually want is *relevance*, and the two come apart in both directions:\n\n*Similar but not relevant (low accuracy).*In professional domains (legal, medical, financial), language is repetitive and small differences carry critical meaning. Two paragraphs can look almost identical to an embedding model and yet say opposite things about who is liable, what dose to give, or which clause applies. Vector search happily returns the wrong one because it \"looks right\".*Relevant but not similar (low recall).*Conversely, the right answer is often phrased very differently from the query, or lives many sections away from the most-cited passage. Finding it takes*reasoning over the document's structure*, not surface-level word matching. Vector search has no mechanism for that, so the genuinely relevant chunk falls past rank*K*and disappears silently. You don't get an error; you just get a worse answer.\n\nThese aren't edge cases. They're the two failure modes our enterprise customers hit again and again, and they're exactly what motivated us to build a different kind of retriever.\n\nWhat is PageIndex?\n\nPageIndex is a **vectorless RAG framework**. Instead of chopping documents into chunks, embedding them into vectors, and ranking by cosine similarity, PageIndex represents each document as a **tree** (sections nest into subsections, subsections into pages, pages into content blocks) and lets an LLM **navigate the tree** to find the answer.\n\nThe shape of the tree is the table of contents you'd see in a book. The retrieval policy is an LLM that, at each node, asks a single question: *given the user's query, the conversation so far, and where I am in the document, should I look inside this subtree?* No fixed top-*K*, no embedding bottleneck, no information dropped silently because it ranked .\n\nThree properties fall out of this design, and each one is exactly what classic vector RAG cannot offer:\n\n**Relevance classification, not semantic similarity.** The LLM doesn't compute a cosine score; it makes a yes/no judgment at every node (*is this subtree worth opening for this query?*) using full-document understanding, not a 768-dimensional proxy. The two failure modes of similarity search (similar-but-irrelevant, relevant-but-dissimilar) simply don't apply.**Retrieval depends on context.** The decision at each node is conditioned on the query, the conversation history, the user's role, and the path the LLM has already walked. There's no fixed-length cap forcing context to be discarded. Context shapes every navigation step.**Transparent retrieval process.** The search trace is a readable path through the tree: which sections were opened, which were skipped, which yielded the evidence. You can audit*why*an answer came back, replay the same path for a different model, and surface the citation chain to the end user. Vector search returns a list of chunks with no story; PageIndex returns a route.\n\nFrom single document to file system\n\nHere is the obvious objection. Classic vector RAG scales effortlessly to millions of documents: embeddings are pre-computed once, top-*K* lookup over a billion vectors is a well-solved engineering problem, and the index doesn't care whether you have 1k or 100M chunks. PageIndex, by contrast, is built on a document **tree**, a richer structure, but one that an LLM has to *navigate*. Won't the LLM choke when there are a million trees to walk?\n\nIt's a fair question, and the answer starts with an observation we can almost take for granted: **a file system is already a tree.** Folders contain subfolders, subfolders contain files. That is a pre-defined hierarchy over a corpus, given to us for free by every document store that has ever existed. So the natural way to scale tree search across documents is to make the file system *itself* a node-level layer: each document tree hangs off a leaf of the file system tree, and the whole corpus becomes one big tree.\n\nThis unification is the key. The same tree search policy that an LLM uses to navigate a section of a 100-page report can navigate the directory hierarchy of a million-document drive and then descend, without changing tools, into a specific document's internal tree.\n\nThat is the basic shape. But, as the next section shows, **a plain file system is not enough at a million-document scale**: the hierarchy you inherit from disk is rarely the hierarchy you want to search by. The rest of this post is about how to fix that without abandoning the tree search framework.\n\nWhy a simple file system isn't enough\n\nIf you have a few thousand documents that already live in a tidy folder hierarchy, you can take that folder tree, point an LLM at it, and call it a day. PageIndex has supported that \"inherit the folder structure\" mode from day one.\n\nAt a million documents, this stops working. Three reasons.\n\n**1. Often, there is no folder hierarchy at all.** Many enterprise corpora live in document management systems, S3 buckets, or SharePoint libraries that are effectively *flat*: every file in one giant pool, with nothing but a row of metadata fields (author, date, type) and sometimes not even that. A SQL query over those fields handles the easy cases, but anything that needs *content-level* understanding has no tree to navigate, because no tree was ever built.\n\n**2. The hierarchy is one-dimensional.** Even when a folder tree exists, a document is rarely \"about\" exactly one thing. A contract belongs simultaneously to a vendor, a region, a fiscal year, and a product line. A folder tree forces you to pick one axis. The other axes, the ones the user is actually querying on, are gone.\n\n**3. Folder labels are unreliable signals for an LLM.** Real corpora accumulate folders called `misc/`\n\n, `final_v3_USE_THIS_ONE/`\n\n, and `2019_legacy/`\n\n. Even tidy-looking paths like `/finance/2024/`\n\nsay nothing about whether the document discusses *pricing risk* or *liquidity*. The LLM ends up searching folder *names*, not document *meaning*, and gets pruned in the wrong direction.\n\nSo the question becomes: what do you put inside the tree when no good tree exists yet?\n\nThe PageIndex File System\n\nThe **PageIndex File System** is what we built to fix all three. It's a query-time tree layer that sits above your documents and lets the same tree search policy scale from a single document to millions. Three technical pieces, all live in the enterprise release:\n\nVirtual nodes: synthesizing the structure\n\nWhen the corpus has no usable hierarchy, PageIndex builds one. Documents are clustered into **topic nodes** by topic models or LLM-driven grouping; each document also gets LLM-inferred metadata (category, summary, key entities), which become additional internal nodes in the tree. The result is a hierarchy whose internal labels are *semantic*: exactly the signal the LLM needs to prune branches early.\n\nCrucially, the same document can sit under more than one virtual ancestor (vendor *and* region *and* year). A flat file system can't express this; a PageIndex tree can.\n\nThe tree is query-dependent\n\nA traditional file system has *one* hierarchy, fixed at ingestion. That works for storage; it does not work for retrieval, because no single hierarchy is right for every question. *\"What did vendor X charge us in 2024?\"* wants a tree organized by vendor, then by year. *\"Show me all contracts up for renewal next quarter\"* wants a tree organized by status and renewal date. Same corpus, two completely different trees.\n\nPageIndex builds the file system tree **on demand, conditioned on the query**. Given a question, it picks which metadata axes to use as internal nodes, which clusters to surface, and how deep to nest them, so the LLM is always navigating a hierarchy that is informative *for this query*. Different queries produce different views over the same documents, without re-ingesting or re-embedding anything.\n\nThe same machinery makes the index improve over time: traversal patterns from past queries refine the virtual nodes and metadata, so the more the system is used, the better the tree it builds for the next question.\n\nTree search adapts to whether the structure is informative\n\nA static traversal (always going layer by layer) is the wrong default at scale. Sometimes a node's children carry rich, query-relevant labels (`/contracts/2024/vendor_X/`\n\n), and the LLM should descend one layer at a time, using each label to prune. Sometimes the labels are uninformative (`misc/`\n\n, `folder_1/`\n\n, an arbitrary user-uploaded directory), and walking the structure layer by layer just burns LLM calls on signal-free intermediate nodes.\n\nPageIndex picks the strategy **per node, conditioned on the query**:\n\n**Layer-wise**: when child labels are informative for this query, return the children, prune by their labels.** Recursive (dynamic flattening)**: when child labels are uninformative, collapse the subtree down to its leaves and defer the discrimination to the actual content. Uninformative levels are bypassed entirely.\n\nThis dynamic flattening is what keeps tree search *efficient* at a million-document scale. The LLM never has to read structure that doesn't help it; the depth of the search shrinks to the depth that actually carries information for the question being asked.\n\nWhat's available today\n\n**PageIndex Enterprise**, with the **PageIndex File System** included, is generally available now:\n\n- Single-index scale to\n**millions of documents** via the PageIndex File System - Virtual-node synthesis and query-dependent index construction\n- Dedicated or VPC deployment\n\nThe **PageIndex Cloud** edition (same engine and file system, fully managed) is rolling out later this month. Existing OSS users keep everything that's in the open-source repo, and the OSS roadmap continues unchanged.\n\nIf you're hitting the wall with your current vector RAG stack — accuracy falling short, recall gaps you can't audit, retrieval blind to context — [get in touch](https://ii2abc2jejf.typeform.com/to/gVv7qkaN). We'd love to show you what tree search at scale looks like.\n\n— *The PageIndex Team*", "url": "https://wpnews.pro/news/a-file-level-tree-that-lets-an-llm-reason-over-a-document-corpus", "canonical_source": "https://pageindex.ai/blog/pageindex-filesystem", "published_at": "2026-05-27 10:37:11+00:00", "updated_at": "2026-05-27 10:45:59.388546+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-products", "ai-tools", "ai-startups", "large-language-models"], "entities": ["PageIndex", "GitHub", "GitHub Secure Open Source Fund", "PageIndex Enterprise"], "alternates": {"html": "https://wpnews.pro/news/a-file-level-tree-that-lets-an-llm-reason-over-a-document-corpus", "markdown": "https://wpnews.pro/news/a-file-level-tree-that-lets-an-llm-reason-over-a-document-corpus.md", "text": "https://wpnews.pro/news/a-file-level-tree-that-lets-an-llm-reason-over-a-document-corpus.txt", "jsonld": "https://wpnews.pro/news/a-file-level-tree-that-lets-an-llm-reason-over-a-document-corpus.jsonld"}}