Stop Cloning Entire Repos for Your Doc Builds

Microsoft has open-sourced selective-repo-fetch, a TypeScript library that extracts only the documentation files needed for a build from large repositories instead of cloning the entire repo. The tool uses a two-step process: first matching a repository's file listing against the manifest patterns defined in documentation configs like docfx.json, then filtering resource files to only those actually referenced in content. This approach reduces a 200,000-file repository down to the roughly 50 files needed for a documentation build, solving both build-speed problems and enabling more efficient AI-powered documentation experiences.

Your docs live next to your code. That's the docs-as-code promise — version control, pull request reviews, CI/CD pipelines. It works beautifully. Until your repo hits 100,000 files. Our team runs a documentation portal that pulls content from dozens of large repositories. Each doc build needs a handful of markdown files and images from repos containing hundreds of thousands of files. The naive approach — git clone — is painfully slow and wasteful. We tried sparse checkout. We tried shallow clones. We tried the git provider APIs directly. Each came with its own problems: The irony? The manifest already declares exactly which files are needed. The docfx.json or whatever config your static site generator uses lists every content glob, every resource pattern. We just weren't using that information early enough. This isn't just a build-speed problem anymore. If you're building AI agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. Not your code. Not your tests. The docs . The challenge scales fast: The faster and more precisely you can extract documentation from your repositories, the fresher and more accurate your agents' knowledge becomes. Solving the selective fetch problem unlocks both faster builds and reliable AI-powered documentation experiences. What if we flipped the order? Instead of: clone everything → build → throw away 99% of the files We do: get the file listing → match against manifest → fetch only what matches ┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐ │ Git Provider │ │ selective-repo-fetch │ │ Doc Pipeline │ │ file listing │────▶│ manifest matching │────▶│ build only │ │ │ │ + reference filter │ │ matched files │ └─────────────────┘ └──────────────────────┘ └─────────────────┘ A file tree listing from GitHub/Azure DevOps/GitLab is a single, cheap API call — it returns metadata, not file contents. Match that listing against your manifest patterns, and you know exactly what to fetch. We open-sourced this logic as a TypeScript library: selective-repo-fetch https://github.com/microsoft/selective-repo-fetch . It's MIT-licensed and provider-agnostic. npm install github:microsoft/selective-repo-fetch Here's the core workflow: js import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch'; // Your manifest declares what your doc site needs const manifest = { build: { content: { files: ' / .md' , src: 'docs' } , resource: { files: ' / .{png,jpg,svg}' , src: 'docs/images' } , }, }; // Step 1: Get file listing from any git API one cheap metadata call const repoFiles = await getTreeListing ; // returns { path: '/docs/intro.md' }, ... // Step 2: Resolve manifest → content + resource matches const matched = resolveFileMatches repoFiles, manifest, '/', '/docfx.json' ; // matched.contentMatches → only the markdown files your build needs // matched.resourceMatches → only images/videos matching resource globs From 200,000 files down to the 50 that matter. One function call. Glob matching is great, but it can be too generous. A / .png pattern in your resource section will match every image under that folder — even the ones no markdown file actually references. For large repos, this matters. Unreferenced images can be megabytes of wasted downloads. So we added a second pass: js // Step 3: Fetch the content files small text — fast and cheap const contentFileTexts = {}; for const filePath of matched.contentMatches { contentFileTexts filePath = await fetchFileContent filePath ; } // Step 4: Filter resources to only those actually referenced const referencedResources = filterReferencedResources matched.resourceMatches, contentFileTexts ; // Scans markdown/HTML for path , <img src="path" , text path , etc. // Drops any resource not referenced by any content file This scans your content files for markdown image references path , links text path , and HTML attributes src="path" , href="path" . If a resource file isn't referenced anywhere in your content, it gets dropped. Here's what it looks like end-to-end with the GitHub API: js import { Octokit } from '@octokit/rest'; import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch'; const octokit = new Octokit { auth: token } ; // 1. One API call to get the full file tree metadata only, no content const { data } = await octokit.git.getTree { owner, repo, tree sha: 'HEAD', recursive: 'true' } ; const files = data.tree .filter item = item.type === 'blob' .map item = { path: '/' + item.path } ; // 2. Resolve manifest patterns const manifest = JSON.parse / your docfx.json / ; const matched = resolveFileMatches files, manifest, '/', '/docfx.json' ; // 3. Fetch content files small text const contentTexts: Record<string, string = {}; for const path of matched.contentMatches { const { data } = await octokit.repos.getContent { owner, repo, path: path.slice 1 } ; contentTexts path = Buffer.from data.content, 'base64' .toString ; } // 4. Filter resources to only referenced ones const resources = filterReferencedResources matched.resourceMatches, contentTexts ; // 5. Fetch only referenced resources // You now have the exact list — nothing wasted The manifest matching is thorough: .{md,yml} src path resolution exclude: " /draft/ " .order files src: "../other-folder" — discovered before you fetchThe reference filter handles: alt path , text path <img src="path" , <video src="path" , <a href="path" ~/ , leading / , query strings, anchors mailto: , javascript: There's a downstream benefit we didn't anticipate when we first built this: making documentation efficiently available to AI agents . If you're building agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. But they don't need your entire codebase. They need the docs . The manifest-driven approach gives you exactly that separation: js // Feed docs from multiple repos into your agent's knowledge base for const repo of repositories { const files = await getTreeListing repo ; const matched = resolveFileMatches files, repo.manifest, '/', '/docfx.json' ; // Only index documentation — not code, not tests, not configs for const docPath of matched.contentMatches { const content = await fetchFile repo, docPath ; await knowledgeBase.ingest { path: docPath, repo: repo.name, content } ; } } The faster and more precisely you can extract documentation from your repos, the fresher and more accurate your agents' knowledge becomes. Efficient content fetching is the foundation of a reliable AI-powered docs experience. The library is MIT-licensed and has zero opinions about your git provider — it works with any API that can give you a file listing. npm install github:microsoft/selective-repo-fetch GitHub: microsoft/selective-repo-fetch https://github.com/microsoft/selective-repo-fetch If your doc builds are slow because of large repos, give it a try. And if you have ideas for improvements, PRs are welcome. What's the worst monorepo doc build experience you've had? I'd love to hear about it in the comments.