Stop Cloning Entire Repos for Your Doc Builds Microsoft has open-sourced selective-repo-fetch, a TypeScript library that extracts only the documentation files needed for a build from large repositories instead of cloning the entire repo. The tool uses a two-step process: first matching a repository's file listing against the manifest patterns defined in documentation configs like docfx.json, then filtering resource files to only those actually referenced in content. This approach reduces a 200,000-file repository down to the roughly 50 files needed for a documentation build, solving both build-speed problems and enabling more efficient AI-powered documentation experiences. Your docs live next to your code. That's the docs-as-code promise — version control, pull request reviews, CI/CD pipelines. It works beautifully. Until your repo hits 100,000 files. Our team runs a documentation portal that pulls content from dozens of large repositories. Each doc build needs a handful of markdown files and images from repos containing hundreds of thousands of files. The naive approach — git clone — is painfully slow and wasteful. We tried sparse checkout. We tried shallow clones. We tried the git provider APIs directly. Each came with its own problems: The irony? The manifest already declares exactly which files are needed. The docfx.json or whatever config your static site generator uses lists every content glob, every resource pattern. We just weren't using that information early enough. This isn't just a build-speed problem anymore. If you're building AI agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. Not your code. Not your tests. The docs . The challenge scales fast: The faster and more precisely you can extract documentation from your repositories, the fresher and more accurate your agents' knowledge becomes. Solving the selective fetch problem unlocks both faster builds and reliable AI-powered documentation experiences. What if we flipped the order? Instead of: clone everything → build → throw away 99% of the files We do: get the file listing → match against manifest → fetch only what matches ┌─────────────────┐ ┌──────────────────────┐ ┌─────────────────┐ │ Git Provider │ │ selective-repo-fetch │ │ Doc Pipeline │ │ file listing │────▶│ manifest matching │────▶│ build only │ │ │ │ + reference filter │ │ matched files │ └─────────────────┘ └──────────────────────┘ └─────────────────┘ A file tree listing from GitHub/Azure DevOps/GitLab is a single, cheap API call — it returns metadata, not file contents. Match that listing against your manifest patterns, and you know exactly what to fetch. We open-sourced this logic as a TypeScript library: selective-repo-fetch https://github.com/microsoft/selective-repo-fetch . It's MIT-licensed and provider-agnostic. npm install github:microsoft/selective-repo-fetch Here's the core workflow: js import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch'; // Your manifest declares what your doc site needs const manifest = { build: { content: { files: ' / .md' , src: 'docs' } , resource: { files: ' / .{png,jpg,svg}' , src: 'docs/images' } , }, }; // Step 1: Get file listing from any git API one cheap metadata call const repoFiles = await getTreeListing ; // returns { path: '/docs/intro.md' }, ... // Step 2: Resolve manifest → content + resource matches const matched = resolveFileMatches repoFiles, manifest, '/', '/docfx.json' ; // matched.contentMatches → only the markdown files your build needs // matched.resourceMatches → only images/videos matching resource globs From 200,000 files down to the 50 that matter. One function call. Glob matching is great, but it can be too generous. A / .png pattern in your resource section will match every image under that folder — even the ones no markdown file actually references. For large repos, this matters. Unreferenced images can be megabytes of wasted downloads. So we added a second pass: js // Step 3: Fetch the content files small text — fast and cheap const contentFileTexts = {}; for const filePath of matched.contentMatches { contentFileTexts filePath = await fetchFileContent filePath ; } // Step 4: Filter resources to only those actually referenced const referencedResources = filterReferencedResources matched.resourceMatches, contentFileTexts ; // Scans markdown/HTML for path ,