Stop Cloning Entire Repos for Your Doc Builds

wpnews.pro

Your docs live next to your code. That's the docs-as-code promise — version control, pull request reviews, CI/CD pipelines. It works beautifully.

Until your repo hits 100,000 files.

Our team runs a documentation portal that pulls content from dozens of large repositories. Each doc build needs a handful of markdown files and images from repos containing hundreds of thousands of files. The naive approach — git clone

— is painfully slow and wasteful.

We tried sparse checkout. We tried shallow clones. We tried the git provider APIs directly. Each came with its own problems:

The irony? The manifest already declares exactly which files are needed. The docfx.json

(or whatever config your static site generator uses) lists every content glob, every resource pattern. We just weren't using that information early enough.

This isn't just a build-speed problem anymore. If you're building AI agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. Not your code. Not your tests. The docs.

The challenge scales fast:

The faster and more precisely you can extract documentation from your repositories, the fresher and more accurate your agents' knowledge becomes. Solving the selective fetch problem unlocks both faster builds and reliable AI-powered documentation experiences.

What if we flipped the order?

Instead of: clone everything → build → throw away 99% of the files

We do: get the file listing → match against manifest → fetch only what matches

┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│  Git Provider    │     │ selective-repo-fetch  │     │  Doc Pipeline   │
│  (file listing)  │────▶│  (manifest matching   │────▶│  (build only    │
│                  │     │   + reference filter) │     │   matched files)│
└─────────────────┘     └──────────────────────┘     └─────────────────┘

A file tree listing from GitHub/Azure DevOps/GitLab is a single, cheap API call — it returns metadata, not file contents. Match that listing against your manifest patterns, and you know exactly what to fetch.

We open-sourced this logic as a TypeScript library: selective-repo-fetch. It's MIT-licensed and provider-agnostic.

npm install github:microsoft/selective-repo-fetch

Here's the core workflow:

import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

// Your manifest declares what your doc site needs
const manifest = {
  build: {
    content: [{ files: ['**/*.md'], src: 'docs' }],
    resource: [{ files: ['**/*.{png,jpg,svg}'], src: 'docs/images' }],
  },
};

// Step 1: Get file listing from any git API (one cheap metadata call)
const repoFiles = await getTreeListing(); // returns [{ path: '/docs/intro.md' }, ...]

// Step 2: Resolve manifest → content + resource matches
const matched = resolveFileMatches(repoFiles, manifest, '/', '/docfx.json');
// matched.contentMatches → only the markdown files your build needs
// matched.resourceMatches → only images/videos matching resource globs

From 200,000 files down to the 50 that matter. One function call.

Glob matching is great, but it can be too generous. A **/*.png

pattern in your resource section will match every image under that folder — even the ones no markdown file actually references.

For large repos, this matters. Unreferenced images can be megabytes of wasted downloads.

So we added a second pass:

// Step 3: Fetch the content files (small text — fast and cheap)
const contentFileTexts = {};
for (const filePath of matched.contentMatches) {
  contentFileTexts[filePath] = await fetchFileContent(filePath);
}

// Step 4: Filter resources to only those actually referenced
const referencedResources = filterReferencedResources(
  matched.resourceMatches,
  contentFileTexts
);
// Scans markdown/HTML for ![](path), <img src="path">, [text](path), etc.
// Drops any resource not referenced by any content file

This scans your content files for markdown image references (![](path)

), links ([text](path)

), and HTML attributes (src="path"

, href="path"

). If a resource file isn't referenced anywhere in your content, it gets dropped.

Here's what it looks like end-to-end with the GitHub API:

import { Octokit } from '@octokit/rest';
import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

const octokit = new Octokit({ auth: token });

// 1. One API call to get the full file tree (metadata only, no content)
const { data } = await octokit.git.getTree({
  owner, repo, tree_sha: 'HEAD', recursive: 'true'
});

const files = data.tree
  .filter(item => item.type === 'blob')
  .map(item => ({ path: '/' + item.path }));

// 2. Resolve manifest patterns
const manifest = JSON.parse(/* your docfx.json */);
const matched = resolveFileMatches(files, manifest, '/', '/docfx.json');

// 3. Fetch content files (small text)
const contentTexts: Record<string, string> = {};
for (const path of matched.contentMatches) {
  const { data } = await octokit.repos.getContent({ owner, repo, path: path.slice(1) });
  contentTexts[path] = Buffer.from(data.content, 'base64').toString();
}

// 4. Filter resources to only referenced ones
const resources = filterReferencedResources(matched.resourceMatches, contentTexts);

// 5. Fetch only referenced resources
// You now have the exact list — nothing wasted

The manifest matching is thorough:

*.{md,yml}

)src

path resolutionexclude: ["**/draft/**"]

).order

filessrc: "../other-folder"

— discovered before you fetchThe reference filter handles:

![alt](path)

, [text](path)

<img src="path">

, <video src="path">

, <a href="path">

~/

, leading /

, query strings, anchorsmailto:

, javascript:

There's a downstream benefit we didn't anticipate when we first built this: making documentation efficiently available to AI agents.

If you're building agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. But they don't need your entire codebase. They need the docs.

The manifest-driven approach gives you exactly that separation:

// Feed docs from multiple repos into your agent's knowledge base
for (const repo of repositories) {
  const files = await getTreeListing(repo);
  const matched = resolveFileMatches(files, repo.manifest, '/', '/docfx.json');

  // Only index documentation — not code, not tests, not configs
  for (const docPath of matched.contentMatches) {
    const content = await fetchFile(repo, docPath);
    await knowledgeBase.ingest({ path: docPath, repo: repo.name, content });
  }
}

The faster and more precisely you can extract documentation from your repos, the fresher and more accurate your agents' knowledge becomes. Efficient content fetching is the foundation of a reliable AI-powered docs experience.

The library is MIT-licensed and has zero opinions about your git provider — it works with any API that can give you a file listing.

npm install github:microsoft/selective-repo-fetch

GitHub: microsoft/selective-repo-fetch

If your doc builds are slow because of large repos, give it a try. And if you have ideas for improvements, PRs are welcome.

What's the worst monorepo doc build experience you've had? I'd love to hear about it in the comments.

source & further reading

dev.to — original article 6 Months Later, Nobody Could Read the Code — Including Me I kept leaving my terminal. ReskPoints: AI Agent Logging with Sampling, Masking, and Multi-Export

Stop Cloning Entire Repos for Your Doc Builds

Run your AI side-project on zahid.host