# Stop Cloning Entire Repos for Your Doc Builds

> Source: <https://dev.to/saipramod/stop-cloning-entire-repos-for-your-doc-builds-28i0>
> Published: 2026-05-27 01:27:33+00:00

Your docs live next to your code. That's the docs-as-code promise — version control, pull request reviews, CI/CD pipelines. It works beautifully.

Until your repo hits 100,000 files.

Our team runs a documentation portal that pulls content from dozens of large repositories. Each doc build needs a handful of markdown files and images from repos containing hundreds of thousands of files. The naive approach — `git clone`

— is painfully slow and wasteful.

We tried sparse checkout. We tried shallow clones. We tried the git provider APIs directly. Each came with its own problems:

The irony? **The manifest already declares exactly which files are needed.** The `docfx.json`

(or whatever config your static site generator uses) lists every content glob, every resource pattern. We just weren't using that information early enough.

This isn't just a build-speed problem anymore. If you're building AI agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. Not your code. Not your tests. The *docs*.

The challenge scales fast:

The faster and more precisely you can extract documentation from your repositories, the fresher and more accurate your agents' knowledge becomes. Solving the selective fetch problem unlocks both faster builds *and* reliable AI-powered documentation experiences.

What if we flipped the order?

Instead of: clone everything → build → throw away 99% of the files

We do: get the file listing → match against manifest → fetch only what matches

```
┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐
│  Git Provider    │     │ selective-repo-fetch  │     │  Doc Pipeline   │
│  (file listing)  │────▶│  (manifest matching   │────▶│  (build only    │
│                  │     │   + reference filter) │     │   matched files)│
└─────────────────┘     └──────────────────────┘     └─────────────────┘
```

A file tree listing from GitHub/Azure DevOps/GitLab is a single, cheap API call — it returns metadata, not file contents. Match that listing against your manifest patterns, and you know exactly what to fetch.

We open-sourced this logic as a TypeScript library: [ selective-repo-fetch](https://github.com/microsoft/selective-repo-fetch). It's MIT-licensed and provider-agnostic.

```
npm install github:microsoft/selective-repo-fetch
```

Here's the core workflow:

``` js
import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

// Your manifest declares what your doc site needs
const manifest = {
  build: {
    content: [{ files: ['**/*.md'], src: 'docs' }],
    resource: [{ files: ['**/*.{png,jpg,svg}'], src: 'docs/images' }],
  },
};

// Step 1: Get file listing from any git API (one cheap metadata call)
const repoFiles = await getTreeListing(); // returns [{ path: '/docs/intro.md' }, ...]

// Step 2: Resolve manifest → content + resource matches
const matched = resolveFileMatches(repoFiles, manifest, '/', '/docfx.json');
// matched.contentMatches → only the markdown files your build needs
// matched.resourceMatches → only images/videos matching resource globs
```

From 200,000 files down to the 50 that matter. One function call.

Glob matching is great, but it can be too generous. A `**/*.png`

pattern in your resource section will match every image under that folder — even the ones no markdown file actually references.

For large repos, this matters. Unreferenced images can be megabytes of wasted downloads.

So we added a second pass:

``` js
// Step 3: Fetch the content files (small text — fast and cheap)
const contentFileTexts = {};
for (const filePath of matched.contentMatches) {
  contentFileTexts[filePath] = await fetchFileContent(filePath);
}

// Step 4: Filter resources to only those actually referenced
const referencedResources = filterReferencedResources(
  matched.resourceMatches,
  contentFileTexts
);
// Scans markdown/HTML for ![](path), <img src="path">, [text](path), etc.
// Drops any resource not referenced by any content file
```

This scans your content files for markdown image references (`![](path)`

), links (`[text](path)`

), and HTML attributes (`src="path"`

, `href="path"`

). If a resource file isn't referenced anywhere in your content, it gets dropped.

Here's what it looks like end-to-end with the GitHub API:

``` js
import { Octokit } from '@octokit/rest';
import { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';

const octokit = new Octokit({ auth: token });

// 1. One API call to get the full file tree (metadata only, no content)
const { data } = await octokit.git.getTree({
  owner, repo, tree_sha: 'HEAD', recursive: 'true'
});

const files = data.tree
  .filter(item => item.type === 'blob')
  .map(item => ({ path: '/' + item.path }));

// 2. Resolve manifest patterns
const manifest = JSON.parse(/* your docfx.json */);
const matched = resolveFileMatches(files, manifest, '/', '/docfx.json');

// 3. Fetch content files (small text)
const contentTexts: Record<string, string> = {};
for (const path of matched.contentMatches) {
  const { data } = await octokit.repos.getContent({ owner, repo, path: path.slice(1) });
  contentTexts[path] = Buffer.from(data.content, 'base64').toString();
}

// 4. Filter resources to only referenced ones
const resources = filterReferencedResources(matched.resourceMatches, contentTexts);

// 5. Fetch only referenced resources
// You now have the exact list — nothing wasted
```

The manifest matching is thorough:

`*.{md,yml}`

)`src`

path resolution`exclude: ["**/draft/**"]`

)`.order`

files`src: "../other-folder"`

— discovered before you fetchThe reference filter handles:

`![alt](path)`

, `[text](path)`

`<img src="path">`

, `<video src="path">`

, `<a href="path">`

`~/`

, leading `/`

, query strings, anchors`mailto:`

, `javascript:`

There's a downstream benefit we didn't anticipate when we first built this: **making documentation efficiently available to AI agents**.

If you're building agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. But they don't need your entire codebase. They need the *docs*.

The manifest-driven approach gives you exactly that separation:

``` js
// Feed docs from multiple repos into your agent's knowledge base
for (const repo of repositories) {
  const files = await getTreeListing(repo);
  const matched = resolveFileMatches(files, repo.manifest, '/', '/docfx.json');

  // Only index documentation — not code, not tests, not configs
  for (const docPath of matched.contentMatches) {
    const content = await fetchFile(repo, docPath);
    await knowledgeBase.ingest({ path: docPath, repo: repo.name, content });
  }
}
```

The faster and more precisely you can extract documentation from your repos, the fresher and more accurate your agents' knowledge becomes. Efficient content fetching is the foundation of a reliable AI-powered docs experience.

The library is MIT-licensed and has zero opinions about your git provider — it works with any API that can give you a file listing.

```
npm install github:microsoft/selective-repo-fetch
```

GitHub: [microsoft/selective-repo-fetch](https://github.com/microsoft/selective-repo-fetch)

If your doc builds are slow because of large repos, give it a try. And if you have ideas for improvements, PRs are welcome.

*What's the worst monorepo doc build experience you've had? I'd love to hear about it in the comments.*
