{"slug": "stop-cloning-entire-repos-for-your-doc-builds", "title": "Stop Cloning Entire Repos for Your Doc Builds", "summary": "Microsoft has open-sourced selective-repo-fetch, a TypeScript library that extracts only the documentation files needed for a build from large repositories instead of cloning the entire repo. The tool uses a two-step process: first matching a repository's file listing against the manifest patterns defined in documentation configs like docfx.json, then filtering resource files to only those actually referenced in content. This approach reduces a 200,000-file repository down to the roughly 50 files needed for a documentation build, solving both build-speed problems and enabling more efficient AI-powered documentation experiences.", "body_md": "Your docs live next to your code. That's the docs-as-code promise — version control, pull request reviews, CI/CD pipelines. It works beautifully.\n\nUntil your repo hits 100,000 files.\n\nOur team runs a documentation portal that pulls content from dozens of large repositories. Each doc build needs a handful of markdown files and images from repos containing hundreds of thousands of files. The naive approach — `git clone`\n\n— is painfully slow and wasteful.\n\nWe tried sparse checkout. We tried shallow clones. We tried the git provider APIs directly. Each came with its own problems:\n\nThe irony? **The manifest already declares exactly which files are needed.** The `docfx.json`\n\n(or whatever config your static site generator uses) lists every content glob, every resource pattern. We just weren't using that information early enough.\n\nThis isn't just a build-speed problem anymore. If you're building AI agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. Not your code. Not your tests. The *docs*.\n\nThe challenge scales fast:\n\nThe faster and more precisely you can extract documentation from your repositories, the fresher and more accurate your agents' knowledge becomes. Solving the selective fetch problem unlocks both faster builds *and* reliable AI-powered documentation experiences.\n\nWhat if we flipped the order?\n\nInstead of: clone everything → build → throw away 99% of the files\n\nWe do: get the file listing → match against manifest → fetch only what matches\n\n```\n┌─────────────────┐     ┌──────────────────────┐     ┌─────────────────┐\n│  Git Provider    │     │ selective-repo-fetch  │     │  Doc Pipeline   │\n│  (file listing)  │────▶│  (manifest matching   │────▶│  (build only    │\n│                  │     │   + reference filter) │     │   matched files)│\n└─────────────────┘     └──────────────────────┘     └─────────────────┘\n```\n\nA file tree listing from GitHub/Azure DevOps/GitLab is a single, cheap API call — it returns metadata, not file contents. Match that listing against your manifest patterns, and you know exactly what to fetch.\n\nWe open-sourced this logic as a TypeScript library: [ selective-repo-fetch](https://github.com/microsoft/selective-repo-fetch). It's MIT-licensed and provider-agnostic.\n\n```\nnpm install github:microsoft/selective-repo-fetch\n```\n\nHere's the core workflow:\n\n``` js\nimport { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';\n\n// Your manifest declares what your doc site needs\nconst manifest = {\n  build: {\n    content: [{ files: ['**/*.md'], src: 'docs' }],\n    resource: [{ files: ['**/*.{png,jpg,svg}'], src: 'docs/images' }],\n  },\n};\n\n// Step 1: Get file listing from any git API (one cheap metadata call)\nconst repoFiles = await getTreeListing(); // returns [{ path: '/docs/intro.md' }, ...]\n\n// Step 2: Resolve manifest → content + resource matches\nconst matched = resolveFileMatches(repoFiles, manifest, '/', '/docfx.json');\n// matched.contentMatches → only the markdown files your build needs\n// matched.resourceMatches → only images/videos matching resource globs\n```\n\nFrom 200,000 files down to the 50 that matter. One function call.\n\nGlob matching is great, but it can be too generous. A `**/*.png`\n\npattern in your resource section will match every image under that folder — even the ones no markdown file actually references.\n\nFor large repos, this matters. Unreferenced images can be megabytes of wasted downloads.\n\nSo we added a second pass:\n\n``` js\n// Step 3: Fetch the content files (small text — fast and cheap)\nconst contentFileTexts = {};\nfor (const filePath of matched.contentMatches) {\n  contentFileTexts[filePath] = await fetchFileContent(filePath);\n}\n\n// Step 4: Filter resources to only those actually referenced\nconst referencedResources = filterReferencedResources(\n  matched.resourceMatches,\n  contentFileTexts\n);\n// Scans markdown/HTML for ![](path), <img src=\"path\">, [text](path), etc.\n// Drops any resource not referenced by any content file\n```\n\nThis scans your content files for markdown image references (`![](path)`\n\n), links (`[text](path)`\n\n), and HTML attributes (`src=\"path\"`\n\n, `href=\"path\"`\n\n). If a resource file isn't referenced anywhere in your content, it gets dropped.\n\nHere's what it looks like end-to-end with the GitHub API:\n\n``` js\nimport { Octokit } from '@octokit/rest';\nimport { resolveFileMatches, filterReferencedResources } from 'selective-repo-fetch';\n\nconst octokit = new Octokit({ auth: token });\n\n// 1. One API call to get the full file tree (metadata only, no content)\nconst { data } = await octokit.git.getTree({\n  owner, repo, tree_sha: 'HEAD', recursive: 'true'\n});\n\nconst files = data.tree\n  .filter(item => item.type === 'blob')\n  .map(item => ({ path: '/' + item.path }));\n\n// 2. Resolve manifest patterns\nconst manifest = JSON.parse(/* your docfx.json */);\nconst matched = resolveFileMatches(files, manifest, '/', '/docfx.json');\n\n// 3. Fetch content files (small text)\nconst contentTexts: Record<string, string> = {};\nfor (const path of matched.contentMatches) {\n  const { data } = await octokit.repos.getContent({ owner, repo, path: path.slice(1) });\n  contentTexts[path] = Buffer.from(data.content, 'base64').toString();\n}\n\n// 4. Filter resources to only referenced ones\nconst resources = filterReferencedResources(matched.resourceMatches, contentTexts);\n\n// 5. Fetch only referenced resources\n// You now have the exact list — nothing wasted\n```\n\nThe manifest matching is thorough:\n\n`*.{md,yml}`\n\n)`src`\n\npath resolution`exclude: [\"**/draft/**\"]`\n\n)`.order`\n\nfiles`src: \"../other-folder\"`\n\n— discovered before you fetchThe reference filter handles:\n\n`![alt](path)`\n\n, `[text](path)`\n\n`<img src=\"path\">`\n\n, `<video src=\"path\">`\n\n, `<a href=\"path\">`\n\n`~/`\n\n, leading `/`\n\n, query strings, anchors`mailto:`\n\n, `javascript:`\n\nThere's a downstream benefit we didn't anticipate when we first built this: **making documentation efficiently available to AI agents**.\n\nIf you're building agents that answer questions about your product, help onboard developers, or assist with internal processes — they need access to your documentation. But they don't need your entire codebase. They need the *docs*.\n\nThe manifest-driven approach gives you exactly that separation:\n\n``` js\n// Feed docs from multiple repos into your agent's knowledge base\nfor (const repo of repositories) {\n  const files = await getTreeListing(repo);\n  const matched = resolveFileMatches(files, repo.manifest, '/', '/docfx.json');\n\n  // Only index documentation — not code, not tests, not configs\n  for (const docPath of matched.contentMatches) {\n    const content = await fetchFile(repo, docPath);\n    await knowledgeBase.ingest({ path: docPath, repo: repo.name, content });\n  }\n}\n```\n\nThe faster and more precisely you can extract documentation from your repos, the fresher and more accurate your agents' knowledge becomes. Efficient content fetching is the foundation of a reliable AI-powered docs experience.\n\nThe library is MIT-licensed and has zero opinions about your git provider — it works with any API that can give you a file listing.\n\n```\nnpm install github:microsoft/selective-repo-fetch\n```\n\nGitHub: [microsoft/selective-repo-fetch](https://github.com/microsoft/selective-repo-fetch)\n\nIf your doc builds are slow because of large repos, give it a try. And if you have ideas for improvements, PRs are welcome.\n\n*What's the worst monorepo doc build experience you've had? I'd love to hear about it in the comments.*", "url": "https://wpnews.pro/news/stop-cloning-entire-repos-for-your-doc-builds", "canonical_source": "https://dev.to/saipramod/stop-cloning-entire-repos-for-your-doc-builds-28i0", "published_at": "2026-05-27 01:27:33+00:00", "updated_at": "2026-05-27 01:51:40.322664+00:00", "lang": "en", "topics": ["ai-agents", "ai-infrastructure", "ai-tools", "mlops", "large-language-models"], "entities": ["docfx.json"], "alternates": {"html": "https://wpnews.pro/news/stop-cloning-entire-repos-for-your-doc-builds", "markdown": "https://wpnews.pro/news/stop-cloning-entire-repos-for-your-doc-builds.md", "text": "https://wpnews.pro/news/stop-cloning-entire-repos-for-your-doc-builds.txt", "jsonld": "https://wpnews.pro/news/stop-cloning-entire-repos-for-your-doc-builds.jsonld"}}