One Open Source Project a Day (No. 71): CodeGraph — Pre-Index Your Codebase for AI Agents, Save 35% Cost and 70% Tool Calls

CodeGraph is an open-source tool that pre-indexes codebases into a local semantic graph using tree-sitter and SQLite, allowing AI coding agents to access structured code knowledge with a single tool call instead of performing multiple file scans and searches. Benchmarks across seven real projects show it reduces tool calls by 70% and costs by 35%, with one architecture query on VS Code's TypeScript repository dropping from 1.4 million tokens to 393,000 tokens. The tool exposes eight query tools via the Model Context Protocol (MCP) and supports live file synchronization through native OS events.

Introduction "~35% cheaper · ~70% fewer tool calls · 100% local" This is the No.71 article in the "One Open Source Project a Day" series. Today we are exploring CodeGraph . Start with a scenario: you ask Claude Code "How is AuthService being called?" Without any assistance, Claude's approach is: glob-scan directories, run multiple greps, read several files — then finally answer. The whole process might trigger 10–15 tool calls and consume hundreds of thousands of tokens. CodeGraph's insight is to front-load this work : before you start, it has already parsed your codebase with tree-sitter into a semantic graph stored in a local SQLite database, then exposes 8 query tools to AI agents via MCP. When the agent needs to understand code, a single codegraph context call returns entry points, related symbols, and code snippets — no file reading required . 9.6k Stars, 588 Forks. Benchmarks across 7 real open-source projects: average 35% cost savings, 70% fewer tool calls, 49% speed improvement. On VS Code's large TypeScript repository, one architecture Q&A dropped from 1.4M tokens to 393k — cost from $0.64 to $0.42. What You Will Learn - CodeGraph's four-stage pipeline: Extract → Store → Resolve → Auto-Sync - The 8 MCP tools and when to use each - A detailed breakdown of benchmark results across 7 projects: why do larger codebases benefit more? - How 19-language support and 13-framework route recognition work - Complete setup walkthrough from installation to Claude Code integration - codegraph affected : using dependency tracing for smart CI test selection Prerequisites - Familiarity with Claude Code, Cursor, or similar AI coding tools - Basic understanding of MCP Model Context Protocol - Node.js experience Project Background Project Introduction CodeGraph is a local semantic code knowledge graph tool designed specifically to improve AI coding agent efficiency. Its core insight: AI agents spend a massive amount of tokens and time in the "discovery phase" — scanning directories, searching for symbols, reading files — rather than on the actual reasoning and generation. CodeGraph's solution is to outsource the discovery phase to a pre-built index : before you start working, the index is already ready, letting AI agents pull structured code knowledge directly instead of exploring the file system from scratch. The technology choices are pragmatic: tree-sitter for AST parsing mature, multi-language, high-performance , SQLite FTS5 for full-text search zero external dependencies, fully local , and native OS file events for live sync FSEvents/inotify/ReadDirectoryChangesW . Author/Team - Author : Colby McHenry GitHub: colbymchenry - Repository : colbymchenry/codegraph https://github.com/colbymchenry/codegraph - Distribution : npm package @colbymchenry/codegraph Project Stats - ⭐ GitHub Stars: 9,600+ - 🍴 Forks: 588 - 📦 npm package: @colbymchenry/codegraph - 🔧 Runtime: Node.js 20–24 - 💻 Platforms: Windows, macOS, Linux - 📄 License: MIT - 🌐 Repository: colbymchenry/codegraph https://github.com/colbymchenry/codegraph Main Features Core Utility CodeGraph inserts a pre-built index layer between AI agents and codebases: Codebase TypeScript / Python / Go / ... ↓ tree-sitter parsing Semantic graph symbols + relationships + call chains ↓ stored in SQLite FTS5 Local knowledge base ↓ exposed via MCP AI coding agents Claude Code / Cursor / Codex CLI / OpenCode User: "How is AuthService being called?" → Agent: glob "src/ / .ts" Tool call 1 → Agent: grep "AuthService" Tool call 2 → Agent: read "auth.service.ts" Tool call 3 → Agent: grep "import. Auth" Tool call 4 → Agent: read "user.controller.ts" Tool call 5 → Agent: read "app.module.ts" Tool call 6 ... 10–15 total tool calls, massive token consumption With CodeGraph : User: "How is AuthService being called?" → Agent: codegraph callers "AuthService" Tool call 1 → Returns: full caller list + call sites + code snippets → Agent answers directly, no file reading needed Quick Start One-command install recommended : Run the interactive installer — auto-detects installed AI agents and configures them npx @colbymchenry/codegraph Initialize in your project -i for interactive cd your-project codegraph init -i Auto-detect all installed agents, global install codegraph install --yes Target specific agents codegraph install --target=cursor,claude --yes Project-local install codegraph install --target=auto --location=local npm install -g @colbymchenry/codegraph Add to ~/.claude.json or project-level .claude.json : { "mcpServers": { "codegraph": { "type": "stdio", "command": "codegraph", "args": "serve", "--mcp" } } } codegraph status Check index status and stats codegraph query "UserService" Test symbol search The 8 MCP Tools The complete toolset CodeGraph exposes to AI agents: | Tool | Purpose | Typical Invocation | |---|---|---| codegraph search | Find symbols by name | "Find all functions called authenticate" | codegraph context | Build code context for a task | "What code is relevant to the login flow?" | codegraph callers | Find what calls a function | "What calls AuthService?" | codegraph callees | Find what a function calls | "What does processPayment call internally?" | codegraph impact | Analyze change impact radius | "What breaks if I change this function?" | codegraph node | Get details about a specific symbol | "Show me UserController's full signature" | codegraph files | Get indexed file structure | "What is the overall project structure?" | codegraph status | Check index health and stats | "How many symbols are indexed? Last sync?" | codegraph context is the most important tool — it doesn't just return search results; it intelligently assembles a comprehensive context package for a given task, including entry points, related symbols, and code snippets: Command-line equivalent codegraph context "fix user login bug" → Automatically finds login-related functions, call chains, and relevant files packaged into context Claude can consume directly Project Advantages | Dimension | CodeGraph | Native AI Agent no assist | Other code indexers | |---|---|---|---| Tool call count | ~70% fewer | High re-scans each task | Partial reduction | Token usage | ~59% fewer | High | Partial reduction | Data privacy | 100% local | Depends on agent | Most require uploads | Real-time sync | Native OS file events | N/A | Usually polling or manual | Language support | 19+ languages | Depends on agent | Usually 3–5 | Framework route detection | 13 frameworks | None | Rare | Installation complexity | One npx command | N/A | Usually requires server | Detailed Analysis 1. The Four-Stage Pipeline Stage 1: Extraction tree-sitter parses source files into ASTs, extracting: - Symbols : functions, classes, methods, interfaces, variable definitions - Relationships : function calls, module imports, class inheritance, interface implementations tree-sitter's key advantage: it is a fault-tolerant parser — it can extract partial structure even when code has syntax errors. This is critical for indexing files that are actively being edited. Stage 2: Storage All data lands in a local SQLite database using the FTS5 Full-Text Search 5 extension: -- Symbols table simplified CREATE VIRTUAL TABLE symbols USING fts5 name, -- Symbol name kind, -- function/class/method/... file path, -- Source file line start, -- Starting line signature, -- Function signature docstring, -- Documentation comment code snippet -- Code excerpt ; -- Relationships table CREATE TABLE edges from id INTEGER, -- Caller symbol ID to id INTEGER, -- Callee symbol ID kind TEXT, -- calls/imports/inherits/implements file TEXT, line INTEGER ; js Source code: import { AuthService } from './auth.service' ... this.authService.login user ↓ resolution Graph edges: UserController.login → AuthService.login calls UserController → AuthService imports Stage 4: Auto-Sync Uses native OS file events not polling to detect changes: - macOS: FSEvents - Linux: inotify - Windows: ReadDirectoryChangesW A 2-second debounce prevents triggering mass rebuilds when files change rapidly — it waits for changes to settle before doing incremental updates. 2. Benchmark Deep Dive Test conditions: Claude Code headless, Opus 4.7 answering architecture questions. Each result is the median of 4 runs on the same question, across 7 real open-source repositories. Project Language Size Cost ↓ Token ↓ Speed ↑ Tool Calls ↓ ────────────────────────────────────────────────────────────────────────────────────── VS Code TypeScript ~10k files 35% 73% 41% 72% Excalidraw TypeScript ~600 files 47% 73% 60% 86% Django Python ~2.7k files 34% 64% 59% 81% Tokio Rust ~700 files 52% 81% 63% 89% OkHttp Java ~640 files 17% 41% 36% 64% Gin Go ~150 files 22% 23% 34% 19% Alamofire Swift ~100 files 38% 59% 51% 77% ────────────────────────────────────────────────────────────────────────────────────── Average 35% 59% 49% 70% Patterns worth noting : Tokio Rust, 700 files sees the biggest gains 81% token reduction, 89% fewer tool calls : Rust's type system is complex — agents originally needed extensive file exploration to understand trait implementations and generic relationships. CodeGraph's pre-built relationships make this dramatically cheaper. Gin Go, 150 files sees the smallest gains 23% token reduction, 19% fewer tool calls : Small Go projects have simple file structures. Agents can already navigate them efficiently, so CodeGraph's marginal value is lower. VS Code's absolute numbers are the most striking : the same question costs $0.64 1.4M tokens without CodeGraph, $0.42 393k tokens with it. A single task saves $0.22. Takeaway : The larger the codebase, the more complex the dependencies, and the richer the language's type system, the greater CodeGraph's benefit . For developers using Claude Code heavily on large projects, the ROI is clear. 3. 19 Languages + 13 Framework Route Detection Language support via tree-sitter grammars : TypeScript, JavaScript, Python, Go, Rust, Java, C , PHP, Ruby, C, C++, Swift, Kotlin, Dart, Svelte, Vue, Liquid, Pascal/Delphi, Scala Framework route detection is a differentiating feature — CodeGraph doesn't just recognize symbols, it understands the mapping between URL routes and their handler functions: Django urlpatterns = path 'users/<int:pk /', UserDetailView.as view , → CodeGraph knows GET /users/{id}/ maps to UserDetailView FastAPI @app.get "/items/{item id}" async def read item item id: int : ... → CodeGraph knows GET /items/{id} maps to read item The 13 supported frameworks: Django, Flask, FastAPI, Express, NestJS, Laravel, Rails, Spring, Gin/chi/gorilla/mux, Axum/actix/Rocket, ASP.NET, Vapor, React Router/SvelteKit. This means AI agents can ask "Where is the handler for /api/users/:id ?" and get a precise answer, without needing to scan routing config files. 4. codegraph affected — Smart CI Test Selection An underappreciated feature: by tracing import dependencies, it identifies which test files are actually affected by changed source files. CI scenario: only run tests affected by this change