{"slug": "making-the-context-across-46-repositories-semantically-searchable-for-ai-part-2", "title": "Making the Context Across 46 Repositories Semantically Searchable for AI (Part 2)", "summary": "Ryan Tsuji, CTO at airCloset, solved the entry-point problem for a knowledge graph spanning 46 repositories by joining it with a separate database graph (db-graph) that already had semantic search via AI-generated embeddings. This approach allowed the code graph to inherit semantic context without manual annotations, using only boundary nodes for intent. The solution builds on a pattern proven in db-graph, which covers 1,133 tables and 10,815 columns.", "body_md": "Hi, I'm [Ryan](https://x.com/ryantsuji), CTO at airCloset.\n\nIn [Part 1](https://dev.to/ryantsuji/building-one-knowledge-graph-across-46-repositories-with-static-analysis-part-1-egm), I wrote about unifying 46 repositories of production code into a single knowledge graph via static analysis. The graph itself got built, but I closed the post with **four open issues**: no semantic search, node explosion, having to open the file to actually know what a function does, and the cost of writing a new parser every time a new boundary pattern showed up.\n\nThis Part 2 is about **how I solved the first one — the entry-point problem (no semantic search).** The other three are left exactly as Part 1 described them — I'll come back to them at the end, together with the new issues that surfaced once the entry-point problem was out of the way.\n\nThe reason to start with the entry-point problem is simple: if the graph exists but the only way to reach it is grep, the model ends up inferring anyway. The whole point — **\"give the model verified facts, not inference\"** — falls apart. So the entry-point problem had to be solved before the others.\n\nMonths earlier, I'd already solved the same structural problem in a different domain — the [db-graph project](https://zenn.dev/aircloset/articles/2731787582881a).\n\nInternally, we had a large number of DB tables spread across many services, and **no single person had the full picture**. Different people knew different pieces well, but the whole map didn't fit in anyone's head. So I built db-graph: extract schemas statically from ORM definitions, generate per-table descriptions with Gemini, embed them as 768-dimensional vectors in the graph, and make the whole thing semantically searchable in natural language.\n\nAt the time of that article it covered 991 tables. Today it spans **21 schemas / 1,133 tables / 10,815 columns**, and finding data in natural language without knowing table names is just how people work now.\n\nThe pattern that proved out there:\n\nStatic-analysis graph + AI-generated context = natural-language semantic search works.\n\nIf it worked for db-graph, it should work for code-graph. The moment that thought landed, I noticed something:\n\n**code-graph already contains \"DB table nodes\" as boundary nodes** — they're one of the boundary node types I covered in Part 1.\n\nSo if I just **join** code-graph and db-graph, code-graph automatically inherits db-graph's semantic context. Without writing a single annotation, the existing assets alone make the graph meaningfully richer.\n\nThat's where the idea of \"joining graphs\" first came up — not treating each graph as its own island, but designing the joins between them.\n\nJoining db-graph took care of DB context. But the remaining boundaries (API / Event) and the graph's entry-point type (Page) still need meaning attached. Static analysis alone can't pull intent out of those, so context has to come from somewhere else.\n\nThe choice was clear: **write the intent directly into the code via annotations** (the same approach used by cortex's internal knowledge graph, which I covered in [AI Harness Series, Part 2](https://dev.to/ryantsuji/the-heart-of-the-ai-harness-a-knowledge-graph-of-the-ai-by-the-ai-for-the-ai-series-part-2-53bm)).\n\nThe catch: you can't annotate all the functions across 46 repos. There must be tens of thousands of them. Asking established teams running an existing production codebase to retroactively annotate everything is just not realistic.\n\nBut here's the second realization:\n\nWhat matters is just the boundary nodes.So if I only annotate around the boundaries, that's enough.\n\nWhen an AI agent asks \"what breaks if I change this code\" or \"what other repos call this API,\" what it needs isn't a per-function logic explanation. It needs **boundary intent** — what is this screen for, what does this API return, what milestone in the business does this Event mark.\n\n= **Minimum annotations, maximum meaning.** That became the heart of the design.\n\nPutting it together (internally we call this annotation graph **service-product-graph**, or SPG):\n\nThree graphs sit **as peers, joined by SAME_ENTITY edges**. There's no hierarchy — **you can start from any graph and reach the others**.\n\n`@graph-*`\n\ntags written only around boundariesThe entry point for AI agents is a single **MCP server** that traverses all three graphs. AI agents never hit db-graph directly — the annotation graph's MCP server proxies db-graph calls on their behalf.\n\nThe annotation graph has 7 node types: Page / Section / Dialog / Field / Action / Api / Task. The early version was screen-focused and called `screen-graph`\n\n, but once it grew to cover backend Api / Task, it was renamed to service-product-graph.\n\nHere's what an annotation looks like (fictional, but close in shape to the real ones):\n\n```\n/**\n * @graph-page /home\n * @graph-business Main screen. Members can see what they're currently renting, buy items, and initiate returns.\n * @graph-label Home Screen\n * @graph-has-section banners, wearing-items, wearing-return, delivery-status\n * @graph-has-dialog buying-modal, return-modal\n * @graph-navigates-to /return-procedure, /checkout, /my-karte\n * @graph-calls GET /api/v1/wearing\n * @graph-reads admin_delivery_orders, admin_rental_items\n * @graph-flow styling-loop\n * @graph-status monthly-member\n */\n```\n\nTwo things matter here:\n\n`@graph-business`\n\n`@graph-flow`\n\n/ `@graph-status`\n\nThere's also `@graph-case`\n\n(the conditional pattern tag that test cases derive from), but that's for another time.\n\nThis is where it gets practical.\n\nOnce I committed to building annotation graph, here were the constraints:\n\nIn other words: **don't mix humans and AI inside the same PR**.\n\nThe solution was to physically separate annotations onto their own branch.\n\nThis is the \"every line of code passes through an AI gate\" ideal from [AI Harness Series, Part 6](https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa), adapted to the constraints of an existing organization. cortex (the internal AI platform) is a monorepo I assemble from scratch, so \"every commit passes the AI gate\" actually holds there. For the 46-repo production system, that precondition doesn't hold. So instead of giving up on the ideal, I split it: **engineers' workflow on one branch, AI's annotation workflow on another, both running in parallel**.\n\nJust running the annotation pipeline doesn't guarantee the **quality of the joins** between the three graphs (code-graph / db-graph / annotation graph). So there's a set of SLOs that automatically check the consistency across the entire graph.\n\nThe main rules:\n\n`HANDLES_API`\n\nhandlers must have downstream function calls (= no handlers that receive an API and then do nothing)These are really just **a naive question — \"shouldn't the boundaries connect to each other?\" — turned into an SLO.** If anything drops below threshold, an alert fires, and the trustworthiness of the whole graph gets defended every day.\n\nThe daily boundary-analysis cron from Part 1 (5% connection-rate drop = alert) was code-graph-only. This is a **cross-graph SLO** — it guards the joins between graphs themselves. Add a parser to one repo, write a new annotation, change a schema — whatever happens, by the next morning a quality drop in any join becomes visible.\n\nI've been writing \"join\" casually, but the actual joining wasn't that straightforward.\n\nStatic-analysis API / Page / Task nodes and annotation graph API / Page / Task nodes are created as **separate nodes**. They mean the same thing, but their names / paths / identifiers don't match by themselves — there's nothing automatic about lining them up.\n\nTo connect them, we generate a separate edge type called SAME_ENTITY. There are three bridges:\n\n`/console/api/`\n\nto `/api/`\n\n)`/v1.x/`\n\n→ `/`\n\n)`/:id`\n\n, `/{id}`\n\nto `/:dynamic`\n\n)`?`\n\n→ strip trailing `:dynamic?`\n\n→ finally fall back to a dynamic-dispatch boundary `:dynamic`\n\n, loosening progressivelyThere was also one operational footgun. The first implementation used `INSERT NOT EXISTS`\n\nto avoid duplicates. But BigQuery's streaming-buffer visibility lag let duplicates slip in — in one repo the edges doubled from 106 to 214 overnight. We fixed it by rewriting to `MERGE INTO`\n\nto make the operation idempotent.\n\nWith all of this in place, the entry-point problem from the end of Part 1 was finally solved:\n\n\"the subscription-fee calculation for members seems off\"\n\nThrow this natural-language query at annotation graph and vector search returns the related nodes (Page / Api / Function / DB table) **as facts**. From there, SAME_ENTITY takes you over to code-graph functions, including callers and callees in other repos. From the DB boundaries in code-graph, you can cross into db-graph and pull the relevant columns.\n\nThe entry point can be anywhere — \"what calls this table?\" starts from db-graph, \"what's the blast radius of this function?\" starts from code-graph, both walk the same connected network. From a single natural-language query, or from a specific node, **you can now traverse all three graphs and get every relevant piece of code plus every relevant DB schema**.\n\nThe Part 1 lament — **\"the graph is there but the entry point is missing\"** — could finally be put to bed.\n\nFrom 2026-04-16 (first production deployment) to the time of writing — about 2.5 months — the annotation graph's MCP server has handled **~50,000 calls from ~73 users**. The breakdown:\n\nThe interesting line is the second one. \"Search the codebase in natural language\" is usually an engineer's tool — but once the entry-point problem was solved, **people outside engineering** started using it too, asking things like \"how does this feature actually work?\" or \"what's in this DB?\" in their own words.\n\nThis is adjacent to the \"non-engineers writing specs with AI\" trend I covered in [AI Harness Series, Part 5](https://dev.to/ryantsuji/the-author-doesnt-have-to-be-an-engineer-how-the-harness-holds-quality-series-part-5-12e4) — **a graph that can be queried by meaning starts to matter org-wide.** Call volume is overwhelmingly dominated by engineers, of course. The interesting thing is the **range of job roles starting to pick it up**. That's the real impact of solving the entry-point problem.\n\nThe MCP server is the cross-graph entry point. It exposes six tools — service search / service detail / API detail / data-flow tracing / impact-radius tracing / business-rule full-text search — and that's the only entry point AI agents ever touch.\n\nOne design choice worth calling out: **AI agents never talk to db-graph directly.** The annotation graph's MCP proxies db-graph calls. From the agent's side, the mental model stays simple: \"ask one MCP and get everything back.\"\n\nThat makes the full chain — \"Screen → API → Code → DB → Column\" — traversable in a single MCP tool call.\n\nSame approach as Part 1 (pulling commits from Jan–Mar). For Part 2, the key commits are from April–May.\n\n`refactor(graph): rename screen-graph to service-product-graph`\n\n— declaration that the scope expands from screen-only to whole-service`feat(graph): add Api and Task node types to service-product-graph parser`\n\n— Api / Task node types added`feat(mcp): add cross-graph tools to service-product-graph MCP`\n\n— `feat(graph): add SAME_ENTITY bridge edges between service-product-graph and code-graph`\n\n— `feat(graph): resolve Redis keys to code-graph boundary nodes`\n\n— boundary resolution through Redis`feat(service-product-graph): add EventBridge EMITS_TO support + SAME_ENTITY bridge`\n\n`feat(code-graph, service-product-graph): improve SAME_ENTITY boundary bridge coverage`\n\n— 4-stage fallback locked in`feat(auto-review): SPG annotation auto-maintenance pipeline`\n\n— `feat(service-product-graph): add Task SAME_ENTITY bridge to code-graph`\n\n— all three bridges in place`feat(spg): add mall repos to SPG indexing`\n\n— mall repos indexed`feat(spg): add Go-aware parser`\n\n— April 15 was the day \"expansion + cross-graph tools + bridges\" landed in close succession. Over the next week, \"Redis / EventBridge / Task bridges / annotation auto-maintenance\" stacked up week over week.\n\nIn particular, **the annotation auto-maintenance pipeline on April 21** is where the \"humans alone can't do this, but AI can\" promise from Part 1 got cashed in. From that point on, annotation shifted from \"humans grind through writing them\" to \"design the whole operation assuming AI writes them.\"\n\nSolving the entry-point problem didn't make everything clean. A few issues remain.\n\nThe frontend side is annotated heavily. Backend / Go / batch are still thin. **Some nodes will always be missing annotations** — that's structural, and you can't drive it to zero. It's an ongoing operational issue.\n\nThe Page bridge in particular has cases where multiple annotation Pages map to the same boundary — that's structural and unavoidable. Adding more strategies got coverage to 100%, but **guaranteeing \"every join is correct\" 100% is hard**.\n\nThe graph only carries the fact that \"this edge exists statically.\" How often that edge actually **gets used in production** isn't recorded. Piping production execution counts back into the static graph and surfacing dead-code edges as a separate signal — that's still untouched.\n\nEvery time a new repo enters production, the bridge normalization rules and per-repo patterns need adjusting. This is the annotation-graph-side version of Part 1's fourth issue (the cost of adding a new parser for every new boundary pattern).\n\nIn Part 1's closing note, I touched on the fact that the cortex side (the internal AI platform) bailed out of the code-graph approach **early** and bet on an annotation-based knowledge graph instead. The bail-out was fast enough that calling it \"thrown away\" wouldn't be wrong — but looking back across this whole series, the more accurate word is **\"evolved.\"**\n\nWhat it evolved into, in the end, is **three graphs joined as peers**:\n\nJoined by SAME_ENTITY, served to the agent through MCP. The thing static analysis alone couldn't deliver — querying by meaning — became workable by reusing the db-graph success pattern and adding minimal annotations only at the boundaries.\n\nAnd one more framing: paired with the [AI Harness Series, Parts 1–6](https://dev.to/ryantsuji/ai-isnt-something-to-trust-its-something-to-design-series-final-30aa), this series sits as:\n\n= the same philosophy (design without trusting AI), implemented under two different sets of constraints.\n\nThanks for reading this far.", "url": "https://wpnews.pro/news/making-the-context-across-46-repositories-semantically-searchable-for-ai-part-2", "canonical_source": "https://dev.to/ryantsuji/making-the-context-across-46-repositories-semantically-searchable-for-ai-part-2-51d9", "published_at": "2026-06-29 23:50:05+00:00", "updated_at": "2026-06-30 00:19:17.632689+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "developer-tools", "ai-agents"], "entities": ["Ryan Tsuji", "airCloset", "Gemini", "db-graph", "code-graph", "AI Harness"], "alternates": {"html": "https://wpnews.pro/news/making-the-context-across-46-repositories-semantically-searchable-for-ai-part-2", "markdown": "https://wpnews.pro/news/making-the-context-across-46-repositories-semantically-searchable-for-ai-part-2.md", "text": "https://wpnews.pro/news/making-the-context-across-46-repositories-semantically-searchable-for-ai-part-2.txt", "jsonld": "https://wpnews.pro/news/making-the-context-across-46-repositories-semantically-searchable-for-ai-part-2.jsonld"}}