Stop Letting LLMs Hallucinate Your Codebase: A Graph-First Way to Summarize Repos

A new open-source Python project, code-graph-ai-summarizer, prevents LLM hallucinations in code summarization by first building a structured Code Property Graph (CPG) via static analysis tool Joern, then feeding only verified facts to the LLM for narration. The tool extracts cross-file relationships like call graphs and entry points, ensuring summaries are grounded in actual code structure rather than statistical guesswork.

Ask any LLM to “summarize this repository” and it will happily oblige, and it will also happily make things up . It will mention a test suite that doesn’t exist. It will describe an API endpoint it inferred from the folder name api/. It will confidently tell you about a “data flow” it never actually traced : Reason: LLMs are pattern completers, not code analyzers. When you dump a pile of files into a context window and ask for a summary, the model is guessing based on naming conventions and statistical priors from millions of other repos it has seen, not from understanding this code. Solution: code-graph-ai-summarizer https://github.com/shaktiwadekar9/code-graph-ai-summarizer is a small, Python project that takes a different approach: don't let the LLM look at raw code at all. Instead, build a precise, structured, graph-derived set of facts about the repo first, using real static analysis, and only then hand the LLM a curated fact-sheet and ask it to write summary. The LLM's job shrinks from “understand this codebase” to “narrate these facts I already verified,” which is a job LLMs are very good at. That one design decision is the whole story of this repo , and it’s a pattern worth learning regardless of whether you ever run this exact tool. Five stages. Each stage only passes forward what the next stage needs and never the raw source code itself. Let’s walk through each one, building up from the basics. Before we can reason about a codebase, we need a representation of it that a program can query. Reading characters in a .py file tells you nothing about which function calls which other function. You need structure. This is where Joern comes in. Joern is an open-source static analysis platform that parses source code into a Code Property Graph CPG : a single graph data structure that fuses together several representations: Once your repo is imported into Joern, all of that becomes queryable through CPGQL : a Scala-based query language that treats the whole codebase as one big graph you can filter, map, and traverse. In code-graph-ai-summarizer https://github.com/shaktiwadekar9/code-graph-ai-summarizer repo, that connection lives in joern/client.py: php class JoernRunner: def init self, server: str - None: self.client = CPGQLSClient server def import repo self, repo path: Path, project name: str - None: result = self.client.execute import code query str repo path , project name ... JoernRunner is just a thin wrapper around cpgqls client, which talks to a Joern server running locally joern --server, listening on localhost:8080 by default . You point it at a local folder, it imports the repo, and from then on you can fire CPGQL queries at it. This is done by code-graph-ai-summarizer https://github.com/shaktiwadekar9/code-graph-ai-summarizer so you don’t have to Why a graph and not just an AST per file? Because the interesting questions about a codebase are inherently cross-file: “what calls this function,” “where does this value end up,” “which file is the most central.” Those are graph-traversal questions, not single-file parsing questions. The CPG gives you one graph spanning the entire repo, so those questions become tractable. A CPG by itself is just a big graph sitting in memory. The value comes from the specific queries you run against it. joern/queries.py defines six of them, and reading through them is basically a mini-lesson in “what does a useful static analysis tool actually need to know about a codebase.” Query and What it extracts: files : Every source file in the repo methods: Every function/method, with file, full name, signature, line types: Every class/type declared call edges: For each method, which other methods it calls internal + external calls: Every individual call site, with its code text entry candidates: Methods that look like entry points source sink calls: Calls that look like data sources or data sinks entry candidates The entry candidates query is critical. There’s no universal CPGQL way to say “find the main function” across Python, JS, Go, etc. So the repo uses a name/filename heuristic instead: val entryRe = " ?i . main|run|start|serve|handler|handle|route|controller|command|execute|process|consume|worker|app . "cpg.method .filterNot .isExternal .filter m = m.name.matches entryRe || m.filename.matches entryRe .take maxItems One more detail worth noticing: joern/client.py wraps every single query in a try/except: for name, query in joern queries max items .items : try: facts name = self.run json query name, query except Exception as exc: print f" warn Joern query failed: {name}: {exc}" facts name = If one query fails say, the data-flow query isn’t supported for a given language overlay , the pipeline doesn’t crash, it just records an empty result and moves on. Joern hands back raw lists: every file, every method, every call. That’s hundreds or thousands of items, too much, too unstructured, and too noisy to throw straight at an LLM. This is where the repo’s analysis/ package comes in. Its whole job iscompression with judgment: turning a flood of graph facts into a small set ofranked, labeledsignals. analysis/patterns.py defines keyword buckets for common categories: api web, cli, storage db, filesystem, llm, network, auth, queue worker. A snippet: CATEGORY PATTERNS = { "storage db": "sqlite", "postgres", "mysql", "mongodb", "redis", "sqlalchemy", "save", "insert", "update", "delete", "select", "execute", "commit", "query", , "llm": "openai", "ollama", "anthropic", "gemini", "groq", "cerebras", "completion", "chat.completions", "llm", "model", "generate", , ...} analysis/classify.py then just checks whether any of these substrings show up in a call's name/code/target text. This is deliberately simple : no embeddings, no ML model, just substring matching. Reason: it's fast, debuggable, language-agnostic, and “good enough” because its output isn't the final answer, it's a signal that downstream ranking and the LLM will further interpret. Don't reach for a heavyweight model when a keyword list solves 90% of the problem at near-zero cost. analysis/architecture.py turns the call-edge facts into a per-file importance score. The logic, simplified: file scores caller file += len internal callees + len external callees ...file edge counts caller file, callee file += 1file scores callee file += 2 Sort by score , and you get a ranked list of “central files”, a cheap but effective proxy for architectural importance. This is the most conceptually interesting part of the repo. analysis/flows.py runs a breadth-first search BFS starting from each entry-point candidate, walking forward through the call graph, and scoring every path it finds: php entry point - calls method A - calls method B touches "storage db" - calls method C touches "llm" queue = deque entry while queue and seen paths < 80: path = queue.popleft if len path = 3: score, signals = path score path, method to file if signals: candidates.append runtime candidate ... if len path = 5: continue for next method in graph.get path -1 , :12 : if next method not in path: queue.append path + next method Each path’s score analysis/graph.py rewards length and, much more heavily, rewards touching “important” categories like api web, storage db, or llm: score = len path + 4 len set categories important = {"api web", "storage db", "filesystem", "llm", "network", "auth", "queue worker"}score += 5 len important.intersection categories In plain English: A path that goes from an entry point all the way to a database call or an LLM call is more “interesting” than a path that just bounces between two utility functions. That’s a simple but effective heuristic. find data flows is the mirror image: instead of starting from entry points, it starts from calls that look like data sources request, input, argv, env, ... and BFS-searches forward until it reaches calls that look like data sinks write, save, insert, chat, post, ... . source: read user input | v some processing methods | vsink: save to DB / send to LLM Important nuance the README states explicitly and the code backs up: these are graph-derived candidates, not proven runtime traces. Joern is doing static analysis, it never executes the code. A BFS path through the call graph is a plausible flow, not a guaranteed one. All of the analysis above gets assembled in summarization/facts builder.py into a single summary facts dictionary: php def build summary facts repo path: Path, facts: dict - dict: repo map = build repo map facts.get "files", architecture = derive architecture facts return { "repo name": repo path.name, "repo map": repo map, "architecture signals": architecture, "entry points": facts.get "entry candidates", :40 , "critical runtime flow candidates": find runtime flows facts , "critical data flow candidates": find data flows facts , "important symbols": important symbols facts, architecture , "limits": {"note": "This is static analysis. Runtime/data flows are graph-derived candidates, not guaranteed actual production traces."}, } Notice what’s not in here: important symbols is deliberately filtered down to only the methods/types that live in the already-identified “central files”, another compression step that keeps the eventual LLM prompt small and focused. This dictionary, not the repo itself, is what the LLM will actually see. summarization/prompts.py It builds the final prompt, and it’s worth reading closely because it shows how to constrain an LLM rather than just hope it behaves: return f"""You are generating a repository summary using Joern Code Property Graph facts.Use only the supplied graph facts.Do not invent files, folders, APIs, tests, classes, functions, runtime flows, or data flows.Separate detected facts from inferred conclusions.For runtime flows and data flows, include only the critical ones, not every path.If something is weakly supported, say "likely".If something is not supported, say "not detected".Return Markdown with exactly these sections: Repository Summary 1. Repository Purpose 2. Repository Map 3. Architecture 4. Critical Runtime Flows 5. Critical Data Flows 6. Important Files 7. Important Symbols 8. Not Detected / Unknown...""" A few important details: llm/client.py llm/client.py then does the boring-but-important part: it’s a thin OpenAI-SDK wrapper that works with any OpenAI-compatible endpoint: Groq, OpenRouter, Gemini’s OpenAI-compatible endpoint, or Cerebras, controlled purely through .env config: LLM PROVIDER=groqLLM API KEY=your api key hereLLM MODEL=llama-3.3-70b-versatile php def generate repo summary summary facts: dict, config: LLMConfig - str: client = make client config response = client.chat.completions.create model=config.model, temperature=config.temperature, max tokens=config.max tokens, messages= {"role": "system", "content": "You are a precise static-analysis repo summarizer. You must not hallucinate unsupported repo facts."}, {"role": "user", "content": build summary prompt summary facts }, , return response.choices 0 .message.content or "" uv run code-graph-ai-summarizer /path/to/local/repo Walking through run in cli/main.py function end to end: Third, repo summary.md: the final human-readable summary. outputs/<repo-name /├── joern facts.json <- raw, large, exact├── summary facts.json <- compact, ranked, curated└── repo summary.md <- narrated, by the LLM That intermediate summary facts.json is, in practice, one of the most useful files this tool produces. It's the auditable middle layer: if the final Markdown says something surprising, you can open this file and check whether it’s actually grounded in a detected signal or whether the LLM drifted. The specific use case here is repo summarization, but the underlying pattern is broadly applicable to anyone building tools on top of LLMs: git clone <your-repo-url cd code-graph-ai-summarizeruv synccp .env.example .env edit .env: set LLM PROVIDER, LLM API KEY, LLM MODEL in a separate terminaljoern --server in a separate terminal if using ollama ollama serve back in your main terminaluv run code-graph-ai-summarizer /path/to/any/local/repo Point it at a small repo first, and get repo summary.md . Full credit where it’s due: I didn’t write this by hand, line by line, heroically, at 2 AM, fueled by coffee. I played as an orchestrator, pointing ChatGPT and Claude at the problem, arguing with them when they hallucinated a function that didn’t exist, and stitching their outputs into something that actually runs and is a useful application . They wrote the code, I supplied the opinions, the rejections, and the “no, that’s not what I meant” loop until it converged. So consider this repo https://github.com/shaktiwadekar9/code-graph-ai-summarizer a small case study in human + AI pair programming, minus the part where the AI gets annoyed at my code review comments. Stop Letting LLMs Hallucinate Your Codebase: A Graph-First Way to Summarize Repos https://pub.towardsai.net/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos-8a803db9c931 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.