{"slug": "stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos", "title": "Stop Letting LLMs Hallucinate Your Codebase: A Graph-First Way to Summarize Repos", "summary": "A new open-source Python project, code-graph-ai-summarizer, prevents LLM hallucinations in code summarization by first building a structured Code Property Graph (CPG) via static analysis tool Joern, then feeding only verified facts to the LLM for narration. The tool extracts cross-file relationships like call graphs and entry points, ensuring summaries are grounded in actual code structure rather than statistical guesswork.", "body_md": "Ask any LLM to “summarize this repository” and it will happily oblige, and it **will also** happily *make things up*. It will mention a test suite that doesn’t exist. It will describe an API endpoint it inferred from the folder name api/. It will confidently tell you about a “data flow” it never actually traced :(\n\nReason:\n\n**LLMs are pattern completers, not code analyzers.** When you dump a pile of files into a context window and ask for a summary, the model is guessing based on naming conventions and statistical priors from millions of other repos it has seen, not from understanding *this* code.\n\nSolution:\n\n[code-graph-ai-summarizer](https://github.com/shaktiwadekar9/code-graph-ai-summarizer) is a small, Python project that takes a different approach: **don't let the LLM look at raw code at all.**\n\nInstead, build a precise, structured, graph-derived set of facts about the repo first, using real static analysis, and only then hand the LLM a curated fact-sheet and ask it to write summary.\n\nThe LLM's job shrinks from “understand this codebase” to “narrate these facts I already verified,” which is a job LLMs are very good at.\n\n**That one design decision is the whole story of this repo**, and it’s a pattern worth learning regardless of whether you ever run this exact tool.\n\nFive stages. Each stage only passes forward what the next stage needs and never the raw source code itself.\n\n**Let’s walk through each one, building up from the basics.**\n\nBefore we can reason about a codebase, we need a representation of it that a program can query.\n\nReading characters in a .py file tells you nothing about which function calls which other function. You need structure.\n\nThis is where **Joern** comes in. Joern is an open-source static analysis platform that parses source code into a **Code Property Graph (CPG): **a single graph data structure that fuses together several representations:\n\nOnce your repo is imported into Joern, all of that becomes queryable through **CPGQL : **a Scala-based query language that treats the whole codebase as one big graph you can filter, map, and traverse.\n\nIn [code-graph-ai-summarizer](https://github.com/shaktiwadekar9/code-graph-ai-summarizer) repo, that connection lives in joern/client.py:\n\n``` php\nclass JoernRunner:    def __init__(self, server: str) -> None:        self.client = CPGQLSClient(server)    def import_repo(self, repo_path: Path, project_name: str) -> None:            result = self.client.execute(import_code_query(str(repo_path), project_name))            ...\n```\n\nJoernRunner is just a thin wrapper around cpgqls_client, which talks to a Joern server running **locally** (joern --server, listening on localhost:8080 by default).\n\nYou point it at a local folder, it imports the repo, and from then on you can fire CPGQL queries at it. (This is done by [code-graph-ai-summarizer](https://github.com/shaktiwadekar9/code-graph-ai-summarizer) so you don’t have to)\n\nWhy a graph and not just an AST per file?\n\n**Because the interesting questions about a codebase are inherently cross-file: “what calls this function,” “where does this value end up,” “which file is the most central.”**\n\nThose are graph-traversal questions, not single-file parsing questions.\n\nThe CPG gives you one graph spanning the entire repo, so those questions become tractable.\n\nA CPG by itself is just a big graph sitting in memory.\n\n**The value comes from the specific queries you run against it.**\n\njoern/queries.py defines ** six** of them, and reading through them is basically a mini-lesson in “what does a useful static analysis tool actually need to know about a codebase.”\n\nQuery and What it extracts:\n\nfiles : Every source file in the repo\n\nmethods: Every function/method, with file, full name, signature, line\n\ntypes: Every class/type declared\n\ncall_edges: For each method, which other methods it calls (internal + external)\n\ncalls: Every individual call site, with its code text\n\nentry_candidates: Methods that *look* like entry points\n\nsource_sink_calls: Calls that look like data sources or data sinks\n\nentry_candidates\n\nThe entry_candidates query is critical. There’s no universal CPGQL way to say “find the main function” across Python, JS, Go, etc.\n\n**So the repo uses a name/filename heuristic instead:**\n\n```\nval entryRe = \"(?i).*(main|run|start|serve|handler|handle|route|controller|command|execute|process|consume|worker|app).*\"cpg.method  .filterNot(_.isExternal)  .filter(m => m.name.matches(entryRe) || m.filename.matches(entryRe))  .take(maxItems)\n```\n\nOne more detail worth noticing:\n\njoern/client.py wraps every single query in a try/except:\n\n```\nfor name, query in joern_queries(max_items).items():    try:        facts[name] = self.run_json_query(name, query)    except Exception as exc:        print(f\"[warn] Joern query failed: {name}: {exc}\")        facts[name] = []\n```\n\nIf one query fails (say, the data-flow query isn’t supported for a given language overlay), the pipeline doesn’t crash, it just records an empty result and moves on.\n\nJoern hands back raw lists: every file, every method, every call. That’s hundreds or thousands of items, too much, too unstructured, and too noisy to throw straight at an LLM.\n\nThis is where the repo’s analysis/ package comes in.\n\nIts whole job iscompression with judgment: turning a flood of graph facts into a small set ofranked, labeledsignals.\n\nanalysis/patterns.py defines keyword buckets for common categories: api_web, cli, storage_db, filesystem, llm, network, auth, queue_worker.\n\nA snippet:\n\n```\nCATEGORY_PATTERNS = {    \"storage_db\": [        \"sqlite\", \"postgres\", \"mysql\", \"mongodb\", \"redis\", \"sqlalchemy\",        \"save\", \"insert\", \"update\", \"delete\", \"select\", \"execute\", \"commit\", \"query\",    ],    \"llm\": [        \"openai\", \"ollama\", \"anthropic\", \"gemini\", \"groq\", \"cerebras\",        \"completion\", \"chat.completions\", \"llm\", \"model\", \"generate\",    ],    ...}\n```\n\nanalysis/classify.py then just checks whether any of these substrings show up in a call's name/code/target text.\n\n**This is deliberately simple**: no embeddings, no ML model, just substring matching. Reason: it's fast, debuggable, language-agnostic, and “good enough” because its output isn't the final answer, it's a *signal* that downstream ranking and the LLM will further interpret.\n\nDon't reach for a heavyweight model when a keyword list solves 90% of the problem at near-zero cost.\n\nanalysis/architecture.py turns the call-edge facts into a per-file importance score.\n\nThe logic, simplified:\n\n```\nfile_scores[caller_file] += len(internal_callees) + len(external_callees)...file_edge_counts[(caller_file, callee_file)] += 1file_scores[callee_file] += 2\n```\n\n**Sort by score**, and you get a ranked list of “central files”, a cheap but effective proxy for architectural importance.\n\nThis is the most conceptually interesting part of the repo.\n\nanalysis/flows.py runs a **breadth-first search (BFS)** starting from each entry-point candidate, walking forward through the call graph, and scoring every path it finds:\n\n``` php\nentry_point   -> calls method A        -> calls method B  (touches \"storage_db\")             -> calls method C  (touches \"llm\")\nqueue = deque([[entry]])while queue and seen_paths < 80:    path = queue.popleft()    if len(path) >= 3:        score, signals = path_score(path, method_to_file)        if signals:            candidates.append(runtime_candidate(...))    if len(path) >= 5:        continue    for next_method in graph.get(path[-1], [])[:12]:        if next_method not in path:            queue.append(path + [next_method])\n```\n\nEach path’s score (analysis/graph.py) rewards length and, much more heavily, rewards touching “important” categories like api_web, storage_db, or llm:\n\n```\nscore = len(path) + 4 * len(set(categories))important = {\"api_web\", \"storage_db\", \"filesystem\", \"llm\", \"network\", \"auth\", \"queue_worker\"}score += 5 * len(important.intersection(categories))\n```\n\nIn plain English:\n\nA path that goes from an entry point all the way to a database call or an LLM call is more “interesting” than a path that just bounces between two utility functions.\n\nThat’s a simple but effective heuristic.\n\nfind_data_flows is the mirror image: instead of starting from entry points, it starts from calls that look like **data sources** (request, input, argv, env, ...) and BFS-searches forward until it reaches calls that look like **data sinks** (write, save, insert, chat, post, ...).\n\n```\nsource: read user input    |    v  [ some processing methods ]    |    vsink: save to DB / send to LLM\n```\n\nImportant nuance the README states explicitly and the code backs up: **these are graph-derived candidates, not proven runtime traces.**\n\nJoern is doing static analysis, it never executes the code.\n\nA BFS path through the call graph is a *plausible* flow, not a guaranteed one.\n\nAll of the analysis above gets assembled in summarization/facts_builder.py into a single summary_facts dictionary:\n\n``` php\ndef build_summary_facts(repo_path: Path, facts: dict) -> dict:    repo_map = build_repo_map(facts.get(\"files\", []))    architecture = derive_architecture(facts)    return {            \"repo_name\": repo_path.name,            \"repo_map\": repo_map,            \"architecture_signals\": architecture,            \"entry_points\": facts.get(\"entry_candidates\", [])[:40],            \"critical_runtime_flow_candidates\": find_runtime_flows(facts),            \"critical_data_flow_candidates\": find_data_flows(facts),            \"important_symbols\": important_symbols(facts, architecture),            \"limits\": {\"note\": \"This is static analysis. Runtime/data flows are graph-derived candidates, not guaranteed actual production traces.\"},        }\n```\n\n**Notice what’s not in here:**\n\nimportant_symbols is deliberately filtered down to only the methods/types that live in the already-identified “central files”, another compression step that keeps the eventual LLM prompt small and focused.\n\nThis dictionary, not the repo itself, is what the LLM will actually see.\n\nsummarization/prompts.py\n\nIt builds the final prompt, and it’s worth reading closely because it shows how to constrain an LLM rather than just hope it behaves:\n\n```\nreturn f\"\"\"You are generating a repository summary using Joern Code Property Graph facts.Use only the supplied graph facts.Do not invent files, folders, APIs, tests, classes, functions, runtime flows, or data flows.Separate detected facts from inferred conclusions.For runtime flows and data flows, include only the critical ones, not every path.If something is weakly supported, say \"likely\".If something is not supported, say \"not detected\".Return Markdown with exactly these sections:# Repository Summary## 1. Repository Purpose## 2. Repository Map## 3. Architecture## 4. Critical Runtime Flows## 5. Critical Data Flows## 6. Important Files## 7. Important Symbols## 8. Not Detected / Unknown...\"\"\"\n```\n\nA few important details:\n\nllm/client.py\n\nllm/client.py then does the boring-but-important part: it’s a thin OpenAI-SDK wrapper that works with any OpenAI-compatible endpoint: Groq, OpenRouter, Gemini’s OpenAI-compatible endpoint, or Cerebras, controlled purely through .env config:\n\n```\nLLM_PROVIDER=groqLLM_API_KEY=your_api_key_hereLLM_MODEL=llama-3.3-70b-versatile\nphp\ndef generate_repo_summary(summary_facts: dict, config: LLMConfig) -> str:    client = make_client(config)    response = client.chat.completions.create(        model=config.model,        temperature=config.temperature,        max_tokens=config.max_tokens,        messages=[            {\"role\": \"system\", \"content\": \"You are a precise static-analysis repo summarizer. You must not hallucinate unsupported repo facts.\"},            {\"role\": \"user\", \"content\": build_summary_prompt(summary_facts)},        ],    )    return response.choices[0].message.content or \"\"\nuv run code-graph-ai-summarizer /path/to/local/repo\n```\n\nWalking through run() in cli/main.py function end to end:\n\n**Third, ****repo_summary.md: the final human-readable summary.**\n\n```\noutputs/<repo-name>/├── joern_facts.json     <- raw, large, exact├── summary_facts.json   <- compact, ranked, curated└── repo_summary.md      <- narrated, by the LLM\n```\n\nThat intermediate summary_facts.json is, in practice, one of the most useful files this tool produces. It's the auditable middle layer: **if the final Markdown says something surprising, you can open this file and check whether it’s actually grounded** in a detected signal or whether the LLM drifted.\n\nThe specific use case here is repo summarization, but the underlying pattern is broadly applicable to anyone building tools on top of LLMs:\n\n```\ngit clone <your-repo-url>cd code-graph-ai-summarizeruv synccp .env.example .env# edit .env: set LLM_PROVIDER, LLM_API_KEY, LLM_MODEL# in a separate terminaljoern --server# in a separate terminal (if using ollama)ollama serve# back in your main terminaluv run code-graph-ai-summarizer /path/to/any/local/repo\n```\n\nPoint it at a small repo first, and get repo_summary.md .\n\nFull credit where it’s due: I didn’t write this by hand, line by line, heroically, at 2 AM, fueled by coffee.\n\nI played as an orchestrator, pointing ChatGPT and Claude at the problem, arguing with them when they hallucinated a function that didn’t exist, and stitching their outputs into **something that actually runs and is a useful application**. They wrote the code, I supplied the opinions, the rejections, and the “no, that’s not what I meant” loop until it converged.\n\nSo consider [this repo](https://github.com/shaktiwadekar9/code-graph-ai-summarizer) a small case study in human + AI pair programming, minus the part where the AI gets annoyed at my code review comments.\n\n[Stop Letting LLMs Hallucinate Your Codebase: A Graph-First Way to Summarize Repos](https://pub.towardsai.net/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos-8a803db9c931) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos", "canonical_source": "https://pub.towardsai.net/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos-8a803db9c931?source=rss----98111c9905da---4", "published_at": "2026-06-26 04:02:41+00:00", "updated_at": "2026-06-26 04:11:44.223574+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-tools"], "entities": ["Joern", "code-graph-ai-summarizer", "CPGQL", "Shakti Wadekar"], "alternates": {"html": "https://wpnews.pro/news/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos", "markdown": "https://wpnews.pro/news/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos.md", "text": "https://wpnews.pro/news/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos.txt", "jsonld": "https://wpnews.pro/news/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos.jsonld"}}