# Stop Letting LLMs Hallucinate Your Codebase: A Graph-First Way to Summarize Repos

> Source: <https://pub.towardsai.net/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos-8a803db9c931?source=rss----98111c9905da---4>
> Published: 2026-06-26 04:02:41+00:00

Ask any LLM to “summarize this repository” and it will happily oblige, and it **will also** happily *make things up*. It will mention a test suite that doesn’t exist. It will describe an API endpoint it inferred from the folder name api/. It will confidently tell you about a “data flow” it never actually traced :(

Reason:

**LLMs are pattern completers, not code analyzers.** When you dump a pile of files into a context window and ask for a summary, the model is guessing based on naming conventions and statistical priors from millions of other repos it has seen, not from understanding *this* code.

Solution:

[code-graph-ai-summarizer](https://github.com/shaktiwadekar9/code-graph-ai-summarizer) is a small, Python project that takes a different approach: **don't let the LLM look at raw code at all.**

Instead, build a precise, structured, graph-derived set of facts about the repo first, using real static analysis, and only then hand the LLM a curated fact-sheet and ask it to write summary.

The LLM's job shrinks from “understand this codebase” to “narrate these facts I already verified,” which is a job LLMs are very good at.

**That one design decision is the whole story of this repo**, and it’s a pattern worth learning regardless of whether you ever run this exact tool.

Five stages. Each stage only passes forward what the next stage needs and never the raw source code itself.

**Let’s walk through each one, building up from the basics.**

Before we can reason about a codebase, we need a representation of it that a program can query.

Reading characters in a .py file tells you nothing about which function calls which other function. You need structure.

This is where **Joern** comes in. Joern is an open-source static analysis platform that parses source code into a **Code Property Graph (CPG): **a single graph data structure that fuses together several representations:

Once your repo is imported into Joern, all of that becomes queryable through **CPGQL : **a Scala-based query language that treats the whole codebase as one big graph you can filter, map, and traverse.

In [code-graph-ai-summarizer](https://github.com/shaktiwadekar9/code-graph-ai-summarizer) repo, that connection lives in joern/client.py:

``` php
class JoernRunner:    def __init__(self, server: str) -> None:        self.client = CPGQLSClient(server)    def import_repo(self, repo_path: Path, project_name: str) -> None:            result = self.client.execute(import_code_query(str(repo_path), project_name))            ...
```

JoernRunner is just a thin wrapper around cpgqls_client, which talks to a Joern server running **locally** (joern --server, listening on localhost:8080 by default).

You point it at a local folder, it imports the repo, and from then on you can fire CPGQL queries at it. (This is done by [code-graph-ai-summarizer](https://github.com/shaktiwadekar9/code-graph-ai-summarizer) so you don’t have to)

Why a graph and not just an AST per file?

**Because the interesting questions about a codebase are inherently cross-file: “what calls this function,” “where does this value end up,” “which file is the most central.”**

Those are graph-traversal questions, not single-file parsing questions.

The CPG gives you one graph spanning the entire repo, so those questions become tractable.

A CPG by itself is just a big graph sitting in memory.

**The value comes from the specific queries you run against it.**

joern/queries.py defines ** six** of them, and reading through them is basically a mini-lesson in “what does a useful static analysis tool actually need to know about a codebase.”

Query and What it extracts:

files : Every source file in the repo

methods: Every function/method, with file, full name, signature, line

types: Every class/type declared

call_edges: For each method, which other methods it calls (internal + external)

calls: Every individual call site, with its code text

entry_candidates: Methods that *look* like entry points

source_sink_calls: Calls that look like data sources or data sinks

entry_candidates

The entry_candidates query is critical. There’s no universal CPGQL way to say “find the main function” across Python, JS, Go, etc.

**So the repo uses a name/filename heuristic instead:**

```
val entryRe = "(?i).*(main|run|start|serve|handler|handle|route|controller|command|execute|process|consume|worker|app).*"cpg.method  .filterNot(_.isExternal)  .filter(m => m.name.matches(entryRe) || m.filename.matches(entryRe))  .take(maxItems)
```

One more detail worth noticing:

joern/client.py wraps every single query in a try/except:

```
for name, query in joern_queries(max_items).items():    try:        facts[name] = self.run_json_query(name, query)    except Exception as exc:        print(f"[warn] Joern query failed: {name}: {exc}")        facts[name] = []
```

If one query fails (say, the data-flow query isn’t supported for a given language overlay), the pipeline doesn’t crash, it just records an empty result and moves on.

Joern hands back raw lists: every file, every method, every call. That’s hundreds or thousands of items, too much, too unstructured, and too noisy to throw straight at an LLM.

This is where the repo’s analysis/ package comes in.

Its whole job iscompression with judgment: turning a flood of graph facts into a small set ofranked, labeledsignals.

analysis/patterns.py defines keyword buckets for common categories: api_web, cli, storage_db, filesystem, llm, network, auth, queue_worker.

A snippet:

```
CATEGORY_PATTERNS = {    "storage_db": [        "sqlite", "postgres", "mysql", "mongodb", "redis", "sqlalchemy",        "save", "insert", "update", "delete", "select", "execute", "commit", "query",    ],    "llm": [        "openai", "ollama", "anthropic", "gemini", "groq", "cerebras",        "completion", "chat.completions", "llm", "model", "generate",    ],    ...}
```

analysis/classify.py then just checks whether any of these substrings show up in a call's name/code/target text.

**This is deliberately simple**: no embeddings, no ML model, just substring matching. Reason: it's fast, debuggable, language-agnostic, and “good enough” because its output isn't the final answer, it's a *signal* that downstream ranking and the LLM will further interpret.

Don't reach for a heavyweight model when a keyword list solves 90% of the problem at near-zero cost.

analysis/architecture.py turns the call-edge facts into a per-file importance score.

The logic, simplified:

```
file_scores[caller_file] += len(internal_callees) + len(external_callees)...file_edge_counts[(caller_file, callee_file)] += 1file_scores[callee_file] += 2
```

**Sort by score**, and you get a ranked list of “central files”, a cheap but effective proxy for architectural importance.

This is the most conceptually interesting part of the repo.

analysis/flows.py runs a **breadth-first search (BFS)** starting from each entry-point candidate, walking forward through the call graph, and scoring every path it finds:

``` php
entry_point   -> calls method A        -> calls method B  (touches "storage_db")             -> calls method C  (touches "llm")
queue = deque([[entry]])while queue and seen_paths < 80:    path = queue.popleft()    if len(path) >= 3:        score, signals = path_score(path, method_to_file)        if signals:            candidates.append(runtime_candidate(...))    if len(path) >= 5:        continue    for next_method in graph.get(path[-1], [])[:12]:        if next_method not in path:            queue.append(path + [next_method])
```

Each path’s score (analysis/graph.py) rewards length and, much more heavily, rewards touching “important” categories like api_web, storage_db, or llm:

```
score = len(path) + 4 * len(set(categories))important = {"api_web", "storage_db", "filesystem", "llm", "network", "auth", "queue_worker"}score += 5 * len(important.intersection(categories))
```

In plain English:

A path that goes from an entry point all the way to a database call or an LLM call is more “interesting” than a path that just bounces between two utility functions.

That’s a simple but effective heuristic.

find_data_flows is the mirror image: instead of starting from entry points, it starts from calls that look like **data sources** (request, input, argv, env, ...) and BFS-searches forward until it reaches calls that look like **data sinks** (write, save, insert, chat, post, ...).

```
source: read user input    |    v  [ some processing methods ]    |    vsink: save to DB / send to LLM
```

Important nuance the README states explicitly and the code backs up: **these are graph-derived candidates, not proven runtime traces.**

Joern is doing static analysis, it never executes the code.

A BFS path through the call graph is a *plausible* flow, not a guaranteed one.

All of the analysis above gets assembled in summarization/facts_builder.py into a single summary_facts dictionary:

``` php
def build_summary_facts(repo_path: Path, facts: dict) -> dict:    repo_map = build_repo_map(facts.get("files", []))    architecture = derive_architecture(facts)    return {            "repo_name": repo_path.name,            "repo_map": repo_map,            "architecture_signals": architecture,            "entry_points": facts.get("entry_candidates", [])[:40],            "critical_runtime_flow_candidates": find_runtime_flows(facts),            "critical_data_flow_candidates": find_data_flows(facts),            "important_symbols": important_symbols(facts, architecture),            "limits": {"note": "This is static analysis. Runtime/data flows are graph-derived candidates, not guaranteed actual production traces."},        }
```

**Notice what’s not in here:**

important_symbols is deliberately filtered down to only the methods/types that live in the already-identified “central files”, another compression step that keeps the eventual LLM prompt small and focused.

This dictionary, not the repo itself, is what the LLM will actually see.

summarization/prompts.py

It builds the final prompt, and it’s worth reading closely because it shows how to constrain an LLM rather than just hope it behaves:

```
return f"""You are generating a repository summary using Joern Code Property Graph facts.Use only the supplied graph facts.Do not invent files, folders, APIs, tests, classes, functions, runtime flows, or data flows.Separate detected facts from inferred conclusions.For runtime flows and data flows, include only the critical ones, not every path.If something is weakly supported, say "likely".If something is not supported, say "not detected".Return Markdown with exactly these sections:# Repository Summary## 1. Repository Purpose## 2. Repository Map## 3. Architecture## 4. Critical Runtime Flows## 5. Critical Data Flows## 6. Important Files## 7. Important Symbols## 8. Not Detected / Unknown..."""
```

A few important details:

llm/client.py

llm/client.py then does the boring-but-important part: it’s a thin OpenAI-SDK wrapper that works with any OpenAI-compatible endpoint: Groq, OpenRouter, Gemini’s OpenAI-compatible endpoint, or Cerebras, controlled purely through .env config:

```
LLM_PROVIDER=groqLLM_API_KEY=your_api_key_hereLLM_MODEL=llama-3.3-70b-versatile
php
def generate_repo_summary(summary_facts: dict, config: LLMConfig) -> str:    client = make_client(config)    response = client.chat.completions.create(        model=config.model,        temperature=config.temperature,        max_tokens=config.max_tokens,        messages=[            {"role": "system", "content": "You are a precise static-analysis repo summarizer. You must not hallucinate unsupported repo facts."},            {"role": "user", "content": build_summary_prompt(summary_facts)},        ],    )    return response.choices[0].message.content or ""
uv run code-graph-ai-summarizer /path/to/local/repo
```

Walking through run() in cli/main.py function end to end:

**Third, ****repo_summary.md: the final human-readable summary.**

```
outputs/<repo-name>/├── joern_facts.json     <- raw, large, exact├── summary_facts.json   <- compact, ranked, curated└── repo_summary.md      <- narrated, by the LLM
```

That intermediate summary_facts.json is, in practice, one of the most useful files this tool produces. It's the auditable middle layer: **if the final Markdown says something surprising, you can open this file and check whether it’s actually grounded** in a detected signal or whether the LLM drifted.

The specific use case here is repo summarization, but the underlying pattern is broadly applicable to anyone building tools on top of LLMs:

```
git clone <your-repo-url>cd code-graph-ai-summarizeruv synccp .env.example .env# edit .env: set LLM_PROVIDER, LLM_API_KEY, LLM_MODEL# in a separate terminaljoern --server# in a separate terminal (if using ollama)ollama serve# back in your main terminaluv run code-graph-ai-summarizer /path/to/any/local/repo
```

Point it at a small repo first, and get repo_summary.md .

Full credit where it’s due: I didn’t write this by hand, line by line, heroically, at 2 AM, fueled by coffee.

I played as an orchestrator, pointing ChatGPT and Claude at the problem, arguing with them when they hallucinated a function that didn’t exist, and stitching their outputs into **something that actually runs and is a useful application**. They wrote the code, I supplied the opinions, the rejections, and the “no, that’s not what I meant” loop until it converged.

So consider [this repo](https://github.com/shaktiwadekar9/code-graph-ai-summarizer) a small case study in human + AI pair programming, minus the part where the AI gets annoyed at my code review comments.

[Stop Letting LLMs Hallucinate Your Codebase: A Graph-First Way to Summarize Repos](https://pub.towardsai.net/stop-letting-llms-hallucinate-your-codebase-a-graph-first-way-to-summarize-repos-8a803db9c931) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.