Building a RAG-Powered Code Reviewer That Actually Understands Your Codebase

wpnews.pro

Most AI code review tools give you generic advice. "Add type hints." "Handle exceptions." Useful, sure — but the same advice you'd get from any linter or a quick ChatGPT prompt.

What if your AI reviewer could say: "Add type hints to the constructor — consistent with how OrderProcessor.php and OrderPlaceAfter.php already do it in your project"?

That's the difference between a generic AI tool and one that understands your codebase. I built the latter. Here's how.

I've been a Magento/PHP developer for 12+ years. Magento has complex architectural patterns — plugins, observers, dependency injection via XML, area-scoped configuration. When tools like CodeRabbit or GitHub Copilot review Magento code, they're getting better at repository-wide context — Copilot indexes your workspace, CodeRabbit reads related files. But they still treat Magento XML configurations as static text files rather than active dependency injection and event routing maps.

They don't inherently know that:

<plugin>

in di.xml

must implement at least one interceptor method (before

, after

, or around

) matching the target class methodssales_order_place_after

must implement Magento\Framework\Event\ObserverInterface

I wanted a tool that understands these things. Not because it was trained on Magento docs, but because it read my actual project and understands the patterns we follow.

RAG (Retrieval Augmented Generation) is a pattern where you search for relevant context before sending a query to an LLM. For code review, this means:

di.xml

, events.xml

, module.xml

to understand how classes are registered (as plugins, observers, preferences)The key insight: the LLM doesn't need to be trained on Magento. It just needs to see how your project does things, and it can spot inconsistencies.

Fair question. Gemini offers a 1-2M token context window. Why not dump the entire module in there?

For a single small module (20-50 KB) — you absolutely could. But this approach is designed for a different scale: a real Magento project with vendor/magento/

(200+ core modules) plus app/code/

(dozens of custom modules). That's megabytes of code. Sending all of it into context for every review is slow, expensive, and — as my experiments showed — counterproductive.

When I tested RAG with irrelevant context (a "customer hobby" module as context for an "order processing" review), the results were worse than simple mode. The LLM got confused by unrelated code. RAG gives you surgical, relevant context — 5 precisely selected code chunks instead of 500 files where 490 are noise.

That said, for small projects, long-context is simpler and may be sufficient. RAG pays off at scale.

The system has three layers:

┌─────────────────────────────────┐
│  CLI: review / index / search   │
├─────────────────────────────────┤
│  Agent Layer                    │
│  ├── RAG Review Service         │
│  ├── Search Service             │
│  └── Review Service (simple)    │
├─────────────────────────────────┤
│  Core Engine                    │
│  ├── LLM Providers (Gemini...)  │
│  ├── Embedding (BGE-small)      │
│  └── PHP Parser (tree-sitter)   │
├─────────────────────────────────┤
│  Framework Layer                │
│  ├── Magento Config Parser      │
│  └── Module Indexer             │
└─────────────────────────────────┘

Why layers matter: The core engine knows nothing about PHP or Magento. The PHP parser knows nothing about Magento. Only the framework layer has Magento-specific knowledge. Adding Symfony or Laravel support reuses ~80% of the codebase.

LLM providers are behind an abstraction — switching from Gemini to Claude is one environment variable. Same for embeddings. This was a deliberate architectural choice: provider-agnostic from day one.

When you run mage-audit index ./my-module

, the system:

1. Parses PHP files with tree-sitter — not regex, not string splitting. Tree-sitter builds an AST (Abstract Syntax Tree), so we extract classes, methods, and functions at their logical boundaries. A 500-line file becomes 15-20 separate chunks, each a complete logical unit.

2. Parses Magento XML configs as structured data — di.xml

becomes a map of plugins (with target classes, sort orders, disabled flags) and preferences (interface-to-implementation mappings). events.xml

becomes a list of observers with event names, handler classes, and methods. This isn't text parsing — it's structured extraction that understands what these configs mean in the Magento DI/event system.

3. Enriches chunks with framework context — if a class is registered as a plugin for OrderRepositoryInterface

, the chunk gets tagged: [PLUGIN for Magento\Sales\Api\OrderRepositoryInterface]

. This tag is included in the embedding text, so searching for "plugin for order save" directly matches it.

4. Generates embeddings and stores in pgvector — each chunk becomes a 384-dimensional vector via BGE-small. This is an intentional trade-off: BGE-small is a general-purpose text embedding model, not a code-specific one. Models like jina-embeddings-v2-base-code

or voyage-code-3

would likely perform better on code search. But BGE-small runs locally on CPU with zero API cost — critical for a zero-budget MVP. The architecture supports swapping models with one config change, so upgrading is trivial when budget allows.

Simple RAG uses one search query. We use three:

use

statements, find code that uses the same interfacesResults are deduplicated and sorted by similarity. This catches context that a single strategy would miss — a class name search finds the exact interface implementation, while a code similarity search finds structurally similar methods.

I tested on a PHP file with 13 known issues (SQL injection, architecture violations, missing error handling, etc.) — issues I identified manually as a senior Magento developer. This hand-curated list serves as ground truth for evaluation.

Metric	Simple Mode	RAG Mode
Issues detected (recall)	54% (7/13)	69% (9/13)
Project-specific references	1	10
Input tokens	637	1,495

RAG found 15 percentage points more issues. But the real difference is qualitative.

Simple mode says:

"Add type hints to constructor parameters."

RAG mode says:

"Add type hints to constructor parameters —

consistent withfrom your project."Model\OrderProcessor.php

andObserver\OrderPlaceAfter.php

RAG mode also found issues that Simple mode missed entirely — like the method always returning true

regardless of outcome, or the lack of error handling around repository calls. These are the kinds of issues that only become visible when you see how the rest of the project handles similar cases.

1. Context relevance is everything. When I tested RAG with an unrelated module as context (a "customer hobby" module for an "order processing" review), the results were worse than simple mode. The LLM got confused by irrelevant code. When the context was from a related module (also about orders), results improved dramatically. This confirms: retrieval quality determines review quality. Garbage context in → garbage review out.

2. LLMs don't return clean JSON. Even with explicit instructions to "return ONLY valid JSON", Gemini would add markdown fences, inconsistently escape backslashes in PHP namespaces (\Magento\Sales

has invalid JSON escapes), and sometimes swap field values (putting a category in the severity field). I built a character-by-character JSON fixer and fallback field mapping for misplaced values.

A note on this: Gemini does offer response_schema

(Structured Outputs) that enforces valid JSON at the token generation level. I chose prompt-based JSON instead for a specific reason — provider agnosticism. The same prompt works with Gemini, Claude, and OpenAI without changes. response_schema

is a Gemini-specific API. For a production system targeting one provider, Structured Outputs would reduce parsing issues. For a multi-provider architecture, defensive parsing is the more portable approach.

3. Embeddings find similar code, not bugs. Searching for "SQL injection vulnerability" returned the actual vulnerable function as the 3rd result, not the 1st. Embeddings measure text similarity, not security analysis. That's why you need both retrieval (find relevant code) AND generation (analyze it with LLM). Each alone is weak; together they're strong. Using code-specific embedding models (Voyage Code, Jina Code) instead of the general-purpose BGE-small would likely improve retrieval quality — that's a planned upgrade.

4. Free tiers are enough for building. The entire project runs on Google Gemini free tier (1,500 requests/day), local embeddings (BGE-small, no API cost), and PostgreSQL + pgvector in Docker. Total spend: $0. You don't need an API budget to build serious AI applications — but you do need architecture that lets you upgrade when budget appears.

response_schema

vs prompt-based JSONThe project is open source: github.com/Aquarvin/mage-audit

I'm a senior PHP/Magento developer transitioning into AI Engineering. This project is both a real tool and a learning vehicle — every architectural decision, experiment, and dead end is documented in the repo. If you're hiring AI Engineers or interested in discussing RAG for code analysis — reach out on LinkedIn or Telegram.

source & further reading

dev.to — original article How to Generate Short Videos from Claude.ai with a Seedance MCP Connector PIVOT Explained — From Paper to Working Code in 10 Minutes Stopping Runaway AI Loops: Implementing Enterprise FinOps and Observability with PolicyAware

Building a RAG-Powered Code Reviewer That Actually Understands Your Codebase

Run your AI side-project on zahid.host