Building a RAG-Powered Code Reviewer That Actually Understands Your Codebase

A developer with 12+ years of Magento experience built a RAG-powered code review tool that understands project-specific patterns, not just generic advice. The tool indexes a codebase using tree-sitter for PHP parsing and BGE-small for embeddings, then retrieves relevant context before querying an LLM. It outperforms long-context approaches at scale by providing surgical, relevant code snippets instead of noise.

Most AI code review tools give you generic advice. "Add type hints." "Handle exceptions." Useful, sure — but the same advice you'd get from any linter or a quick ChatGPT prompt. What if your AI reviewer could say: "Add type hints to the constructor — consistent with how OrderProcessor.php and OrderPlaceAfter.php already do it in your project" ? That's the difference between a generic AI tool and one that understands your codebase . I built the latter. Here's how. I've been a Magento/PHP developer for 12+ years. Magento has complex architectural patterns — plugins, observers, dependency injection via XML, area-scoped configuration. When tools like CodeRabbit or GitHub Copilot review Magento code, they're getting better at repository-wide context — Copilot indexes your workspace, CodeRabbit reads related files. But they still treat Magento XML configurations as static text files rather than active dependency injection and event routing maps. They don't inherently know that: <plugin in di.xml must implement at least one interceptor method before , after , or around matching the target class methods sales order place after must implement Magento\Framework\Event\ObserverInterface I wanted a tool that understands these things. Not because it was trained on Magento docs, but because it read my actual project and understands the patterns we follow. RAG Retrieval Augmented Generation is a pattern where you search for relevant context before sending a query to an LLM. For code review, this means: di.xml , events.xml , module.xml to understand how classes are registered as plugins, observers, preferences The key insight: the LLM doesn't need to be trained on Magento. It just needs to see how your project does things , and it can spot inconsistencies. Fair question. Gemini offers a 1-2M token context window. Why not dump the entire module in there? For a single small module 20-50 KB — you absolutely could. But this approach is designed for a different scale: a real Magento project with vendor/magento/ 200+ core modules plus app/code/ dozens of custom modules . That's megabytes of code. Sending all of it into context for every review is slow, expensive, and — as my experiments showed — counterproductive. When I tested RAG with irrelevant context a "customer hobby" module as context for an "order processing" review , the results were worse than simple mode. The LLM got confused by unrelated code. RAG gives you surgical, relevant context — 5 precisely selected code chunks instead of 500 files where 490 are noise. That said, for small projects, long-context is simpler and may be sufficient. RAG pays off at scale. The system has three layers: ┌─────────────────────────────────┐ │ CLI: review / index / search │ ├─────────────────────────────────┤ │ Agent Layer │ │ ├── RAG Review Service │ │ ├── Search Service │ │ └── Review Service simple │ ├─────────────────────────────────┤ │ Core Engine │ │ ├── LLM Providers Gemini... │ │ ├── Embedding BGE-small │ │ └── PHP Parser tree-sitter │ ├─────────────────────────────────┤ │ Framework Layer │ │ ├── Magento Config Parser │ │ └── Module Indexer │ └─────────────────────────────────┘ Why layers matter: The core engine knows nothing about PHP or Magento. The PHP parser knows nothing about Magento. Only the framework layer has Magento-specific knowledge. Adding Symfony or Laravel support reuses ~80% of the codebase. LLM providers are behind an abstraction — switching from Gemini to Claude is one environment variable. Same for embeddings. This was a deliberate architectural choice: provider-agnostic from day one. When you run mage-audit index ./my-module , the system: 1. Parses PHP files with tree-sitter — not regex, not string splitting. Tree-sitter builds an AST Abstract Syntax Tree , so we extract classes, methods, and functions at their logical boundaries. A 500-line file becomes 15-20 separate chunks, each a complete logical unit. 2. Parses Magento XML configs as structured data — di.xml becomes a map of plugins with target classes, sort orders, disabled flags and preferences interface-to-implementation mappings . events.xml becomes a list of observers with event names, handler classes, and methods. This isn't text parsing — it's structured extraction that understands what these configs mean in the Magento DI/event system. 3. Enriches chunks with framework context — if a class is registered as a plugin for OrderRepositoryInterface , the chunk gets tagged: PLUGIN for Magento\Sales\Api\OrderRepositoryInterface . This tag is included in the embedding text, so searching for "plugin for order save" directly matches it. 4. Generates embeddings and stores in pgvector — each chunk becomes a 384-dimensional vector via BGE-small. This is an intentional trade-off: BGE-small is a general-purpose text embedding model, not a code-specific one. Models like jina-embeddings-v2-base-code or voyage-code-3 would likely perform better on code search. But BGE-small runs locally on CPU with zero API cost — critical for a zero-budget MVP. The architecture supports swapping models with one config change, so upgrading is trivial when budget allows. Simple RAG uses one search query. We use three: use statements, find code that uses the same interfacesResults are deduplicated and sorted by similarity. This catches context that a single strategy would miss — a class name search finds the exact interface implementation, while a code similarity search finds structurally similar methods. I tested on a PHP file with 13 known issues SQL injection, architecture violations, missing error handling, etc. — issues I identified manually as a senior Magento developer. This hand-curated list serves as ground truth for evaluation. | Metric | Simple Mode | RAG Mode | |---|---|---| | Issues detected recall | 54% 7/13 | 69% 9/13 | | Project-specific references | 1 | 10 | | Input tokens | 637 | 1,495 | RAG found 15 percentage points more issues. But the real difference is qualitative. Simple mode says: "Add type hints to constructor parameters." RAG mode says: "Add type hints to constructor parameters — consistent withfrom your project." Model\OrderProcessor.php and Observer\OrderPlaceAfter.php RAG mode also found issues that Simple mode missed entirely — like the method always returning true regardless of outcome, or the lack of error handling around repository calls. These are the kinds of issues that only become visible when you see how the rest of the project handles similar cases. 1. Context relevance is everything. When I tested RAG with an unrelated module as context a "customer hobby" module for an "order processing" review , the results were worse than simple mode. The LLM got confused by irrelevant code. When the context was from a related module also about orders , results improved dramatically. This confirms: retrieval quality determines review quality. Garbage context in → garbage review out. 2. LLMs don't return clean JSON. Even with explicit instructions to "return ONLY valid JSON", Gemini would add markdown fences, inconsistently escape backslashes in PHP namespaces \Magento\Sales has invalid JSON escapes , and sometimes swap field values putting a category in the severity field . I built a character-by-character JSON fixer and fallback field mapping for misplaced values. A note on this: Gemini does offer response schema Structured Outputs that enforces valid JSON at the token generation level. I chose prompt-based JSON instead for a specific reason — provider agnosticism. The same prompt works with Gemini, Claude, and OpenAI without changes. response schema is a Gemini-specific API. For a production system targeting one provider, Structured Outputs would reduce parsing issues. For a multi-provider architecture, defensive parsing is the more portable approach. 3. Embeddings find similar code, not bugs. Searching for "SQL injection vulnerability" returned the actual vulnerable function as the 3rd result, not the 1st. Embeddings measure text similarity, not security analysis. That's why you need both retrieval find relevant code AND generation analyze it with LLM . Each alone is weak; together they're strong. Using code-specific embedding models Voyage Code, Jina Code instead of the general-purpose BGE-small would likely improve retrieval quality — that's a planned upgrade. 4. Free tiers are enough for building. The entire project runs on Google Gemini free tier 1,500 requests/day , local embeddings BGE-small, no API cost , and PostgreSQL + pgvector in Docker. Total spend: $0. You don't need an API budget to build serious AI applications — but you do need architecture that lets you upgrade when budget appears. response schema vs prompt-based JSONThe project is open source: github.com/Aquarvin/mage-audit https://github.com/Aquarvin/mage-audit I'm a senior PHP/Magento developer transitioning into AI Engineering. This project is both a real tool and a learning vehicle — every architectural decision, experiment, and dead end is documented in the repo. If you're hiring AI Engineers or interested in discussing RAG for code analysis — reach out on LinkedIn or Telegram.