{"slug": "i-built-a-private-ai-assistant-that-queries-my-git-history-and-project-data-only", "title": "I Built a Private AI Assistant That Queries My Git History and Project Management Data — Using Only Local LLMs", "summary": "A web developer built a private AI assistant that queries git history and project management data using only local LLMs, with no API keys or cloud services. The system uses a Text-to-SQL approach where a local LLM (qwen2.5-coder:7b) translates natural language questions into SQL queries against a SQLite database containing structured data from git logs and project boards. The developer created Python collectors to populate the database, implemented keyword discovery to help the LLM generate precise SQL, and included automatic retry logic for failed queries, all running locally through Ollama for privacy.", "body_md": "**No API keys. No cloud. All data stays on my machine.**\n\n## The Problem\n\nAs a web developer, I constantly need to answer questions like:\n\n- \"Who committed the most to our main repo this month?\"\n- \"What files were changed for the last campaign launch?\"\n- \"What project tasks are still in progress for the web team?\"\n\nThese answers exist — scattered across `git log`\n\n, project management boards, and my own memory. I was tired of digging through terminal output and clicking through boards manually.\n\nSo I built a **natural language interface** that lets me ask these questions in plain English and get instant answers.\n\n## The Architecture: Text-to-SQL, Not Vector RAG\n\nHere's the key insight that shaped the entire project:\n\n**My data is structured, not unstructured.** Commits have authors, dates, and repos. Project tasks have statuses, deadlines, and assignees. This isn't a pile of PDFs — it's relational data that fits naturally into a SQLite database.\n\nTraditional RAG (vector embeddings + similarity search) is built for unstructured documents. For structured data, there's a better approach: **Text-to-SQL**.\n\n```\nUser Question\n    ↓\nLocal LLM (generates SQL)\n    ↓\nSQLite Database (executes query)\n    ↓\nLocal LLM (summarizes results)\n    ↓\nHuman-readable Answer\n```\n\nThe LLM doesn't store or memorize my data. It just translates my question into SQL, runs it, and explains the results.\n\n## The Data Pipeline\n\n### Step 1: Collect everything into SQLite\n\nI wrote two Python collectors that populate a single SQLite database:\n\n**Git history collector** (`collect.py`\n\n):\n\n- Runs\n`git log`\n\nacross multiple repositories - Stores commits, file changes, branches, and tags\n- Captures author, date, message, and insertions/deletions per file\n\n**Project management collector** (`collect_pm.py`\n\n):\n\n- Queries the project management platform's GraphQL API (Monday.com in my case, but the pattern works for Jira, Linear, etc.)\n- Stores boards, items, and subitems\n- Extracts status, assignee, department, and deadline\n- Flags web-team tasks automatically (\n`is_web = 1`\n\n)\n\nThe result: a single SQLite database holding everything needed to answer cross-cutting questions.\n\n### Step 2: Link git branches to project tasks\n\nThis was the crucial step. Git branches like `feature/example-promo-banner`\n\ndon't obviously connect to project items like *\"Example Promo Banner — Launch\"*.\n\nI created a `branch_task_map`\n\ntable that links them:\n\n```\nSELECT branch_name, task_name, board_name\nFROM branch_task_map\nWHERE branch_name LIKE '%promo-banner%'\n```\n\nThis lets the system cross-reference: *\"What tasks relate to this branch?\"* or *\"What commits were made for this launch?\"*\n\n## The RAG System\n\n### Why Ollama?\n\nPrivacy was non-negotiable. Project data, commit messages, and task details shouldn't leave the machine. **Ollama runs the LLM entirely locally** — no internet needed, no data sent anywhere.\n\nI chose `qwen2.5-coder:7b`\n\nas the model — it's excellent at SQL generation and runs fast on Apple Silicon.\n\n### The smart prompt\n\nThe system prompt is where the magic happens. It includes:\n\n-\n**Full database schema**— auto-introspected at startup -\n**Sample values**— actual repo names, anonymized author identifiers, statuses from the database -\n**Few-shot SQL examples**— teaches the model the query patterns -\n**Today's date**— so \"this week\" and \"last month\" work correctly\n\n``` python\ndef build_system_prompt():\n    schema = get_schema()         # Auto-introspect SQLite tables\n    samples = get_sample_values() # Real values from the DB\n    return f\"\"\"You are a data analyst assistant...\n\n## Database Schema\n{schema}\n\n## Sample Values\n{samples}\n...\"\"\"\n```\n\n### Auto-discovery: the secret sauce\n\nBefore the LLM even sees the question, the system extracts keywords and searches across all tables:\n\n``` python\ndef discover(question):\n    keywords = extract_keywords(question)\n    # Search task_boards, task_items, commits, branches...\n    # Return matching IDs, names, values\n```\n\nThis means when you ask *\"What's happening with the example promo banner launch?\"*, the system has already found:\n\n- The matching project board\n- Related branches:\n`feature/example-promo-banner`\n\n- Recent commits referencing the same keywords\n\nThe LLM gets these **exact values**, so it writes precise SQL instead of guessing.\n\n### Self-correcting queries\n\nIf a SQL query returns 0 results, the system automatically retries with different keyword strategies:\n\n```\nAttempt 1: WHERE branch = 'feature/example-promo-banner'  → 0 results\nAttempt 2: WHERE message LIKE '%promo banner%'             → 12 results\n```\n\nThis handles the reality that commits are often on parent branches, not the feature branch itself.\n\n## The Result\n\nA CLI tool where I type questions and get answers:\n\n``` bash\n$ python3 main.py \"who committed the most this month?\"\n\nDeveloper A and Developer B lead this month\nwith roughly 350 commits each, followed by\nDeveloper C with around 280 commits.\nbash\n$ python3 main.py \"what web tasks are pending for the next launch?\"\n\nThe upcoming launch has 8 web tasks remaining:\n3 in progress, 2 ready for review, 3 not started...\nbash\n$ python3 main.py -v \"what files changed for the example promo banner?\"\n\n-- SQL: SELECT DISTINCT fc.file_path, SUM(fc.insertions)...\n-- WHERE c.message LIKE '%promo banner%'...\n\nSeveral template and snippet files were modified,\nconcentrated in the promo banner section and a few\nrelated shared components.\n```\n\n## Project Structure\n\nThe entire system is **8 files, ~400 lines of code, 2 dependencies**:\n\n```\ncustom-rag/\n  main.py           # CLI entry point — REPL + one-shot mode\n  agent.py          # LLM conversation loop (question → SQL → answer)\n  db.py             # SQLite read-only, schema introspection, auto-discovery\n  prompts.py        # System prompt with schema + few-shot examples\n  tools.py          # Tool definitions\n  formatter.py      # Rich terminal output\n  config.py         # Paths and model settings\n  requirements.txt  # rich, requests (that's it)\n```\n\nNo LangChain. No vector database. No embeddings. No cloud services.\n\n## Key Takeaways\n\n-\n**Not all RAG needs vectors.** If your data is structured, Text-to-SQL is simpler and more accurate than embedding everything into a vector store. -\n**Local LLMs are production-ready.** Ollama +`qwen2.5-coder:7b`\n\nruns fast on a MacBook and generates correct SQL reliably. -\n**Auto-discovery beats prompt engineering.** Instead of hoping the LLM guesses the right table values, search the database first and feed it exact matches. -\n**Privacy and simplicity can coexist.** The whole system is a few hundred lines of Python, runs offline, and handles real questions. -\n**Cross-referencing is the real value.** Any single data source is easy to query manually. The power comes from connecting git history with project management data in one natural language interface.\n\n## What's Next\n\n- Cron job to auto-refresh data every hour\n- Adding chat message history as a third data source\n- A simple web UI for non-terminal users\n\n*Built with Python, SQLite, Ollama, and qwen2.5-coder. All code runs locally — no data leaves the machine. All examples in this article use illustrative names and rounded figures; real commit authors, project codenames, and counts have been replaced or generalized.*\n\n*If you're interested in the implementation details or want to build something similar, feel free to reach out or comment below.*", "url": "https://wpnews.pro/news/i-built-a-private-ai-assistant-that-queries-my-git-history-and-project-data-only", "canonical_source": "https://dev.to/pouria_zand/i-built-a-private-ai-assistant-that-queries-my-git-history-and-project-management-data-using-only-39mn", "published_at": "2026-05-21 22:14:40+00:00", "updated_at": "2026-05-21 22:32:22.627119+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "data", "open-source", "artificial-intelligence"], "entities": ["SQLite", "Python", "RAG", "Text-to-SQL", "Git", "LLM"], "alternates": {"html": "https://wpnews.pro/news/i-built-a-private-ai-assistant-that-queries-my-git-history-and-project-data-only", "markdown": "https://wpnews.pro/news/i-built-a-private-ai-assistant-that-queries-my-git-history-and-project-data-only.md", "text": "https://wpnews.pro/news/i-built-a-private-ai-assistant-that-queries-my-git-history-and-project-data-only.txt", "jsonld": "https://wpnews.pro/news/i-built-a-private-ai-assistant-that-queries-my-git-history-and-project-data-only.jsonld"}}