No API Keys. No Cloud Bills. No Data Leaving My Machine. Here's Exactly How — and What It Actually Costs

A developer built a local AI research assistant using a Ryzen 7 7435HS, 16GB RAM, and an RTX 3050, avoiding API keys, cloud bills, and data leaving the machine. The setup uses Ollama to run models like Qwen3 8B locally, achieving a practical sweet spot for security research and privacy. The developer emphasizes that local AI eliminates OPSEC risks from cloud APIs while noting the tradeoff of a weekend setup and hardware limits.

Most people use AI the same way: I did that too—until I started spending more time on cybersecurity research, vulnerability analysis, and application security. At some point, it felt strange researching privacy and security while sending chunks of my work to infrastructure I didn't control. So I decided to build my own. ... No API keys. No subscriptions. No cloud processing. Just a Ryzen 7 7435HS, 16GB RAM, an RTX 3050 , and a growing curiosity about what actually happens behind the chatbot interface. The goal was simple: create a local AI research assistant that could search my notes, help with security research, and keep everything on my machine. What I didn't expect was that building it would teach me more about AI, LLMs, RAG, agents, and AI security than years of simply using them ever could. Here's what I built and what I learned. Let’s kill the ambiguity before we go any further. There are three distinct tiers to this: I'm at Level 1 moving toward Level 2 . That's the practical sweet spot for anyone doing real work without a datacenter. LLMs are prediction systems. They don't think. They predict the statistically most likely next token given everything in their context window. Input: "The capital of France is" Output: "Paris" Do that billions of times across a massive training corpus, and complex, reasoning-like behavior emerges. That's genuinely it. No magic. No ghost in the machine. Just very expensive statistics that happen to be incredibly useful. Understanding this matters for security work specifically: ⚠️ The core paradox of LLMs:The model is confident. The model is sometimes wrong. These two facts coexist comfortably and will continue to cause problems for everyone who forgets them. For general-purpose use, cloud AI is fine. For security research, the calculus is completely different. When you use a cloud model: If you're doing bug bounty, AppSec auditing, or anything involving non-public vulnerability data, feeding that into a cloud API is an OPSEC problem . Full stop. Local solves this. Your data stays on your hardware. No terms of service to audit. No compliance risk from third-party data processing. No API costs. It works completely offline, allowing unlimited experimentation without watching a token counter. The tradeoff? Setup takes a weekend and your hardware has strict limits. Still worth it. Ollama https://ollama.com/ is the easiest on-ramp to local AI. Think of it as Docker for language models — it handles downloads, quantization, GPU acceleration, and exposes a clean REST API at http://localhost:11434 . Mac / Linux installation curl -fsSL https://ollama.sh | sh Pull and run a model ollama run qwen3:8b That's it. The model downloads, loads into memory, and the API goes live. Download Model Weights ── Load Into RAM/VRAM ── Tokenize Input ── Transformer Inference ── Local REST API Ollama is a model manager, a runtime, and an API server all in one. The intelligence is in the model weights; Ollama is just the plumbing that makes those weights usable without a PhD in infrastructure. ollama list See installed models ollama pull qwen3:8b Download a specific model ollama rm llama3 Remove an unwanted model ollama ps See what models are currently loaded in memory Not all models fit on all machines. Here is an honest breakdown of the hardware requirements: | RAM / VRAM | Recommended Model | Experience Notes | |---|---|---| 8GB | Gemma 4 4B or Phi-4 Mini | Fits cleanly, decent quality, highly efficient | 16GB | Qwen3 8B or DeepSeek R1 Distilled 8B | The Sweet Spot. Fast and highly capable | 32GB+ | DeepSeek R1 14B–32B | High-level technical reasoning, heavily data-intensive | My daily driver: Qwen3 8B . It provides strong technical reasoning, handles code exceptionally well, is Apache 2.0 licensed, and runs cleanly on my laptop without fighting for VRAM. The open-model ecosystem moves fast. Here's where things actually stand right now: 💡 The honest takeaway on hardware:8GB of VRAM was borderline a couple of years ago. It's cramped now.12GB is the modern floorfor serious local work, while16GB gives you room to actually experiment. A stock model only knows what it was trained on. It doesn't know your security notes, your project internals, your custom vulnerability writeups, or yesterday's newly disclosed CVEs. That gap is the main limitation of Level 1. The fix is RAG. RAG stands for Retrieval-Augmented Generation . The concept is simpler than the name suggests: User Asks Question │ ▼ Search Vector Database ChromaDB │ ▼ Retrieve Relevant Document Chunks │ ▼ Inject Context into System Prompt Question + Source Chunks │ ▼ Local LLM Generates Grounded Answer Here is how you can spin up a local RAG pipeline using LangChain and Ollama. pip install chromadb langchain langchain-community langchain-ollama python from langchain core.prompts import ChatPromptTemplate from langchain ollama import OllamaEmbeddings, ChatOllama from langchain community.vectorstores import Chroma from langchain text splitters import RecursiveCharacterTextSplitter from langchain.chains import create retrieval chain from langchain.chains.combine documents import create stuff documents chain 1. Your proprietary knowledge base your docs = "BOLA Broken Object Level Authorization occurs when an API doesn't verify the requesting user has permission to access the specific object. Most common API vulnerability in 2026.", "JWT tokens must be verified server-side. Common mistakes: not checking the signature algorithm, skipping expiry validation, or accepting 'none' as a valid algorithm.", "Django DEBUG=True in production exposes detailed stack traces, environment variables, and raw database queries to anyone who triggers an internal server error." 2. Split text into digestible chunks text splitter = RecursiveCharacterTextSplitter chunk size=500, chunk overlap=50 chunks = text splitter.create documents your docs 3. Initialize local embeddings and store in ChromaDB embeddings = OllamaEmbeddings model="qwen3:8b" vector store = Chroma.from documents chunks, embeddings retriever = vector store.as retriever search kwargs={"k": 2} 4. Connect to local LLM llm = ChatOllama model="qwen3:8b", temperature=0 5. Define the RAG prompt system system prompt = "You are a security research assistant. Use the following pieces of retrieved context " "to answer the question. If you don't know the answer, say that you don't know.\n\n" "Context:\n{context}" prompt = ChatPromptTemplate.from messages "system", system prompt , "human", "{input}" , 6. Create and execute the RAG chain question answer chain = create stuff documents chain llm, prompt rag chain = create retrieval chain retriever, question answer chain response = rag chain.invoke {"input": "What is BOLA and why is it dangerous?"} print response "answer" Point this script toward your local Markdown folders, OWASP PDFs, PortSwigger writeups, or disclosed HackerOne reports, and you instantly have a local research assistant that knows your actual data. Beginners almost always assume they need to fine-tune a model to teach it new information. Usually, they are wrong. ┌───────────────────────────────────────┬───────────────────────────────────────┐ │ USE RAG WHEN │ USE FINE-TUNING WHEN │ ├───────────────────────────────────────┼───────────────────────────────────────┤ │ • Knowledge changes frequently │ • You need style/tone behavioral shifts│ │ new CVEs, fresh writeups │ • You need strict output formatting │ │ • You need explicit source citations │ • You want deep task specialization │ │ • You want fast, zero-cost iteration │ • You have a vast, clean dataset │ └───────────────────────────────────────┴───────────────────────────────────────┘ The play: Always start with RAG. Fine-tune only if RAG fails to meet your structural formatting needs after extensive testing. RAG gives a model knowledge . MCP gives a model tools. Model Context Protocol allows local LLMs to safely step outside their sandbox and interact natively with systems: ┌── GitHub Repository ├── Live CVE Databases User ── Agent ── Tools ── ├── Burp Suite Reports ├── Local Filesystem └── System Logs A chatbot answers questions; an agent completes tasks. Imagine an automated workflow: Find latest Django CVEs ── read advisories ── compare against requirements.txt ── generate a remediation report ── open a local GitHub issue. That's the power of tool integration. Running AI locally protects your data from leaving your machine, but it shifts the application attack surface. Prompt injection isn't theoretical—it's cataloged under real CVEs e.g., CVE-2025-53773 in GitHub Copilot allowing remote code execution . When building local RAG and agent architectures, you must defend against: 🛡️ The Defense:Applyleast-privilege principlesto your agent's tools, sandbox execution environments, sanitize inputs, and treat every retrieved document chunk as potentially untrusted adversarial input. Building a local AI system isn't about outperforming a trillion-dollar tech giant on a standard benchmark. You cannot thoroughly audit AI-integrated applications if you treat the model as a black box. You cannot effectively reason about prompt injection vectors in an enterprise system if you have never engineered a document pipeline from scratch. A few years ago, understanding the TCP/IP stack separated master engineers from beginners. Today, understanding LLM inference, embedding vectors, context windows, and tool integration protocols is becoming the new dividing line. To paraphrase Richard Feynman: There is a profound difference between knowing the name of something and knowing the thing itself. Build the thing. Then break it. Then secure it. That's where the real learning starts. I write about API security, backend systems, and building tools from scratch.