No API Keys. No Cloud Bills. No Data Leaving My Machine. Here's Exactly How — and What It Actually Costs

wpnews.pro

Most people use AI the same way:

I did that too—until I started spending more time on cybersecurity research, vulnerability analysis, and application security. At some point, it felt strange researching privacy and security while sending chunks of my work to infrastructure I didn't control.

So I decided to build my own.

...

No API keys. No subscriptions. No cloud processing. Just a Ryzen 7 7435HS, 16GB RAM, an RTX 3050, and a growing curiosity about what actually happens behind the chatbot interface.

The goal was simple: create a local AI research assistant that could search my notes, help with security research, and keep everything on my machine. What I didn't expect was that building it would teach me more about AI, LLMs, RAG, agents, and AI security than years of simply using them ever could.

Here's what I built and what I learned.

Let’s kill the ambiguity before we go any further. There are three distinct tiers to this:

I'm at Level 1 moving toward Level 2. That's the practical sweet spot for anyone doing real work without a datacenter.

LLMs are prediction systems. They don't think. They predict the statistically most likely next token given everything in their context window.

Input:  "The capital of France is"
Output: "Paris"

Do that billions of times across a massive training corpus, and complex, reasoning-like behavior emerges. That's genuinely it. No magic. No ghost in the machine. Just very expensive statistics that happen to be incredibly useful.

Understanding this matters for security work specifically:

⚠️

The core paradox of LLMs:The model is confident. The model is sometimes wrong. These two facts coexist comfortably and will continue to cause problems for everyone who forgets them.

For general-purpose use, cloud AI is fine. For security research, the calculus is completely different. When you use a cloud model:

If you're doing bug bounty, AppSec auditing, or anything involving non-public vulnerability data, feeding that into a cloud API is an OPSEC problem. Full stop.

Local solves this. Your data stays on your hardware. No terms of service to audit. No compliance risk from third-party data processing. No API costs. It works completely offline, allowing unlimited experimentation without watching a token counter.

The tradeoff? Setup takes a weekend and your hardware has strict limits. Still worth it.

Ollama is the easiest on-ramp to local AI. Think of it as Docker for language models — it handles downloads, quantization, GPU acceleration, and exposes a clean REST API at http://localhost:11434

.

curl -fsSL https://ollama.sh | sh

ollama run qwen3:8b

That's it. The model downloads, loads into memory, and the API goes live.

Download Model Weights ──> Load Into RAM/VRAM ──> Tokenize Input ──> Transformer Inference ──> Local REST API

Ollama is a model manager, a runtime, and an API server all in one. The intelligence is in the model weights; Ollama is just the plumbing that makes those weights usable without a PhD in infrastructure.

ollama list         # See installed models
ollama pull qwen3:8b # Download a specific model
ollama rm llama3    # Remove an unwanted model
ollama ps           # See what models are currently loaded in memory

Not all models fit on all machines. Here is an honest breakdown of the hardware requirements:

RAM / VRAM	Recommended Model	Experience Notes
8GB
Gemma 4 4B or Phi-4 Mini	Fits cleanly, decent quality, highly efficient
16GB
Qwen3 8B or DeepSeek R1 Distilled 8B
The Sweet Spot. Fast and highly capable
32GB+
DeepSeek R1 14B–32B	High-level technical reasoning, heavily data-intensive

My daily driver: Qwen3 8B. It provides strong technical reasoning, handles code exceptionally well, is Apache 2.0 licensed, and runs cleanly on my laptop without fighting for VRAM.

The open-model ecosystem moves fast. Here's where things actually stand right now:

💡

The honest takeaway on hardware:8GB of VRAM was borderline a couple of years ago. It's cramped now.12GB is the modern floorfor serious local work, while16GB gives you room to actually experiment.

A stock model only knows what it was trained on. It doesn't know your security notes, your project internals, your custom vulnerability writeups, or yesterday's newly disclosed CVEs.

That gap is the main limitation of Level 1. The fix is RAG.

RAG stands for Retrieval-Augmented Generation. The concept is simpler than the name suggests:

User Asks Question 
       │
       ▼
Search Vector Database (ChromaDB)
       │
       ▼
Retrieve Relevant Document Chunks
       │
       ▼
Inject Context into System Prompt (Question + Source Chunks)
       │
       ▼
Local LLM Generates Grounded Answer

Here is how you can spin up a local RAG pipeline using LangChain and Ollama.

pip install chromadb langchain langchain-community langchain-ollama
python
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

your_docs = [
    "BOLA (Broken Object Level Authorization) occurs when an API doesn't verify the requesting user has permission to access the specific object. Most common API vulnerability in 2026.",
    "JWT tokens must be verified server-side. Common mistakes: not checking the signature algorithm, skipping expiry validation, or accepting 'none' as a valid algorithm.",
    "Django DEBUG=True in production exposes detailed stack traces, environment variables, and raw database queries to anyone who triggers an internal server error."
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.create_documents(your_docs)

embeddings = OllamaEmbeddings(model="qwen3:8b")
vector_store = Chroma.from_documents(chunks, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 2})

llm = ChatOllama(model="qwen3:8b", temperature=0)

system_prompt = (
    "You are a security research assistant. Use the following pieces of retrieved context "
    "to answer the question. If you don't know the answer, say that you don't know.\n\n"
    "Context:\n{context}"
)
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is BOLA and why is it dangerous?"})
print(response["answer"])

Point this script toward your local Markdown folders, OWASP PDFs, PortSwigger writeups, or disclosed HackerOne reports, and you instantly have a local research assistant that knows your actual data.

Beginners almost always assume they need to fine-tune a model to teach it new information. Usually, they are wrong.

┌───────────────────────────────────────┬───────────────────────────────────────┐
│               USE RAG WHEN            │            USE FINE-TUNING WHEN       │
├───────────────────────────────────────┼───────────────────────────────────────┤
│ • Knowledge changes frequently        │ • You need style/tone behavioral shifts│
│   (new CVEs, fresh writeups)          │ • You need strict output formatting   │
│ • You need explicit source citations  │ • You want deep task specialization   │
│ • You want fast, zero-cost iteration  │ • You have a vast, clean dataset      │
└───────────────────────────────────────┴───────────────────────────────────────┘

The play: Always start with RAG. Fine-tune only if RAG fails to meet your structural formatting needs after extensive testing.

RAG gives a model knowledge. MCP gives a model tools.

Model Context Protocol allows local LLMs to safely step outside their sandbox and interact natively with systems:

              ┌──> GitHub Repository
              ├──> Live CVE Databases
User ──> Agent ──> Tools ──> ├──> Burp Suite Reports
              ├──> Local Filesystem
              └──> System Logs

A chatbot answers questions; an agent completes tasks. Imagine an automated workflow: Find latest Django CVEs ──> read advisories ──> compare against requirements.txt ──> generate a remediation report ──> open a local GitHub issue. That's the power of tool integration.

Running AI locally protects your data from leaving your machine, but it shifts the application attack surface. Prompt injection isn't theoretical—it's cataloged under real CVEs (e.g., CVE-2025-53773 in GitHub Copilot allowing remote code execution).

When building local RAG and agent architectures, you must defend against:

🛡️

The Defense:Applyleast-privilege principlesto your agent's tools, sandbox execution environments, sanitize inputs, and treat every retrieved document chunk as potentially untrusted adversarial input.

Building a local AI system isn't about outperforming a trillion-dollar tech giant on a standard benchmark.

You cannot thoroughly audit AI-integrated applications if you treat the model as a black box. You cannot effectively reason about prompt injection vectors in an enterprise system if you have never engineered a document pipeline from scratch.

A few years ago, understanding the TCP/IP stack separated master engineers from beginners. Today, understanding LLM inference, embedding vectors, context windows, and tool integration protocols is becoming the new dividing line.

To paraphrase Richard Feynman:

There is a profound difference between knowing the name of something and knowing the thing itself.

Build the thing. Then break it. Then secure it. That's where the real learning starts.

I write about API security, backend systems, and building tools from scratch.

source & further reading

dev.to — original article From Software Engineer to AI Engineer - Part 3: Giving it a hand AI collapsed my job into three roles and I had to relearn all of them The Shape of Failure: Before You Blame the AI

No API Keys. No Cloud Bills. No Data Leaving My Machine. Here's Exactly How — and What It Actually Costs

Run your AI side-project on zahid.host