cd /news/artificial-intelligence/no-api-keys-no-cloud-bills-no-data-l… · home topics artificial-intelligence article
[ARTICLE · art-31278] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=↑ positive

No API Keys. No Cloud Bills. No Data Leaving My Machine. Here's Exactly How — and What It Actually Costs

A developer built a local AI research assistant using a Ryzen 7 7435HS, 16GB RAM, and an RTX 3050, avoiding API keys, cloud bills, and data leaving the machine. The setup uses Ollama to run models like Qwen3 8B locally, achieving a practical sweet spot for security research and privacy. The developer emphasizes that local AI eliminates OPSEC risks from cloud APIs while noting the tradeoff of a weekend setup and hardware limits.

read7 min views1 publishedJun 17, 2026

Most people use AI the same way:

I did that too—until I started spending more time on cybersecurity research, vulnerability analysis, and application security. At some point, it felt strange researching privacy and security while sending chunks of my work to infrastructure I didn't control.

So I decided to build my own.

...

No API keys. No subscriptions. No cloud processing. Just a Ryzen 7 7435HS, 16GB RAM, an RTX 3050, and a growing curiosity about what actually happens behind the chatbot interface.

The goal was simple: create a local AI research assistant that could search my notes, help with security research, and keep everything on my machine. What I didn't expect was that building it would teach me more about AI, LLMs, RAG, agents, and AI security than years of simply using them ever could.

Here's what I built and what I learned.

Let’s kill the ambiguity before we go any further. There are three distinct tiers to this:

I'm at Level 1 moving toward Level 2. That's the practical sweet spot for anyone doing real work without a datacenter.

LLMs are prediction systems. They don't think. They predict the statistically most likely next token given everything in their context window.

Input:  "The capital of France is"
Output: "Paris"

Do that billions of times across a massive training corpus, and complex, reasoning-like behavior emerges. That's genuinely it. No magic. No ghost in the machine. Just very expensive statistics that happen to be incredibly useful.

Understanding this matters for security work specifically:

⚠️

The core paradox of LLMs:The model is confident. The model is sometimes wrong. These two facts coexist comfortably and will continue to cause problems for everyone who forgets them.

For general-purpose use, cloud AI is fine. For security research, the calculus is completely different. When you use a cloud model:

If you're doing bug bounty, AppSec auditing, or anything involving non-public vulnerability data, feeding that into a cloud API is an OPSEC problem. Full stop.

Local solves this. Your data stays on your hardware. No terms of service to audit. No compliance risk from third-party data processing. No API costs. It works completely offline, allowing unlimited experimentation without watching a token counter.

The tradeoff? Setup takes a weekend and your hardware has strict limits. Still worth it.

Ollama is the easiest on-ramp to local AI. Think of it as Docker for language models — it handles downloads, quantization, GPU acceleration, and exposes a clean REST API at http://localhost:11434

.

curl -fsSL https://ollama.sh | sh

ollama run qwen3:8b

That's it. The model downloads, loads into memory, and the API goes live.

Download Model Weights ──> Load Into RAM/VRAM ──> Tokenize Input ──> Transformer Inference ──> Local REST API

Ollama is a model manager, a runtime, and an API server all in one. The intelligence is in the model weights; Ollama is just the plumbing that makes those weights usable without a PhD in infrastructure.

ollama list         # See installed models
ollama pull qwen3:8b # Download a specific model
ollama rm llama3    # Remove an unwanted model
ollama ps           # See what models are currently loaded in memory

Not all models fit on all machines. Here is an honest breakdown of the hardware requirements:

RAM / VRAM Recommended Model Experience Notes
8GB
Gemma 4 4B or Phi-4 Mini Fits cleanly, decent quality, highly efficient
16GB
Qwen3 8B or DeepSeek R1 Distilled 8B
The Sweet Spot. Fast and highly capable
32GB+
DeepSeek R1 14B–32B High-level technical reasoning, heavily data-intensive

My daily driver: Qwen3 8B. It provides strong technical reasoning, handles code exceptionally well, is Apache 2.0 licensed, and runs cleanly on my laptop without fighting for VRAM.

The open-model ecosystem moves fast. Here's where things actually stand right now:

💡

The honest takeaway on hardware:8GB of VRAM was borderline a couple of years ago. It's cramped now.12GB is the modern floorfor serious local work, while16GB gives you room to actually experiment.

A stock model only knows what it was trained on. It doesn't know your security notes, your project internals, your custom vulnerability writeups, or yesterday's newly disclosed CVEs.

That gap is the main limitation of Level 1. The fix is RAG.

RAG stands for Retrieval-Augmented Generation. The concept is simpler than the name suggests:

User Asks Question 
       │
       ▼
Search Vector Database (ChromaDB)
       │
       ▼
Retrieve Relevant Document Chunks
       │
       ▼
Inject Context into System Prompt (Question + Source Chunks)
       │
       ▼
Local LLM Generates Grounded Answer

Here is how you can spin up a local RAG pipeline using LangChain and Ollama.

pip install chromadb langchain langchain-community langchain-ollama
python
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import Chroma
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

your_docs = [
    "BOLA (Broken Object Level Authorization) occurs when an API doesn't verify the requesting user has permission to access the specific object. Most common API vulnerability in 2026.",
    "JWT tokens must be verified server-side. Common mistakes: not checking the signature algorithm, skipping expiry validation, or accepting 'none' as a valid algorithm.",
    "Django DEBUG=True in production exposes detailed stack traces, environment variables, and raw database queries to anyone who triggers an internal server error."
]

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.create_documents(your_docs)

embeddings = OllamaEmbeddings(model="qwen3:8b")
vector_store = Chroma.from_documents(chunks, embeddings)
retriever = vector_store.as_retriever(search_kwargs={"k": 2})

llm = ChatOllama(model="qwen3:8b", temperature=0)

system_prompt = (
    "You are a security research assistant. Use the following pieces of retrieved context "
    "to answer the question. If you don't know the answer, say that you don't know.\n\n"
    "Context:\n{context}"
)
prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{input}"),
])

question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is BOLA and why is it dangerous?"})
print(response["answer"])

Point this script toward your local Markdown folders, OWASP PDFs, PortSwigger writeups, or disclosed HackerOne reports, and you instantly have a local research assistant that knows your actual data.

Beginners almost always assume they need to fine-tune a model to teach it new information. Usually, they are wrong.

┌───────────────────────────────────────┬───────────────────────────────────────┐
│               USE RAG WHEN            │            USE FINE-TUNING WHEN       │
├───────────────────────────────────────┼───────────────────────────────────────┤
│ • Knowledge changes frequently        │ • You need style/tone behavioral shifts│
│   (new CVEs, fresh writeups)          │ • You need strict output formatting   │
│ • You need explicit source citations  │ • You want deep task specialization   │
│ • You want fast, zero-cost iteration  │ • You have a vast, clean dataset      │
└───────────────────────────────────────┴───────────────────────────────────────┘

The play: Always start with RAG. Fine-tune only if RAG fails to meet your structural formatting needs after extensive testing.

RAG gives a model knowledge. MCP gives a model tools.

Model Context Protocol allows local LLMs to safely step outside their sandbox and interact natively with systems:

              ┌──> GitHub Repository
              ├──> Live CVE Databases
User ──> Agent ──> Tools ──> ├──> Burp Suite Reports
              ├──> Local Filesystem
              └──> System Logs

A chatbot answers questions; an agent completes tasks. Imagine an automated workflow: Find latest Django CVEs ──> read advisories ──> compare against requirements.txt ──> generate a remediation report ──> open a local GitHub issue. That's the power of tool integration.

Running AI locally protects your data from leaving your machine, but it shifts the application attack surface. Prompt injection isn't theoretical—it's cataloged under real CVEs (e.g., CVE-2025-53773 in GitHub Copilot allowing remote code execution).

When building local RAG and agent architectures, you must defend against:

🛡️

The Defense:Applyleast-privilege principlesto your agent's tools, sandbox execution environments, sanitize inputs, and treat every retrieved document chunk as potentially untrusted adversarial input.

Building a local AI system isn't about outperforming a trillion-dollar tech giant on a standard benchmark.

You cannot thoroughly audit AI-integrated applications if you treat the model as a black box. You cannot effectively reason about prompt injection vectors in an enterprise system if you have never engineered a document pipeline from scratch.

A few years ago, understanding the TCP/IP stack separated master engineers from beginners. Today, understanding LLM inference, embedding vectors, context windows, and tool integration protocols is becoming the new dividing line.

To paraphrase Richard Feynman:

There is a profound difference between knowing the name of something and knowing the thing itself.

Build the thing. Then break it. Then secure it. That's where the real learning starts.

I write about API security, backend systems, and building tools from scratch.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @ollama 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/no-api-keys-no-cloud…] indexed:0 read:7min 2026-06-17 ·