I've been experimenting with running large language models entirely on my own machine, and the setup turned out to be simpler than I expected. Here's exactly what I did to get Llama 3 running locally using Docker - no cloud API, no data leaving my machine.
The first thing I noticed after switching to local inference was the privacy gain. Every prompt I send stays on my machine. For projects involving sensitive data, internal documents, customer queries, proprietary code, that matters. There's no third-party logging, no rate limits, and no per-token cost.
Beyond privacy, running models locally gives you full control over the model version, the inference parameters, and the runtime environment. Cloud APIs abstract all of that away. Whenever tweak temperature or context length is needed for a specific task, I could do it directly without navigating a provider dashboard. Local inference also means your application keeps working even when an external API goes down — a real advantage in production workflows.
Before starting, make sure your machine meets these minimums:
Ollama is a lightweight runtime that handles model , quantization, and serving over a local HTTP API. Wrapping it in Docker makes the setup portable and isolated — the model files, config, and server all live inside a named volume, separate from your system. Docker also means you can spin this up on any machine with a single command, no manual dependency installs.
I have used Ollama inside Docker, which packages the model runtime cleanly. Created a docker-compose.yml
to make the setup reproducible:
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
volumes:
ollama_data:
Then I pulled and ran Llama 3:
docker exec -it ollama ollama pull llama3
docker exec -it ollama ollama run llama3
I added a simple Python client to query it programmatically:
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3",
"prompt": "Summarize the key risks in this contract clause: ...",
"stream": False
})
print(response.json()["response"])
Response latency on the 8B model was 2–4 seconds per query — fast enough for interactive use.
| Model | RAM Required | Disk Space |
|---|---|---|
| Llama 3 8B | ~6 GB | ~4.7 GB |
| Llama 3 70B | ~48 GB | ~40 GB |
Running Llama locally with Docker took me under 15 minutes to configure, and it's now part of my standard dev environment for any task where keeping data private is non-negotiable.
Have you tried running llama models locally? How was your experience?