If you've ever wanted to run a large language model on your own machine β no API key, no cloud bill, no data leaving your laptop β Ollama is the easiest way to get there. It packages model weights, a runtime (built on llama.cpp
), and a simple CLI/REST API into one tool that works the same way on macOS, Linux, and Windows.
This guide covers installation, running your first model, the core commands you'll actually use, picking a model for your hardware, and hooking Ollama into your own code via its API.
The tradeoff: local models are generally smaller and slightly behind frontier cloud models (GPT, Claude, Gemini) on raw capability β though the gap keeps shrinking fast.
Download the app from ollama.com/download, or use Homebrew:
brew install ollama
curl -fsSL https://ollama.com/install.sh | sh
This installs the ollama
binary and sets up a systemd service so it runs in the background. Check it's alive:
systemctl status ollama
Download OllamaSetup.exe
from ollama.com/download and run it β no admin rights required. Recent versions ship a full desktop app with a chat window, so you can skip the terminal entirely if you prefer. A native ARM64 build is also available for Windows-on-Arm devices.
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Add --gpus=all
if you have an NVIDIA GPU and the NVIDIA Container Toolkit installed.
ollama --version
ollama list
An empty list is expected on a fresh install β it just confirms the daemon is up and responding.
ollama run llama3.2
This pulls the model (a few GB, one-time download) and drops you into an interactive chat session. Type a prompt, hit enter, get a response. Ctrl+D
or /bye
exits.
| Command | What it does |
|---|---|
ollama run <model> |
|
| Pull (if needed) and chat with a model | |
ollama pull <model> |
|
| Download a model without starting a chat | |
ollama list |
|
| Show models you have installed | |
ollama ps |
|
| Show models currently loaded in memory | |
ollama show <model> |
|
| Show details/parameters for a model | |
ollama rm <model> |
|
| Delete a model to free disk space | |
ollama stop <model> |
|
| Unload a model from memory | |
ollama create <name> -f Modelfile |
|
| Build a custom model from a Modelfile |
Always pull with an explicit tag for anything you depend on (ollama pull qwen2.5-coder:7b
), since :latest
can change under you.
Ollama's library has hundreds of models. As a starting point:
| Use case | Try | Rough RAM/VRAM |
|---|---|---|
| General daily driver, light hardware | llama3.2:3b |
|
| ~4 GB | ||
| General daily driver, mid hardware | ||
llama3.1:8b or qwen3:8b |
||
| ~6β8 GB | ||
| Coding | ||
qwen2.5-coder:7b or qwen3-coder:30b (MoE, runs lighter than its size suggests) |
||
| 6β20 GB | ||
| Reasoning / math / step-by-step logic | ||
deepseek-r1:7b or :14b |
||
| 6β12 GB | ||
| Best quality you can fit on a single consumer GPU | ||
qwen3.6:27b or gpt-oss:20b |
||
| ~16β24 GB | ||
| Vision (images + text) | ||
llava or gemma3:12b |
||
| 8β16 GB | ||
| Embeddings (for RAG / semantic search) | nomic-embed-text |
|
| <1 GB |
Rule of thumb for sizing: a 7β8B model at Q4 quantization needs roughly 5β6 GB of memory; rough numbers, not gospel. Mixture-of-experts models (the ones with an "active/total" split, like qwen3-coder:30b
) only run a fraction of their listed size at inference time, so they're often faster than their parameter count implies β but they still need the full model in memory, not just the active slice. Always check ollama.com/library
for the current tag list, since model lineups change weekly.
If you're not sure where to start: pull a small model, use it for a week on your actual tasks, and let what it struggles with point you toward the next one.
Ollama exposes a REST API on localhost:11434
β this is how every IDE plugin, chat UI, and framework talks to it under the hood.
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [{ "role": "user", "content": "Explain Ollama in one sentence." }],
"stream": false
}'
It also exposes an OpenAI-compatible endpoint, so anything built for the OpenAI SDK can point at Ollama with a base URL change:
http://localhost:11434/v1/chat/completions
pip install ollama
python
from ollama import chat
response = chat(model='llama3.2', messages=[
{'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response.message.content)
Want a model with a fixed system prompt or different default parameters? Create a Modelfile
:
FROM llama3.2
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
SYSTEM """
You are a terse code reviewer. Point out bugs and style issues only β no praise, no fluff.
"""
Build it:
ollama create code-reviewer -f Modelfile
ollama run code-reviewer
Now code-reviewer
is its own model in ollama list
, with your settings baked in.
127.0.0.1
. Setting OLLAMA_HOST=0.0.0.0
exposes the API to your whole network with OLLAMA_NUM_PARALLEL
and OLLAMA_MAX_LOADED_MODELS
control concurrency if you're serving more than one model.num_ctx
deliberately in a Modelfile instead of leaving it at whatever default your VRAM tier triggers.ollama ps
β it shows whether a model is running on CPU or GPU. Driver issues (CUDA/ROCm) are the most common cause of silent CPU fallback.http://localhost:11434/v1
to swap in local models with minimal code changes.nomic-embed-text
) with a chat model to build a local RAG pipeline with zero API cost.That's the whole loop: install, pull, run, integrate. Everything else is just picking the right model for the job.