# Getting Started with Ollama: Run LLMs Locally in 10 Minutes

> Source: <https://dev.to/mohitkumar4/getting-started-with-ollama-run-llms-locally-in-10-minutes-5g98>
> Published: 2026-06-28 01:18:12+00:00

If you've ever wanted to run a large language model on your own machine — no API key, no cloud bill, no data leaving your laptop — **Ollama** is the easiest way to get there. It packages model weights, a runtime (built on `llama.cpp`

), and a simple CLI/REST API into one tool that works the same way on macOS, Linux, and Windows.

This guide covers installation, running your first model, the core commands you'll actually use, picking a model for your hardware, and hooking Ollama into your own code via its API.

The tradeoff: local models are generally smaller and slightly behind frontier cloud models (GPT, Claude, Gemini) on raw capability — though the gap keeps shrinking fast.

Download the app from [ollama.com/download](https://ollama.com/download), or use Homebrew:

```
brew install ollama
curl -fsSL https://ollama.com/install.sh | sh
```

This installs the `ollama`

binary and sets up a systemd service so it runs in the background. Check it's alive:

```
systemctl status ollama
```

Download `OllamaSetup.exe`

from [ollama.com/download](https://ollama.com/download) and run it — no admin rights required. Recent versions ship a full desktop app with a chat window, so you can skip the terminal entirely if you prefer. A native ARM64 build is also available for Windows-on-Arm devices.

```
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
```

Add `--gpus=all`

if you have an NVIDIA GPU and the NVIDIA Container Toolkit installed.

```
ollama --version
ollama list
```

An empty list is expected on a fresh install — it just confirms the daemon is up and responding.

```
ollama run llama3.2
```

This pulls the model (a few GB, one-time download) and drops you into an interactive chat session. Type a prompt, hit enter, get a response. `Ctrl+D`

or `/bye`

exits.

| Command | What it does |
|---|---|
`ollama run <model>` |
Pull (if needed) and chat with a model |
`ollama pull <model>` |
Download a model without starting a chat |
`ollama list` |
Show models you have installed |
`ollama ps` |
Show models currently loaded in memory |
`ollama show <model>` |
Show details/parameters for a model |
`ollama rm <model>` |
Delete a model to free disk space |
`ollama stop <model>` |
Unload a model from memory |
`ollama create <name> -f Modelfile` |
Build a custom model from a Modelfile |

Always pull with an explicit tag for anything you depend on (`ollama pull qwen2.5-coder:7b`

), since `:latest`

can change under you.

Ollama's library has hundreds of models. As a starting point:

| Use case | Try | Rough RAM/VRAM |
|---|---|---|
| General daily driver, light hardware | `llama3.2:3b` |
~4 GB |
| General daily driver, mid hardware |
`llama3.1:8b` or `qwen3:8b`
|
~6–8 GB |
| Coding |
`qwen2.5-coder:7b` or `qwen3-coder:30b` (MoE, runs lighter than its size suggests) |
6–20 GB |
| Reasoning / math / step-by-step logic |
`deepseek-r1:7b` or `:14b`
|
6–12 GB |
| Best quality you can fit on a single consumer GPU |
`qwen3.6:27b` or `gpt-oss:20b`
|
~16–24 GB |
| Vision (images + text) |
`llava` or `gemma3:12b`
|
8–16 GB |
| Embeddings (for RAG / semantic search) | `nomic-embed-text` |
<1 GB |

Rule of thumb for sizing: a 7–8B model at Q4 quantization needs roughly 5–6 GB of memory; rough numbers, not gospel. Mixture-of-experts models (the ones with an "active/total" split, like `qwen3-coder:30b`

) only run a fraction of their listed size at inference time, so they're often faster than their parameter count implies — but they still need the *full* model in memory, not just the active slice. Always check `ollama.com/library`

for the current tag list, since model lineups change weekly.

If you're not sure where to start: pull a small model, use it for a week on your actual tasks, and let what it struggles with point you toward the next one.

Ollama exposes a REST API on `localhost:11434`

— this is how every IDE plugin, chat UI, and framework talks to it under the hood.

```
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{ "role": "user", "content": "Explain Ollama in one sentence." }],
  "stream": false
}'
```

It also exposes an **OpenAI-compatible endpoint**, so anything built for the OpenAI SDK can point at Ollama with a base URL change:

```
http://localhost:11434/v1/chat/completions
pip install ollama
python
from ollama import chat

response = chat(model='llama3.2', messages=[
    {'role': 'user', 'content': 'Why is the sky blue?'}
])
print(response.message.content)
```

Want a model with a fixed system prompt or different default parameters? Create a `Modelfile`

:

```
FROM llama3.2

PARAMETER temperature 0.7
PARAMETER num_ctx 4096

SYSTEM """
You are a terse code reviewer. Point out bugs and style issues only — no praise, no fluff.
"""
```

Build it:

```
ollama create code-reviewer -f Modelfile
ollama run code-reviewer
```

Now `code-reviewer`

is its own model in `ollama list`

, with your settings baked in.

`127.0.0.1`

. Setting `OLLAMA_HOST=0.0.0.0`

exposes the API to your whole network with `OLLAMA_NUM_PARALLEL`

and `OLLAMA_MAX_LOADED_MODELS`

control concurrency if you're serving more than one model.`num_ctx`

deliberately in a Modelfile instead of leaving it at whatever default your VRAM tier triggers.`ollama ps`

— it shows whether a model is running on CPU or GPU. Driver issues (CUDA/ROCm) are the most common cause of silent CPU fallback.`http://localhost:11434/v1`

to swap in local models with minimal code changes.`nomic-embed-text`

) with a chat model to build a local RAG pipeline with zero API cost.That's the whole loop: install, pull, run, integrate. Everything else is just picking the right model for the job.
