Getting Started with Ollama: Run LLMs Locally in 10 Minutes

Ollama provides a tool for running large language models locally on macOS, Linux, and Windows without requiring an API key or cloud service. The tool packages model weights, a runtime based on llama.cpp, and a CLI/REST API, enabling users to download and run models like Llama 3.2 with a single command. Ollama's library includes hundreds of models for various use cases, and it exposes a REST API on localhost:11434 for integration with other applications.

If you've ever wanted to run a large language model on your own machine — no API key, no cloud bill, no data leaving your laptop — Ollama is the easiest way to get there. It packages model weights, a runtime built on llama.cpp , and a simple CLI/REST API into one tool that works the same way on macOS, Linux, and Windows. This guide covers installation, running your first model, the core commands you'll actually use, picking a model for your hardware, and hooking Ollama into your own code via its API. The tradeoff: local models are generally smaller and slightly behind frontier cloud models GPT, Claude, Gemini on raw capability — though the gap keeps shrinking fast. Download the app from ollama.com/download https://ollama.com/download , or use Homebrew: brew install ollama curl -fsSL https://ollama.com/install.sh | sh This installs the ollama binary and sets up a systemd service so it runs in the background. Check it's alive: systemctl status ollama Download OllamaSetup.exe from ollama.com/download https://ollama.com/download and run it — no admin rights required. Recent versions ship a full desktop app with a chat window, so you can skip the terminal entirely if you prefer. A native ARM64 build is also available for Windows-on-Arm devices. docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama Add --gpus=all if you have an NVIDIA GPU and the NVIDIA Container Toolkit installed. ollama --version ollama list An empty list is expected on a fresh install — it just confirms the daemon is up and responding. ollama run llama3.2 This pulls the model a few GB, one-time download and drops you into an interactive chat session. Type a prompt, hit enter, get a response. Ctrl+D or /bye exits. | Command | What it does | |---|---| ollama run <model | Pull if needed and chat with a model | ollama pull <model | Download a model without starting a chat | ollama list | Show models you have installed | ollama ps | Show models currently loaded in memory | ollama show <model | Show details/parameters for a model | ollama rm <model | Delete a model to free disk space | ollama stop <model | Unload a model from memory | ollama create <name -f Modelfile | Build a custom model from a Modelfile | Always pull with an explicit tag for anything you depend on ollama pull qwen2.5-coder:7b , since :latest can change under you. Ollama's library has hundreds of models. As a starting point: | Use case | Try | Rough RAM/VRAM | |---|---|---| | General daily driver, light hardware | llama3.2:3b | ~4 GB | | General daily driver, mid hardware | llama3.1:8b or qwen3:8b | ~6–8 GB | | Coding | qwen2.5-coder:7b or qwen3-coder:30b MoE, runs lighter than its size suggests | 6–20 GB | | Reasoning / math / step-by-step logic | deepseek-r1:7b or :14b | 6–12 GB | | Best quality you can fit on a single consumer GPU | qwen3.6:27b or gpt-oss:20b | ~16–24 GB | | Vision images + text | llava or gemma3:12b | 8–16 GB | | Embeddings for RAG / semantic search | nomic-embed-text | <1 GB | Rule of thumb for sizing: a 7–8B model at Q4 quantization needs roughly 5–6 GB of memory; rough numbers, not gospel. Mixture-of-experts models the ones with an "active/total" split, like qwen3-coder:30b only run a fraction of their listed size at inference time, so they're often faster than their parameter count implies — but they still need the full model in memory, not just the active slice. Always check ollama.com/library for the current tag list, since model lineups change weekly. If you're not sure where to start: pull a small model, use it for a week on your actual tasks, and let what it struggles with point you toward the next one. Ollama exposes a REST API on localhost:11434 — this is how every IDE plugin, chat UI, and framework talks to it under the hood. curl http://localhost:11434/api/chat -d '{ "model": "llama3.2", "messages": { "role": "user", "content": "Explain Ollama in one sentence." } , "stream": false }' It also exposes an OpenAI-compatible endpoint , so anything built for the OpenAI SDK can point at Ollama with a base URL change: http://localhost:11434/v1/chat/completions pip install ollama python from ollama import chat response = chat model='llama3.2', messages= {'role': 'user', 'content': 'Why is the sky blue?'} print response.message.content Want a model with a fixed system prompt or different default parameters? Create a Modelfile : FROM llama3.2 PARAMETER temperature 0.7 PARAMETER num ctx 4096 SYSTEM """ You are a terse code reviewer. Point out bugs and style issues only — no praise, no fluff. """ Build it: ollama create code-reviewer -f Modelfile ollama run code-reviewer Now code-reviewer is its own model in ollama list , with your settings baked in. 127.0.0.1 . Setting OLLAMA HOST=0.0.0.0 exposes the API to your whole network with OLLAMA NUM PARALLEL and OLLAMA MAX LOADED MODELS control concurrency if you're serving more than one model. num ctx deliberately in a Modelfile instead of leaving it at whatever default your VRAM tier triggers. ollama ps — it shows whether a model is running on CPU or GPU. Driver issues CUDA/ROCm are the most common cause of silent CPU fallback. http://localhost:11434/v1 to swap in local models with minimal code changes. nomic-embed-text with a chat model to build a local RAG pipeline with zero API cost.That's the whole loop: install, pull, run, integrate. Everything else is just picking the right model for the job.