{"slug": "getting-started-with-ollama-run-llms-locally-in-10-minutes", "title": "Getting Started with Ollama: Run LLMs Locally in 10 Minutes", "summary": "Ollama provides a tool for running large language models locally on macOS, Linux, and Windows without requiring an API key or cloud service. The tool packages model weights, a runtime based on llama.cpp, and a CLI/REST API, enabling users to download and run models like Llama 3.2 with a single command. Ollama's library includes hundreds of models for various use cases, and it exposes a REST API on localhost:11434 for integration with other applications.", "body_md": "If you've ever wanted to run a large language model on your own machine — no API key, no cloud bill, no data leaving your laptop — **Ollama** is the easiest way to get there. It packages model weights, a runtime (built on `llama.cpp`\n\n), and a simple CLI/REST API into one tool that works the same way on macOS, Linux, and Windows.\n\nThis guide covers installation, running your first model, the core commands you'll actually use, picking a model for your hardware, and hooking Ollama into your own code via its API.\n\nThe tradeoff: local models are generally smaller and slightly behind frontier cloud models (GPT, Claude, Gemini) on raw capability — though the gap keeps shrinking fast.\n\nDownload the app from [ollama.com/download](https://ollama.com/download), or use Homebrew:\n\n```\nbrew install ollama\ncurl -fsSL https://ollama.com/install.sh | sh\n```\n\nThis installs the `ollama`\n\nbinary and sets up a systemd service so it runs in the background. Check it's alive:\n\n```\nsystemctl status ollama\n```\n\nDownload `OllamaSetup.exe`\n\nfrom [ollama.com/download](https://ollama.com/download) and run it — no admin rights required. Recent versions ship a full desktop app with a chat window, so you can skip the terminal entirely if you prefer. A native ARM64 build is also available for Windows-on-Arm devices.\n\n```\ndocker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama\n```\n\nAdd `--gpus=all`\n\nif you have an NVIDIA GPU and the NVIDIA Container Toolkit installed.\n\n```\nollama --version\nollama list\n```\n\nAn empty list is expected on a fresh install — it just confirms the daemon is up and responding.\n\n```\nollama run llama3.2\n```\n\nThis pulls the model (a few GB, one-time download) and drops you into an interactive chat session. Type a prompt, hit enter, get a response. `Ctrl+D`\n\nor `/bye`\n\nexits.\n\n| Command | What it does |\n|---|---|\n`ollama run <model>` |\nPull (if needed) and chat with a model |\n`ollama pull <model>` |\nDownload a model without starting a chat |\n`ollama list` |\nShow models you have installed |\n`ollama ps` |\nShow models currently loaded in memory |\n`ollama show <model>` |\nShow details/parameters for a model |\n`ollama rm <model>` |\nDelete a model to free disk space |\n`ollama stop <model>` |\nUnload a model from memory |\n`ollama create <name> -f Modelfile` |\nBuild a custom model from a Modelfile |\n\nAlways pull with an explicit tag for anything you depend on (`ollama pull qwen2.5-coder:7b`\n\n), since `:latest`\n\ncan change under you.\n\nOllama's library has hundreds of models. As a starting point:\n\n| Use case | Try | Rough RAM/VRAM |\n|---|---|---|\n| General daily driver, light hardware | `llama3.2:3b` |\n~4 GB |\n| General daily driver, mid hardware |\n`llama3.1:8b` or `qwen3:8b`\n|\n~6–8 GB |\n| Coding |\n`qwen2.5-coder:7b` or `qwen3-coder:30b` (MoE, runs lighter than its size suggests) |\n6–20 GB |\n| Reasoning / math / step-by-step logic |\n`deepseek-r1:7b` or `:14b`\n|\n6–12 GB |\n| Best quality you can fit on a single consumer GPU |\n`qwen3.6:27b` or `gpt-oss:20b`\n|\n~16–24 GB |\n| Vision (images + text) |\n`llava` or `gemma3:12b`\n|\n8–16 GB |\n| Embeddings (for RAG / semantic search) | `nomic-embed-text` |\n<1 GB |\n\nRule of thumb for sizing: a 7–8B model at Q4 quantization needs roughly 5–6 GB of memory; rough numbers, not gospel. Mixture-of-experts models (the ones with an \"active/total\" split, like `qwen3-coder:30b`\n\n) only run a fraction of their listed size at inference time, so they're often faster than their parameter count implies — but they still need the *full* model in memory, not just the active slice. Always check `ollama.com/library`\n\nfor the current tag list, since model lineups change weekly.\n\nIf you're not sure where to start: pull a small model, use it for a week on your actual tasks, and let what it struggles with point you toward the next one.\n\nOllama exposes a REST API on `localhost:11434`\n\n— this is how every IDE plugin, chat UI, and framework talks to it under the hood.\n\n```\ncurl http://localhost:11434/api/chat -d '{\n  \"model\": \"llama3.2\",\n  \"messages\": [{ \"role\": \"user\", \"content\": \"Explain Ollama in one sentence.\" }],\n  \"stream\": false\n}'\n```\n\nIt also exposes an **OpenAI-compatible endpoint**, so anything built for the OpenAI SDK can point at Ollama with a base URL change:\n\n```\nhttp://localhost:11434/v1/chat/completions\npip install ollama\npython\nfrom ollama import chat\n\nresponse = chat(model='llama3.2', messages=[\n    {'role': 'user', 'content': 'Why is the sky blue?'}\n])\nprint(response.message.content)\n```\n\nWant a model with a fixed system prompt or different default parameters? Create a `Modelfile`\n\n:\n\n```\nFROM llama3.2\n\nPARAMETER temperature 0.7\nPARAMETER num_ctx 4096\n\nSYSTEM \"\"\"\nYou are a terse code reviewer. Point out bugs and style issues only — no praise, no fluff.\n\"\"\"\n```\n\nBuild it:\n\n```\nollama create code-reviewer -f Modelfile\nollama run code-reviewer\n```\n\nNow `code-reviewer`\n\nis its own model in `ollama list`\n\n, with your settings baked in.\n\n`127.0.0.1`\n\n. Setting `OLLAMA_HOST=0.0.0.0`\n\nexposes the API to your whole network with `OLLAMA_NUM_PARALLEL`\n\nand `OLLAMA_MAX_LOADED_MODELS`\n\ncontrol concurrency if you're serving more than one model.`num_ctx`\n\ndeliberately in a Modelfile instead of leaving it at whatever default your VRAM tier triggers.`ollama ps`\n\n— it shows whether a model is running on CPU or GPU. Driver issues (CUDA/ROCm) are the most common cause of silent CPU fallback.`http://localhost:11434/v1`\n\nto swap in local models with minimal code changes.`nomic-embed-text`\n\n) with a chat model to build a local RAG pipeline with zero API cost.That's the whole loop: install, pull, run, integrate. Everything else is just picking the right model for the job.", "url": "https://wpnews.pro/news/getting-started-with-ollama-run-llms-locally-in-10-minutes", "canonical_source": "https://dev.to/mohitkumar4/getting-started-with-ollama-run-llms-locally-in-10-minutes-5g98", "published_at": "2026-06-28 01:18:12+00:00", "updated_at": "2026-06-28 02:03:51.361145+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-infrastructure"], "entities": ["Ollama", "llama.cpp", "Llama 3.2", "Qwen", "DeepSeek", "Gemma", "Nomic", "NVIDIA"], "alternates": {"html": "https://wpnews.pro/news/getting-started-with-ollama-run-llms-locally-in-10-minutes", "markdown": "https://wpnews.pro/news/getting-started-with-ollama-run-llms-locally-in-10-minutes.md", "text": "https://wpnews.pro/news/getting-started-with-ollama-run-llms-locally-in-10-minutes.txt", "jsonld": "https://wpnews.pro/news/getting-started-with-ollama-run-llms-locally-in-10-minutes.jsonld"}}